Tools for analysis and evaluation of CPU Performance

Analysis tools for Evaluation and
Performance

Mourad Bouache
PhD, Computer Architecture
bouache@gmail.com

Oracle - Nov, 14-2011

Introduction

Processors are increasingly complex
• More diﬃcult microarchitecture
design.

Simulator : very important tool
• Understand the instruction
behavior during its execution in
processor.

Complex Simulator :
• Time for preparation and
modiﬁcation.

Simulator

Tool
• Simulator : very important tool
• test new concepts

Simulator

Tool
• Simulator : very important tool
• test new concepts

Three characteristics

Complexity of microarchitectures

Speed decreases as complexity increases

Contribution : vectorization methodology

Monolithic Simulation

Simplescalar, is the most used (in 70% of articles).
This simulator and most other simulators have a serious
drawback : monolithic

Advantage
• simulation speed

Monolithic Simulation

Simplescalar, is the most used (in 70% of articles).
This simulator and most other simulators have a serious
drawback : monolithic

Advantage
• simulation speed

Disadvantages
• Diﬃcult to update.
• Diﬃcult to extract and compare the simulator components.

Modular simulation

Advantages
• Reuse/ exchange and compare simulator modules,

Modular simulation

Advantages
• Better conﬁdence in simulation (closer to HW),

Modular simulation

Advantages
• Easier to read.

Modular simulation

Advantages
• Easier to read.

Main drawback :
• Simulation speed slowdown

Outline

1 Modular simulation environment
2 Acceleration techniques
3 Vectorization of Simulator Modules
4 Experimental framework
5 Results
6 Scheduling process in SystemC
7 Conclusion & future works.

Modular simulation environments

• A modular simulation environment describe hierarchically and
structurally the system to simulate.
To simulate the entire system, the environment includes a
scheduler controlling the performance of diﬀerent components.


• A modular simulation environment describe hierarchically and
structurally the system to simulate.
To simulate the entire system, the environment includes a
scheduler controlling the performance of diﬀerent components.
• Key beneﬁts :


Reuse ...


Compare ...


Share ...

acceleration techniques

Acceleration techniques
• reduction of inputs and simulation programs : MinneSPEC,
• simulation engine optimization : FastSysC 1 (speedX 2),
• distribution of simulation : DisT,
• sampling techniques : representative, periodic and
random sampling,
• transition to modeling TTLM :Timed Transaction Level
Modeling.

1. Daniel Gracia Perez et al. FastSysC : a fast SystemC engine


• Compromise between accuracy and simulation speed,

2. David Parello, Mourad Bouache, and Bernard Goossens. Improving cycle-level modular simulation by vec-
torization. In Rapid Simulation and Performance Evaluation : Methods and Tools (RAPIDO’09)


• Compromise between accuracy and simulation speed,
• Vectorization 2 is a methodology that can be used with
one of these acceleration techniques.

2. David Parello, Mourad Bouache, and Bernard Goossens. Improving cycle-level modular simulation by vec-
torization. In Rapid Simulation and Performance Evaluation : Methods and Tools (RAPIDO’09)

Modular simulation environment

UNISIM 3 : A modular simulation framework

• UNISIM is a modular framework for simulation, each simulator
is divided into several modules, each module corresponding to
a hardware block.

3. http ://www.unisim.org/


UNISIM 3 : A modular simulation framework

• UNISIM is a modular framework for simulation, each simulator
is divided into several modules, each module corresponding to
a hardware block.
• A module is composed of two parts : state and processes.

3. http ://www.unisim.org/


UNISIM : A modular simulation framework

• A process is deﬁned in a .sim ﬁle as a C++ class

UNISIM : Communication protocol

Communication protocol
• Ports : inports and outports
• Signals


• Ports : inports and outports
• Signals

3 signals :
• Processes can be sensitive to
the data, the accept and
the enable signals.


UNISIM : signals

• The simulation engine (SystemC) wakes up the modules
process.


Communication between modules


Communication between modules
Scalability is diﬃcult with a modular simulation, for two factors :
• Communication costs between the simulator modules.
• Awakening process for each communicating module.

Communication costs

Monolithic Simulator
• Write/read a variable.

Communication costs

Monolithic Simulator
• Write/read a variable.

Modular Simulator

A New Communication Protocol

Signals Array
• Reduce the number of signals,
• Several values of data, accept, enable temporarily stored in
signals array.

A New Communication Protocol

Signals Array
• An extension of the communication protocol between modules
is a solution to accelerate a simulation speed.

Module Vectorization

A simple and systematic procedure
1 vectorize module state and ports,
2 add a loop around the process,
3 add method calls to send() following the addition of for
loops.

Example : Functional Unit

1 class FunctionalUnit : public module
2 { public :
3 inclock clock ;
4 inport < instr > in ;
5 outport < instr > out ;
6 FunctionalUnit ( const char * name ): module ( name )
7 { sensitive_pos_method ( start_of_cycle ) << clock ;
8 sensitive_neg_method ( end_of_cycle ) << clock ;
9 sensitive_method ( on_data_accept ) << in . data << out . accept ;
10 }
11 void start_of_cycle ()
12 { if ( pipeline . is_ready ())
13 out . data = pipeline . get ();
14 else out . data . nothing ();
15 }
16 void on_data_accept ()
17 { if ( in . data . know () && out . accept . know ())
18 { if (! pipeline . is_full () || out . accept )
19 in . accept = true ;
20 else in . accept = false ;
21 out . enable = out . accept ;
22 }
23 }
24 void end_of_cycle ()
25 { if ( out . accept ) pipeline . pop ();
26 if ( in . enable ) pipeline . push ( in . data );
27 pipeline . run ();
28 }
29 private :
30 Fifo < instr > pipeline ;
31 };


Vectorization Procedure
1. vectorize module state and ports.
1 class FunctionalUnit : public module 1 class FunctionalUnit : public module
2 { public : 2 { public :
3 inclock clock ; 3 inclock clock ;
4 inport < instr > in ; 4 inport < instr , NBCFG > in ;
5 outport < instr > out ; 5 outport < instr , NBCFG > out ;
6 ... 6 ...
7 private : 7 private :
8 Fifo < instr > pipeline ; 8 Fifo < instr > pipeline [ NBCFG ];


Vectorization procedure
2. add a loop around the process.
1 ...
3 { for ( int cfg =0; cfg < NBCFG; cfg ++)
4 {
1 ... 5 if ( pipeline [ cfg ]. is_ready ())
2 void start_of_cycle () 6 out . data [ cfg ] = pipeline [ cfg ]. get ();
3 { if ( pipeline . is_ready ()) 7 else out . data [ cfg ]. nothing ();
4 out . data = pipeline . get (); 8 ...
5 else out . data . nothing (); 9 }
6 } 10 }
7 void on_data_accept () 11 void on_data_accept ()
8 { if ( in . data . know () && out . accept . know ()) 12 { if ( in . data . know () && out . accept . know ())
9 { if (! pipeline . is_full () || out . accept ) 13 { for ( int cfg =0; cfg < NBCFG; cfg ++)
10 in . accept = true ; 14 { if (! pipeline [ cfg ]. is_full ()
11 else in . accept = false ; 15 || out . accept [ cfg ])
12 out . enable = out . accept ; 16 in . accept [ cfg ] = true ;
13 } 17 else in . accept [ cfg ] = false ;
14 } 18 out . enable [ cfg ] = out . accept [ cfg ];
15 ... 19 ...
20 }
21 }
22 }
23 ...


Vectorization procedure
3. add method calls to send() following the addition of for loops.
1 ...
4 {
5 if ( pipeline [ cfg ]. is_ready ())
1 ... 6 out . data [ cfg ] = pipeline [ cfg ]. get ();
2 void start_of_cycle () 7 else out . data [ cfg ]. nothing ();
3 { if ( pipeline . is_ready ()) 8 }
4 out . data = pipeline . get (); 9 out . data. send ();
5 else out . data . nothing (); 10 }
6 } 11 void on_data_accept ()
7 void on_data_accept () 12 { if ( in . data . know () && out . accept . know ())
8 { if ( in . data . know () && out . accept . know ()) 13 { for ( int cfg =0; cfg < NBCFG; cfg ++)
9 { if (! pipeline . is_full () || out . accept ) 14 { if (! pipeline [ cfg ]. is_full ()
10 in . accept = true ; 15 || out . accept [ cfg ])
11 else in . accept = false ;
12 out . enable = out . accept ; 16 in . accept [ cfg ] = true ;
13 } 17 else in . accept [ cfg ] = false ;
14 } 18 out . enable [ cfg ] = out . accept [ cfg ];
15 ... 19 }
20 in . accept . send ();
21 out . enable . send ();
22 }
23 }
24 ...

Example : Vectorized Functional Unit

1 class FunctionalUnit : public module
2 { public :
3 inclock clock;
4 inport < instr , NBCFG > in ;
5 outport < instr , NBCFG > out ;
6 FunctionalUnit ( const char * name ): module ( name )
7 { // sensitive list
8 sensitive_pos_method ( start_of_cycle ) << clock ;
9 sensitive_neg_method ( end_of_cycle ) << clock ;
10 sensitive_method ( on_data_accept ) << in . data << out . accept ;
11 }
14 {
15 if ( pipeline [ cfg ]. is_ready ())
16 out . data[ cfg ] = pipeline [ cfg ]. get ();
17 else out . data [ cfg ]. nothing ();
18 }
19 out . data . send ();
20 }
21 void on_data_accept ()
22 { if ( in . data. know () && out . accept . know ())
24 { if (! pipeline [ cfg ]. is_full () || out . accept [ cfg ])
25 in . accept [ cfg ] = true ;
26 else in . accept [ cfg ] = false ;
27 out . enable [ cfg ] = out . accept [ cfg ];
28 }
29 in . accept . send();
30 out . enable . send ();
31 }
32 }
33 void end_of_cycle ()
35 { if ( out . accept [ cfg ]) pipeline [ cfg ]. pop ();
36 if ( in . enable [ cfg ]) pipeline [ cfg ]. push ( in . data );
37 pipeline [ cfg ]. run ();
38 }
39 }
40 private :
41 Fifo < instr > pipeline [ NBCFG ];
42 };

Simulator Vectorization

Multi-cores Simulation
• In our study, we performed simulations of multi-cores : 2, 4, 8,
16, 32 and 64.

OoOSim : Out of Order Simulator

OoOSim 4 modelises a generic superscalar out-of-order processor.
The baseline simulator includes a 4-way superscalar core with an L1
instruction cache, an L1 write-back data cache, a bus and a dram.

4. Mourad Bouache, David Parello, Bernard Goossens. Acceleration of Modular simulation. In International
Supercomputing Conference (ISC09) Hamburg, Germany, June 2009.


OoOSim : 12 modules
1 Fetcher,
2 AllocatorRenamer,
3 Dispatcher,
4 Scheduler,
5 RegisterFile,
6 Ret-Broadcast and CDBA:Common Data Bus Arbiter,
7 IntegerUnit, FloatingPointUnit and AddressGenerationUnit,
8 LoadStoreQueue,
9 Data caches L1 and L2,
10 Instruction cache L1,
11 Memory DRAM,
12 Reorder Buffer.


more than 15.000 code lines, 12 connected modules through 187 signals.

Benchmarks

Benchmarks : MiBench

• Simulations were carried out by MiBench, divided into six
suites targeted areas speciﬁc market for embedded
applications :
Automotive, Network, Security, Consumer Devices,
Office Automation, and Telecommunications.

Auto./Industrial Consummer Oﬃce Network Security Telecomm.
susan (edges) jpeg stringsearch dijkstra sha FFT
susan (corners) - - - rijndael -
susan (smoothing) - - - - -

Performance evaluation

Simulation machine
• Performance evaluation has been carried out on a cluster of
30 Intel Xeon 5148 dual-core processors clocked at
2.33GHz with a 4MBytes L2 cache.

Results : simulation speed (without vectorization)

simulation speed (with vectorization)

Why ... ?

Instrumentation of the FastSysC code(program)

• Cycle Counters (RDTSC:Read Time Stamp Counter) :
1 The scheduler FastSysC transit time.
2 The process time.

FastSysC transit time(without/with vectorization)

Conclusion

Results
• To address the need to improve the simulation speed, we
proposed a developing modules methodology in a modular
simulator.
• This methodology is based on a new communication signals
protocol .

The vectorial simulation improves scalability.

Results Discussion

Vectorization ...
• improves the speedup of the simulation time.
• it allows duplicate resources by limiting the overhead of
scheduler simulation time.
• can be used in conjunction with other techniques to
improve the speed as sampling techniques or reduction
of test programs.

Results Discussion

Vectorization ...

Conclusion

Conclusion
Our contribution aims to improve the simulation speed in
modular simulators, oﬀering a simple and systematic
development based on the vectorization of the simulator
modules.

Conclusion

Simplescalar is not a multi-core simulator

In focus

Other idea ...
• Vectorization
We wish to compare the results of this methodology using
TTLM modeling (Timed Transaction Level Modeling).

Merci, Thank you, Tack

QUESTIONS ?

Back-up slides

Post-doc research work
• Instruction Level Parallelism : ILP
Goal : understand the general structure of an execution and
parallelism it oﬀers.
• PerPi : A Tool to Measure Instruction Level Parallelism
• http://kenny.univ-perp.fr/PerPi/
• A Pin tool, an Intel free programmable tool,
• computes the instructions dependency graph,
• computes, for each instruction in the run, its instruction cycle in the ideal
machine,
• Analysis of the structure of instruction-level parallelism,
• Parallelism on loops,
• Local and global parallelism,
• Parallelism on function ”CALL”.

Back-up slides

SystemC and FastSysC
SystemC, Contains a scheduler which manages signals and directs
the process to start. It contains a sequential processes (sensitive to
the clock) and combinatorial process (sensitive to input ports).
FastSysC, a mixture of static and dynamic scheduling to avoid
unnecessary awakening processes : thus optimize the simulation
engine.

Tools for analysis and evaluation of CPU Performance

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (20)

Similar to Tools for analysis and evaluation of CPU Performance

Similar to Tools for analysis and evaluation of CPU Performance (20)

Tools for analysis and evaluation of CPU Performance