SlideShare a Scribd company logo
1 of 13
Download to read offline
Performance Analysis of multithreaded applications
       based on Hardware Simulations with Dimemas and Pin tool


                                       Maria Stylianou
                         Universitat Politecnica de Catalunya (UPC)
                                     marsty5@gmail.com
                                        June 15, 2012




Abstract
It is widely accepted that application development is rapidly growing while hardware design
occurs in a smaller pace and success. The high cost of buying the best hardware as well as the
non-existence of machines that can support the latest application developments lead scientists
to look for other ways to examine the performance of their applications. Hardware simulation
appears to be crucial for an application analysis. This project attempts to simulate hardware
using two simulators; one for exploring the outcome of parameters like latency, network
bandwidth, congestion and the other for studying the effect of parameters related to cache
memory, such as cache size, cluster size, number of processors. Our predictions for the great
effect of these parameters are justified with the results extracted from our experimentation.




                                                                                            1
1 Introduction
Simulation is defined as the imitation of the operation of a real-world process or system over
time [1]. In the area of engineering, an architecture simulation is important to model a real-life or
a hypothetical situation on a computer, in order to be later studied and analyzed. With several
simulations and variable modifications, researchers predict and draw conclusions about the
behavior of the system. To go a step further, simulation ends to be crucial when either the real
computer hardware is not accessible or prohibited to be engaged either the hardware is not yet
built [2].

In this report, the attention revolves around hardware simulations and more specifically
simulations using two tools; Dimemas and Pin. Both tools are used under this study to analyze
and predict hardware behavior upon execution of parallel programs. Major differences
distinguish the two tools and are shortly described below.

Dimemas is a performance analysis tool for message-passing programs [3], differently
characterized as a trace-driven simulator. By taking as inputs a machine’s configuration and an
application trace file, Dimemas can reconstruct the time behavior of a parallel application and
open the doors for experimenting with the performance of the modeled hardware.

Similarly, Pin tool can be used for program analysis like Dimemas. More precisely, Pin is a
dynamic binary instrumentation tool [4]. By its definition, instrumentation takes place on
compiled binary files during runtime, and thus recompiling the source code is not needed. As it
will be later described, the two tools are used in different scenarios for achieving different goals.

This paper continues in the next section with the methodology followed for setting up the proper
environments for both tools and performing the simulations. In Section 3, results are presented
and discussed. Finally, in Section 4 final conclusions are made upon our observations.



2 Methodology
The study constitutes of two main parts; simulation with Dimemas and simulation with Pin. In
this section, both parts are explained in more detail and depth.


2.1 Dimemas simulator
As it has been previously explained, Dimemas is used for developing and analysing the
performance of parallel programs. Several message-passing libraries are supported [3], but for
this work an MPI application was chosen. With the configuration parameters of a machine,
several simulations were performed, for testing and identifying the sensitivity of the application
performance to interconnect parameters.

2.1.1 Pre-process
The MPI application used is offered by the NAS Parallel Benchmark called MG [5] and was run
in boada server offered by the Barcelona Supercomputing Center. This server has a Dual Intel


                                                                                                   2
Xeon E5645 with 24 cores. Traces were generated by running the program with 2, 4, 8, 16, 32
and 64 threads and using Extrae, a dynamic instrumentation package which traces programs
running with a shared memory model or a message passing programming model. More details
on how to set up Extrae can be found in [6].

Traces generated from Extrae can be visualized with Paraver, a performance analysis tool that
allows visual inspection of an application and a more detailed quantitative analysis of problems
observed [7]. To be used as input to Dimemas, an application called prv2trf is used which
translates traces from Paraver format to Dimemas format. This trace translator can be found in
[8]. The command line for running prv2trf is:
./prv2trf paraver_trace.prv dimemas_trace.trf

The second input in Dimemas is a configuration file that describes an architecture model
ridealized from MareNostrum. This machine is ideal with zero latency, unlimited bandwidth and
no limit on the number of concurrent communications.

2.1.2 Simulations
The objective is to test the application under different situations and characteristics. The
parameters changed in simulations are the latency, the network bandwidth, the number of
buses and the relative processor speed. They are studied one by one in the order given above
and for each parameter, a range of values is specified on which the application is tested. After
choosing the best value for a parameter, we move on to the next parameter having as a fixed
value the the last parameter value decided. This loop process is performed 6 times, one for
each trace generated with extrae for 2, 4, 8, 16, 32, 64 threads. Using this methodology, it
becomes easier to observe how the application behaves in each circumstance.

The first step for using Dimemas after installing it, is to run the Dimemas gui located in
Dimemas_directory/bin/dimemas-gui.sh. In the window opened and from the menu label
Configuration, we choose Load Configuration in order to load the configuration file of the
machine. Afterwards, we specify the trace file converted by prv2trf, by clicking in Configuration
→ Initial Machine and we Compute the number of application tasks. After that we are able to
make changes in the machine characteristics.

The parameters mentioned before can be changed from Configuration → Target Configuration.
Specifically, from Node information, latency and Relative CPU Performance can be changed
through the values of Startup on Remote Communication and Relative Processor Speed
respectively. From Environment information, network bandwidth and number of buses can be
modified. It is important to mention that for each change done in a parameter, the button “Do all
the same” has to be pressed in order for the change to get applied to all nodes.

2.1.2.1 Latency
The first parameter studied was latency and what its impact is when increasing or decreasing
this value. In Dimemas, latency represents the local overhead of an MPI implementation. We
ran simulations with different values of latency, beginning from 1ns and increasing each time the


                                                                                               3
latency by multiplying with 10, up to 100,000ns. After each change, the new configuration was
saved as a new configuration file.

2.1.2.2 Network Bandwidth
Another important parameter to be studied is the network bandwidth. In the ideal machine, the
bandwidth is unlimited, thus the impact of reducing it would be interesting. We ran simulations
starting from 1 Mbyte and increasing it in each scenario by multiplying with 10, up to 1,000,000
Mbytes.

2.1.2.3 Number of Buses
An important question that needs to be answered refers to the impact of contention in the
application. Congestion can be modeled by the number of buses but this is not restrictively the
only way. With these simulations, the possibility of having a bad routing that could cause
contention and negatively affect performance is examined. Initially, the machine has no limit on
the number of concurrent communications. We, then, ran simulations for 1, 2, 4, 8, 16 and 32
buses. In other words, the number of buses defines how many possible transfers can take place
at any time.

2.1.2.4 Relative CPU performance
The last parameter examined was the Relative Processor Speed and what would be the impact
of having a faster processor in the machine. By saying faster, we mean the speed in the
execution of the sequential computation burst between MPI calls. Initially, the speed is the
minimum 1%. In our simulations, we tried values of being from half a time faster up to 5 times
faster, increasing by half in each simulation.

2.1.3 Post-process
As it has been already explained, the study of each parameter is done exclusively without
changing any other parameters. When all configuration files are generated regarding the same
parameter, they are studied, compared and the best value is decided depending on the impact
on the execution time and the cost that comes along. The configuration file with this value will
be the loading configuration in next simulations where a new parameter will be studied.

For each configuration saved during simulations, a Paraver file should be produced. This is
done with the command below:
./Dimemas3 -S -32K -pa new_paraver_trace.prv new_config_file.cfg
where we specify the name of the configuration file we have saved and the name we want the
new Paraver file to have.

The traces generated are opened with Paraver along with the initial trace files in order to
compare, observe performance characteristics and examine any problems indicated by the
simulator. In Section 3, the results of these simulations are presented and discussed.




                                                                                              4
2.2 Pin simulator
Pin analyzes programs by inserting arbitrary code inside executable [4]. In this project, a pin-tool
was designed to simulate a three-level cache hierarchy with a per-processor L1 data cache, a
cluster-shared L2 data cache and a globally-shared L3 data cache. While processing, each
processor uses its dedicated L1 data cache which is the fastest but usually the smallest. When
L1 fills in, the L2 data cache is used. L2 caches are responsible for a cluster of processors and
are usually slower than L1 but larger in size. Eventually when L2 cache is full, the L3 data cache
is utilized. L3 is the most expensive out of the three caches and can be used by all processors.

The objective is to perform multiprocessor cache simulations with a pthread parallel application,
changing several parameters like the number of processors, the size of L1, L2 and L3 caches
and the number of processors per cluster.

2.2.1 Pre-process
The pthread application chosen is called dotprod.c and was found in a list of sample pthread
programs provided in the website of the course [9]. Compiling the program after every change is
needed and done with the command: gcc dotprod.c -o dotprod -lpthread. After
downloading Pin, we chose an already existing pin-tool, called dcache.cpp and located in
pin_directory/source/tools/Memory/, to be the basis of our final pin-tool. This pin-tool
simulates the L1 cache memory, and therefore was helpful for building the L2 and L3 caches.
The final pin-tool was named mycache.cpp.

2.2.2 Simulations
The first series of simulations - and the biggest one - studies the impact of cache size, line size
and associativity. The idea was to study - for each cache - the three parameters, and find which
values increase hit rate the most. The cluster size and number of processors were kept the
same throughout these experiments with the values of 2 and 4 respectively. All parameters can
be changed inside the pin-tool and with every change a new compilation is needed and done
with the command: <pin_directory> source/tools/Memory/make
Afterwards, the command below is executed in order to run the pthread program using the
caches configuration given in mycache.cpp.
<pin_directory> ./pin -t ./source/tools/Memory/obj-intel64/mycache.so --
./dotprod

In Table 1, the initial values given to the parameters are shown. Starting with L1 we fixed the
best values for the three parameters and then we proceeded to L2 and finally to L3. We name
stage of simulations the set of simulations related to a single parameter. For each cache, after a
stage of simulations was complete, the best value of the studying parameter was chosen and
used to the next stages of simulations..
Parameters/Cache          L1                       L2                       L3
Cache Size               128 KB                     1 MB                    4 MB
Line Size (bytes)        32                         32                      32
Associativity            1                          1                       1
                                  Table 1: Initial Parameters Values


                                                                                                  5
As it is previously said, L2 is cluster-shared cache, which means that it is shared among a set of
nodes. The second series of simulations was focused on the cluster size and how this affects
the L2 hit rate.

Finally, in the third series of simulations we studied how the number of processors devoted for
the execution of the pthread application affects the execution time of the program. This
parameter is set in two places; the pin-tool and the pthread program.

Shortly, the parameters examined during pin simulations are explained below.
2.2.2.1 Cache Size
Cache size is the maximum number of kilobytes (KB) that a cache can keep at a time. It is
expected that by increasing the cache size, the hit rate will increase as well. Simulations were
performed for 1, 2, 4, 8, 16, 32 and 64 KB of L1 cache size in order to confirm our expectations.
L2 cache size range depends on the size choses for L1, since it should be at least double.
Similarly, L3 cache size should be at least double of L2 cache size.

2.2.2.2 Line Size
Line size is the number of bytes that can be saved at once in the cache. All three caches
are tested with values 32, 64 and 128 bytes.

2.2.2.3 Associativity
Associativity parameter keeps the number of possible memory location mappings in the
cache. Three simulations were ran with three different values of associativity; 1, 2 and 4 for all
caches.

2.2.2.4 Cluster size
Cluster size keeps the number of nodes sharing a L2 cache memory. For this study we tried 1,
2, 4 and 8 processors per L2.

2.2.3 Post-process
After each run of the pin tool, the execution time is printed in the screen while the L1, L2, L3 hit
rates and some other relevant information are printed in an output textfile called mycache.out.



3 Results
In this section, the results of both Dimemas and Pin simulations are presented and explained.

3.1 Dimemas Simulations
3.1.1 Latency
The first simulations tested latency and how it affects the execution time of the program. Several
simulations with different values of latency are performed, from 1ns up to 100,000ns increasing
exponentially each time. In Figure 1, we present for different number of processors, the values
of latency in the x-axis showing the change of the execution time from the ideal one in the y
axis. This ratio is calculated with the division: Current Execution Time / Ideal


                                                                                                  6
Execution Time. Small values of latency do now affect the time, while it becomes obvious
that after 10,000ns the ratio starts to increase. When excluding the last value of latency,
1,000,000ns, we could see that the execution time rises after the 1,000ns and therefore we
chose the 1,000ns as the best latency that our application can handle.




                              Figure 1: Time Ratio based on Latency



3.1.2 Network Bandwidth
With a fixed value of latency in 1,000ns, we moved on to the network bandwidth. Beginning with
the ideal bandwidth in the x axis – which is unlimited, we tried several values from 1 to 100,000
Mbytes, increasing exponentially. In the y axis, the change in the execution time can be seen.
As expected, small amounts of bandwidth cause traffic and lead to longer execution time. The
value of 1000 Mbytes was chosen as the ideal one, since the improvement in time with larger
bandwidth was minimal and the cost for having more bandwidth would be higher.




                             Figure 2: Time Ratio based on Bandwidth




                                                                                               7
3.1.3 Number of Buses
After fixing the value of latency and bandwidth to 1,000 ns and 1,000 Mbytes, we studied which
number of buses would give better results. Running simulations for 1, 2, 4, 8, 16 and 32 buses,
it is obvious that with more buses, the execution time tends to reach the ideal one. Though,
having many buses is not feasible or at least very difficult to implement. We also notice, that the
application still performs well with a very small number of concurrent transfers and therefore the
2 buses were chosen.




                          Figure 3: Time Ratio based on Number of Buses

3.1.4 Relative CPU performance
Having fixed values for latency, bandwidth and number of buses, we tested how the application
performs in the case of faster processors. With values from half time faster up to 5 times faster,
increasing by half in each simulation, it is clearly observed that speed up is proportional to the
increase. This time the ratio was calculated with the following division: Ideal Execution Time /
Current Execution Time for easier understanding of the graph.




                                                                                                 8
Figure 4: Time Ratio based on Relative Processor Speed



3.2 Pin Simulations

3.2.1 Cache Size
In Figure 5, the hit rate depending on the cache size is presented for all three caches. For L1
(Figure 5-a), the sizes 1, 2, 4, 8, 16, 32 and 64 KB were tested, choosing the 64KB as the best
choice. For L2, the size should be at least the double number of the L1 size, and therefore the
range begins from 128 till 1024, choosing the 256KB as the best choice, since the difference
from bigger memory size was not very high. For the same reason, in L3 the range begins from
512 and goes up to 4096, selecting the 2048 as the fixed value.




                 (a)                                                             (b)




                                                 (c)
              Figure 5: Hit Rate based on Cache Size (a) for L1, (b) for L2 and (c) for L3




                                                                                             9
3.2.2 Line Size
After observing the cache size effects, the line size was tested for three values: 32, 64 and 129
bytes. As it can be seen in Figure 6, for all caches, by increasing the line size, the hit rate is
being rising as well, and therefore the 128 bytes was chosen.




                     Figure 6: Hit Rate based on Line Size, for L1, L2, L3 caches


3.2.3 Associativity
With fixed parameters in cache size and line size, we studied the impact of associativity.
Simulations were performed for the values 1, 2 and 4. From Figure 7, it can be observed that
associativity does not affect significantly the hit rate in none of the caches.




                    Figure 7: Hit Rate based on Associativity for L1, L2, L3 caches




                                                                                               10
3.2.4 Cluster Size
The second series of simulations studied the cluster size. In Figure 8 we show the hit rate for 1,
2, 4 and 8 processors per L2 cache. With more processors devoted to one L2 cache, it is
expected that cache accesses will increase and therefore the hit rate will drop. Indeed, in Figure
8, this decrease can be seen.




                            Figure 8: Hit Rate based on Cluster Size (L2)


3.2.5 Number of Processors
The final series of simulations is related to the number of processors working for the application.
As it is observed in Figure 9, with bigger number of processors the execution time of the pthread
application increases. The parameters used for these simulations are the ones chosen on the
first series of simulations and are shown in Table 2.




                      Figure 9: Execution Time based on number of Processors



                                                                                                11
Parameters/Cache        L1                          L2                         L3
Cache Size              64                          256                        1024
Line Size               128                         128                        128
Associativity           1                           1                          1
                   Table 2: Values of parameters for the last series of simulations



4 Conclusions
This project focused on how hardware simulations are performed using two simulators;
Dimemas and Pin tool. Hardware behavior was analyzed by examining the performance impact
of various parameters on multithreaded applications.

Simulations with Dimemas show how latency, network bandwidth, number of buses and CPU
speed can affect the execution time of a parallel application. Simulations with Pin regarding a
multi-level cache memory show how cache size, line size and associativity can affect the
specific cache as well as the caches coming afterwards. Additionally, cluster size is proven to
be an important factor for L2 hit rate. Considering that L2 misses will proceed to L3, this
parameter ends up to be important for L3 hit rate as well. Lastly, the number of threads is
examined and how their increase rise the execution time. With pin, we were able to measure the
performance of the application based on hardware that we do not have.

This last series of experiments with pin open the doors for further experimentation and analysis
of applications without having access to hardware, because it is either costly or prohibited to
use. Thinking even further, these simulations can help scientists examine the pros and cons of
implementing hardware the way it is proposed or theoretically designed.




                                                                                             12
References
[1] J. Banks, J. Carson, B. Nelson, D. Nicol (2001). Discrete-Event System Simulation. Prentice
Hall. p. 3.

[2] J.A. Sokolowski, C.M. Banks (2009). Principles of Modeling and Simulation. Hoboken, NJ:
Wiley. p. 6.

[3] Barcelona Supercomputing Center. Dimemas. [Online]. Available:
http://www.bsc.es/computer-sciences/performance-tools/dimemas.

[4] Intel Software Network. Pin - A Dynamic Binary Instrumentation Tool. [Online]. Available:
http://www.pintool.org.

[5] J. Dunbar (2012, Mar.). NAS Parallel Benchmarks. [Online]. Available:
http://www.nas.nasa.gov/publications/npb.html

[6] H. S. Gelabert, G. L. Sánchez (2011, Nov.). Extrae: User guide manual for version 2.2.0.
Barcelona Supercomputing Center. [Online]. Available:
http://www.bsc.es/ssl/apps/performanceTools/files/docs/extrae-userguide.pdf

[7] Barcelona Supercomputing Center. Paraver [Online]. Available: http://www.bsc.es/computer-
sciences/performance-tools/paraver

[8] Barcelona Supercomputing Center. Software Modules [Online]. Available:
http://www.bsc.es/ssl/apps/performanceTools/

[9] A. Ramirez (2012, Jan). Primavera 2012. Tools and Measurement Techniques [Online].
Available: http://pcsostres.ac.upc.edu/eitm/doku.php/pri12

[10] Wikipedia, the free encyclopedia. CPU cache [Online]. Available:
http://en.wikipedia.org/wiki/CPU_cache




                                                                                                13

More Related Content

What's hot

Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionShobha Kumar
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semesterRafi Ullah
 
Hardware and Software parallelism
Hardware and Software parallelismHardware and Software parallelism
Hardware and Software parallelismprashantdahake
 
Evaluation of morden computer & system attributes in ACA
Evaluation of morden computer &  system attributes in ACAEvaluation of morden computer &  system attributes in ACA
Evaluation of morden computer & system attributes in ACAPankaj Kumar Jain
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computingHeman Pathak
 
Processor allocation in Distributed Systems
Processor allocation in Distributed SystemsProcessor allocation in Distributed Systems
Processor allocation in Distributed SystemsRitu Ranjan Shrivastwa
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
Software effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN networkSoftware effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN networkIOSR Journals
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: OverviewGeoffrey Fox
 
System interconnect architecture
System interconnect architectureSystem interconnect architecture
System interconnect architectureGagan Kumar
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
advanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismadvanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismPankaj Kumar Jain
 
Computer Network Performance Evaluation Based on Different Data Packet Size U...
Computer Network Performance Evaluation Based on Different Data Packet Size U...Computer Network Performance Evaluation Based on Different Data Packet Size U...
Computer Network Performance Evaluation Based on Different Data Packet Size U...Jaipal Dhobale
 
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...ijceronline
 
DEVELOPMENT AND PERFORMANCE EVALUATION OF A LAN-BASED EDGE-DETECTION TOOL
DEVELOPMENT AND PERFORMANCE EVALUATION OF A LAN-BASED EDGE-DETECTION TOOL DEVELOPMENT AND PERFORMANCE EVALUATION OF A LAN-BASED EDGE-DETECTION TOOL
DEVELOPMENT AND PERFORMANCE EVALUATION OF A LAN-BASED EDGE-DETECTION TOOL ijsc
 
Distribution systems efficiency
Distribution systems efficiencyDistribution systems efficiency
Distribution systems efficiencyAlexander Decker
 

What's hot (20)

Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solution
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
Hardware and Software parallelism
Hardware and Software parallelismHardware and Software parallelism
Hardware and Software parallelism
 
Distributed System
Distributed System Distributed System
Distributed System
 
4. system models
4. system models4. system models
4. system models
 
Evaluation of morden computer & system attributes in ACA
Evaluation of morden computer &  system attributes in ACAEvaluation of morden computer &  system attributes in ACA
Evaluation of morden computer & system attributes in ACA
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computing
 
Processor allocation in Distributed Systems
Processor allocation in Distributed SystemsProcessor allocation in Distributed Systems
Processor allocation in Distributed Systems
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Software effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN networkSoftware effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN network
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: Overview
 
System interconnect architecture
System interconnect architectureSystem interconnect architecture
System interconnect architecture
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
advanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismadvanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelism
 
Computer Network Performance Evaluation Based on Different Data Packet Size U...
Computer Network Performance Evaluation Based on Different Data Packet Size U...Computer Network Performance Evaluation Based on Different Data Packet Size U...
Computer Network Performance Evaluation Based on Different Data Packet Size U...
 
Chapter 1 pc
Chapter 1 pcChapter 1 pc
Chapter 1 pc
 
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
 
DEVELOPMENT AND PERFORMANCE EVALUATION OF A LAN-BASED EDGE-DETECTION TOOL
DEVELOPMENT AND PERFORMANCE EVALUATION OF A LAN-BASED EDGE-DETECTION TOOL DEVELOPMENT AND PERFORMANCE EVALUATION OF A LAN-BASED EDGE-DETECTION TOOL
DEVELOPMENT AND PERFORMANCE EVALUATION OF A LAN-BASED EDGE-DETECTION TOOL
 
Distribution systems efficiency
Distribution systems efficiencyDistribution systems efficiency
Distribution systems efficiency
 

Viewers also liked

IV Torneo de Voleibol "CIUTAT DE GRANOLLERS"
IV Torneo de Voleibol "CIUTAT DE GRANOLLERS"IV Torneo de Voleibol "CIUTAT DE GRANOLLERS"
IV Torneo de Voleibol "CIUTAT DE GRANOLLERS"Jordi Robles
 
Cuentos y fabulas de marine
Cuentos y fabulas de marineCuentos y fabulas de marine
Cuentos y fabulas de marineMcbs8
 
PROYECTO SEDE LA BOTICA
PROYECTO SEDE LA BOTICAPROYECTO SEDE LA BOTICA
PROYECTO SEDE LA BOTICApilycarmen
 
Ensayo ecotecnologias (1)
Ensayo ecotecnologias (1)Ensayo ecotecnologias (1)
Ensayo ecotecnologias (1)Robert180696
 
Fanon, frantz racismo e cultura
Fanon, frantz   racismo e culturaFanon, frantz   racismo e cultura
Fanon, frantz racismo e culturaCamila Avelino
 
Mobile plus
Mobile plusMobile plus
Mobile plusphinshaw
 
Convocatoria 2â° carrera nocturna normalistica neã³n.
Convocatoria 2â° carrera nocturna normalistica neã³n.Convocatoria 2â° carrera nocturna normalistica neã³n.
Convocatoria 2â° carrera nocturna normalistica neã³n.Victor Melendez
 
Introducing mysoft 10
Introducing mysoft 10Introducing mysoft 10
Introducing mysoft 10Ulhas Joshi
 
Learning Disability Qualification (LDQ) Training Courses
Learning Disability Qualification (LDQ) Training Courses Learning Disability Qualification (LDQ) Training Courses
Learning Disability Qualification (LDQ) Training Courses The Pathway Group
 
Una breve introducción de la marca de moda
Una breve introducción de la marca de modaUna breve introducción de la marca de moda
Una breve introducción de la marca de modaelenagalena
 
02 1 ju fra jmj-ficha...
02 1 ju fra jmj-ficha...02 1 ju fra jmj-ficha...
02 1 ju fra jmj-ficha...franfrater
 
Evolucion de la presencia de metales pesados en
Evolucion de la presencia de metales pesados enEvolucion de la presencia de metales pesados en
Evolucion de la presencia de metales pesados enRuben Eche
 
Trinity Kings World Leadership Services: Family Franchise Systems Data solves...
Trinity Kings World Leadership Services: Family Franchise Systems Data solves...Trinity Kings World Leadership Services: Family Franchise Systems Data solves...
Trinity Kings World Leadership Services: Family Franchise Systems Data solves...Terrell Patillo
 
TRIUMPH BOARD PORTABLE SLIM USB / WR
TRIUMPH BOARD PORTABLE SLIM USB / WRTRIUMPH BOARD PORTABLE SLIM USB / WR
TRIUMPH BOARD PORTABLE SLIM USB / WRAVprezentace
 
Agroindustria seminario
Agroindustria seminarioAgroindustria seminario
Agroindustria seminarioDavid Erazo
 
El genero rap
El genero rapEl genero rap
El genero rapwaionel
 
Forever Living Notiziario novembre 2011
Forever Living Notiziario novembre 2011Forever Living Notiziario novembre 2011
Forever Living Notiziario novembre 2011Iuliana Fartade
 

Viewers also liked (20)

IV Torneo de Voleibol "CIUTAT DE GRANOLLERS"
IV Torneo de Voleibol "CIUTAT DE GRANOLLERS"IV Torneo de Voleibol "CIUTAT DE GRANOLLERS"
IV Torneo de Voleibol "CIUTAT DE GRANOLLERS"
 
Cuentos y fabulas de marine
Cuentos y fabulas de marineCuentos y fabulas de marine
Cuentos y fabulas de marine
 
PROYECTO SEDE LA BOTICA
PROYECTO SEDE LA BOTICAPROYECTO SEDE LA BOTICA
PROYECTO SEDE LA BOTICA
 
Ensayo ecotecnologias (1)
Ensayo ecotecnologias (1)Ensayo ecotecnologias (1)
Ensayo ecotecnologias (1)
 
Biografía de tetsu
Biografía de tetsuBiografía de tetsu
Biografía de tetsu
 
Fanon, frantz racismo e cultura
Fanon, frantz   racismo e culturaFanon, frantz   racismo e cultura
Fanon, frantz racismo e cultura
 
Mobile plus
Mobile plusMobile plus
Mobile plus
 
Convocatoria 2â° carrera nocturna normalistica neã³n.
Convocatoria 2â° carrera nocturna normalistica neã³n.Convocatoria 2â° carrera nocturna normalistica neã³n.
Convocatoria 2â° carrera nocturna normalistica neã³n.
 
Location plan
Location planLocation plan
Location plan
 
Introducing mysoft 10
Introducing mysoft 10Introducing mysoft 10
Introducing mysoft 10
 
Learning Disability Qualification (LDQ) Training Courses
Learning Disability Qualification (LDQ) Training Courses Learning Disability Qualification (LDQ) Training Courses
Learning Disability Qualification (LDQ) Training Courses
 
Una breve introducción de la marca de moda
Una breve introducción de la marca de modaUna breve introducción de la marca de moda
Una breve introducción de la marca de moda
 
02 1 ju fra jmj-ficha...
02 1 ju fra jmj-ficha...02 1 ju fra jmj-ficha...
02 1 ju fra jmj-ficha...
 
Evolucion de la presencia de metales pesados en
Evolucion de la presencia de metales pesados enEvolucion de la presencia de metales pesados en
Evolucion de la presencia de metales pesados en
 
Trinity Kings World Leadership Services: Family Franchise Systems Data solves...
Trinity Kings World Leadership Services: Family Franchise Systems Data solves...Trinity Kings World Leadership Services: Family Franchise Systems Data solves...
Trinity Kings World Leadership Services: Family Franchise Systems Data solves...
 
TRIUMPH BOARD PORTABLE SLIM USB / WR
TRIUMPH BOARD PORTABLE SLIM USB / WRTRIUMPH BOARD PORTABLE SLIM USB / WR
TRIUMPH BOARD PORTABLE SLIM USB / WR
 
Agroindustria seminario
Agroindustria seminarioAgroindustria seminario
Agroindustria seminario
 
@TalentoGo. KnowtoGo: "Lidera tu proyecto" Reflexiones
@TalentoGo. KnowtoGo: "Lidera tu proyecto" Reflexiones@TalentoGo. KnowtoGo: "Lidera tu proyecto" Reflexiones
@TalentoGo. KnowtoGo: "Lidera tu proyecto" Reflexiones
 
El genero rap
El genero rapEl genero rap
El genero rap
 
Forever Living Notiziario novembre 2011
Forever Living Notiziario novembre 2011Forever Living Notiziario novembre 2011
Forever Living Notiziario novembre 2011
 

Similar to Performance Analysis of multithreaded applications based on Hardware Simulations with Dimemas and Pin tool

Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
 
HW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on ChipHW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on ChipCSCJournals
 
A3: application-aware acceleration for wireless data networks
A3: application-aware acceleration for wireless data networksA3: application-aware acceleration for wireless data networks
A3: application-aware acceleration for wireless data networksZhenyun Zhuang
 
Parallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MPParallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MPIJSRED
 
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerPerformance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerIOSRjournaljce
 
Traffic Simulator
Traffic SimulatorTraffic Simulator
Traffic Simulatorgystell
 
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...Nikhil Jain
 
Lecture 1 - Introduction.pptx
Lecture 1 - Introduction.pptxLecture 1 - Introduction.pptx
Lecture 1 - Introduction.pptxaida alsamawi
 
Network Analyzer and Report Generation Tool for NS-2 using TCL Script
Network Analyzer and Report Generation Tool for NS-2 using TCL ScriptNetwork Analyzer and Report Generation Tool for NS-2 using TCL Script
Network Analyzer and Report Generation Tool for NS-2 using TCL ScriptIRJET Journal
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridadMarketing Donalba
 
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP IJCSEIT Journal
 
Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...Damir Delija
 
Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...Damir Delija
 
Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...Damir Delija
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...IDES Editor
 
Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptxssuser41d319
 
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYSPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 

Similar to Performance Analysis of multithreaded applications based on Hardware Simulations with Dimemas and Pin tool (20)

Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
HW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on ChipHW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
 
A3: application-aware acceleration for wireless data networks
A3: application-aware acceleration for wireless data networksA3: application-aware acceleration for wireless data networks
A3: application-aware acceleration for wireless data networks
 
Parallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MPParallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MP
 
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerPerformance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
 
Traffic Simulator
Traffic SimulatorTraffic Simulator
Traffic Simulator
 
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
 
Lecture 1 - Introduction.pptx
Lecture 1 - Introduction.pptxLecture 1 - Introduction.pptx
Lecture 1 - Introduction.pptx
 
Network Analyzer and Report Generation Tool for NS-2 using TCL Script
Network Analyzer and Report Generation Tool for NS-2 using TCL ScriptNetwork Analyzer and Report Generation Tool for NS-2 using TCL Script
Network Analyzer and Report Generation Tool for NS-2 using TCL Script
 
p850-ries
p850-riesp850-ries
p850-ries
 
FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
 
Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...
 
Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...
 
Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...Communication network simulation on the unix system trough use of the remote ...
Communication network simulation on the unix system trough use of the remote ...
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
 
Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptx
 
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYSPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 

More from Maria Stylianou

SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication MiddlewareSPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication MiddlewareMaria Stylianou
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksMaria Stylianou
 
Scaling Online Social Networks (OSNs)
Scaling Online Social Networks (OSNs)Scaling Online Social Networks (OSNs)
Scaling Online Social Networks (OSNs)Maria Stylianou
 
Green Optical Networks with Signal Quality Guarantee
Green Optical Networks with Signal Quality Guarantee Green Optical Networks with Signal Quality Guarantee
Green Optical Networks with Signal Quality Guarantee Maria Stylianou
 
Cano projectGreen Optical Networks with Signal Quality Guarantee
Cano projectGreen Optical Networks with Signal Quality Guarantee Cano projectGreen Optical Networks with Signal Quality Guarantee
Cano projectGreen Optical Networks with Signal Quality Guarantee Maria Stylianou
 
A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...
A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...
A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...Maria Stylianou
 
Automatic Energy-based Scheduling
Automatic Energy-based SchedulingAutomatic Energy-based Scheduling
Automatic Energy-based SchedulingMaria Stylianou
 
Intelligent Placement of Datacenters for Internet Services
Intelligent Placement of Datacenters for Internet ServicesIntelligent Placement of Datacenters for Internet Services
Intelligent Placement of Datacenters for Internet ServicesMaria Stylianou
 
Instrumenting the MG applicaiton of NAS Parallel Benchmark
Instrumenting the MG applicaiton of NAS Parallel BenchmarkInstrumenting the MG applicaiton of NAS Parallel Benchmark
Instrumenting the MG applicaiton of NAS Parallel BenchmarkMaria Stylianou
 
Low-Latency Multi-Writer Atomic Registers
Low-Latency Multi-Writer Atomic RegistersLow-Latency Multi-Writer Atomic Registers
Low-Latency Multi-Writer Atomic RegistersMaria Stylianou
 
How Companies Learn Your Secrets
How Companies Learn Your SecretsHow Companies Learn Your Secrets
How Companies Learn Your SecretsMaria Stylianou
 
EEDC - Why use of REST for Web Services
EEDC - Why use of REST for Web Services EEDC - Why use of REST for Web Services
EEDC - Why use of REST for Web Services Maria Stylianou
 
EEDC - Distributed Systems
EEDC - Distributed SystemsEEDC - Distributed Systems
EEDC - Distributed SystemsMaria Stylianou
 

More from Maria Stylianou (16)

SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication MiddlewareSPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible Attacks
 
Scaling Online Social Networks (OSNs)
Scaling Online Social Networks (OSNs)Scaling Online Social Networks (OSNs)
Scaling Online Social Networks (OSNs)
 
Erlang in 10 minutes
Erlang in 10 minutesErlang in 10 minutes
Erlang in 10 minutes
 
Pregel - Paper Review
Pregel - Paper ReviewPregel - Paper Review
Pregel - Paper Review
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Green Optical Networks with Signal Quality Guarantee
Green Optical Networks with Signal Quality Guarantee Green Optical Networks with Signal Quality Guarantee
Green Optical Networks with Signal Quality Guarantee
 
Cano projectGreen Optical Networks with Signal Quality Guarantee
Cano projectGreen Optical Networks with Signal Quality Guarantee Cano projectGreen Optical Networks with Signal Quality Guarantee
Cano projectGreen Optical Networks with Signal Quality Guarantee
 
A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...
A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...
A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...
 
Automatic Energy-based Scheduling
Automatic Energy-based SchedulingAutomatic Energy-based Scheduling
Automatic Energy-based Scheduling
 
Intelligent Placement of Datacenters for Internet Services
Intelligent Placement of Datacenters for Internet ServicesIntelligent Placement of Datacenters for Internet Services
Intelligent Placement of Datacenters for Internet Services
 
Instrumenting the MG applicaiton of NAS Parallel Benchmark
Instrumenting the MG applicaiton of NAS Parallel BenchmarkInstrumenting the MG applicaiton of NAS Parallel Benchmark
Instrumenting the MG applicaiton of NAS Parallel Benchmark
 
Low-Latency Multi-Writer Atomic Registers
Low-Latency Multi-Writer Atomic RegistersLow-Latency Multi-Writer Atomic Registers
Low-Latency Multi-Writer Atomic Registers
 
How Companies Learn Your Secrets
How Companies Learn Your SecretsHow Companies Learn Your Secrets
How Companies Learn Your Secrets
 
EEDC - Why use of REST for Web Services
EEDC - Why use of REST for Web Services EEDC - Why use of REST for Web Services
EEDC - Why use of REST for Web Services
 
EEDC - Distributed Systems
EEDC - Distributed SystemsEEDC - Distributed Systems
EEDC - Distributed Systems
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Performance Analysis of multithreaded applications based on Hardware Simulations with Dimemas and Pin tool

  • 1. Performance Analysis of multithreaded applications based on Hardware Simulations with Dimemas and Pin tool Maria Stylianou Universitat Politecnica de Catalunya (UPC) marsty5@gmail.com June 15, 2012 Abstract It is widely accepted that application development is rapidly growing while hardware design occurs in a smaller pace and success. The high cost of buying the best hardware as well as the non-existence of machines that can support the latest application developments lead scientists to look for other ways to examine the performance of their applications. Hardware simulation appears to be crucial for an application analysis. This project attempts to simulate hardware using two simulators; one for exploring the outcome of parameters like latency, network bandwidth, congestion and the other for studying the effect of parameters related to cache memory, such as cache size, cluster size, number of processors. Our predictions for the great effect of these parameters are justified with the results extracted from our experimentation. 1
  • 2. 1 Introduction Simulation is defined as the imitation of the operation of a real-world process or system over time [1]. In the area of engineering, an architecture simulation is important to model a real-life or a hypothetical situation on a computer, in order to be later studied and analyzed. With several simulations and variable modifications, researchers predict and draw conclusions about the behavior of the system. To go a step further, simulation ends to be crucial when either the real computer hardware is not accessible or prohibited to be engaged either the hardware is not yet built [2]. In this report, the attention revolves around hardware simulations and more specifically simulations using two tools; Dimemas and Pin. Both tools are used under this study to analyze and predict hardware behavior upon execution of parallel programs. Major differences distinguish the two tools and are shortly described below. Dimemas is a performance analysis tool for message-passing programs [3], differently characterized as a trace-driven simulator. By taking as inputs a machine’s configuration and an application trace file, Dimemas can reconstruct the time behavior of a parallel application and open the doors for experimenting with the performance of the modeled hardware. Similarly, Pin tool can be used for program analysis like Dimemas. More precisely, Pin is a dynamic binary instrumentation tool [4]. By its definition, instrumentation takes place on compiled binary files during runtime, and thus recompiling the source code is not needed. As it will be later described, the two tools are used in different scenarios for achieving different goals. This paper continues in the next section with the methodology followed for setting up the proper environments for both tools and performing the simulations. In Section 3, results are presented and discussed. Finally, in Section 4 final conclusions are made upon our observations. 2 Methodology The study constitutes of two main parts; simulation with Dimemas and simulation with Pin. In this section, both parts are explained in more detail and depth. 2.1 Dimemas simulator As it has been previously explained, Dimemas is used for developing and analysing the performance of parallel programs. Several message-passing libraries are supported [3], but for this work an MPI application was chosen. With the configuration parameters of a machine, several simulations were performed, for testing and identifying the sensitivity of the application performance to interconnect parameters. 2.1.1 Pre-process The MPI application used is offered by the NAS Parallel Benchmark called MG [5] and was run in boada server offered by the Barcelona Supercomputing Center. This server has a Dual Intel 2
  • 3. Xeon E5645 with 24 cores. Traces were generated by running the program with 2, 4, 8, 16, 32 and 64 threads and using Extrae, a dynamic instrumentation package which traces programs running with a shared memory model or a message passing programming model. More details on how to set up Extrae can be found in [6]. Traces generated from Extrae can be visualized with Paraver, a performance analysis tool that allows visual inspection of an application and a more detailed quantitative analysis of problems observed [7]. To be used as input to Dimemas, an application called prv2trf is used which translates traces from Paraver format to Dimemas format. This trace translator can be found in [8]. The command line for running prv2trf is: ./prv2trf paraver_trace.prv dimemas_trace.trf The second input in Dimemas is a configuration file that describes an architecture model ridealized from MareNostrum. This machine is ideal with zero latency, unlimited bandwidth and no limit on the number of concurrent communications. 2.1.2 Simulations The objective is to test the application under different situations and characteristics. The parameters changed in simulations are the latency, the network bandwidth, the number of buses and the relative processor speed. They are studied one by one in the order given above and for each parameter, a range of values is specified on which the application is tested. After choosing the best value for a parameter, we move on to the next parameter having as a fixed value the the last parameter value decided. This loop process is performed 6 times, one for each trace generated with extrae for 2, 4, 8, 16, 32, 64 threads. Using this methodology, it becomes easier to observe how the application behaves in each circumstance. The first step for using Dimemas after installing it, is to run the Dimemas gui located in Dimemas_directory/bin/dimemas-gui.sh. In the window opened and from the menu label Configuration, we choose Load Configuration in order to load the configuration file of the machine. Afterwards, we specify the trace file converted by prv2trf, by clicking in Configuration → Initial Machine and we Compute the number of application tasks. After that we are able to make changes in the machine characteristics. The parameters mentioned before can be changed from Configuration → Target Configuration. Specifically, from Node information, latency and Relative CPU Performance can be changed through the values of Startup on Remote Communication and Relative Processor Speed respectively. From Environment information, network bandwidth and number of buses can be modified. It is important to mention that for each change done in a parameter, the button “Do all the same” has to be pressed in order for the change to get applied to all nodes. 2.1.2.1 Latency The first parameter studied was latency and what its impact is when increasing or decreasing this value. In Dimemas, latency represents the local overhead of an MPI implementation. We ran simulations with different values of latency, beginning from 1ns and increasing each time the 3
  • 4. latency by multiplying with 10, up to 100,000ns. After each change, the new configuration was saved as a new configuration file. 2.1.2.2 Network Bandwidth Another important parameter to be studied is the network bandwidth. In the ideal machine, the bandwidth is unlimited, thus the impact of reducing it would be interesting. We ran simulations starting from 1 Mbyte and increasing it in each scenario by multiplying with 10, up to 1,000,000 Mbytes. 2.1.2.3 Number of Buses An important question that needs to be answered refers to the impact of contention in the application. Congestion can be modeled by the number of buses but this is not restrictively the only way. With these simulations, the possibility of having a bad routing that could cause contention and negatively affect performance is examined. Initially, the machine has no limit on the number of concurrent communications. We, then, ran simulations for 1, 2, 4, 8, 16 and 32 buses. In other words, the number of buses defines how many possible transfers can take place at any time. 2.1.2.4 Relative CPU performance The last parameter examined was the Relative Processor Speed and what would be the impact of having a faster processor in the machine. By saying faster, we mean the speed in the execution of the sequential computation burst between MPI calls. Initially, the speed is the minimum 1%. In our simulations, we tried values of being from half a time faster up to 5 times faster, increasing by half in each simulation. 2.1.3 Post-process As it has been already explained, the study of each parameter is done exclusively without changing any other parameters. When all configuration files are generated regarding the same parameter, they are studied, compared and the best value is decided depending on the impact on the execution time and the cost that comes along. The configuration file with this value will be the loading configuration in next simulations where a new parameter will be studied. For each configuration saved during simulations, a Paraver file should be produced. This is done with the command below: ./Dimemas3 -S -32K -pa new_paraver_trace.prv new_config_file.cfg where we specify the name of the configuration file we have saved and the name we want the new Paraver file to have. The traces generated are opened with Paraver along with the initial trace files in order to compare, observe performance characteristics and examine any problems indicated by the simulator. In Section 3, the results of these simulations are presented and discussed. 4
  • 5. 2.2 Pin simulator Pin analyzes programs by inserting arbitrary code inside executable [4]. In this project, a pin-tool was designed to simulate a three-level cache hierarchy with a per-processor L1 data cache, a cluster-shared L2 data cache and a globally-shared L3 data cache. While processing, each processor uses its dedicated L1 data cache which is the fastest but usually the smallest. When L1 fills in, the L2 data cache is used. L2 caches are responsible for a cluster of processors and are usually slower than L1 but larger in size. Eventually when L2 cache is full, the L3 data cache is utilized. L3 is the most expensive out of the three caches and can be used by all processors. The objective is to perform multiprocessor cache simulations with a pthread parallel application, changing several parameters like the number of processors, the size of L1, L2 and L3 caches and the number of processors per cluster. 2.2.1 Pre-process The pthread application chosen is called dotprod.c and was found in a list of sample pthread programs provided in the website of the course [9]. Compiling the program after every change is needed and done with the command: gcc dotprod.c -o dotprod -lpthread. After downloading Pin, we chose an already existing pin-tool, called dcache.cpp and located in pin_directory/source/tools/Memory/, to be the basis of our final pin-tool. This pin-tool simulates the L1 cache memory, and therefore was helpful for building the L2 and L3 caches. The final pin-tool was named mycache.cpp. 2.2.2 Simulations The first series of simulations - and the biggest one - studies the impact of cache size, line size and associativity. The idea was to study - for each cache - the three parameters, and find which values increase hit rate the most. The cluster size and number of processors were kept the same throughout these experiments with the values of 2 and 4 respectively. All parameters can be changed inside the pin-tool and with every change a new compilation is needed and done with the command: <pin_directory> source/tools/Memory/make Afterwards, the command below is executed in order to run the pthread program using the caches configuration given in mycache.cpp. <pin_directory> ./pin -t ./source/tools/Memory/obj-intel64/mycache.so -- ./dotprod In Table 1, the initial values given to the parameters are shown. Starting with L1 we fixed the best values for the three parameters and then we proceeded to L2 and finally to L3. We name stage of simulations the set of simulations related to a single parameter. For each cache, after a stage of simulations was complete, the best value of the studying parameter was chosen and used to the next stages of simulations.. Parameters/Cache L1 L2 L3 Cache Size 128 KB 1 MB 4 MB Line Size (bytes) 32 32 32 Associativity 1 1 1 Table 1: Initial Parameters Values 5
  • 6. As it is previously said, L2 is cluster-shared cache, which means that it is shared among a set of nodes. The second series of simulations was focused on the cluster size and how this affects the L2 hit rate. Finally, in the third series of simulations we studied how the number of processors devoted for the execution of the pthread application affects the execution time of the program. This parameter is set in two places; the pin-tool and the pthread program. Shortly, the parameters examined during pin simulations are explained below. 2.2.2.1 Cache Size Cache size is the maximum number of kilobytes (KB) that a cache can keep at a time. It is expected that by increasing the cache size, the hit rate will increase as well. Simulations were performed for 1, 2, 4, 8, 16, 32 and 64 KB of L1 cache size in order to confirm our expectations. L2 cache size range depends on the size choses for L1, since it should be at least double. Similarly, L3 cache size should be at least double of L2 cache size. 2.2.2.2 Line Size Line size is the number of bytes that can be saved at once in the cache. All three caches are tested with values 32, 64 and 128 bytes. 2.2.2.3 Associativity Associativity parameter keeps the number of possible memory location mappings in the cache. Three simulations were ran with three different values of associativity; 1, 2 and 4 for all caches. 2.2.2.4 Cluster size Cluster size keeps the number of nodes sharing a L2 cache memory. For this study we tried 1, 2, 4 and 8 processors per L2. 2.2.3 Post-process After each run of the pin tool, the execution time is printed in the screen while the L1, L2, L3 hit rates and some other relevant information are printed in an output textfile called mycache.out. 3 Results In this section, the results of both Dimemas and Pin simulations are presented and explained. 3.1 Dimemas Simulations 3.1.1 Latency The first simulations tested latency and how it affects the execution time of the program. Several simulations with different values of latency are performed, from 1ns up to 100,000ns increasing exponentially each time. In Figure 1, we present for different number of processors, the values of latency in the x-axis showing the change of the execution time from the ideal one in the y axis. This ratio is calculated with the division: Current Execution Time / Ideal 6
  • 7. Execution Time. Small values of latency do now affect the time, while it becomes obvious that after 10,000ns the ratio starts to increase. When excluding the last value of latency, 1,000,000ns, we could see that the execution time rises after the 1,000ns and therefore we chose the 1,000ns as the best latency that our application can handle. Figure 1: Time Ratio based on Latency 3.1.2 Network Bandwidth With a fixed value of latency in 1,000ns, we moved on to the network bandwidth. Beginning with the ideal bandwidth in the x axis – which is unlimited, we tried several values from 1 to 100,000 Mbytes, increasing exponentially. In the y axis, the change in the execution time can be seen. As expected, small amounts of bandwidth cause traffic and lead to longer execution time. The value of 1000 Mbytes was chosen as the ideal one, since the improvement in time with larger bandwidth was minimal and the cost for having more bandwidth would be higher. Figure 2: Time Ratio based on Bandwidth 7
  • 8. 3.1.3 Number of Buses After fixing the value of latency and bandwidth to 1,000 ns and 1,000 Mbytes, we studied which number of buses would give better results. Running simulations for 1, 2, 4, 8, 16 and 32 buses, it is obvious that with more buses, the execution time tends to reach the ideal one. Though, having many buses is not feasible or at least very difficult to implement. We also notice, that the application still performs well with a very small number of concurrent transfers and therefore the 2 buses were chosen. Figure 3: Time Ratio based on Number of Buses 3.1.4 Relative CPU performance Having fixed values for latency, bandwidth and number of buses, we tested how the application performs in the case of faster processors. With values from half time faster up to 5 times faster, increasing by half in each simulation, it is clearly observed that speed up is proportional to the increase. This time the ratio was calculated with the following division: Ideal Execution Time / Current Execution Time for easier understanding of the graph. 8
  • 9. Figure 4: Time Ratio based on Relative Processor Speed 3.2 Pin Simulations 3.2.1 Cache Size In Figure 5, the hit rate depending on the cache size is presented for all three caches. For L1 (Figure 5-a), the sizes 1, 2, 4, 8, 16, 32 and 64 KB were tested, choosing the 64KB as the best choice. For L2, the size should be at least the double number of the L1 size, and therefore the range begins from 128 till 1024, choosing the 256KB as the best choice, since the difference from bigger memory size was not very high. For the same reason, in L3 the range begins from 512 and goes up to 4096, selecting the 2048 as the fixed value. (a) (b) (c) Figure 5: Hit Rate based on Cache Size (a) for L1, (b) for L2 and (c) for L3 9
  • 10. 3.2.2 Line Size After observing the cache size effects, the line size was tested for three values: 32, 64 and 129 bytes. As it can be seen in Figure 6, for all caches, by increasing the line size, the hit rate is being rising as well, and therefore the 128 bytes was chosen. Figure 6: Hit Rate based on Line Size, for L1, L2, L3 caches 3.2.3 Associativity With fixed parameters in cache size and line size, we studied the impact of associativity. Simulations were performed for the values 1, 2 and 4. From Figure 7, it can be observed that associativity does not affect significantly the hit rate in none of the caches. Figure 7: Hit Rate based on Associativity for L1, L2, L3 caches 10
  • 11. 3.2.4 Cluster Size The second series of simulations studied the cluster size. In Figure 8 we show the hit rate for 1, 2, 4 and 8 processors per L2 cache. With more processors devoted to one L2 cache, it is expected that cache accesses will increase and therefore the hit rate will drop. Indeed, in Figure 8, this decrease can be seen. Figure 8: Hit Rate based on Cluster Size (L2) 3.2.5 Number of Processors The final series of simulations is related to the number of processors working for the application. As it is observed in Figure 9, with bigger number of processors the execution time of the pthread application increases. The parameters used for these simulations are the ones chosen on the first series of simulations and are shown in Table 2. Figure 9: Execution Time based on number of Processors 11
  • 12. Parameters/Cache L1 L2 L3 Cache Size 64 256 1024 Line Size 128 128 128 Associativity 1 1 1 Table 2: Values of parameters for the last series of simulations 4 Conclusions This project focused on how hardware simulations are performed using two simulators; Dimemas and Pin tool. Hardware behavior was analyzed by examining the performance impact of various parameters on multithreaded applications. Simulations with Dimemas show how latency, network bandwidth, number of buses and CPU speed can affect the execution time of a parallel application. Simulations with Pin regarding a multi-level cache memory show how cache size, line size and associativity can affect the specific cache as well as the caches coming afterwards. Additionally, cluster size is proven to be an important factor for L2 hit rate. Considering that L2 misses will proceed to L3, this parameter ends up to be important for L3 hit rate as well. Lastly, the number of threads is examined and how their increase rise the execution time. With pin, we were able to measure the performance of the application based on hardware that we do not have. This last series of experiments with pin open the doors for further experimentation and analysis of applications without having access to hardware, because it is either costly or prohibited to use. Thinking even further, these simulations can help scientists examine the pros and cons of implementing hardware the way it is proposed or theoretically designed. 12
  • 13. References [1] J. Banks, J. Carson, B. Nelson, D. Nicol (2001). Discrete-Event System Simulation. Prentice Hall. p. 3. [2] J.A. Sokolowski, C.M. Banks (2009). Principles of Modeling and Simulation. Hoboken, NJ: Wiley. p. 6. [3] Barcelona Supercomputing Center. Dimemas. [Online]. Available: http://www.bsc.es/computer-sciences/performance-tools/dimemas. [4] Intel Software Network. Pin - A Dynamic Binary Instrumentation Tool. [Online]. Available: http://www.pintool.org. [5] J. Dunbar (2012, Mar.). NAS Parallel Benchmarks. [Online]. Available: http://www.nas.nasa.gov/publications/npb.html [6] H. S. Gelabert, G. L. Sánchez (2011, Nov.). Extrae: User guide manual for version 2.2.0. Barcelona Supercomputing Center. [Online]. Available: http://www.bsc.es/ssl/apps/performanceTools/files/docs/extrae-userguide.pdf [7] Barcelona Supercomputing Center. Paraver [Online]. Available: http://www.bsc.es/computer- sciences/performance-tools/paraver [8] Barcelona Supercomputing Center. Software Modules [Online]. Available: http://www.bsc.es/ssl/apps/performanceTools/ [9] A. Ramirez (2012, Jan). Primavera 2012. Tools and Measurement Techniques [Online]. Available: http://pcsostres.ac.upc.edu/eitm/doku.php/pri12 [10] Wikipedia, the free encyclopedia. CPU cache [Online]. Available: http://en.wikipedia.org/wiki/CPU_cache 13