• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Performance Analysis of multithreaded applications based on Hardware Simulations with Dimemas and Pin tool

Performance Analysis of multithreaded applications based on Hardware Simulations with Dimemas and Pin tool



Course: Measurement Tools and Techniques (UPC)

Course: Measurement Tools and Techniques (UPC)



Total Views
Views on SlideShare
Embed Views



1 Embed 2

http://www.slashdocs.com 2



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Performance Analysis of multithreaded applications based on Hardware Simulations with Dimemas and Pin tool Performance Analysis of multithreaded applications based on Hardware Simulations with Dimemas and Pin tool Document Transcript

    • Performance Analysis of multithreaded applications based on Hardware Simulations with Dimemas and Pin tool Maria Stylianou Universitat Politecnica de Catalunya (UPC) marsty5@gmail.com June 15, 2012AbstractIt is widely accepted that application development is rapidly growing while hardware designoccurs in a smaller pace and success. The high cost of buying the best hardware as well as thenon-existence of machines that can support the latest application developments lead scientiststo look for other ways to examine the performance of their applications. Hardware simulationappears to be crucial for an application analysis. This project attempts to simulate hardwareusing two simulators; one for exploring the outcome of parameters like latency, networkbandwidth, congestion and the other for studying the effect of parameters related to cachememory, such as cache size, cluster size, number of processors. Our predictions for the greateffect of these parameters are justified with the results extracted from our experimentation. 1
    • 1 IntroductionSimulation is defined as the imitation of the operation of a real-world process or system overtime [1]. In the area of engineering, an architecture simulation is important to model a real-life ora hypothetical situation on a computer, in order to be later studied and analyzed. With severalsimulations and variable modifications, researchers predict and draw conclusions about thebehavior of the system. To go a step further, simulation ends to be crucial when either the realcomputer hardware is not accessible or prohibited to be engaged either the hardware is not yetbuilt [2].In this report, the attention revolves around hardware simulations and more specificallysimulations using two tools; Dimemas and Pin. Both tools are used under this study to analyzeand predict hardware behavior upon execution of parallel programs. Major differencesdistinguish the two tools and are shortly described below.Dimemas is a performance analysis tool for message-passing programs [3], differentlycharacterized as a trace-driven simulator. By taking as inputs a machine’s configuration and anapplication trace file, Dimemas can reconstruct the time behavior of a parallel application andopen the doors for experimenting with the performance of the modeled hardware.Similarly, Pin tool can be used for program analysis like Dimemas. More precisely, Pin is adynamic binary instrumentation tool [4]. By its definition, instrumentation takes place oncompiled binary files during runtime, and thus recompiling the source code is not needed. As itwill be later described, the two tools are used in different scenarios for achieving different goals.This paper continues in the next section with the methodology followed for setting up the properenvironments for both tools and performing the simulations. In Section 3, results are presentedand discussed. Finally, in Section 4 final conclusions are made upon our observations.2 MethodologyThe study constitutes of two main parts; simulation with Dimemas and simulation with Pin. Inthis section, both parts are explained in more detail and depth.2.1 Dimemas simulatorAs it has been previously explained, Dimemas is used for developing and analysing theperformance of parallel programs. Several message-passing libraries are supported [3], but forthis work an MPI application was chosen. With the configuration parameters of a machine,several simulations were performed, for testing and identifying the sensitivity of the applicationperformance to interconnect parameters.2.1.1 Pre-processThe MPI application used is offered by the NAS Parallel Benchmark called MG [5] and was runin boada server offered by the Barcelona Supercomputing Center. This server has a Dual Intel 2
    • Xeon E5645 with 24 cores. Traces were generated by running the program with 2, 4, 8, 16, 32and 64 threads and using Extrae, a dynamic instrumentation package which traces programsrunning with a shared memory model or a message passing programming model. More detailson how to set up Extrae can be found in [6].Traces generated from Extrae can be visualized with Paraver, a performance analysis tool thatallows visual inspection of an application and a more detailed quantitative analysis of problemsobserved [7]. To be used as input to Dimemas, an application called prv2trf is used whichtranslates traces from Paraver format to Dimemas format. This trace translator can be found in[8]. The command line for running prv2trf is:./prv2trf paraver_trace.prv dimemas_trace.trfThe second input in Dimemas is a configuration file that describes an architecture modelridealized from MareNostrum. This machine is ideal with zero latency, unlimited bandwidth andno limit on the number of concurrent communications.2.1.2 SimulationsThe objective is to test the application under different situations and characteristics. Theparameters changed in simulations are the latency, the network bandwidth, the number ofbuses and the relative processor speed. They are studied one by one in the order given aboveand for each parameter, a range of values is specified on which the application is tested. Afterchoosing the best value for a parameter, we move on to the next parameter having as a fixedvalue the the last parameter value decided. This loop process is performed 6 times, one foreach trace generated with extrae for 2, 4, 8, 16, 32, 64 threads. Using this methodology, itbecomes easier to observe how the application behaves in each circumstance.The first step for using Dimemas after installing it, is to run the Dimemas gui located inDimemas_directory/bin/dimemas-gui.sh. In the window opened and from the menu labelConfiguration, we choose Load Configuration in order to load the configuration file of themachine. Afterwards, we specify the trace file converted by prv2trf, by clicking in Configuration→ Initial Machine and we Compute the number of application tasks. After that we are able tomake changes in the machine characteristics.The parameters mentioned before can be changed from Configuration → Target Configuration.Specifically, from Node information, latency and Relative CPU Performance can be changedthrough the values of Startup on Remote Communication and Relative Processor Speedrespectively. From Environment information, network bandwidth and number of buses can bemodified. It is important to mention that for each change done in a parameter, the button “Do allthe same” has to be pressed in order for the change to get applied to all nodes. LatencyThe first parameter studied was latency and what its impact is when increasing or decreasingthis value. In Dimemas, latency represents the local overhead of an MPI implementation. Weran simulations with different values of latency, beginning from 1ns and increasing each time the 3
    • latency by multiplying with 10, up to 100,000ns. After each change, the new configuration wassaved as a new configuration file. Network BandwidthAnother important parameter to be studied is the network bandwidth. In the ideal machine, thebandwidth is unlimited, thus the impact of reducing it would be interesting. We ran simulationsstarting from 1 Mbyte and increasing it in each scenario by multiplying with 10, up to 1,000,000Mbytes. Number of BusesAn important question that needs to be answered refers to the impact of contention in theapplication. Congestion can be modeled by the number of buses but this is not restrictively theonly way. With these simulations, the possibility of having a bad routing that could causecontention and negatively affect performance is examined. Initially, the machine has no limit onthe number of concurrent communications. We, then, ran simulations for 1, 2, 4, 8, 16 and 32buses. In other words, the number of buses defines how many possible transfers can take placeat any time. Relative CPU performanceThe last parameter examined was the Relative Processor Speed and what would be the impactof having a faster processor in the machine. By saying faster, we mean the speed in theexecution of the sequential computation burst between MPI calls. Initially, the speed is theminimum 1%. In our simulations, we tried values of being from half a time faster up to 5 timesfaster, increasing by half in each simulation.2.1.3 Post-processAs it has been already explained, the study of each parameter is done exclusively withoutchanging any other parameters. When all configuration files are generated regarding the sameparameter, they are studied, compared and the best value is decided depending on the impacton the execution time and the cost that comes along. The configuration file with this value willbe the loading configuration in next simulations where a new parameter will be studied.For each configuration saved during simulations, a Paraver file should be produced. This isdone with the command below:./Dimemas3 -S -32K -pa new_paraver_trace.prv new_config_file.cfgwhere we specify the name of the configuration file we have saved and the name we want thenew Paraver file to have.The traces generated are opened with Paraver along with the initial trace files in order tocompare, observe performance characteristics and examine any problems indicated by thesimulator. In Section 3, the results of these simulations are presented and discussed. 4
    • 2.2 Pin simulatorPin analyzes programs by inserting arbitrary code inside executable [4]. In this project, a pin-toolwas designed to simulate a three-level cache hierarchy with a per-processor L1 data cache, acluster-shared L2 data cache and a globally-shared L3 data cache. While processing, eachprocessor uses its dedicated L1 data cache which is the fastest but usually the smallest. WhenL1 fills in, the L2 data cache is used. L2 caches are responsible for a cluster of processors andare usually slower than L1 but larger in size. Eventually when L2 cache is full, the L3 data cacheis utilized. L3 is the most expensive out of the three caches and can be used by all processors.The objective is to perform multiprocessor cache simulations with a pthread parallel application,changing several parameters like the number of processors, the size of L1, L2 and L3 cachesand the number of processors per cluster.2.2.1 Pre-processThe pthread application chosen is called dotprod.c and was found in a list of sample pthreadprograms provided in the website of the course [9]. Compiling the program after every change isneeded and done with the command: gcc dotprod.c -o dotprod -lpthread. Afterdownloading Pin, we chose an already existing pin-tool, called dcache.cpp and located inpin_directory/source/tools/Memory/, to be the basis of our final pin-tool. This pin-toolsimulates the L1 cache memory, and therefore was helpful for building the L2 and L3 caches.The final pin-tool was named mycache.cpp.2.2.2 SimulationsThe first series of simulations - and the biggest one - studies the impact of cache size, line sizeand associativity. The idea was to study - for each cache - the three parameters, and find whichvalues increase hit rate the most. The cluster size and number of processors were kept thesame throughout these experiments with the values of 2 and 4 respectively. All parameters canbe changed inside the pin-tool and with every change a new compilation is needed and donewith the command: <pin_directory> source/tools/Memory/makeAfterwards, the command below is executed in order to run the pthread program using thecaches configuration given in mycache.cpp.<pin_directory> ./pin -t ./source/tools/Memory/obj-intel64/mycache.so --./dotprodIn Table 1, the initial values given to the parameters are shown. Starting with L1 we fixed thebest values for the three parameters and then we proceeded to L2 and finally to L3. We namestage of simulations the set of simulations related to a single parameter. For each cache, after astage of simulations was complete, the best value of the studying parameter was chosen andused to the next stages of simulations..Parameters/Cache L1 L2 L3Cache Size 128 KB 1 MB 4 MBLine Size (bytes) 32 32 32Associativity 1 1 1 Table 1: Initial Parameters Values 5
    • As it is previously said, L2 is cluster-shared cache, which means that it is shared among a set ofnodes. The second series of simulations was focused on the cluster size and how this affectsthe L2 hit rate.Finally, in the third series of simulations we studied how the number of processors devoted forthe execution of the pthread application affects the execution time of the program. Thisparameter is set in two places; the pin-tool and the pthread program.Shortly, the parameters examined during pin simulations are explained below. Cache SizeCache size is the maximum number of kilobytes (KB) that a cache can keep at a time. It isexpected that by increasing the cache size, the hit rate will increase as well. Simulations wereperformed for 1, 2, 4, 8, 16, 32 and 64 KB of L1 cache size in order to confirm our expectations.L2 cache size range depends on the size choses for L1, since it should be at least double.Similarly, L3 cache size should be at least double of L2 cache size. Line SizeLine size is the number of bytes that can be saved at once in the cache. All three cachesare tested with values 32, 64 and 128 bytes. AssociativityAssociativity parameter keeps the number of possible memory location mappings in thecache. Three simulations were ran with three different values of associativity; 1, 2 and 4 for allcaches. Cluster sizeCluster size keeps the number of nodes sharing a L2 cache memory. For this study we tried 1,2, 4 and 8 processors per L2.2.2.3 Post-processAfter each run of the pin tool, the execution time is printed in the screen while the L1, L2, L3 hitrates and some other relevant information are printed in an output textfile called mycache.out.3 ResultsIn this section, the results of both Dimemas and Pin simulations are presented and explained.3.1 Dimemas Simulations3.1.1 LatencyThe first simulations tested latency and how it affects the execution time of the program. Severalsimulations with different values of latency are performed, from 1ns up to 100,000ns increasingexponentially each time. In Figure 1, we present for different number of processors, the valuesof latency in the x-axis showing the change of the execution time from the ideal one in the yaxis. This ratio is calculated with the division: Current Execution Time / Ideal 6
    • Execution Time. Small values of latency do now affect the time, while it becomes obviousthat after 10,000ns the ratio starts to increase. When excluding the last value of latency,1,000,000ns, we could see that the execution time rises after the 1,000ns and therefore wechose the 1,000ns as the best latency that our application can handle. Figure 1: Time Ratio based on Latency3.1.2 Network BandwidthWith a fixed value of latency in 1,000ns, we moved on to the network bandwidth. Beginning withthe ideal bandwidth in the x axis – which is unlimited, we tried several values from 1 to 100,000Mbytes, increasing exponentially. In the y axis, the change in the execution time can be seen.As expected, small amounts of bandwidth cause traffic and lead to longer execution time. Thevalue of 1000 Mbytes was chosen as the ideal one, since the improvement in time with largerbandwidth was minimal and the cost for having more bandwidth would be higher. Figure 2: Time Ratio based on Bandwidth 7
    • 3.1.3 Number of BusesAfter fixing the value of latency and bandwidth to 1,000 ns and 1,000 Mbytes, we studied whichnumber of buses would give better results. Running simulations for 1, 2, 4, 8, 16 and 32 buses,it is obvious that with more buses, the execution time tends to reach the ideal one. Though,having many buses is not feasible or at least very difficult to implement. We also notice, that theapplication still performs well with a very small number of concurrent transfers and therefore the2 buses were chosen. Figure 3: Time Ratio based on Number of Buses3.1.4 Relative CPU performanceHaving fixed values for latency, bandwidth and number of buses, we tested how the applicationperforms in the case of faster processors. With values from half time faster up to 5 times faster,increasing by half in each simulation, it is clearly observed that speed up is proportional to theincrease. This time the ratio was calculated with the following division: Ideal Execution Time /Current Execution Time for easier understanding of the graph. 8
    • Figure 4: Time Ratio based on Relative Processor Speed3.2 Pin Simulations3.2.1 Cache SizeIn Figure 5, the hit rate depending on the cache size is presented for all three caches. For L1(Figure 5-a), the sizes 1, 2, 4, 8, 16, 32 and 64 KB were tested, choosing the 64KB as the bestchoice. For L2, the size should be at least the double number of the L1 size, and therefore therange begins from 128 till 1024, choosing the 256KB as the best choice, since the differencefrom bigger memory size was not very high. For the same reason, in L3 the range begins from512 and goes up to 4096, selecting the 2048 as the fixed value. (a) (b) (c) Figure 5: Hit Rate based on Cache Size (a) for L1, (b) for L2 and (c) for L3 9
    • 3.2.2 Line SizeAfter observing the cache size effects, the line size was tested for three values: 32, 64 and 129bytes. As it can be seen in Figure 6, for all caches, by increasing the line size, the hit rate isbeing rising as well, and therefore the 128 bytes was chosen. Figure 6: Hit Rate based on Line Size, for L1, L2, L3 caches3.2.3 AssociativityWith fixed parameters in cache size and line size, we studied the impact of associativity.Simulations were performed for the values 1, 2 and 4. From Figure 7, it can be observed thatassociativity does not affect significantly the hit rate in none of the caches. Figure 7: Hit Rate based on Associativity for L1, L2, L3 caches 10
    • 3.2.4 Cluster SizeThe second series of simulations studied the cluster size. In Figure 8 we show the hit rate for 1,2, 4 and 8 processors per L2 cache. With more processors devoted to one L2 cache, it isexpected that cache accesses will increase and therefore the hit rate will drop. Indeed, in Figure8, this decrease can be seen. Figure 8: Hit Rate based on Cluster Size (L2)3.2.5 Number of ProcessorsThe final series of simulations is related to the number of processors working for the application.As it is observed in Figure 9, with bigger number of processors the execution time of the pthreadapplication increases. The parameters used for these simulations are the ones chosen on thefirst series of simulations and are shown in Table 2. Figure 9: Execution Time based on number of Processors 11
    • Parameters/Cache L1 L2 L3Cache Size 64 256 1024Line Size 128 128 128Associativity 1 1 1 Table 2: Values of parameters for the last series of simulations4 ConclusionsThis project focused on how hardware simulations are performed using two simulators;Dimemas and Pin tool. Hardware behavior was analyzed by examining the performance impactof various parameters on multithreaded applications.Simulations with Dimemas show how latency, network bandwidth, number of buses and CPUspeed can affect the execution time of a parallel application. Simulations with Pin regarding amulti-level cache memory show how cache size, line size and associativity can affect thespecific cache as well as the caches coming afterwards. Additionally, cluster size is proven tobe an important factor for L2 hit rate. Considering that L2 misses will proceed to L3, thisparameter ends up to be important for L3 hit rate as well. Lastly, the number of threads isexamined and how their increase rise the execution time. With pin, we were able to measure theperformance of the application based on hardware that we do not have.This last series of experiments with pin open the doors for further experimentation and analysisof applications without having access to hardware, because it is either costly or prohibited touse. Thinking even further, these simulations can help scientists examine the pros and cons ofimplementing hardware the way it is proposed or theoretically designed. 12
    • References[1] J. Banks, J. Carson, B. Nelson, D. Nicol (2001). Discrete-Event System Simulation. PrenticeHall. p. 3.[2] J.A. Sokolowski, C.M. Banks (2009). Principles of Modeling and Simulation. Hoboken, NJ:Wiley. p. 6.[3] Barcelona Supercomputing Center. Dimemas. [Online]. Available:http://www.bsc.es/computer-sciences/performance-tools/dimemas.[4] Intel Software Network. Pin - A Dynamic Binary Instrumentation Tool. [Online]. Available:http://www.pintool.org.[5] J. Dunbar (2012, Mar.). NAS Parallel Benchmarks. [Online]. Available:http://www.nas.nasa.gov/publications/npb.html[6] H. S. Gelabert, G. L. Sánchez (2011, Nov.). Extrae: User guide manual for version 2.2.0.Barcelona Supercomputing Center. [Online]. Available:http://www.bsc.es/ssl/apps/performanceTools/files/docs/extrae-userguide.pdf[7] Barcelona Supercomputing Center. Paraver [Online]. Available: http://www.bsc.es/computer-sciences/performance-tools/paraver[8] Barcelona Supercomputing Center. Software Modules [Online]. Available:http://www.bsc.es/ssl/apps/performanceTools/[9] A. Ramirez (2012, Jan). Primavera 2012. Tools and Measurement Techniques [Online].Available: http://pcsostres.ac.upc.edu/eitm/doku.php/pri12[10] Wikipedia, the free encyclopedia. CPU cache [Online]. Available:http://en.wikipedia.org/wiki/CPU_cache 13