• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Dimemas and Multi-Level Cache Simulations
 

Dimemas and Multi-Level Cache Simulations

on

  • 714 views

This report describes the simulation and benchmarking steps taken in order to predict the parallel performance of an application using Dimemas and Cache-level simulations. Using Dimemas [3] the ...

This report describes the simulation and benchmarking steps taken in order to predict the parallel performance of an application using Dimemas and Cache-level simulations. Using Dimemas [3] the time
behaviour of NAS [1] integer sort was simulated for the architecture of the Barcelona Super Computer, MareNostrum [4]. The performance was evaluated as a function of the architecture latency, bandwidth,
connectivity and CPU speed. For Cache-Level Simulations, Intel's pin tool was used to benchmark a simple parallel application in function of the cache and cluster sizes.

Statistics

Views

Total Views
714
Views on SlideShare
650
Embed Views
64

Actions

Likes
0
Downloads
15
Comments
0

1 Embed 64

http://www.marioalmeida.eu 64

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Dimemas and Multi-Level Cache Simulations Dimemas and Multi-Level Cache Simulations Document Transcript

    • ` Universitat Politecnica de CatalunyaMeasurement and Tools Project Report Dimemas and Multi-level Cache Simulations Author: Supervisor: M´rio Almeida a Alejandro Ramirez Bellido June 22, 2012
    • Contents1 Introduction 22 Methodology 2 2.1 Dimemas Simulation . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Multi-Level Cache Simulation . . . . . . . . . . . . . . . . . . 33 Results 4 3.1 Dimemas Simulation . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Multi-Level Cache Simulation . . . . . . . . . . . . . . . . . . 84 Conclusions 11A Used Scripts 13 A.1 Dimemas instrumentation . . . . . . . . . . . . . . . . . . . . 13 A.1.1 Generating Dimemas Configuration . . . . . . . . . . . 13 A.1.2 Running experiments . . . . . . . . . . . . . . . . . . . 13 A.1.3 Graph generator . . . . . . . . . . . . . . . . . . . . . 17 A.1.4 Generating graphs . . . . . . . . . . . . . . . . . . . . 19 A.2 Pin tool instrumentation . . . . . . . . . . . . . . . . . . . . . 20 A.2.1 Generate and Compile Application and DCache tool . 20 A.2.2 Running the experiments . . . . . . . . . . . . . . . . . 28 A.2.3 Importing the results to a database . . . . . . . . . . . 31 A.2.4 Generating graphs . . . . . . . . . . . . . . . . . . . . 31 1
    • Abstract This report describes the simulation and benchmarking steps taken in order to predict the parallel performance of an application using Dimemas and Cache-level simulations. Using Dimemas [3] the time behaviour of NAS [1] integer sort was simulated for the architecture of the Barcelona Super Computer, MareNostrum [4]. The performance was evaluated as a function of the architecture latency, bandwidth, connectivity and CPU speed. For Cache-Level Simulations, Intel’s pin tool was used to benchmark a simple parallel application in function of the cache and cluster sizes.1 IntroductionThis report describes the simulation and benchmarking steps taken in orderto predict the parallel performance of an application using Dimemas [3] andCache-level simulations. Previous work was focused on benchmarking a PARSEC [2] ray-tracingapplication on the multi-processor Boada server. For this purpose EXTRAEand Paraver [5] were used to instrument and provide detailed quantitativeanalysis of the application performance. Following the study of measurement tools and techniques, this reportdescribes the usage of Dimemas to simulate the time behaviour of anotherbenchmarking application on the Barcelona Super Computer, MareNostrum.This time the used traces were taken from a NAS benchmark application alsorunning on boada server. The performance of the application in this simu-lation environment was evaluated as a function of the architecture latency,bandwidth, connectivity and CPU speed. To conclude this study on performance analysis, Cache-Level Simulationswere performed using Intel’s pin tool. The chosen application was a sim-ple parallel application that performs distributed arithmetic operations. Itrepresents the typical Master-Slave paradigm with embarrassingly parallelworkload. For evaluating the cache architecture, the total cache miss ratesper cache level were calculated as a function of the cache sizes, associativity,number of threads and the cluster size.2 MethodologyThis section presents the two different simulation configurations: Dimemasand Multi-Level Cache simulations. Both sections describe the used tools,configuration values and metrics used. 2
    • Boada Server Bandwidth 1 Gb/s Latency 6-10 us Number of cores 12 Ram 24 GB Table 1: Boada server configuration.2.1 Dimemas SimulationThe application chosen for this experiment was the NAS Parallel Benchmarkapplication, integer sort. The NAS benchmark is a set of programs designedto help evaluate the performance of parallel super computers. In this case,the benchmark was done on the boada server which attributes are describedin table 1. In order to perform an architecture simulation, it was decided to use theMareNostrum Super Compute configuration which parameters are shown intable 2. Note that a simplification was made, since it was considered thateach processor runs a single thread. Starting from MareNostrums original ar-chitecture, multiple simulations were performed changing its attributes. Forthis purpose, the script in section A.1.1 was created that generates Dimemasconfiguration files and another to automate its variations. The changed at-tributes in the simulated architecture consisted of latency, CPU speed, band-width and the number of buses. All the measurements were stored in a sqlite3database and then queried in order to automatically generate the graphs (sec-tion A.1.3) presented on the section 3 using gnuplot. To conclude, the changed attributes were recursively fixed on a chosenoptimal value to find a final architecture that needs lesser resources whilehaving similar execution times to the original MareNostrum configuration.2.2 Multi-Level Cache SimulationTo conclude this study on performance analysis, Cache-Level Simulationswere performed using Intel’s pin tool. The chosen application was a sim-ple parallel application that performs distributed arithmetic operations. Itrepresents the typical Master-Slave paradigm with embarrassingly parallelworkload. For evaluating the cache architecture, the pin tool dcache application waschanged in order to support multiple levels of cache shared by parallel pro- 3
    • cessors. The implemented cache architecture is represented in figure 1. Asone might infer from the figure, the cache level two is cluster shared and thecache level three is globally shared. P0 L1 . . . . . . L2 P7 L1 L3 P8 L1 . . L2 . . . . Size of L2 Size of L1 P15 L1 = 1 MB = 4 MB Size of L1 = 16 KBFigure 1: Cache architecture for a cluster size of 8 and a total of 16 processors. For this experiments, the total cache miss rates per cache level were calcu-lated as a function of the multiple cache sizes, number of processors and thecluster size. Some experiments were performed in terms of cache associativityand the number of cache lines per cache set.3 ResultsIn this section the results of both the experiments will be described alongsidewith the resulting charts, descriptions and discussion.3.1 Dimemas SimulationStarting with the initial architecture of MareNostrum, the first experimentconsisted on varying the number of buses and observing its impact on the ex-ecution time of our application. The results of this experiment are depictedon the Figure 2. As it can be observed from figure 2, the execution time decreases whileincreasing the number of buses. This result was expected since this is a 4
    • Execution time with variable buses 1600 #Procs = 2 1400 #Procs = 4 #Procs = 8 1200 Execution time(s) #Procs = 16 1000 #Procs = 32 800 600 400 200 0 0 5 10 15 20 busesFigure 2: Execution time of IntegerSort depending on the number of buses.multi-threaded application in which the data is transferred between threadsand adding more buses increases the amount of data that can be transferredin parallel. Also it can be seen that from sixteen buses, the execution timestarts stabilizing. This is probably because most of the data to be sent, isalready sent in parallel and thus the increase of buses does not impact theperformance. The second experiment consisted on varying the available bandwidth fromthe initial MareNostrum configuration. The results are shown in Figure 3. Execution time with variable bandwidth 140 #Procs = 2 120 #Procs = 4 #Procs = 8 Execution time(s) 100 #Procs = 16 #Procs = 32 80 60 40 20 0 170 180 190 200 210 220 230 240 250 bandwidthFigure 3: Execution time of IntegerSort depending on the bandwidth (MB/s). 5
    • Figure 3 shows that the bandwidth as a bigger impact on performance ifthe application is run on a smaller set of threads. For example, a variationof 40 MB/s can increase the execution time by 20 seconds for four threads,but for 32 threads, the changes are almost unnoticeable. This is probablydue to the fact that the master thread has to send the initial data to allslaves. This means that increasing the number of slaves, the data can be di-vided in smaller chunks that can be sent in parallel and thus taking less time. The third experiment consisted on varying the processing capacity of theCPU. As one can observe in figure 4, increasing the processing power of eachprocessor decreases the execution time. This impact is more noticeable if weconsider processing capacity smaller than 100%. It is not very tunable interms of optimizing the usage of resources in terms of decreasing the CPUpower since a small decrease has a big impact on the execution time. Execution time with variable cpu 500 #Procs = 2 450 #Procs = 4 400 #Procs = 8 Execution time(s) 350 #Procs = 16 #Procs = 32 300 250 200 150 100 50 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 cpuFigure 4: Execution time of IntegerSort depending on the available CPU(%). To conclude the experiments on the variation of the architecture param-eters, figure 5 shows the impact of latency on the execution time. For figure 5 a logarithmic scale was chosen for the x axis since changesin the same order of the initial MareNostrum configuration do not have asignificant impact on the execution time. The latency can be increased tovalue significantly bigger without affecting much the performance since thelatency values in MareNostrum are very small. Only from values of latencyclose to 0.01 seconds we start seeing bigger increases of the execution time.This attribute should have a bigger impact for more communication intensive 6
    • Execution time with variable latency 600 #Procs = 2 500 #Procs = 4 #Procs = 8 Execution time(s) #Procs = 16 400 #Procs = 32 300 200 100 0 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 latency Figure 5: Execution time of IntegerSort depending on the latency (s).applications.Figure 6: Execution time of IntegerSort depending on the number of threads. To conclude, a comparison is shown in table 2 that presents the dif-ferences between a less resource demanding configuration and the originalMareNostrum configuration, both achieving similar execution times. Thechosen number of threads was 32 due to its better performance as shown in 7
    • Parameters MareNostrum Config 1 Config 2 Cpu (%) 1.0 0.95 0.9 Latency (s) 0.000008 0.0001 0.001 Bandwidth (MB/s) 250.0 240 230 Number of buses 20+ * 16 16 Execution time (s) 12.506 13.150 13.779Table 2: Comparison between the execution times of the initial MareNostrumconfiguration and its less resource demanding configuration.figure 6. The table 2 confirms the predictions made in previous experiments. Thechosen values increase the execution time at most 1 second while reducingmost parameters by around 10% and increasing significantly the latency.3.2 Multi-Level Cache SimulationAs previously mentioned, the chosen application was a simple parallel ap-plication that performs distributed arithmetic operations. It represents thetypical Master-Slave paradigm with embarrassingly parallel workload. MissRate of cache L2 per cluster size (Lsize=[16,1,4]) 50 48 #procs = 2 #procs = 4 46 #procs = 8 Total Miss Rate (%) 44 #procs = 16 42 40 38 36 34 32 30 28 2 3 4 5 6 7 8 Cluster sizeFigure 7: MissRate of Cache L2 for L1, L2 and L3 sized, respectively 16K,1M, 4M For evaluating the cache architecture, the cache architecture was changeddepending on multiple factors, such as the cluster size, caches sizes and cacheline sizes. To start with this experiments the cache architecture was set asshown in figure 1. It has 16 processors with one L1 cache of 16 KB each. 8
    • The cache level two has 1 MB and is cluster shared with a cluster size of 8.And finally, the cache level three is globally shared and has a size of 4 MB.The first experiment consisted on varying the cluster size as shown in figure7 and verifying its impact on the cache L2 miss rate. As it can be seen,for the number of threads of this experiment the impact on the miss ratesof changing the cluster size was not very significant. For up to 4 threads ithas almost no impact at all, but when the system has more than 8 threadsit can reduce the miss rate by 2%. It is interesting to notice that in thisexperiment, the more threads sharing the same L2 cache, the lesser the missrate becomes.Since most cache size configurations produced similar variations for the clus-ter size experiment, the next step consisted on verifying the the impact ofthe cache sizes on the miss rates. The first step consisted on varying the sizeof the non-shared cache L1 and its results are presented on figure 8. MissRate of cache 1 per L1 size 15 #procs = 2 14 #procs = 4 #procs = 8 Total Miss Rate (%) 13 #procs = 16 12 11 10 9 8 15 20 25 30 35 40 45 50 55 60 65 Size of cache L1 Figure 8: MissRate of Cache L1 for a variable L1 cache size (KB). Looking at figure 8 it might seem strange that a smaller number of threadshas such a lower miss rate. This is because of the master/slave paradigm thatfor an increasing number of threads makes the accesses to data more sparse.For bigger numbers of threads the miss rates can reach values close to 15%.As expected, bigger sizes of L1 caches achieve smaller miss rates, althoughthe difference isn’t greater than 2%. Although the experiments were performed for more sizes of L1 cache, inorder to study the impact of the L2 cache size, the L1 cache size was fixedon 16 KB. The variation of L2 cache size is presented on figure 9. As one canobserve, the miss rate of L2 cache for 2 threads is high, being close to 50%.This is probably because of the low miss rate of the L1 cache, the accesses 9
    • MissRate of cache 2 per L2 size (Lsize=[16,.,.]) 50 #procs = 2 48 #procs = 4 46 #procs = 8 Total Miss Rate (%) 44 #procs = 16 42 40 38 36 34 32 30 1 1.5 2 2.5 3 3.5 4 Size of cache L2Figure 9: MissRate of Cache L2 for a variable L2 cache size (MB) and a L1cache size of 16KB.that don’t produces hits on L1 should have lower predictability. For biggernumbers of threads, the miss rates are still high although they don’t reachvalues higher than 33%. MissRate of cache L3 per L3 size (Lsize=[16,1,.]) 100 #procs = 2 #procs = 4 80 #procs = 8 Total Miss Rate (%) #procs = 16 60 40 20 0 4 6 8 10 12 14 16 Size of cache L3Figure 10: MissRate of Cache L3 for a variable L3 cache size (MB) and a L1cache size of 16KB. Finally, for the L3 cache size, the impact on the miss rate of the L3 cachesize is shown in figure 10. It seems that accesses that don’t produce hits onthe previous two levels of cache, will hardly produce hits on the third levelof cache. The only exception are the 2 threads for which the set of accesseddata is bigger. This probably shows that either the application doesn’t justifythe use of three levels of cache, or the data accessed by each thread at each 10
    • moment is too short.4 ConclusionsDimemas allowed to experiment the theoretical performance of the applica-tion in the MareNostrum architecture. Through the variation of each dif-ferent parameter it was possible to create graphs depicting their impact onthe execution time. By the end of the experiment it was possible to suggestan architecture with less resources that achieves similar results to the initialMareNostrum architecture. This architecture is presented in table 2 and con-firms the predictions made in the Dimemas experiments. The chosen valuesincrease the execution time at most 1 second while reducing most parametersby around 10% and increasing significantly the latency. For the second experiment, the impact of the cluster size and caches sizeswere presented for a simple parallel arithmetic calculations application. Theexperiments showed that the cluster size impact on the miss rate was notvery significant. For more than 8 threads it can reduce the miss rate by2%. Overall, the more threads sharing the same L2 cache, the lesser themiss rate becomes. This is because of the master/slave paradigm that foran increasing number of threads makes the accesses to data more sparse.As expected, bigger sizes of L1 caches achieve smaller miss rates. For bignumbers of threads, the miss rates in L2 cache were high although they don’treach values higher than 33%. In general, accesses that didn’t produce hitson the first two levels of cache, hardly produced hits on the third level ofcache. The experiments showed that either the application doesn’t justifythe use of three levels of cache, or the data accessed by each thread at eachmoment is too short. Scripting the experiments had a huge impact on the time needed to per-form them. Some of the experiments produced thousands of results. Thetechnique that has proven to be more efficient was to script the generationof results, output them to a sql database and perform queries to generategraphs through gnuplot.References[1] http://www.nas.nasa.gov/publications/npb.html, NAS benchmark.[2] http://parsec.cs.princeton.edu/, PARSEC benchmark. 11
    • [3] http://www.bsc.es/computer-sciences/performance-tools/dimemas, Dimemas.[4] http://en.wikipedia.org/wiki/MareNostrum, MareNostrum.[5] http://www.bsc.es/computer-sciences/performance-tools/paraver, Par- aver. 12
    • A Used Scripts A.1 Dimemas instrumentation A.1.1 Generating Dimemas Configuration 1 #! / b i n / b a s h 2 3 i f [ $# −ne 6 ] 4 then 5 echo ” $0 : Wrong number o f arguments . ” 6 echo ” $0 : <i n p u t . t r f > <n t h r e a d s > <nbuses> <l a t e n c y > < bandwidth> <%cpuspeed>” 7 exit 1 8 fi 910 c a t b e g i n o f c o n f i g1112 #Bandwidth d e f i n i t i o n13 echo −e ”nn” environment i n f o r m a t i o n ” { ” ” , 0 , ” ” , 1 2 8 , $5 , $3 , 3 } ; ; n”1415 #Latency and %cpu s p e e d d e f i n i t i o n s16 f o r ( ( i =0; i <=127; i++ ) )17 do18 echo ” ” node i n f o r m a t i o n ” { 0 , $ i , ” ” , 1 , 1 , 1 , 0 . 0 , $4 , $6 , 0 . 0 , 0 . 0 } ; ; ”19 done2021 #F i l e name and number o f p r o c e s s o r s d e f i n i t i o n s22 echo ” ”23 echo −n ” ” mapping i n f o r m a t i o n ” { ” $1 ” , $2 , [ $2 ] ”24 echo −n ” {0 ”2526 f o r ( ( i =1; i<=$2 −1; i++ ) )27 do28 echo −n ” , $ i ”29 done3031 echo ” } } ; ; ”3233 c a t e n d o f c o n f i g A.1.2 Running experiments 1 #! / b i n / b a s h 2# 13
    • 3 # S c r i p t by aknahs ( Mario Almeida ) 4 # 5 6 cat logo 7 8 echo ”Removing out f o l d e r ( f o r c e ) ” 9 rm − r f out1011 echo ” C r e a t i n g out f o l d e r ”12 mkdir out13 mkdir out / c f g14 mkdir out / prv15 mkdir out / d e t a i l s16 mkdir out / r e s u l t s1718 echo ” C r e a t i n g s q l i t e 3 d a t a b a s e ”19 s q l i t e 3 out / r e s u l t s / r e s . db ’CREATE TABLE dimemas ( p r o c s INTEGER, b u s e s INTEGER, l a t e n c y REAL, bandwidth REAL, cpu REAL, runtime REAL) ; ’2021 echo ” S e t t i n g d e f a u l t v a l u e s ”22 LATENCY=” 0 . 0 0 0 0 0 8 ”23 BANDWIDTH 2 5 0 . 0 ” =”24 BUSES=” 0 ”25 CPU=” 1 . 0 ”2627 f o r i i n 02 04 08 16 3228 do29 #echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−”30 i f [ $ { i : 0 : 1 } == 0 ]31 then32 #echo ” S e t t i n g n t h r e a d s t o $ { i : 1 } ”33 n t h r e a d s=$ { i : 1 }34 else35 #echo ” S e t t i n g n t h r e a d s t o $ { i }”36 n t h r e a d s=$ i37 fi3839 echo −n ” G e n e r a t i n g r e s u l t s f o r $ n t h r e a d s ”4041 #BUSES−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−42 f o r j i n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2043 do44 #echo ” G e n e r a t i n g c o n f i g u r a t i o n f i l e f o r BUSES = $ j ”45 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $ j $LATENCY $BANDWIDTH $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$j −$LATENCY − $BANDWIDTH −$CPU . c f g46 #echo ” C o n v e r t i n g t o p a r a v e r t r a c e . . . ” 14
    • 47 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$j −$LATENCY − $BANDWIDTH −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$j −$LATENCY − $BANDWIDTH −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$j − $LATENCY −$BANDWIDTH −$CPU48 #echo ” O u t p u t i n g r e s u l t s . ”49 echo −n ” $ n t h r e a d s , $j ,$LATENCY,$BANDWIDTH, $CPU, ” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v50 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$j −$LATENCY − $BANDWIDTH −$CPU | awk ”{ p r i n t $3 } ” >> out / r e s u l t s / r e s − $nthreads . csv51 done5253 echo −n ” . ”5455 #LATENCY −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−56 for j in 0.000001 0.00001 0.0001 0.001 0.01 0 . 1 1 . 057 do58 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $ j $BANDWIDTH $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$j −$BANDWIDTH −$CPU . cfg59 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−$j − $BANDWIDTH −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$j − $BANDWIDTH −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES− $j −$BANDWIDTH −$CPU60 echo −n ” $ n t h r e a d s , $BUSES , $j ,$BANDWIDTH, $CPU, ” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v61 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$j − $BANDWIDTH −$CPU | awk ”{ p r i n t $3 } ” >> out / r e s u l t s / r e s − $nthreads . csv62 done6364 echo −n ” . ”6566 # BANDWIDTH −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−67 for j in 250.0 245.0 240.0 235.0 230.0 225.0 220.0 215.0 210.0 205.0 200.0 195.0 190.0 185.0 180.0 175.0 170.068 do69 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $LATENCY $ j $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$LATENCY −$j −$CPU . c f g70 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES− $LATENCY −$j −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES− $LATENCY −$j −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES− $LATENCY −$j −$CPU71 echo −n ” $ n t h r e a d s , $BUSES ,$LATENCY, $j , $CPU, ” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v72 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$LATENCY −$j −$CPU | awk ”{ p r i n t $3 }” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v73 done74 15
    • 75 echo −n ” . ” 76 77 #CPU SPEED−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 78 for j in 5 . 0 4 . 0 3 . 0 2 . 0 1 . 0 0.95 0 . 9 0.85 0 . 8 0.75 0 . 7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.25 0.1 0.05 79 do 80 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $LATENCY $BANDWIDTH $ j > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$LATENCY − $BANDWIDTH j . c f g−$ 81 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES− $LATENCY −$BANDWIDTH j . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES− −$ $LATENCY −$BANDWIDTH j . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s − −$ $BUSES−$LATENCY −$BANDWIDTH j −$ 82 echo −n ” $ n t h r e a d s , $BUSES ,$LATENCY,$BANDWIDTH, $j , ” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v 83 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$LATENCY − $BANDWIDTH j | awk ” { p r i n t $3 } ” >> out / r e s u l t s / r e s − −$ $nthreads . csv 84 85 86 done 87 echo ” . ” 88 echo ” I m p o r t i n g t o d a t a b a s e ” 89 echo ” . s e p a r a t o r ” , ” ” > out / r e s u l t s /command 90 echo ” . import out / r e s u l t s / r e s −$ { n t h r e a d s } . c s v dimemas” >> out / r e s u l t s /command 91 s q l i t e 3 out / r e s u l t s / r e s . db < out / r e s u l t s /command 92 rm out / r e s u l t s /command 93 done 94 95 echo ” G e n e r a t i n g b e s t c o n f i g u r a t i o n 1 ” 96 . / c o n f i g g e n i n / m p i p i n g 3 2 . t r f 32 16 0 . 0 0 0 1 2 4 0 . 0 0 . 9 5 > out / c f g / c o n f i g −32 −16 −0.0001 −240.0 −0.95. c f g 97 . / Dimemas3 −S 32K −pa out / prv / paraver −32 −16 −0.0001 −240.0 −0.95. prv out / c f g / c o n f i g −32 −16 −0.0001 −240.0 −0.95. c f g > out / d e t a i l s / d e t a i l −32 −16 −0.0001 −240.0 −0.95 98 echo −n ” 3 2 , 1 6 , 0 . 0 0 0 1 , 2 4 0 . 0 , 0 . 9 5 , ” > out / r e s u l t s / o p t i m a l . c s v 99 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −32 −16 −0.0001 −240.0 −0.95 | awk ”{ p r i n t $3 }” >> out / r e s u l t s / o p t i m a l . c s v100101 echo ” G e n e r a t i n g b e s t c o n f i g u r a t i o n ”102 . / c o n f i g g e n i n / m p i p i n g 1 6 . t r f 16 16 0 . 0 0 0 1 2 3 0 . 0 0 . 9 > out / c f g / c o n f i g −16 −16 −0.0001 −230.0 −0.9. c f g103 . / Dimemas3 −S 32K −pa out / prv / paraver −16 −16 −0.0001 −230.0 −0.9. prv out / c f g / c o n f i g −16 −16 −0.0001 −230.0 −0.9. c f g > out / d e t a i l s / d e t a i l −16 −16 −0.0001 −230.0 −0.9104 echo −n ” 1 6 , 1 6 , 0 . 0 0 0 1 , 2 3 0 . 0 , 0 . 9 , ” >> out / r e s u l t s / o p t i m a l . c s v105 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −16 −16 −0.0001 −230.0 −0.9 | awk ” { p r i n t $3 } ” >> out / r e s u l t s / o p t i m a l . c s v 16
    • 106107 ./ graphall buses108 ./ graphall cpu109 ./ graphall bandwidth110 ./ graphall latency111112 echo ” A l l done ! ” A.1.3 Graph generator 1 #! / b i n / b a s h 2 # 3 # S c r i p t by aknahs ( Mario Almeida ) 4 # 5 6 l a t e n c y=” 0 . 0 0 0 0 0 8 ” 7 bandwidth=” 2 5 0 . 0 ” 8 b u s e s=” 0 ” 9 cpu=” 1 . 0 ”10 aux=” ”11 aux2=” ”1213 i f [ ” $1 ” == ” l a t e n c y ” ]14 then15 comp=$ l a t e n c y16 aux=” s e t l o g x”17 aux2=” s e t m x t i c s 10 ”18 fi19 i f [ ” $1 ” == ” bandwidth ” ]20 then21 comp=$bandwidth22 fi23 i f [ ” $1 ” == ” b u s e s ” ]24 then25 comp=$ b u s e s26 fi27 i f [ ” $1 ” == ” cpu ” ]28 then29 comp=$cpu30 fi313233 echo ” G e n e r a t i n g Graph”34 g n u p l o t << EOF35 set d a t a f i l e s e p a r a t o r ” | ”3637 # Line s t y l e f o r a x e s38 set s t y l e l i n e 80 l t rgb ”#808080” 17
    • 3940 # Line s t y l e f o r g r i d41 set s t y l e l i n e 81 l t 0 # dashed42 set s t y l e l i n e 81 l t rgb ”#808080” # grey4344 set grid back l i n e s t y l e 8145 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and r i g h t . These46 # b o r d e r s a r e u s e l e s s and make i t h a r d e r47 # t o s e e p l o t t e d l i n e s near t h e b o r d e r .48 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a border .49 set x t i c s n o m i r r o r50 set y t i c s n o m i r r o r5152 #s e t l o g x53 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good .5455 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r56 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s57 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k58 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .59 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 2 pt 160 set s t y l e l i n e 2 l t rgb ”#00A000” lw 2 pt 661 set s t y l e l i n e 3 l t rgb ”#5060D0” lw 2 pt 262 set s t y l e l i n e 4 l t rgb ”#F25900 ” lw 2 pt 963 set s t y l e l i n e 5 lw 2 pt 96465 #s e t key t o p r i g h t6667 #s e t x r a n g e [ 0 : 1 ]68 #s e t y r a n g e [ 0 : 1 ]6970 #p l o t ” t e m p l a t e . d a t ” 71 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 72 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 27374 #s e t s t y l e d a t a l i n e s75 set key o u t s i d e76 #s e t x t i c s r o t a t e by −4577 #s e t s i z e r a t i o 0 . 878 set t i t l e ” E x e c u t i o n time with v a r i a b l e $1 ”79 set xlabel ” $1 ”80 $aux81 $aux282 set ylabel ” E x e c u t i o n time ( s ) ”8384 plot ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from dimemas where $1 != $comp and p r o c s = 2 UNION s e l e c t $1 , 18
    • runtime from dimemas where p r o c s = 2 and b u s e s = $ b u s e s and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu = $cpu ’ ” u s i n g 1 : 2 w l p l s 1 t i t l e ’#Procs = 2 ’ , 85 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from dimemas where $1 != $comp and p r o c s = 4 UNION s e l e c t $1 , runtime from dimemas where p r o c s = 4 and b u s e s = $ b u s e s and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu = $cpu ’ ” u s i n g 1 : 2 w l p l s 2 t i t l e ’#Procs = 4 ’ , 86 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from dimemas where $1 != $comp and p r o c s = 8 UNION s e l e c t $1 , runtime from dimemas where p r o c s = 8 and b u s e s = $ b u s e s and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu = $cpu ’ ” u s i n g 1 : 2 w l p l s 3 t i t l e ’#Procs = 8 ’ , 87 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from dimemas where $1 != $comp and p r o c s = 16 UNION s e l e c t $1 , runtime from dimemas where p r o c s = 16 and b u s e s = $ b u s e s and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu = $cpu ’ ” u s i n g 1 : 2 with l i n e s l s 4 t i t l e ’#Procs = 16 ’ , 88 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from dimemas where $1 != $comp and p r o c s = 32 UNION s e l e c t $1 , runtime from dimemas where p r o c s = 32 and b u s e s = $ b u s e s and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu = $cpu ’ ” u s i n g 1 : 2 w l p l s 5 t i t l e ’#Procs = 32 ’8990 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded91 #s e t t e r m i n a l p d f c a i r o s i z e 10cm, 2 0cm92 set output ” out / r e s u l t s / $1 . pdf ”93 replot94 EOF9596 echo ”Done” A.1.4 Generating graphs 1 ./ graphall buses 2 ./ graphall latency 3 ./ graphall cpu 4 ./ graphall bandwidth 5 6 echo ” G e n e r a t i n g Graph” 7 g n u p l o t << EOF 8 set d a t a f i l e s e p a r a t o r ” , ” 9 set nokey1011 set t i t l e ” E x e c u t i o n time depending on t h e number o f t h r e a d s ”12 set xlabel ”Number o f t h r e a d s ”1314 set x t i c s ( 0 , 2 , 4 , 8 , 1 6 , 3 2 , 3 4 ) 19
    • 1516 set ylabel ” E x e c u t i o n time ( s ) ”1718 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 501920 plot ” out / r e s u l t s / comparisonThreads . c s v ” u s i n g 1 : 2 with imp l s 12122 set term p o s t s c r i p t eps enhanced c o l o r23 set output ” out / r e s u l t s / comparison . pdf ”24 replot25 EOF A.2 Pin tool instrumentation A.2.1 Generate and Compile Application and DCache tool 1 #! / b i n / b a s h 2 3 #c l u s t e r S i z e 4 # c o n s t UINT32 c a c h e S i z e = 256∗KILO ; 5 # c o n s t UINT32 l i n e S i z e = 1 ; 6 # c o n s t UINT32 a s s o c i a t i v i t y = 2 5 6 ; 7 # l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc> <L 2 c a c h e s i z e <c > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <L 3 l i n e S i z e > <L3assoc> <nThreads> 8# $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 910 i f [ $# −ne 11 ]11 then12 echo ” $0 : Wrong number o f arguments . ”13 echo ” $0 : <c l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc > <L 2 c a c h e s i z e > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > < L 3 l i n e S i z e > <L3assoc> <nThreads>”14 exit 115 f i1617 threadsAndMaster=$ ( ( $ {11} −1) )18 #echo ” TreadsAndMaster = $threadsAndMaster ”1920 #echo −n ”INPUT=”21 #echo ” $1 $2 $3 $4 $5 $6 $7 $8 $9 $ {10} $ {11}”2223 #echo ” S a v i n g backup o f dcache f i l e ”24 mv −f dcache . cpp dcache backup . cpp 20
    • 2526 echo ”27 #i n c l u d e <i o s t r e a m >28 #i n c l u d e <f s t r e a m >29 #i n c l u d e <c a s s e r t >3031 #i n c l u d e ” p i n .H”323334 t y p e d e f UINT32 CACHE STATS ; // type o f c a c h e h i t / m i s s c o u n t e r s3536 #i n c l u d e ” p i n c a c h e .H”3738 KNOB t r i n g > KnobOutputFile (KNOB MODE WRITEONCE, <s ” p i n t o o l ” ,39 ” o ” , ” a l l c a c h e . out ” , ” s p e c i f y dcache f i l e name” ) ;4041 PIN LOCK l o c k ;4243 INT32 numThreads = 0 ;44 c o n s t INT32 MaxNumThreads = $11 ;45 c o n s t INT32 c l u s t e r S i z e = $1 ;4647 s t r u c t THREAD DATA48 {49 UINT64 H i t s ;50 UINT64 Miss ;51 };5253 THREAD DATA l 1 c o u n t [ MaxNumThreads ] ;54 THREAD DATA l 2 c o u n t [ c l u s t e r S i z e ] ;5556 VOID T h r e a d S t a r t (THREADID t h r e a d i d , CONTEXT ∗ c t x t , INT32 f l a g s , VOID ∗v )57 {58 GetLock(& l o c k , t h r e a d i d +1) ;59 numThreads++;60 R e l e a s e L o c k (& l o c k ) ;6162 ASSERT( numThreads <= MaxNumThreads , ”Maximum number o f t h r e a d s e x c e e d e d n” ) ;63 }6465 namespace DL166 {67 // 1 s t l e v e l data c a c h e : 32 kB , 32 B l i n e s , 32−way associative68 c o n s t UINT32 c a c h e S i z e = $2 ∗KILO ;69 c o n s t UINT32 l i n e S i z e = $3 ;70 c o n s t UINT32 a s s o c i a t i v i t y = $4 ; 21
    • 71 c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC : : STORE NO ALLOCATE; 72 73 c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗ associativity ) ; 74 c o n s t UINT32 m a x a s s o c i a t i v i t y = a s s o c i a t i v i t y ; 75 76 t y p e d e f CACHE ROUND ROBIN( max sets , m a x a s s o c i a t i v i t y , a l l o c a t i o n ) CACHE; 77 } 78 LOCALVAR DL1 : : CACHE d l 1 ( ”L1 Data Cache ” , DL1 : : c a c h e S i z e , DL1 : : l i n e S i z e , DL1 : : a s s o c i a t i v i t y ) ; 79 80 namespace UL2 81 { 82 // 2nd l e v e l u n i f i e d c a c h e : 2 MB, 64 B l i n e s , d i r e c t mapped 83 c o n s t UINT32 c a c h e S i z e = $5 ∗MEGA; 84 c o n s t UINT32 l i n e S i z e = $6 ; 85 c o n s t UINT32 a s s o c i a t i v i t y = $7 ; 86 c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC : : STORE ALLOCATE; 87 88 c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗ associativity ) ; 89 90 t y p e d e f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE; 91 } 92 LOCALVAR UL2 : : CACHE u l 2 ( ”L2 C l u s t e r −s h a r e d Cache ” , UL2 : : c a c h e S i z e , UL2 : : l i n e S i z e , UL2 : : a s s o c i a t i v i t y ) ; 93 94 namespace UL3 95 { 96 // 3 rd l e v e l u n i f i e d c a c h e : 16 MB, 64 B l i n e s , d i r e c t mapped 97 c o n s t UINT32 c a c h e S i z e = $8 ∗MEGA; 98 c o n s t UINT32 l i n e S i z e = $9 ; 99 c o n s t UINT32 a s s o c i a t i v i t y = $ { 1 0 } ;100 c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC : : STORE ALLOCATE;101102 c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗ associativity ) ;103104 t y p e d e f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE;105 }106 LOCALVAR UL3 : : CACHE u l 3 ( ”L3 G l o b a l l y −s h a r e d Cache ” , UL3 : : c a c h e S i z e , UL3 : : l i n e S i z e , UL3 : : a s s o c i a t i v i t y ) ;107108 LOCALFUN VOID F i n i ( i n t code , VOID ∗ v )109 { 22
    • 110 s t d : : o f s t r e a m out ( KnobOutputFile . Value ( ) . c s t r ( ) ) ;111112 out <<113 ”#n”114 ”# DCACHE s t a t s n ”115 ”#n” ;116117 out << d l 1 ;118 out << u l 2 ;119 out << u l 3 ;120121 out . c l o s e ( ) ;122123 f o r ( i n t i =0; i <numThreads ; i ++)124 {125 p r i n t f ( ”%d L1 H i t s : %I64d n” , i , ( u n s i g n e d i n t ) l 1 c o u n t [ i ] . H i t s );126 p r i n t f ( ”%d L1 Miss : %I64d n” , i , ( u n s i g n e d i n t ) l 1 c o u n t [ i ] . Miss );127 p r i n t f ( ”%d L1 Hit r a t e : %f nn” , i , ( 1 0 0 . 0 ∗ l 1 c o u n t [ i ] . H i t s / ( l 1 c o u n t [ i ] . H i t s+l 1 c o u n t [ i ] . Miss ) ) ) ;128 }129130 f o r ( i n t i =0; i <c l u s t e r S i z e ; i ++)131 {132 p r i n t f ( ”%d L2 H i t s : %I64d n” , i , ( u n s i g n e d i n t ) l 2 c o u n t [ i ] . H i t s );133 p r i n t f ( ”%d L2 Miss : %I64d n” , i , ( u n s i g n e d i n t ) l 2 c o u n t [ i ] . Miss );134 p r i n t f ( ”%d L2 Hit r a t e : %f nn” , i , ( 1 0 0 . 0 ∗ l 2 c o u n t [ i ] . H i t s / ( l 2 c o u n t [ i ] . H i t s+l 2 c o u n t [ i ] . Miss ) ) ) ;135 }136 }137138 LOCALFUN VOID U l 2 A c c e s s (ADDRINT addr , UINT32 size , CACHE BASE : : ACCESS TYPE accessType , THREADID t i d )139 {140 // s e c o n d l e v e l u n i f i e d c a c h e141 c o n s t BOOL d l 2 H i t = u l 2 . A c c e s s ( addr , size , a c c e s s T y p e ) ;142143 // t h i r d l e v e l u n i f i e d c a c h e144 i n t c i d = t i d / ( MaxNumThreads/ c l u s t e r S i z e ) ;145 i f ( ! dl2Hit )146 {147 GetLock(& l o c k , t i d +1) ;148 l 2 c o u n t [ c i d ] . Miss++;149 R e l e a s e L o c k (& l o c k ) ;150 u l 3 . A c c e s s ( addr , size , a c c e s s T y p e ) ;151 } else 23
    • 152 l 2 c o u n t [ c i d ] . H i t s ++;153 }154155 LOCALFUN VOID MemRefMulti (ADDRINT addr , UINT32 size , CACHE BASE : : ACCESS TYPE accessType , THREADID t i d )156 {157 // f i r s t l e v e l D−c a c h e158 c o n s t BOOL d l 1 H i t = d l 1 . A c c e s s ( addr , size , a c c e s s T y p e ) ;159160 i f ( ! dl1Hit ) {161 l 1 c o u n t [ t i d ] . Miss++;162 U l 2 A c c e s s ( addr , size , accessType , t i d ) ;163 }164 else165 {166 l 1 c o u n t [ t i d ] . H i t s ++;167 }168 }169170 LOCALFUN VOID MemRefSingle (ADDRINT addr , UINT32 size , CACHE BASE : : ACCESS TYPE accessType , THREADID t i d )171 {172 // f i r s t l e v e l D−c a c h e173 c o n s t BOOL d l 1 H i t = d l 1 . A c c e s s S i n g l e L i n e ( addr , a c c e s s T y p e ) ;174175 i f ( ! dl1Hit ) {176 l 1 c o u n t [ t i d ] . Miss++;177 U l 2 A c c e s s ( addr , size , accessType , t i d ) ;178 }179 else180 {181 l 1 c o u n t [ t i d ] . H i t s ++;182 }183 }184185 LOCALFUN VOID I n s t r u c t i o n ( INS i n s , VOID ∗v )186 {187 i f ( INS IsMemoryRead ( i n s ) )188 {189 c o n s t UINT32 s i z e = INS MemoryReadSize ( i n s ) ;190 c o n s t AFUNPTR countFun = ( s i z e <= 4 ? (AFUNPTR) MemRefSingle : (AFUNPTR) MemRefMulti ) ;191192 // o n l y p r e d i c a t e d −on memory i n s t r u c t i o n s a c c e s s D−c a c h e193 INS InsertPredicatedCall (194 i n s , IPOINT BEFORE , countFun ,195 IARG MEMORYREAD EA,196 IARG MEMORYREAD SIZE,197 IARG UINT32 , CACHE BASE : : ACCESS TYPE LOAD, 24
    • 198 IARG THREAD ID ,199 IARG END) ;200 }201202 i f ( INS IsMemoryWrite ( i n s ) )203 {204 c o n s t UINT32 s i z e = INS MemoryWriteSize ( i n s ) ;205 c o n s t AFUNPTR countFun = ( s i z e <= 4 ? (AFUNPTR) MemRefSingle : (AFUNPTR) MemRefMulti ) ;206207 // o n l y p r e d i c a t e d −on memory i n s t r u c t i o n s a c c e s s D−c a c h e208 INS InsertPredicatedCall (209 i n s , IPOINT BEFORE , countFun ,210 IARG MEMORYWRITE EA,211 IARG MEMORYWRITE SIZE,212 IARG UINT32 , CACHE BASE : : ACCESS TYPE STORE,213 IARG THREAD ID ,214 IARG END) ;215 }216 }217218 GLOBALFUN i n t main ( i n t argc , c h a r ∗ argv [ ] )219 {220 P I N I n i t ( argc , argv ) ;221222 f o r ( INT32 t =0; t<MaxNumThreads ; t++)223 {224 l1count [ t ] . Hits = 0;225 l 1 c o u n t [ t ] . Miss =0;226 }227228 f o r ( i n t i =0; i <c l u s t e r S i z e ; i ++)229 {230 l 2 c o u n t [ i ] . H i t s =0;231 l 2 c o u n t [ i ] . Miss =0;232 }233234 PIN AddThreadStartFunction ( ThreadStart , 0 ) ;235 INS AddInstrumentFunction ( I n s t r u c t i o n , 0 ) ;236 PIN AddFiniFunction ( F i n i , 0 ) ;237238 // Never r e t u r n s239 PIN StartProgram ( ) ;240241 return 0 ; // make c o m p i l e r happy242 }” > dcache . cpp243244 make > makeres245 25
    • 246 echo ”247 #i n c l u d e <p t h r e a d . h>248 #i n c l u d e <s t d i o . h>249 #i n c l u d e < s t d l i b . h>250 #i n c l u d e <time . h>251 typedef struct252 {253 double ∗a ;254 double ∗b ;255 double sum ;256 int veclen ;257 } DOTDATA;258259260 #d e f i n e NUMTHRDS $threadsAndMaster261 #d e f i n e VECLEN 1000000262263 DOTDATA d o t s t r ;264 p t h r e a d t c a l l T h d [NUMTHRDS] ;265 p t h r e a d m u t e x t mutexsum ;266267 v o i d ∗ dotprod ( v o i d ∗ a r g )268 {269 i n t i , s t a r t , end , l e n ;270 long o f f s e t ;271 // p r i n t f ( ”%dn ” , ( i n t ) a r g ) ;272 d o u b l e mysum , ∗x , ∗y ;273 o f f s e t = ( long ) arg ;274275 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 8 ) ;276277 len = dotstr . veclen ;278 // p r i n t f ( ”%dn” , l e n ) ;279 s t a r t = o f f s e t ∗ ( l e n /NUMTHRDS) ;280 end = s t a r t + ( l e n /NUMTHRDS) ;281 x = dotstr . a ;282 y = dotstr . b ;283284 mysum = 0 ;285 f o r ( i=s t a r t ; i <end ; i ++)286 mysum += ( x [ i ] ∗ y [ i ] ) ;287288 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 9 ) ;289290 p t h r e a d m u t e x l o c k (&mutexsum ) ;291292 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 0 ) ;293 d o t s t r .sum += mysum ;294 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 1 ) ; 26
    • 295296 p t h r e a d m u t e x u n l o c k (&mutexsum ) ;297298 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 0 ) ;299300 // p t h r e a d e x i t ( ( v o i d ∗ ) 0 ) ;301 }302303304 i n t main ( i n t argc , c h a r ∗ argv [ ] )305 {306 long i ;307 d o u b l e ∗a , ∗b ;308 void ∗ s ta t us ;309 pthread attr t attr ;310311 c l o c k t begin , end ;312 double time spent ;313314 b e g i n = clock ( ) ;315 // E x t r a e i n i t ( ) ;316317 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 ) ;318319 a = ( d o u b l e ∗ ) m a l l o c (NUMTHRDS∗VECLEN∗ s i z e o f ( d o u b l e ) ) ;320 b = ( d o u b l e ∗ ) m a l l o c (NUMTHRDS∗VECLEN∗ s i z e o f ( d o u b l e ) ) ;321322 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 2 ) ;323324 f o r ( i =0; i <VECLEN∗NUMTHRDS; i ++)325 {326 a [ i ]=1;327 b [ i ]=a [ i ] ;328 }329330 d o t s t r . v e c l e n = VECLEN;331 dotstr . a = a ;332 dotstr . b = b ;333 d o t s t r .sum=0;334335 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 3 ) ;336337 p t h r e a d m u t e x i n i t (&mutexsum , NULL) ;338339 p t h r e a d a t t r i n i t (& a t t r ) ;340 p t h r e a d a t t r s e t d e t a c h s t a t e (& a t t r , PTHREAD CREATE JOINABLE) ;341342 f o r ( i =0; i < NUMTHRDS; i ++)343 { 27
    • 344 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 4 ) ;345 p t h r e a d c r e a t e (& c a l l T h d [ i ] , &a t t r , dotprod , ( v o i d ∗ ) i ) ;346 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 3 ) ;347 }348349 p t h r e a d a t t r d e s t r o y (& a t t r ) ;350 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 5 ) ;351352 f o r ( i =0; i < NUMTHRDS; i ++)353 {354 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 6 ) ;355 p t h r e a d j o i n ( c a l l T h d [ i ] , &s t a t u s ) ;356 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 7 ) ;357 }358359 p r i n t f ( ”Sum = %f n” , d o t s t r .sum) ;360 free (a) ;361 free (b) ;362363 end=clock ( ) ;364 t i m e s p e n t= ( d o u b l e ) ( end − b e g i n ) / CLOCKS PER SEC ;365 p r i n t f ( ” E x e c u t i o n time : %f n ” , t i m e s p e n t ) ;366367 // E x t r a e f i n i ( ) ;368369 p t h r e a d m u t e x d e s t r o y (&mutexsum ) ;370 p t h r e a d e x i t (NULL) ;371 }372 ” > dotprod . c373374 #echo ” Compiling dotprod ”375 g c c −o dotprod dotprod . c −l p t h r e a d376377 #echo ” Running p i n t o o l ”378 cd / s c r a t c h / boada −1/etm022 / p i n379 . / p i n −t / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/ obj− i n t e l 6 4 / dcache . s o −− / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/ dotprod > / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s / Memory/ r e s u l t s / r e s −$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ { 1 1 } . r e s380381 mv a l l c a c h e . out / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/ r e s u l t s / r e s −$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ { 1 1 } . a l l c a c h e382383 cd / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory384 echo ” done ! ” A.2.2 Running the experiments 28
    • 1 #! / b i n / b a s h 2 # 3 #S c r i p t by aknahs ( Mario Almeida ) 4 # 5 #c l u s t e r S i z e 6 # c o n s t UINT32 c a c h e S i z e = 256∗KILO ; 7 # c o n s t UINT32 l i n e S i z e = 1 ; 8 # c o n s t UINT32 a s s o c i a t i v i t y = 2 5 6 ; 9 # l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc> <L 2 c a c h e s i z e <c > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <L 3 l i n e S i z e > <L3assoc> <nThreads>10 # $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $111112 rm − r f r e s u l t s13 mkdir r e s u l t s1415 t o t a l=$ ( ( 3 ∗ 4 ∗ 3 ∗ 3 ∗ 2 ∗ 3 ∗ 2 ) )16 n=017 r e s 1=$ ( date +%s .%N)181920 #c l u s t e r S i z e21 f o r c s i n 2 4 822 do23 f o r mt i n 2 4 8 1624 do25 #L 1 c a c h e S i z e26 f o r l 1 c i n 16 32 6427 do28 #L 1 l i n e S i z e29 f o r l 1 l i n 32 #64 12830 do31 #L1assoc32 f o r l 1 a i n 1 #2 433 do34 #L 2 c a c h e S i z e35 for l 2 c in 1 2 436 do37 #L 2 l i n e S i z e38 f o r l 2 l i n 32 64 #12839 do40 #L2assoc41 f o r l 2 a i n 1 #2 442 do43 #L 3 c a c h e S i z e44 f o r l 3 c i n 4 8 16 29
    • 45 do46 #L 3 l i n e S i z e47 f o r l 3 l i n 32 64 #12848 do49 #L3assoc50 f o r l 3 a i n 1 #2 451 do52 clear53 cat logo54 echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−by aknahs ”55 echo −n ” G e n e r a t i n g [ $n/ $ t o t a l ] . . . ”56 r e s 2=$ ( date +%s .%N)57 p r i n t f ” Elapsed : %.3Fn” $ ( echo ” $ r e s 2 − $ r e s 1 ” | bc )5859 n=$ ( ( $n + 1 ) )60 #echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−61 #echo ” G e n e r a t i n g CPP and Make”62 #echo ” . / genMakeCPP $ c s $ l 1 c $ l 1 l $ l 1 a $ l 2 c $ l 2 l $ l 2 a $ l 3 c $ l 3 l $ l 3 a $mt”63 . / genMakeCPP $ c s $ l 1 c $ l 1 l $ l 1 a $ l 2 c $ l 2 l $ l 2 a $ l 3 c $ l 3 l $ l 3 a $mt64 echo ” . ”65 done66 done67 done68 done69 done70 done71 done72 done73 done74 done75 done76 echo ” a l l done . ”7778 g r e p ” T o t a l Miss Rate ” r e s u l t s / ∗ . a l l c a c h e | awk ’BEGIN{n=0; p r i n t f ” C l u s t e r S i z e , L1 Cache S i z e , L1 L ine S i z e , L1 A s s o c i a t i o n , L2 Cache S i z e , L2 Li ne S i z e , L2 A s s o c i a t i o n , L3 Cache S i z e , L3 L in e S i z e , L3 A s s o c i a t i o n , Number o f t h r e a d s , T o t a l Miss Caches n”} { s p l i t ( $1 , a , ” . ” ) ; s p l i t ( a [ 1 ] , b ,” −”) ; p r i n t f ”%d,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,% s n ” , n%3 +1,b [ 2 ] , b [ 3 ] , b [ 4 ] , b [ 5 ] , b [ 6 ] , b [ 7 ] , b [ 8 ] , b [ 9 ] , b [ 1 0 ] , b [ 1 1 ] , b [ 1 2 ] , $5 ;++n} ’ >> r e s u l t s / b r u t a l d b . c s v 30
    • A.2.3 Importing the results to a database 1 #! / b i n / b a s h 2 # 3 # S c r i p t by aknahs ( Mario Almeida ) 4 # 5 6 rm − r f power 7 mkdir power 8 9 s q l i t e 3 power / r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER, c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER , l 2 s i z e INTEGER, l 2 l i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e INTEGER , l 3 l i n e INTEGER, l 3 a s s o c INTEGER, t h r e a d s INTEGER, m i s s r a t e REAL ); ’1011 echo ” I m p o r t i n g t o d a t a b a s e ”12 echo ” . s e p a r a t o r ” , ” ” > power /command13 echo ” . import b r u t a l d b . c s v r e s ” >> power /command14 s q l i t e 3 power / r e s . db < power /command15 rm power /command1617 echo ” done ”1819 ./ graphall A.2.4 Generating graphs 1 #! / b i n / b a s h 2 # 3 # S c r i p t by aknahs ( Mario Almeida ) 4 # 5 6 #s q l i t e 3 power / r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER, c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER , l 2 s i z e INTEGER, l 2 l i n e i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e INTEGER, l 3 l i n e i n e INTEGER, l 3 a s s o c INTEGER, t h r e a d s INTEGER, m i s s r a t e REAL) ; ’ 7 8 mkdir power / c l u s t e r 2 9 mkdir power / c l u s t e r 410 mkdir power / c l u s t e r 81112 #f o r i n s t r u m e n t a t i o n l e v e l13 f o r set i n 1 2 314 do15 #f o r each c l u s t e r s i z e 31
    • 16 for cs in 2 4 817 do18 #f o r each l e v e l o f c a ch e19 for l in 1 2 320 do2122 i f [ $ s e t == 1 ]23 then24 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−L1Size16 −c l u s t e r $ { c s } ”25 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and l 1 s i z e = 16 and c a c h e l e v e l = $ { l } and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e = 32 ”26 t i t l e=” MissRate o f c a c h e $ { l } p e r L${ l } s i z e ( L s i z e = [ 1 6 , . , . ] ) ”27 xlabel=” S i z e o f c a c h e L${ l } ”28 fi2930 i f [ $ s e t == 2 ]31 then32 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−c l u s t e r $ { c s } ”33 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and c a c h e l e v e l = $ { l } and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e = 32 ”34 t i t l e=” MissRate o f c a c h e $ { l } p e r L${ l } s i z e ”35 xlabel=” S i z e o f c a c h e L${ l } ”36 f i3738 i f [ $ s e t == 3 ]39 then40 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−L1Size16 −L 2 s i z e 1 − c l u s t e r $ { c s }”41 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and l 1 s i z e = 16 and l 2 s i z e = 1 and c a c h e l e v e l = ${ l } and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e = 32 ”42 t i t l e=” MissRate o f c a c h e L${ l } p e r L${ l } s i z e ( L s i z e = [ 1 6 , 1 , . ] ) ”43 xlabel=” S i z e o f c a c h e L${ l } ”44 f i4546 i f [ [ $ s e t = 1 && $ l = 1 ] ]47 then48 continue49 f i5051 i f [ [ $ s e t == 3 && ( $ l == 1 | | $ l == 2 ) ] ]52 then53 continue54 f i5556 32
    • 57 echo ” G e n e r a t i n g Graph f o r s e t $ s e t on c a c h e l e v e l $ l ” 58 g n u p l o t << EOF 59 set d a t a f i l e s e p a r a t o r ” | ” 60 61 # Line s t y l e f o r a x e s 62 set s t y l e l i n e 80 l t rgb ”#808080” 63 64 # Line s t y l e f o r g r i d 65 set s t y l e l i n e 81 l t 0 # dashed 66 set s t y l e l i n e 81 l t rgb ”#808080” # g r e y 67 68 set grid back l i n e s t y l e 81 69 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and r i g h t . These 70 # b o r d e r s a r e u s e l e s s and make i t h a r d e r 71 # t o s e e p l o t t e d l i n e s near t h e b o r d e r . 72 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a border . 73 set x t i c s n o m i r r o r 74 set y t i c s n o m i r r o r 75 76 #s e t l o g x 77 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good . 78 79 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r 80 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s 81 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k 82 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s . 83 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 2 ps 1 pt 1 84 set s t y l e l i n e 2 l t rgb ”#00A000” lw 2 ps 1 pt 6 85 set s t y l e l i n e 3 l t rgb ”#5060D0” lw 2 ps 1 pt 2 86 set s t y l e l i n e 4 l t rgb ”#F25900 ” lw 2 ps 1 pt 9 87 88 #s e t key t o p r i g h t 89 90 #s e t x r a n g e [ 0 : 1 ] 91 set yrange [ 0 : 1 0 0 ] 92 93 #p l o t ” t e m p l a t e . d a t ” 94 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 95 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2 96 97 #s e t s t y l e d a t a l i n e s 98 #s e t key o u t s i d e 99 #s e t x t i c s r o t a t e by −45100 #s e t s i z e r a t i o 0 . 8101 set t i t l e ” $ t i t l e ”102 set xlabel ” $ x l a b e l ”103 $aux 33
    • 104 $aux2105 set ylabel ” T o t a l Miss Rate (%)”106 plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g 1 : 2 with p o i n t s l s 1 t i t l e ’#p r o c s = 2 ’ , 107 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2 with p o i n t s l s 2 t i t l e ’#p r o c s = 4 ’ , 108 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2 with p o i n t s l s 3 t i t l e ’#p r o c s = 8 ’ , 109 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2 with p o i n t s l s 4 t i t l e ’#p r o c s = 16 ’110111 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded112 #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm113 set output ” ${ f i l e n a m e } . pdf ”114 replot115 EOF116 done117 done118 done119120 echo ”Done”121122 f i l e n a m e=” power / L2MissRate−L1Size32 −L 2 s i z e 4 −l 3 s i z e 4 −v a r C l u s t e r ”123 s q l=” s e l e c t c l u s t e r , m i s s r a t e from r e s where l 1 s i z e = 32 and l 2 s i z e = 4 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e = 32 ”124 t i t l e=” MissRate o f c a c h e L2 p e r c l u s t e r s i z e ( L s i z e = [ 3 2 , 4 , 4 ] ) ”125 xlabel=” C l u s t e r s i z e ”126127 echo ” G e n e r a t i n g Graph f o r s e t v a r i a b l e c l u s t e r s ”128 g n u p l o t << EOF129 set d a t a f i l e s e p a r a t o r ” | ”130131 # Line s t y l e f o r a x e s132 set s t y l e l i n e 80 l t rgb ”#808080”133134 # Line s t y l e f o r g r i d135 set s t y l e l i n e 81 l t 0 # dashed136 set s t y l e l i n e 81 l t rgb ”#808080” # g r e y137138 set grid back l i n e s t y l e 81139 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and r i g h t . These140 # b o r d e r s a r e u s e l e s s and make i t h a r d e r141 # t o s e e p l o t t e d l i n e s near t h e b o r d e r .142 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a border .143 set x t i c s n o m i r r o r144 set y t i c s n o m i r r o r 34
    • 145146 #s e t l o g x147 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good .148149 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r150 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s151 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k152 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .153 set s t y l e l i n e 1 ps 1 pt 1154 set s t y l e l i n e 2 ps 1 pt 6155 set s t y l e l i n e 3 ps 1 pt 2156 set s t y l e l i n e 4 ps 1 pt 9157158 #s e t key t o p r i g h t159160 #s e t x r a n g e [ 0 : 1 ]161 #s e t y r a n g e [ 0 : 1 ]162163 #p l o t ” t e m p l a t e . d a t ” 164 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 165 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2166167 #s e t s t y l e d a t a l i n e s168 #s e t key o u t s i d e169 #s e t x t i c s r o t a t e by −45170 #s e t s i z e r a t i o 0 . 8171 set t i t l e ” $ t i t l e ”172 set xlabel ” $ x l a b e l ”173 $aux174 $aux2175 set ylabel ” T o t a l Miss Rate (%)”176 plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g 1 : 2 with l p l s 1 t i t l e ’#p r o c s = 2 ’ , 177 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2 with l p l s 2 t i t l e ’#p r o c s = 4 ’ , 178 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2 with l p l s 3 t i t l e ’#p r o c s = 8 ’ , 179 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2 with l p l s 4 t i t l e ’#p r o c s = 16 ’180181 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded182 #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm183 set output ” ${ f i l e n a m e } . pdf ”184 replot185 EOF186187 f i l e n a m e=” power / L2MissRate−L1Size16 −L 2 s i z e 1 −l 3 s i z e 4 −v a r C l u s t e r ”188 s q l=” s e l e c t c l u s t e r , m i s s r a t e from r e s where l 1 s i z e = 16 and l 2 s i z e = 1 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32 35
    • and l 2 l i n e = 32 and l 3 l i n e = 32 ”189 t i t l e=” MissRate o f c a c h e L2 p e r c l u s t e r s i z e ( L s i z e = [ 1 6 , 1 , 4 ] ) ”190 xlabel=” C l u s t e r s i z e ”191192 echo ” G e n e r a t i n g Graph f o r s e t v a r i a b l e c l u s t e r s ”193 g n u p l o t << EOF194 set d a t a f i l e s e p a r a t o r ” | ”195196 # Line s t y l e f o r a x e s197 set s t y l e l i n e 80 l t rgb ”#808080”198199 # Line s t y l e f o r g r i d200 set s t y l e l i n e 81 l t 0 # dashed201 set s t y l e l i n e 81 l t rgb ”#808080” # g r e y202203 set grid back l i n e s t y l e 81204 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and r i g h t . These205 # b o r d e r s a r e u s e l e s s and make i t h a r d e r206 # t o s e e p l o t t e d l i n e s near t h e b o r d e r .207 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a border .208 set x t i c s n o m i r r o r209 set y t i c s n o m i r r o r210211 #s e t l o g x212 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good .213214 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r215 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s216 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k217 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .218 set s t y l e l i n e 1 ps 1 pt 1219 set s t y l e l i n e 2 ps 1 pt 6220 set s t y l e l i n e 3 ps 1 pt 2221 set s t y l e l i n e 4 ps 1 pt 9222223 #s e t key t o p r i g h t224225 #s e t x r a n g e [ 0 : 1 ]226 #s e t y r a n g e [ 0 : 1 ]227228 #p l o t ” t e m p l a t e . d a t ” 229 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 230 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2231232 #s e t s t y l e d a t a l i n e s233 #s e t key o u t s i d e234 #s e t x t i c s r o t a t e by −45 36
    • 235 #s e t s i z e r a t i o 0 . 8236 set t i t l e ” $ t i t l e ”237 set xlabel ” $ x l a b e l ”238 $aux239 $aux2240 set ylabel ” T o t a l Miss Rate (%)”241 plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g 1 : 2 with l p l s 1 t i t l e ’#p r o c s = 2 ’ , 242 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2 with l p l s 2 t i t l e ’#p r o c s = 4 ’ , 243 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2 with l p l s 3 t i t l e ’#p r o c s = 8 ’ , 244 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2 with l p l s 4 t i t l e ’#p r o c s = 16 ’245246 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded247 #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm248 set output ” ${ f i l e n a m e } . pdf ”249 replot250 EOF 37