This report describes the simulation and benchmarking steps taken in order to predict the parallel performance of an application using Dimemas and Cache-level simulations. Using Dimemas [3] the time
behaviour of NAS [1] integer sort was simulated for the architecture of the Barcelona Super Computer, MareNostrum [4]. The performance was evaluated as a function of the architecture latency, bandwidth,
connectivity and CPU speed. For Cache-Level Simulations, Intel's pin tool was used to benchmark a simple parallel application in function of the cache and cluster sizes.
The SpeedIT provides a partial acceleration of sparse linear solvers. Acceleration is achieved with a single reasonably priced NVIDIA Graphics Processing Unit (GPU) that supporst CUDA and proprietary advanced optimisation techniques.
Check also SpeedIT FLOW, our RANS single phase flow solver that runs fully on GPU: vratis.com/blog
The SpeedIT provides a partial acceleration of sparse linear solvers. Acceleration is achieved with a single reasonably priced NVIDIA Graphics Processing Unit (GPU) that supporst CUDA and proprietary advanced optimisation techniques.
Check also SpeedIT FLOW, our RANS single phase flow solver that runs fully on GPU: vratis.com/blog
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
In Matrix multiplication we refer to a concept that is used in technology applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computing time as its complexity is O n3 . Because most engineering science applications require higher computational throughput with minimum time, many sequential and analogue algorithms are developed. In this paper, methods of matrix multiplication are elect, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using open MP and MPI methods of parallel of latitude computing. Adamu Abubakar I | Oyku A | Mehmet K | Amina M. Tako ""Comprehensive Performance Evaluation on Multiplication of Matrices using MPI""
Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30015.pdf
Paper Url : https://www.ijtsrd.com/engineering/electrical-engineering/30015/comprehensive-performance-evaluation-on-multiplication-of-matrices-using-mpi/adamu-abubakar-i
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Databricks
Overview Uber's Michelangelo is a machine learning platform that supports training and serving thousands of models in production. Most Michelangelo customer models are based on Spark Mllib. In this talk, we will describe Michelangelo's experiences with and evolving use of Spark Mllib, particularly in the areas of model persistence and online serving. Extended Description Michelangelo [https://eng.uber.com/michelangelo/] was originally developed to support scalable machine learning for production models. Its end-to-end support for scheduled Spark-based data ingestion and model training, along with model evaluation and deployment for batch and online model serving, has gained wide acceptance across Uber. More recently, Michelangelo is evolving to handle more use cases, including evaluating and serving models trained outside of core Michelangelo, e.g., on a distributed tensorflow platform providing Horovod [https://eng.uber.com/horovod/] or using PySpark in a Jupyter notebook on Data Science Workbench [https://eng.uber.com/dsw/] To support evaluation and serving of models trained outside of Michelangelo, Michelangelo's use of Spark Mllib needed updating, to generalize its mechanisms for model persistence and online serving. In this talk, we will describe these mechanisms and explore possible avenues for open-sourcing them.
Speakers: Anne Holler, Michael Mui
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
The need to support both today’s multicore performance and tomorrow’s heterogeneous computing has become increasingly important. Qualcomm® Multicore Asynchronous Runtime Environment (MARE) provides powerful and easy-to-use abstractions to write parallel software. This session will provide a deep dive into the concepts of power-efficient programming and how to use Qualcomm MARE APIs to get energy and thermal benefits for Android apps. Qualcomm Multicore Asynchronous Runtime Environment is a product of Qualcomm Technologies, Inc.
Learn more about Qualcomm Multicore Asynchronous Runtime Environment: https://developer.qualcomm.com/MARE
Watch this presentation on YouTube:
https://www.youtube.com/watch?v=RI8yXhBb8Hg
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
In Matrix multiplication we refer to a concept that is used in technology applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computing time as its complexity is O n3 . Because most engineering science applications require higher computational throughput with minimum time, many sequential and analogue algorithms are developed. In this paper, methods of matrix multiplication are elect, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using open MP and MPI methods of parallel of latitude computing. Adamu Abubakar I | Oyku A | Mehmet K | Amina M. Tako ""Comprehensive Performance Evaluation on Multiplication of Matrices using MPI""
Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30015.pdf
Paper Url : https://www.ijtsrd.com/engineering/electrical-engineering/30015/comprehensive-performance-evaluation-on-multiplication-of-matrices-using-mpi/adamu-abubakar-i
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Databricks
Overview Uber's Michelangelo is a machine learning platform that supports training and serving thousands of models in production. Most Michelangelo customer models are based on Spark Mllib. In this talk, we will describe Michelangelo's experiences with and evolving use of Spark Mllib, particularly in the areas of model persistence and online serving. Extended Description Michelangelo [https://eng.uber.com/michelangelo/] was originally developed to support scalable machine learning for production models. Its end-to-end support for scheduled Spark-based data ingestion and model training, along with model evaluation and deployment for batch and online model serving, has gained wide acceptance across Uber. More recently, Michelangelo is evolving to handle more use cases, including evaluating and serving models trained outside of core Michelangelo, e.g., on a distributed tensorflow platform providing Horovod [https://eng.uber.com/horovod/] or using PySpark in a Jupyter notebook on Data Science Workbench [https://eng.uber.com/dsw/] To support evaluation and serving of models trained outside of Michelangelo, Michelangelo's use of Spark Mllib needed updating, to generalize its mechanisms for model persistence and online serving. In this talk, we will describe these mechanisms and explore possible avenues for open-sourcing them.
Speakers: Anne Holler, Michael Mui
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
The need to support both today’s multicore performance and tomorrow’s heterogeneous computing has become increasingly important. Qualcomm® Multicore Asynchronous Runtime Environment (MARE) provides powerful and easy-to-use abstractions to write parallel software. This session will provide a deep dive into the concepts of power-efficient programming and how to use Qualcomm MARE APIs to get energy and thermal benefits for Android apps. Qualcomm Multicore Asynchronous Runtime Environment is a product of Qualcomm Technologies, Inc.
Learn more about Qualcomm Multicore Asynchronous Runtime Environment: https://developer.qualcomm.com/MARE
Watch this presentation on YouTube:
https://www.youtube.com/watch?v=RI8yXhBb8Hg
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
During one of my personal projects I decided to study the internals of Android and the potential of altering the Dalvik VM (e.g. Xposed framework and Cydia) and application behaviour. Not going into detail about runtime hooking of constructors and classes like these two tools provide, I also explored the possibility of reverse engineering and modifying existing applications.
In the web you can find multiple tutorials on Android reverse engineering of applications but not many that do it with real applications that are often subject to obfuscation or with complex execution flows. So in order to learn I decided to pick a common application such as Skype and do the following:
decompile it
study contents and completely remove some functionality (e.g. ads)
change some resources (not described in presentation bellow)
recompile, sign and install.
Used tools include :
apktool – for (de)compiling android applications
jarsigner – for signing android applications
xposed – for intercepting runtime execution flow (will make public in future)
The following presentation describes the steps taken in order to completely remove the ads from skype. This includes any computation or data plan usage the ads consume. Please note the disclaimer of the presentation as this information is for educational purposes only.
Check my website : www.marioalmeida.eu
Overview of the high-availability of YARN. How to make it highly available and the possibility of using NDB MySQL Cluster in order to store the state of the Resource Manager and the Application Masters without having to depend on different architectures such as Zookeeper and HDFS.
Missing references, will add soon!!
High Availability of Services in Wide-Area Shared Computing NetworksMário Almeida
(Check my blog @ http://www.marioalmeida.eu/ )
Highly available distributed systems have been widely used and have proven to be resistant to a wide range of faults. Although these kind of services are easy to access, they require an investment that developers might not always be willing to make. We present an overview of Wide-Area shared computing networks as well as methods to provide high availability of services in such networks. We make some references to highly available systems that are being used and studied at the moment this paper was written (2012).
(Check my blog @ http://www.marioalmeida.eu/ )
In this presentation I present the performance metrics and results of running the parsec benchmark with the raytrace application on Upc's boada server
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Pushing the limits of ePRTC: 100ns holdover for 100 days
Dimemas and Multi-Level Cache Simulations
1. `
Universitat Politecnica de
Catalunya
Measurement and Tools Project Report
Dimemas and Multi-level
Cache Simulations
Author: Supervisor:
M´rio Almeida
a Alejandro Ramirez Bellido
June 22, 2012
3. Abstract
This report describes the simulation and benchmarking steps taken
in order to predict the parallel performance of an application using
Dimemas and Cache-level simulations. Using Dimemas [3] the time
behaviour of NAS [1] integer sort was simulated for the architecture of
the Barcelona Super Computer, MareNostrum [4]. The performance
was evaluated as a function of the architecture latency, bandwidth,
connectivity and CPU speed. For Cache-Level Simulations, Intel’s pin
tool was used to benchmark a simple parallel application in function
of the cache and cluster sizes.
1 Introduction
This report describes the simulation and benchmarking steps taken in order
to predict the parallel performance of an application using Dimemas [3] and
Cache-level simulations.
Previous work was focused on benchmarking a PARSEC [2] ray-tracing
application on the multi-processor Boada server. For this purpose EXTRAE
and Paraver [5] were used to instrument and provide detailed quantitative
analysis of the application performance.
Following the study of measurement tools and techniques, this report
describes the usage of Dimemas to simulate the time behaviour of another
benchmarking application on the Barcelona Super Computer, MareNostrum.
This time the used traces were taken from a NAS benchmark application also
running on boada server. The performance of the application in this simu-
lation environment was evaluated as a function of the architecture latency,
bandwidth, connectivity and CPU speed.
To conclude this study on performance analysis, Cache-Level Simulations
were performed using Intel’s pin tool. The chosen application was a sim-
ple parallel application that performs distributed arithmetic operations. It
represents the typical Master-Slave paradigm with embarrassingly parallel
workload. For evaluating the cache architecture, the total cache miss rates
per cache level were calculated as a function of the cache sizes, associativity,
number of threads and the cluster size.
2 Methodology
This section presents the two different simulation configurations: Dimemas
and Multi-Level Cache simulations. Both sections describe the used tools,
configuration values and metrics used.
2
4. Boada Server
Bandwidth 1 Gb/s
Latency 6-10 us
Number of cores 12
Ram 24 GB
Table 1: Boada server configuration.
2.1 Dimemas Simulation
The application chosen for this experiment was the NAS Parallel Benchmark
application, integer sort. The NAS benchmark is a set of programs designed
to help evaluate the performance of parallel super computers. In this case,
the benchmark was done on the boada server which attributes are described
in table 1.
In order to perform an architecture simulation, it was decided to use the
MareNostrum Super Compute configuration which parameters are shown in
table 2. Note that a simplification was made, since it was considered that
each processor runs a single thread. Starting from MareNostrums original ar-
chitecture, multiple simulations were performed changing its attributes. For
this purpose, the script in section A.1.1 was created that generates Dimemas
configuration files and another to automate its variations. The changed at-
tributes in the simulated architecture consisted of latency, CPU speed, band-
width and the number of buses. All the measurements were stored in a sqlite3
database and then queried in order to automatically generate the graphs (sec-
tion A.1.3) presented on the section 3 using gnuplot.
To conclude, the changed attributes were recursively fixed on a chosen
optimal value to find a final architecture that needs lesser resources while
having similar execution times to the original MareNostrum configuration.
2.2 Multi-Level Cache Simulation
To conclude this study on performance analysis, Cache-Level Simulations
were performed using Intel’s pin tool. The chosen application was a sim-
ple parallel application that performs distributed arithmetic operations. It
represents the typical Master-Slave paradigm with embarrassingly parallel
workload.
For evaluating the cache architecture, the pin tool dcache application was
changed in order to support multiple levels of cache shared by parallel pro-
3
5. cessors. The implemented cache architecture is represented in figure 1. As
one might infer from the figure, the cache level two is cluster shared and the
cache level three is globally shared.
P0 L1
. .
. .
. . L2
P7 L1
L3
P8 L1
. . L2
. .
. .
Size of L2 Size of L1
P15 L1 = 1 MB = 4 MB
Size of L1 = 16 KB
Figure 1: Cache architecture for a cluster size of 8 and a total of 16 processors.
For this experiments, the total cache miss rates per cache level were calcu-
lated as a function of the multiple cache sizes, number of processors and the
cluster size. Some experiments were performed in terms of cache associativity
and the number of cache lines per cache set.
3 Results
In this section the results of both the experiments will be described alongside
with the resulting charts, descriptions and discussion.
3.1 Dimemas Simulation
Starting with the initial architecture of MareNostrum, the first experiment
consisted on varying the number of buses and observing its impact on the ex-
ecution time of our application. The results of this experiment are depicted
on the Figure 2.
As it can be observed from figure 2, the execution time decreases while
increasing the number of buses. This result was expected since this is a
4
6. Execution time with variable buses
1600
#Procs = 2
1400 #Procs = 4
#Procs = 8
1200
Execution time(s)
#Procs = 16
1000 #Procs = 32
800
600
400
200
0
0 5 10 15 20
buses
Figure 2: Execution time of IntegerSort depending on the number of buses.
multi-threaded application in which the data is transferred between threads
and adding more buses increases the amount of data that can be transferred
in parallel. Also it can be seen that from sixteen buses, the execution time
starts stabilizing. This is probably because most of the data to be sent, is
already sent in parallel and thus the increase of buses does not impact the
performance.
The second experiment consisted on varying the available bandwidth from
the initial MareNostrum configuration. The results are shown in Figure 3.
Execution time with variable bandwidth
140
#Procs = 2
120 #Procs = 4
#Procs = 8
Execution time(s)
100 #Procs = 16
#Procs = 32
80
60
40
20
0
170 180 190 200 210 220 230 240 250
bandwidth
Figure 3: Execution time of IntegerSort depending on the bandwidth (MB/s).
5
7. Figure 3 shows that the bandwidth as a bigger impact on performance if
the application is run on a smaller set of threads. For example, a variation
of 40 MB/s can increase the execution time by 20 seconds for four threads,
but for 32 threads, the changes are almost unnoticeable. This is probably
due to the fact that the master thread has to send the initial data to all
slaves. This means that increasing the number of slaves, the data can be di-
vided in smaller chunks that can be sent in parallel and thus taking less time.
The third experiment consisted on varying the processing capacity of the
CPU. As one can observe in figure 4, increasing the processing power of each
processor decreases the execution time. This impact is more noticeable if we
consider processing capacity smaller than 100%. It is not very tunable in
terms of optimizing the usage of resources in terms of decreasing the CPU
power since a small decrease has a big impact on the execution time.
Execution time with variable cpu
500
#Procs = 2
450
#Procs = 4
400 #Procs = 8
Execution time(s)
350 #Procs = 16
#Procs = 32
300
250
200
150
100
50
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
cpu
Figure 4: Execution time of IntegerSort depending on the available CPU
(%).
To conclude the experiments on the variation of the architecture param-
eters, figure 5 shows the impact of latency on the execution time.
For figure 5 a logarithmic scale was chosen for the x axis since changes
in the same order of the initial MareNostrum configuration do not have a
significant impact on the execution time. The latency can be increased to
value significantly bigger without affecting much the performance since the
latency values in MareNostrum are very small. Only from values of latency
close to 0.01 seconds we start seeing bigger increases of the execution time.
This attribute should have a bigger impact for more communication intensive
6
8. Execution time with variable latency
600
#Procs = 2
500 #Procs = 4
#Procs = 8
Execution time(s)
#Procs = 16
400 #Procs = 32
300
200
100
0
1e-06 1e-05 0.0001 0.001 0.01 0.1 1
latency
Figure 5: Execution time of IntegerSort depending on the latency (s).
applications.
Figure 6: Execution time of IntegerSort depending on the number of threads.
To conclude, a comparison is shown in table 2 that presents the dif-
ferences between a less resource demanding configuration and the original
MareNostrum configuration, both achieving similar execution times. The
chosen number of threads was 32 due to its better performance as shown in
7
9. Parameters MareNostrum Config 1 Config 2
Cpu (%) 1.0 0.95 0.9
Latency (s) 0.000008 0.0001 0.001
Bandwidth (MB/s) 250.0 240 230
Number of buses 20+ * 16 16
Execution time (s) 12.506 13.150 13.779
Table 2: Comparison between the execution times of the initial MareNostrum
configuration and its less resource demanding configuration.
figure 6.
The table 2 confirms the predictions made in previous experiments. The
chosen values increase the execution time at most 1 second while reducing
most parameters by around 10% and increasing significantly the latency.
3.2 Multi-Level Cache Simulation
As previously mentioned, the chosen application was a simple parallel ap-
plication that performs distributed arithmetic operations. It represents the
typical Master-Slave paradigm with embarrassingly parallel workload.
MissRate of cache L2 per cluster size (Lsize=[16,1,4])
50
48 #procs = 2
#procs = 4
46 #procs = 8
Total Miss Rate (%)
44 #procs = 16
42
40
38
36
34
32
30
28
2 3 4 5 6 7 8
Cluster size
Figure 7: MissRate of Cache L2 for L1, L2 and L3 sized, respectively 16K,
1M, 4M
For evaluating the cache architecture, the cache architecture was changed
depending on multiple factors, such as the cluster size, caches sizes and cache
line sizes. To start with this experiments the cache architecture was set as
shown in figure 1. It has 16 processors with one L1 cache of 16 KB each.
8
10. The cache level two has 1 MB and is cluster shared with a cluster size of 8.
And finally, the cache level three is globally shared and has a size of 4 MB.
The first experiment consisted on varying the cluster size as shown in figure
7 and verifying its impact on the cache L2 miss rate. As it can be seen,
for the number of threads of this experiment the impact on the miss rates
of changing the cluster size was not very significant. For up to 4 threads it
has almost no impact at all, but when the system has more than 8 threads
it can reduce the miss rate by 2%. It is interesting to notice that in this
experiment, the more threads sharing the same L2 cache, the lesser the miss
rate becomes.
Since most cache size configurations produced similar variations for the clus-
ter size experiment, the next step consisted on verifying the the impact of
the cache sizes on the miss rates. The first step consisted on varying the size
of the non-shared cache L1 and its results are presented on figure 8.
MissRate of cache 1 per L1 size
15
#procs = 2
14 #procs = 4
#procs = 8
Total Miss Rate (%)
13 #procs = 16
12
11
10
9
8
15 20 25 30 35 40 45 50 55 60 65
Size of cache L1
Figure 8: MissRate of Cache L1 for a variable L1 cache size (KB).
Looking at figure 8 it might seem strange that a smaller number of threads
has such a lower miss rate. This is because of the master/slave paradigm that
for an increasing number of threads makes the accesses to data more sparse.
For bigger numbers of threads the miss rates can reach values close to 15%.
As expected, bigger sizes of L1 caches achieve smaller miss rates, although
the difference isn’t greater than 2%.
Although the experiments were performed for more sizes of L1 cache, in
order to study the impact of the L2 cache size, the L1 cache size was fixed
on 16 KB. The variation of L2 cache size is presented on figure 9. As one can
observe, the miss rate of L2 cache for 2 threads is high, being close to 50%.
This is probably because of the low miss rate of the L1 cache, the accesses
9
11. MissRate of cache 2 per L2 size (Lsize=[16,.,.])
50
#procs = 2
48
#procs = 4
46 #procs = 8
Total Miss Rate (%)
44 #procs = 16
42
40
38
36
34
32
30
1 1.5 2 2.5 3 3.5 4
Size of cache L2
Figure 9: MissRate of Cache L2 for a variable L2 cache size (MB) and a L1
cache size of 16KB.
that don’t produces hits on L1 should have lower predictability. For bigger
numbers of threads, the miss rates are still high although they don’t reach
values higher than 33%.
MissRate of cache L3 per L3 size (Lsize=[16,1,.])
100
#procs = 2
#procs = 4
80 #procs = 8
Total Miss Rate (%)
#procs = 16
60
40
20
0
4 6 8 10 12 14 16
Size of cache L3
Figure 10: MissRate of Cache L3 for a variable L3 cache size (MB) and a L1
cache size of 16KB.
Finally, for the L3 cache size, the impact on the miss rate of the L3 cache
size is shown in figure 10. It seems that accesses that don’t produce hits on
the previous two levels of cache, will hardly produce hits on the third level
of cache. The only exception are the 2 threads for which the set of accessed
data is bigger. This probably shows that either the application doesn’t justify
the use of three levels of cache, or the data accessed by each thread at each
10
12. moment is too short.
4 Conclusions
Dimemas allowed to experiment the theoretical performance of the applica-
tion in the MareNostrum architecture. Through the variation of each dif-
ferent parameter it was possible to create graphs depicting their impact on
the execution time. By the end of the experiment it was possible to suggest
an architecture with less resources that achieves similar results to the initial
MareNostrum architecture. This architecture is presented in table 2 and con-
firms the predictions made in the Dimemas experiments. The chosen values
increase the execution time at most 1 second while reducing most parameters
by around 10% and increasing significantly the latency.
For the second experiment, the impact of the cluster size and caches sizes
were presented for a simple parallel arithmetic calculations application. The
experiments showed that the cluster size impact on the miss rate was not
very significant. For more than 8 threads it can reduce the miss rate by
2%. Overall, the more threads sharing the same L2 cache, the lesser the
miss rate becomes. This is because of the master/slave paradigm that for
an increasing number of threads makes the accesses to data more sparse.
As expected, bigger sizes of L1 caches achieve smaller miss rates. For big
numbers of threads, the miss rates in L2 cache were high although they don’t
reach values higher than 33%. In general, accesses that didn’t produce hits
on the first two levels of cache, hardly produced hits on the third level of
cache. The experiments showed that either the application doesn’t justify
the use of three levels of cache, or the data accessed by each thread at each
moment is too short.
Scripting the experiments had a huge impact on the time needed to per-
form them. Some of the experiments produced thousands of results. The
technique that has proven to be more efficient was to script the generation
of results, output them to a sql database and perform queries to generate
graphs through gnuplot.
References
[1] http://www.nas.nasa.gov/publications/npb.html, NAS benchmark.
[2] http://parsec.cs.princeton.edu/, PARSEC benchmark.
11
14. A Used Scripts
A.1 Dimemas instrumentation
A.1.1 Generating Dimemas Configuration
1 #! / b i n / b a s h
2
3 i f [ $# −ne 6 ]
4 then
5 echo ” $0 : Wrong number o f arguments . ”
6 echo ” $0 : <i n p u t . t r f > <n t h r e a d s > <nbuses> <l a t e n c y > <
bandwidth> <%cpuspeed>”
7 exit 1
8 fi
9
10 c a t b e g i n o f c o n f i g
11
12 #Bandwidth d e f i n i t i o n
13 echo −e ”nn” environment i n f o r m a t i o n ” { ” ” , 0 , ” ” , 1 2 8 , $5
, $3 , 3 } ; ; n”
14
15 #Latency and %cpu s p e e d d e f i n i t i o n s
16 f o r ( ( i =0; i <=127; i++ ) )
17 do
18 echo ” ” node i n f o r m a t i o n ” { 0 , $ i , ” ” , 1 , 1 , 1 , 0 . 0 , $4 ,
$6 , 0 . 0 , 0 . 0 } ; ; ”
19 done
20
21 #F i l e name and number o f p r o c e s s o r s d e f i n i t i o n s
22 echo ” ”
23 echo −n ” ” mapping i n f o r m a t i o n ” { ” $1 ” , $2 , [ $2 ] ”
24 echo −n ” {0 ”
25
26 f o r ( ( i =1; i<=$2 −1; i++ ) )
27 do
28 echo −n ” , $ i ”
29 done
30
31 echo ” } } ; ; ”
32
33 c a t e n d o f c o n f i g
A.1.2 Running experiments
1 #! / b i n / b a s h
2#
13
15. 3 # S c r i p t by aknahs ( Mario Almeida )
4 #
5
6 cat logo
7
8 echo ”Removing out f o l d e r ( f o r c e ) ”
9 rm − r f out
10
11 echo ” C r e a t i n g out f o l d e r ”
12 mkdir out
13 mkdir out / c f g
14 mkdir out / prv
15 mkdir out / d e t a i l s
16 mkdir out / r e s u l t s
17
18 echo ” C r e a t i n g s q l i t e 3 d a t a b a s e ”
19 s q l i t e 3 out / r e s u l t s / r e s . db ’CREATE TABLE dimemas ( p r o c s INTEGER,
b u s e s INTEGER, l a t e n c y REAL, bandwidth REAL, cpu REAL, runtime
REAL) ; ’
20
21 echo ” S e t t i n g d e f a u l t v a l u e s ”
22 LATENCY=” 0 . 0 0 0 0 0 8 ”
23 BANDWIDTH 2 5 0 . 0 ”
=”
24 BUSES=” 0 ”
25 CPU=” 1 . 0 ”
26
27 f o r i i n 02 04 08 16 32
28 do
29 #echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−”
30 i f [ $ { i : 0 : 1 } == 0 ]
31 then
32 #echo ” S e t t i n g n t h r e a d s t o $ { i : 1 } ”
33 n t h r e a d s=$ { i : 1 }
34 else
35 #echo ” S e t t i n g n t h r e a d s t o $ { i }”
36 n t h r e a d s=$ i
37 fi
38
39 echo −n ” G e n e r a t i n g r e s u l t s f o r $ n t h r e a d s ”
40
41 #BUSES−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
42 f o r j i n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20
43 do
44 #echo ” G e n e r a t i n g c o n f i g u r a t i o n f i l e f o r BUSES = $ j ”
45 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $ j $LATENCY
$BANDWIDTH $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$j −$LATENCY −
$BANDWIDTH −$CPU . c f g
46 #echo ” C o n v e r t i n g t o p a r a v e r t r a c e . . . ”
14
16. 47 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$j −$LATENCY −
$BANDWIDTH −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$j −$LATENCY −
$BANDWIDTH −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$j −
$LATENCY −$BANDWIDTH −$CPU
48 #echo ” O u t p u t i n g r e s u l t s . ”
49 echo −n ” $ n t h r e a d s , $j ,$LATENCY,$BANDWIDTH, $CPU, ” >> out /
r e s u l t s / r e s −$ n t h r e a d s . c s v
50 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$j −$LATENCY −
$BANDWIDTH −$CPU | awk ”{ p r i n t $3 } ” >> out / r e s u l t s / r e s −
$nthreads . csv
51 done
52
53 echo −n ” . ”
54
55 #LATENCY −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
56 for j in 0.000001 0.00001 0.0001 0.001 0.01 0 . 1 1 . 0
57 do
58 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $ j $BANDWIDTH
$CPU > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$j −$BANDWIDTH −$CPU .
cfg
59 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−$j −
$BANDWIDTH −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$j −
$BANDWIDTH −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−
$j −$BANDWIDTH −$CPU
60 echo −n ” $ n t h r e a d s , $BUSES , $j ,$BANDWIDTH, $CPU, ” >> out / r e s u l t s /
r e s −$ n t h r e a d s . c s v
61 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$j −
$BANDWIDTH −$CPU | awk ”{ p r i n t $3 } ” >> out / r e s u l t s / r e s −
$nthreads . csv
62 done
63
64 echo −n ” . ”
65
66 # BANDWIDTH −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
67 for j in 250.0 245.0 240.0 235.0 230.0 225.0 220.0 215.0
210.0 205.0 200.0 195.0 190.0 185.0 180.0 175.0 170.0
68 do
69 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $LATENCY $ j
$CPU > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$LATENCY −$j −$CPU . c f g
70 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−
$LATENCY −$j −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−
$LATENCY −$j −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−
$LATENCY −$j −$CPU
71 echo −n ” $ n t h r e a d s , $BUSES ,$LATENCY, $j , $CPU, ” >> out / r e s u l t s /
r e s −$ n t h r e a d s . c s v
72 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$LATENCY −$j
−$CPU | awk ”{ p r i n t $3 }” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v
73 done
74
15
17. 75 echo −n ” . ”
76
77 #CPU SPEED−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
78 for j in 5 . 0 4 . 0 3 . 0 2 . 0 1 . 0 0.95 0 . 9 0.85 0 . 8 0.75 0 . 7 0.65
0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.25 0.1 0.05
79 do
80 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $LATENCY
$BANDWIDTH $ j > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$LATENCY −
$BANDWIDTH j . c f g−$
81 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−
$LATENCY −$BANDWIDTH j . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−
−$
$LATENCY −$BANDWIDTH j . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −
−$
$BUSES−$LATENCY −$BANDWIDTH j −$
82 echo −n ” $ n t h r e a d s , $BUSES ,$LATENCY,$BANDWIDTH, $j , ” >> out /
r e s u l t s / r e s −$ n t h r e a d s . c s v
83 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$LATENCY −
$BANDWIDTH j | awk ” { p r i n t $3 } ” >> out / r e s u l t s / r e s −
−$
$nthreads . csv
84
85
86 done
87 echo ” . ”
88 echo ” I m p o r t i n g t o d a t a b a s e ”
89 echo ” . s e p a r a t o r ” , ” ” > out / r e s u l t s /command
90 echo ” . import out / r e s u l t s / r e s −$ { n t h r e a d s } . c s v dimemas” >>
out / r e s u l t s /command
91 s q l i t e 3 out / r e s u l t s / r e s . db < out / r e s u l t s /command
92 rm out / r e s u l t s /command
93 done
94
95 echo ” G e n e r a t i n g b e s t c o n f i g u r a t i o n 1 ”
96 . / c o n f i g g e n i n / m p i p i n g 3 2 . t r f 32 16 0 . 0 0 0 1 2 4 0 . 0 0 . 9 5 > out / c f g
/ c o n f i g −32 −16 −0.0001 −240.0 −0.95. c f g
97 . / Dimemas3 −S 32K −pa out / prv / paraver −32 −16 −0.0001 −240.0 −0.95.
prv out / c f g / c o n f i g −32 −16 −0.0001 −240.0 −0.95. c f g > out / d e t a i l s /
d e t a i l −32 −16 −0.0001 −240.0 −0.95
98 echo −n ” 3 2 , 1 6 , 0 . 0 0 0 1 , 2 4 0 . 0 , 0 . 9 5 , ” > out / r e s u l t s / o p t i m a l . c s v
99 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −32 −16 −0.0001 −240.0 −0.95 | awk
”{ p r i n t $3 }” >> out / r e s u l t s / o p t i m a l . c s v
100
101 echo ” G e n e r a t i n g b e s t c o n f i g u r a t i o n ”
102 . / c o n f i g g e n i n / m p i p i n g 1 6 . t r f 16 16 0 . 0 0 0 1 2 3 0 . 0 0 . 9 > out / c f g /
c o n f i g −16 −16 −0.0001 −230.0 −0.9. c f g
103 . / Dimemas3 −S 32K −pa out / prv / paraver −16 −16 −0.0001 −230.0 −0.9. prv
out / c f g / c o n f i g −16 −16 −0.0001 −230.0 −0.9. c f g > out / d e t a i l s /
d e t a i l −16 −16 −0.0001 −230.0 −0.9
104 echo −n ” 1 6 , 1 6 , 0 . 0 0 0 1 , 2 3 0 . 0 , 0 . 9 , ” >> out / r e s u l t s / o p t i m a l . c s v
105 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −16 −16 −0.0001 −230.0 −0.9 | awk ”
{ p r i n t $3 } ” >> out / r e s u l t s / o p t i m a l . c s v
16
18. 106
107 ./ graphall buses
108 ./ graphall cpu
109 ./ graphall bandwidth
110 ./ graphall latency
111
112 echo ” A l l done ! ”
A.1.3 Graph generator
1 #! / b i n / b a s h
2 #
3 # S c r i p t by aknahs ( Mario Almeida )
4 #
5
6 l a t e n c y=” 0 . 0 0 0 0 0 8 ”
7 bandwidth=” 2 5 0 . 0 ”
8 b u s e s=” 0 ”
9 cpu=” 1 . 0 ”
10 aux=” ”
11 aux2=” ”
12
13 i f [ ” $1 ” == ” l a t e n c y ” ]
14 then
15 comp=$ l a t e n c y
16 aux=” s e t l o g x”
17 aux2=” s e t m x t i c s 10 ”
18 fi
19 i f [ ” $1 ” == ” bandwidth ” ]
20 then
21 comp=$bandwidth
22 fi
23 i f [ ” $1 ” == ” b u s e s ” ]
24 then
25 comp=$ b u s e s
26 fi
27 i f [ ” $1 ” == ” cpu ” ]
28 then
29 comp=$cpu
30 fi
31
32
33 echo ” G e n e r a t i n g Graph”
34 g n u p l o t << EOF
35 set d a t a f i l e s e p a r a t o r ” | ”
36
37 # Line s t y l e f o r a x e s
38 set s t y l e l i n e 80 l t rgb ”#808080”
17
19. 39
40 # Line s t y l e f o r g r i d
41 set s t y l e l i n e 81 l t 0 # dashed
42 set s t y l e l i n e 81 l t rgb ”#808080” # grey
43
44 set grid back l i n e s t y l e 81
45 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and
r i g h t . These
46 # b o r d e r s a r e u s e l e s s and make i t h a r d e r
47 # t o s e e p l o t t e d l i n e s near t h e b o r d e r .
48 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a
border .
49 set x t i c s n o m i r r o r
50 set y t i c s n o m i r r o r
51
52 #s e t l o g x
53 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good .
54
55 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r
56 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s
57 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k
58 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .
59 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 2 pt 1
60 set s t y l e l i n e 2 l t rgb ”#00A000” lw 2 pt 6
61 set s t y l e l i n e 3 l t rgb ”#5060D0” lw 2 pt 2
62 set s t y l e l i n e 4 l t rgb ”#F25900 ” lw 2 pt 9
63 set s t y l e l i n e 5 lw 2 pt 9
64
65 #s e t key t o p r i g h t
66
67 #s e t x r a n g e [ 0 : 1 ]
68 #s e t y r a n g e [ 0 : 1 ]
69
70 #p l o t ” t e m p l a t e . d a t ”
71 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 ,
72 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2
73
74 #s e t s t y l e d a t a l i n e s
75 set key o u t s i d e
76 #s e t x t i c s r o t a t e by −45
77 #s e t s i z e r a t i o 0 . 8
78 set t i t l e ” E x e c u t i o n time with v a r i a b l e $1 ”
79 set xlabel ” $1 ”
80 $aux
81 $aux2
82 set ylabel ” E x e c u t i o n time ( s ) ”
83
84 plot ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
dimemas where $1 != $comp and p r o c s = 2 UNION s e l e c t $1 ,
18
20. runtime from dimemas where p r o c s = 2 and b u s e s = $ b u s e s and
l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu =
$cpu ’ ” u s i n g 1 : 2 w l p l s 1 t i t l e ’#Procs = 2 ’ ,
85 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
dimemas where $1 != $comp and p r o c s = 4 UNION s e l e c t $1 ,
runtime from dimemas where p r o c s = 4 and b u s e s = $ b u s e s
and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu
= $cpu ’ ” u s i n g 1 : 2 w l p l s 2 t i t l e ’#Procs = 4 ’ ,
86 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
dimemas where $1 != $comp and p r o c s = 8 UNION s e l e c t $1 ,
runtime from dimemas where p r o c s = 8 and b u s e s = $ b u s e s
and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu
= $cpu ’ ” u s i n g 1 : 2 w l p l s 3 t i t l e ’#Procs = 8 ’ ,
87 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
dimemas where $1 != $comp and p r o c s = 16 UNION s e l e c t $1 ,
runtime from dimemas where p r o c s = 16 and b u s e s = $ b u s e s
and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu
= $cpu ’ ” u s i n g 1 : 2 with l i n e s l s 4 t i t l e ’#Procs = 16 ’ ,
88 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
dimemas where $1 != $comp and p r o c s = 32 UNION s e l e c t $1 ,
runtime from dimemas where p r o c s = 32 and b u s e s = $ b u s e s
and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu
= $cpu ’ ” u s i n g 1 : 2 w l p l s 5 t i t l e ’#Procs = 32 ’
89
90 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded
91 #s e t t e r m i n a l p d f c a i r o s i z e 10cm, 2 0cm
92 set output ” out / r e s u l t s / $1 . pdf ”
93 replot
94 EOF
95
96 echo ”Done”
A.1.4 Generating graphs
1 ./ graphall buses
2 ./ graphall latency
3 ./ graphall cpu
4 ./ graphall bandwidth
5
6 echo ” G e n e r a t i n g Graph”
7 g n u p l o t << EOF
8 set d a t a f i l e s e p a r a t o r ” , ”
9 set nokey
10
11 set t i t l e ” E x e c u t i o n time depending on t h e number o f t h r e a d s ”
12 set xlabel ”Number o f t h r e a d s ”
13
14 set x t i c s ( 0 , 2 , 4 , 8 , 1 6 , 3 2 , 3 4 )
19
21. 15
16 set ylabel ” E x e c u t i o n time ( s ) ”
17
18 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 50
19
20 plot ” out / r e s u l t s / comparisonThreads . c s v ” u s i n g 1 : 2 with imp l s
1
21
22 set term p o s t s c r i p t eps enhanced c o l o r
23 set output ” out / r e s u l t s / comparison . pdf ”
24 replot
25 EOF
A.2 Pin tool instrumentation
A.2.1 Generate and Compile Application and DCache tool
1 #! / b i n / b a s h
2
3 #c l u s t e r S i z e
4 # c o n s t UINT32 c a c h e S i z e = 256∗KILO ;
5 # c o n s t UINT32 l i n e S i z e = 1 ;
6 # c o n s t UINT32 a s s o c i a t i v i t y = 2 5 6 ;
7 # l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc> <L 2 c a c h e s i z e
<c
> <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <L 3 l i n e S i z e > <L3assoc>
<nThreads>
8# $1 $2 $3 $4 $5
$6 $7 $8 $9
$10 $11
9
10 i f [ $# −ne 11 ]
11 then
12 echo ” $0 : Wrong number o f arguments . ”
13 echo ” $0 : <c l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc
> <L 2 c a c h e s i z e > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <
L 3 l i n e S i z e > <L3assoc> <nThreads>”
14 exit 1
15 f i
16
17 threadsAndMaster=$ ( ( $ {11} −1) )
18 #echo ” TreadsAndMaster = $threadsAndMaster ”
19
20 #echo −n ”INPUT=”
21 #echo ” $1 $2 $3 $4 $5 $6 $7 $8 $9 $ {10} $ {11}”
22
23 #echo ” S a v i n g backup o f dcache f i l e ”
24 mv −f dcache . cpp dcache backup . cpp
20
22. 25
26 echo ”
27 #i n c l u d e <i o s t r e a m >
28 #i n c l u d e <f s t r e a m >
29 #i n c l u d e <c a s s e r t >
30
31 #i n c l u d e ” p i n .H”
32
33
34 t y p e d e f UINT32 CACHE STATS ; // type o f c a c h e h i t / m i s s c o u n t e r s
35
36 #i n c l u d e ” p i n c a c h e .H”
37
38 KNOB t r i n g > KnobOutputFile (KNOB MODE WRITEONCE,
<s ” p i n t o o l ” ,
39 ” o ” , ” a l l c a c h e . out ” , ” s p e c i f y dcache f i l e name” ) ;
40
41 PIN LOCK l o c k ;
42
43 INT32 numThreads = 0 ;
44 c o n s t INT32 MaxNumThreads = $11 ;
45 c o n s t INT32 c l u s t e r S i z e = $1 ;
46
47 s t r u c t THREAD DATA
48 {
49 UINT64 H i t s ;
50 UINT64 Miss ;
51 };
52
53 THREAD DATA l 1 c o u n t [ MaxNumThreads ] ;
54 THREAD DATA l 2 c o u n t [ c l u s t e r S i z e ] ;
55
56 VOID T h r e a d S t a r t (THREADID t h r e a d i d , CONTEXT ∗ c t x t , INT32 f l a g s ,
VOID ∗v )
57 {
58 GetLock(& l o c k , t h r e a d i d +1) ;
59 numThreads++;
60 R e l e a s e L o c k (& l o c k ) ;
61
62 ASSERT( numThreads <= MaxNumThreads , ”Maximum number o f
t h r e a d s e x c e e d e d n” ) ;
63 }
64
65 namespace DL1
66 {
67 // 1 s t l e v e l data c a c h e : 32 kB , 32 B l i n e s , 32−way
associative
68 c o n s t UINT32 c a c h e S i z e = $2 ∗KILO ;
69 c o n s t UINT32 l i n e S i z e = $3 ;
70 c o n s t UINT32 a s s o c i a t i v i t y = $4 ;
21
23. 71 c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC
: : STORE NO ALLOCATE;
72
73 c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗
associativity ) ;
74 c o n s t UINT32 m a x a s s o c i a t i v i t y = a s s o c i a t i v i t y ;
75
76 t y p e d e f CACHE ROUND ROBIN( max sets , m a x a s s o c i a t i v i t y ,
a l l o c a t i o n ) CACHE;
77 }
78 LOCALVAR DL1 : : CACHE d l 1 ( ”L1 Data Cache ” , DL1 : : c a c h e S i z e , DL1 : :
l i n e S i z e , DL1 : : a s s o c i a t i v i t y ) ;
79
80 namespace UL2
81 {
82 // 2nd l e v e l u n i f i e d c a c h e : 2 MB, 64 B l i n e s , d i r e c t mapped
83 c o n s t UINT32 c a c h e S i z e = $5 ∗MEGA;
84 c o n s t UINT32 l i n e S i z e = $6 ;
85 c o n s t UINT32 a s s o c i a t i v i t y = $7 ;
86 c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC
: : STORE ALLOCATE;
87
88 c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗
associativity ) ;
89
90 t y p e d e f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE;
91 }
92 LOCALVAR UL2 : : CACHE u l 2 ( ”L2 C l u s t e r −s h a r e d Cache ” , UL2 : :
c a c h e S i z e , UL2 : : l i n e S i z e , UL2 : : a s s o c i a t i v i t y ) ;
93
94 namespace UL3
95 {
96 // 3 rd l e v e l u n i f i e d c a c h e : 16 MB, 64 B l i n e s , d i r e c t mapped
97 c o n s t UINT32 c a c h e S i z e = $8 ∗MEGA;
98 c o n s t UINT32 l i n e S i z e = $9 ;
99 c o n s t UINT32 a s s o c i a t i v i t y = $ { 1 0 } ;
100 c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC
: : STORE ALLOCATE;
101
102 c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗
associativity ) ;
103
104 t y p e d e f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE;
105 }
106 LOCALVAR UL3 : : CACHE u l 3 ( ”L3 G l o b a l l y −s h a r e d Cache ” , UL3 : :
c a c h e S i z e , UL3 : : l i n e S i z e , UL3 : : a s s o c i a t i v i t y ) ;
107
108 LOCALFUN VOID F i n i ( i n t code , VOID ∗ v )
109 {
22
24. 110 s t d : : o f s t r e a m out ( KnobOutputFile . Value ( ) . c s t r ( ) ) ;
111
112 out <<
113 ”#n”
114 ”# DCACHE s t a t s n ”
115 ”#n” ;
116
117 out << d l 1 ;
118 out << u l 2 ;
119 out << u l 3 ;
120
121 out . c l o s e ( ) ;
122
123 f o r ( i n t i =0; i <numThreads ; i ++)
124 {
125 p r i n t f ( ”%d L1 H i t s : %I64d n” , i , ( u n s i g n e d i n t ) l 1 c o u n t [ i ] . H i t s
);
126 p r i n t f ( ”%d L1 Miss : %I64d n” , i , ( u n s i g n e d i n t ) l 1 c o u n t [ i ] . Miss
);
127 p r i n t f ( ”%d L1 Hit r a t e : %f nn” , i , ( 1 0 0 . 0 ∗ l 1 c o u n t [ i ] . H i t s / (
l 1 c o u n t [ i ] . H i t s+l 1 c o u n t [ i ] . Miss ) ) ) ;
128 }
129
130 f o r ( i n t i =0; i <c l u s t e r S i z e ; i ++)
131 {
132 p r i n t f ( ”%d L2 H i t s : %I64d n” , i , ( u n s i g n e d i n t ) l 2 c o u n t [ i ] . H i t s
);
133 p r i n t f ( ”%d L2 Miss : %I64d n” , i , ( u n s i g n e d i n t ) l 2 c o u n t [ i ] . Miss
);
134 p r i n t f ( ”%d L2 Hit r a t e : %f nn” , i , ( 1 0 0 . 0 ∗ l 2 c o u n t [ i ] . H i t s / (
l 2 c o u n t [ i ] . H i t s+l 2 c o u n t [ i ] . Miss ) ) ) ;
135 }
136 }
137
138 LOCALFUN VOID U l 2 A c c e s s (ADDRINT addr , UINT32 size , CACHE BASE : :
ACCESS TYPE accessType , THREADID t i d )
139 {
140 // s e c o n d l e v e l u n i f i e d c a c h e
141 c o n s t BOOL d l 2 H i t = u l 2 . A c c e s s ( addr , size , a c c e s s T y p e ) ;
142
143 // t h i r d l e v e l u n i f i e d c a c h e
144 i n t c i d = t i d / ( MaxNumThreads/ c l u s t e r S i z e ) ;
145 i f ( ! dl2Hit )
146 {
147 GetLock(& l o c k , t i d +1) ;
148 l 2 c o u n t [ c i d ] . Miss++;
149 R e l e a s e L o c k (& l o c k ) ;
150 u l 3 . A c c e s s ( addr , size , a c c e s s T y p e ) ;
151 } else
23
25. 152 l 2 c o u n t [ c i d ] . H i t s ++;
153 }
154
155 LOCALFUN VOID MemRefMulti (ADDRINT addr , UINT32 size , CACHE BASE
: : ACCESS TYPE accessType , THREADID t i d )
156 {
157 // f i r s t l e v e l D−c a c h e
158 c o n s t BOOL d l 1 H i t = d l 1 . A c c e s s ( addr , size , a c c e s s T y p e ) ;
159
160 i f ( ! dl1Hit ) {
161 l 1 c o u n t [ t i d ] . Miss++;
162 U l 2 A c c e s s ( addr , size , accessType , t i d ) ;
163 }
164 else
165 {
166 l 1 c o u n t [ t i d ] . H i t s ++;
167 }
168 }
169
170 LOCALFUN VOID MemRefSingle (ADDRINT addr , UINT32 size , CACHE BASE
: : ACCESS TYPE accessType , THREADID t i d )
171 {
172 // f i r s t l e v e l D−c a c h e
173 c o n s t BOOL d l 1 H i t = d l 1 . A c c e s s S i n g l e L i n e ( addr , a c c e s s T y p e ) ;
174
175 i f ( ! dl1Hit ) {
176 l 1 c o u n t [ t i d ] . Miss++;
177 U l 2 A c c e s s ( addr , size , accessType , t i d ) ;
178 }
179 else
180 {
181 l 1 c o u n t [ t i d ] . H i t s ++;
182 }
183 }
184
185 LOCALFUN VOID I n s t r u c t i o n ( INS i n s , VOID ∗v )
186 {
187 i f ( INS IsMemoryRead ( i n s ) )
188 {
189 c o n s t UINT32 s i z e = INS MemoryReadSize ( i n s ) ;
190 c o n s t AFUNPTR countFun = ( s i z e <= 4 ? (AFUNPTR)
MemRefSingle : (AFUNPTR) MemRefMulti ) ;
191
192 // o n l y p r e d i c a t e d −on memory i n s t r u c t i o n s a c c e s s D−c a c h e
193 INS InsertPredicatedCall (
194 i n s , IPOINT BEFORE , countFun ,
195 IARG MEMORYREAD EA,
196 IARG MEMORYREAD SIZE,
197 IARG UINT32 , CACHE BASE : : ACCESS TYPE LOAD,
24
26. 198 IARG THREAD ID ,
199 IARG END) ;
200 }
201
202 i f ( INS IsMemoryWrite ( i n s ) )
203 {
204 c o n s t UINT32 s i z e = INS MemoryWriteSize ( i n s ) ;
205 c o n s t AFUNPTR countFun = ( s i z e <= 4 ? (AFUNPTR)
MemRefSingle : (AFUNPTR) MemRefMulti ) ;
206
207 // o n l y p r e d i c a t e d −on memory i n s t r u c t i o n s a c c e s s D−c a c h e
208 INS InsertPredicatedCall (
209 i n s , IPOINT BEFORE , countFun ,
210 IARG MEMORYWRITE EA,
211 IARG MEMORYWRITE SIZE,
212 IARG UINT32 , CACHE BASE : : ACCESS TYPE STORE,
213 IARG THREAD ID ,
214 IARG END) ;
215 }
216 }
217
218 GLOBALFUN i n t main ( i n t argc , c h a r ∗ argv [ ] )
219 {
220 P I N I n i t ( argc , argv ) ;
221
222 f o r ( INT32 t =0; t<MaxNumThreads ; t++)
223 {
224 l1count [ t ] . Hits = 0;
225 l 1 c o u n t [ t ] . Miss =0;
226 }
227
228 f o r ( i n t i =0; i <c l u s t e r S i z e ; i ++)
229 {
230 l 2 c o u n t [ i ] . H i t s =0;
231 l 2 c o u n t [ i ] . Miss =0;
232 }
233
234 PIN AddThreadStartFunction ( ThreadStart , 0 ) ;
235 INS AddInstrumentFunction ( I n s t r u c t i o n , 0 ) ;
236 PIN AddFiniFunction ( F i n i , 0 ) ;
237
238 // Never r e t u r n s
239 PIN StartProgram ( ) ;
240
241 return 0 ; // make c o m p i l e r happy
242 }” > dcache . cpp
243
244 make > makeres
245
25
27. 246 echo ”
247 #i n c l u d e <p t h r e a d . h>
248 #i n c l u d e <s t d i o . h>
249 #i n c l u d e < s t d l i b . h>
250 #i n c l u d e <time . h>
251 typedef struct
252 {
253 double ∗a ;
254 double ∗b ;
255 double sum ;
256 int veclen ;
257 } DOTDATA;
258
259
260 #d e f i n e NUMTHRDS $threadsAndMaster
261 #d e f i n e VECLEN 1000000
262
263 DOTDATA d o t s t r ;
264 p t h r e a d t c a l l T h d [NUMTHRDS] ;
265 p t h r e a d m u t e x t mutexsum ;
266
267 v o i d ∗ dotprod ( v o i d ∗ a r g )
268 {
269 i n t i , s t a r t , end , l e n ;
270 long o f f s e t ;
271 // p r i n t f ( ”%dn ” , ( i n t ) a r g ) ;
272 d o u b l e mysum , ∗x , ∗y ;
273 o f f s e t = ( long ) arg ;
274
275 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 8 ) ;
276
277 len = dotstr . veclen ;
278 // p r i n t f ( ”%dn” , l e n ) ;
279 s t a r t = o f f s e t ∗ ( l e n /NUMTHRDS) ;
280 end = s t a r t + ( l e n /NUMTHRDS) ;
281 x = dotstr . a ;
282 y = dotstr . b ;
283
284 mysum = 0 ;
285 f o r ( i=s t a r t ; i <end ; i ++)
286 mysum += ( x [ i ] ∗ y [ i ] ) ;
287
288 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 9 ) ;
289
290 p t h r e a d m u t e x l o c k (&mutexsum ) ;
291
292 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 0 ) ;
293 d o t s t r .sum += mysum ;
294 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 1 ) ;
26
28. 295
296 p t h r e a d m u t e x u n l o c k (&mutexsum ) ;
297
298 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 0 ) ;
299
300 // p t h r e a d e x i t ( ( v o i d ∗ ) 0 ) ;
301 }
302
303
304 i n t main ( i n t argc , c h a r ∗ argv [ ] )
305 {
306 long i ;
307 d o u b l e ∗a , ∗b ;
308 void ∗ s ta t us ;
309 pthread attr t attr ;
310
311 c l o c k t begin , end ;
312 double time spent ;
313
314 b e g i n = clock ( ) ;
315 // E x t r a e i n i t ( ) ;
316
317 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 ) ;
318
319 a = ( d o u b l e ∗ ) m a l l o c (NUMTHRDS∗VECLEN∗ s i z e o f ( d o u b l e ) ) ;
320 b = ( d o u b l e ∗ ) m a l l o c (NUMTHRDS∗VECLEN∗ s i z e o f ( d o u b l e ) ) ;
321
322 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 2 ) ;
323
324 f o r ( i =0; i <VECLEN∗NUMTHRDS; i ++)
325 {
326 a [ i ]=1;
327 b [ i ]=a [ i ] ;
328 }
329
330 d o t s t r . v e c l e n = VECLEN;
331 dotstr . a = a ;
332 dotstr . b = b ;
333 d o t s t r .sum=0;
334
335 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 3 ) ;
336
337 p t h r e a d m u t e x i n i t (&mutexsum , NULL) ;
338
339 p t h r e a d a t t r i n i t (& a t t r ) ;
340 p t h r e a d a t t r s e t d e t a c h s t a t e (& a t t r , PTHREAD CREATE JOINABLE) ;
341
342 f o r ( i =0; i < NUMTHRDS; i ++)
343 {
27
29. 344 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 4 ) ;
345 p t h r e a d c r e a t e (& c a l l T h d [ i ] , &a t t r , dotprod , ( v o i d ∗ ) i ) ;
346 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 3 ) ;
347 }
348
349 p t h r e a d a t t r d e s t r o y (& a t t r ) ;
350 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 5 ) ;
351
352 f o r ( i =0; i < NUMTHRDS; i ++)
353 {
354 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 6 ) ;
355 p t h r e a d j o i n ( c a l l T h d [ i ] , &s t a t u s ) ;
356 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 7 ) ;
357 }
358
359 p r i n t f ( ”Sum = %f n” , d o t s t r .sum) ;
360 free (a) ;
361 free (b) ;
362
363 end=clock ( ) ;
364 t i m e s p e n t= ( d o u b l e ) ( end − b e g i n ) / CLOCKS PER SEC ;
365 p r i n t f ( ” E x e c u t i o n time : %f n ” , t i m e s p e n t ) ;
366
367 // E x t r a e f i n i ( ) ;
368
369 p t h r e a d m u t e x d e s t r o y (&mutexsum ) ;
370 p t h r e a d e x i t (NULL) ;
371 }
372 ” > dotprod . c
373
374 #echo ” Compiling dotprod ”
375 g c c −o dotprod dotprod . c −l p t h r e a d
376
377 #echo ” Running p i n t o o l ”
378 cd / s c r a t c h / boada −1/etm022 / p i n
379 . / p i n −t / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/ obj−
i n t e l 6 4 / dcache . s o −− / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s
/Memory/ dotprod > / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /
Memory/ r e s u l t s / r e s −$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ { 1 1 } . r e s
380
381 mv a l l c a c h e . out / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/
r e s u l t s / r e s −$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ { 1 1 } . a l l c a c h e
382
383 cd / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory
384 echo ” done ! ”
A.2.2 Running the experiments
28
30. 1 #! / b i n / b a s h
2 #
3 #S c r i p t by aknahs ( Mario Almeida )
4 #
5 #c l u s t e r S i z e
6 # c o n s t UINT32 c a c h e S i z e = 256∗KILO ;
7 # c o n s t UINT32 l i n e S i z e = 1 ;
8 # c o n s t UINT32 a s s o c i a t i v i t y = 2 5 6 ;
9 # l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc> <L 2 c a c h e s i z e
<c
> <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <L 3 l i n e S i z e > <L3assoc>
<nThreads>
10 # $1 $2 $3 $4 $5
$6 $7 $8 $9
$10 $11
11
12 rm − r f r e s u l t s
13 mkdir r e s u l t s
14
15 t o t a l=$ ( ( 3 ∗ 4 ∗ 3 ∗ 3 ∗ 2 ∗ 3 ∗ 2 ) )
16 n=0
17 r e s 1=$ ( date +%s .%N)
18
19
20 #c l u s t e r S i z e
21 f o r c s i n 2 4 8
22 do
23 f o r mt i n 2 4 8 16
24 do
25 #L 1 c a c h e S i z e
26 f o r l 1 c i n 16 32 64
27 do
28 #L 1 l i n e S i z e
29 f o r l 1 l i n 32 #64 128
30 do
31 #L1assoc
32 f o r l 1 a i n 1 #2 4
33 do
34 #L 2 c a c h e S i z e
35 for l 2 c in 1 2 4
36 do
37 #L 2 l i n e S i z e
38 f o r l 2 l i n 32 64 #128
39 do
40 #L2assoc
41 f o r l 2 a i n 1 #2 4
42 do
43 #L 3 c a c h e S i z e
44 f o r l 3 c i n 4 8 16
29
31. 45 do
46 #L 3 l i n e S i z e
47 f o r l 3 l i n 32 64 #128
48 do
49 #L3assoc
50 f o r l 3 a i n 1 #2 4
51 do
52 clear
53 cat logo
54 echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−by aknahs ”
55 echo −n ” G e n e r a t i n g [ $n/ $ t o t a l ] . . . ”
56 r e s 2=$ ( date +%s .%N)
57 p r i n t f ” Elapsed : %.3Fn” $ ( echo ” $ r e s 2 − $ r e s 1 ” | bc
)
58
59 n=$ ( ( $n + 1 ) )
60 #echo
”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
61 #echo ” G e n e r a t i n g CPP and Make”
62 #echo ” . / genMakeCPP $ c s $ l 1 c $ l 1 l $ l 1 a $ l 2 c $ l 2 l
$ l 2 a $ l 3 c $ l 3 l $ l 3 a $mt”
63 . / genMakeCPP $ c s $ l 1 c $ l 1 l $ l 1 a $ l 2 c $ l 2 l $ l 2 a
$ l 3 c $ l 3 l $ l 3 a $mt
64 echo ” . ”
65 done
66 done
67 done
68 done
69 done
70 done
71 done
72 done
73 done
74 done
75 done
76 echo ” a l l done . ”
77
78 g r e p ” T o t a l Miss Rate ” r e s u l t s / ∗ . a l l c a c h e | awk ’BEGIN{n=0;
p r i n t f ” C l u s t e r S i z e , L1 Cache S i z e , L1 L ine S i z e , L1
A s s o c i a t i o n , L2 Cache S i z e , L2 Li ne S i z e , L2 A s s o c i a t i o n , L3
Cache S i z e , L3 L in e S i z e , L3 A s s o c i a t i o n , Number o f t h r e a d s ,
T o t a l Miss Caches n”} { s p l i t ( $1 , a , ” . ” ) ; s p l i t ( a [ 1 ] , b ,” −”) ;
p r i n t f ”%d,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,% s n ” , n%3 +1,b
[ 2 ] , b [ 3 ] , b [ 4 ] , b [ 5 ] , b [ 6 ] , b [ 7 ] , b [ 8 ] , b [ 9 ] , b [ 1 0 ] , b [ 1 1 ] , b [ 1 2 ] , $5
;++n} ’ >> r e s u l t s / b r u t a l d b . c s v
30
32. A.2.3 Importing the results to a database
1 #! / b i n / b a s h
2 #
3 # S c r i p t by aknahs ( Mario Almeida )
4 #
5
6 rm − r f power
7 mkdir power
8
9 s q l i t e 3 power / r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER,
c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER
, l 2 s i z e INTEGER, l 2 l i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e INTEGER
, l 3 l i n e INTEGER, l 3 a s s o c INTEGER, t h r e a d s INTEGER, m i s s r a t e REAL
); ’
10
11 echo ” I m p o r t i n g t o d a t a b a s e ”
12 echo ” . s e p a r a t o r ” , ” ” > power /command
13 echo ” . import b r u t a l d b . c s v r e s ” >> power /command
14 s q l i t e 3 power / r e s . db < power /command
15 rm power /command
16
17 echo ” done ”
18
19 ./ graphall
A.2.4 Generating graphs
1 #! / b i n / b a s h
2 #
3 # S c r i p t by aknahs ( Mario Almeida )
4 #
5
6 #s q l i t e 3 power / r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER,
c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER
, l 2 s i z e INTEGER, l 2 l i n e i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e
INTEGER, l 3 l i n e i n e INTEGER, l 3 a s s o c INTEGER, t h r e a d s INTEGER,
m i s s r a t e REAL) ; ’
7
8 mkdir power / c l u s t e r 2
9 mkdir power / c l u s t e r 4
10 mkdir power / c l u s t e r 8
11
12 #f o r i n s t r u m e n t a t i o n l e v e l
13 f o r set i n 1 2 3
14 do
15 #f o r each c l u s t e r s i z e
31
33. 16 for cs in 2 4 8
17 do
18 #f o r each l e v e l o f c a ch e
19 for l in 1 2 3
20 do
21
22 i f [ $ s e t == 1 ]
23 then
24 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−L1Size16 −c l u s t e r $ { c s }
”
25 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and
l 1 s i z e = 16 and c a c h e l e v e l = $ { l } and l 1 l i n e = 32 and l 2 l i n e
= 32 and l 3 l i n e = 32 ”
26 t i t l e=” MissRate o f c a c h e $ { l } p e r L${ l } s i z e ( L s i z e = [ 1 6 , . , . ] ) ”
27 xlabel=” S i z e o f c a c h e L${ l } ”
28 fi
29
30 i f [ $ s e t == 2 ]
31 then
32 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−c l u s t e r $ { c s } ”
33 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and
c a c h e l e v e l = $ { l } and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e
= 32 ”
34 t i t l e=” MissRate o f c a c h e $ { l } p e r L${ l } s i z e ”
35 xlabel=” S i z e o f c a c h e L${ l } ”
36 f i
37
38 i f [ $ s e t == 3 ]
39 then
40 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−L1Size16 −L 2 s i z e 1 −
c l u s t e r $ { c s }”
41 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and
l 1 s i z e = 16 and l 2 s i z e = 1 and c a c h e l e v e l = ${ l } and l 1 l i n e =
32 and l 2 l i n e = 32 and l 3 l i n e = 32 ”
42 t i t l e=” MissRate o f c a c h e L${ l } p e r L${ l } s i z e ( L s i z e = [ 1 6 , 1 , . ] ) ”
43 xlabel=” S i z e o f c a c h e L${ l } ”
44 f i
45
46 i f [ [ $ s e t = 1 && $ l = 1 ] ]
47 then
48 continue
49 f i
50
51 i f [ [ $ s e t == 3 && ( $ l == 1 | | $ l == 2 ) ] ]
52 then
53 continue
54 f i
55
56
32
34. 57 echo ” G e n e r a t i n g Graph f o r s e t $ s e t on c a c h e l e v e l $ l ”
58 g n u p l o t << EOF
59 set d a t a f i l e s e p a r a t o r ” | ”
60
61 # Line s t y l e f o r a x e s
62 set s t y l e l i n e 80 l t rgb ”#808080”
63
64 # Line s t y l e f o r g r i d
65 set s t y l e l i n e 81 l t 0 # dashed
66 set s t y l e l i n e 81 l t rgb ”#808080” # g r e y
67
68 set grid back l i n e s t y l e 81
69 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and
r i g h t . These
70 # b o r d e r s a r e u s e l e s s and make i t h a r d e r
71 # t o s e e p l o t t e d l i n e s near t h e b o r d e r .
72 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a
border .
73 set x t i c s n o m i r r o r
74 set y t i c s n o m i r r o r
75
76 #s e t l o g x
77 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good .
78
79 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r
80 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s
81 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k
82 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .
83 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 2 ps 1 pt 1
84 set s t y l e l i n e 2 l t rgb ”#00A000” lw 2 ps 1 pt 6
85 set s t y l e l i n e 3 l t rgb ”#5060D0” lw 2 ps 1 pt 2
86 set s t y l e l i n e 4 l t rgb ”#F25900 ” lw 2 ps 1 pt 9
87
88 #s e t key t o p r i g h t
89
90 #s e t x r a n g e [ 0 : 1 ]
91 set yrange [ 0 : 1 0 0 ]
92
93 #p l o t ” t e m p l a t e . d a t ”
94 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 ,
95 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2
96
97 #s e t s t y l e d a t a l i n e s
98 #s e t key o u t s i d e
99 #s e t x t i c s r o t a t e by −45
100 #s e t s i z e r a t i o 0 . 8
101 set t i t l e ” $ t i t l e ”
102 set xlabel ” $ x l a b e l ”
103 $aux
33
35. 104 $aux2
105 set ylabel ” T o t a l Miss Rate (%)”
106 plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g
1 : 2 with p o i n t s l s 1 t i t l e ’#p r o c s = 2 ’ ,
107 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2
with p o i n t s l s 2 t i t l e ’#p r o c s = 4 ’ ,
108 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2
with p o i n t s l s 3 t i t l e ’#p r o c s = 8 ’ ,
109 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2
with p o i n t s l s 4 t i t l e ’#p r o c s = 16 ’
110
111 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded
112 #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm
113 set output ” ${ f i l e n a m e } . pdf ”
114 replot
115 EOF
116 done
117 done
118 done
119
120 echo ”Done”
121
122 f i l e n a m e=” power / L2MissRate−L1Size32 −L 2 s i z e 4 −l 3 s i z e 4 −v a r C l u s t e r ”
123 s q l=” s e l e c t c l u s t e r , m i s s r a t e from r e s where l 1 s i z e = 32 and
l 2 s i z e = 4 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32
and l 2 l i n e = 32 and l 3 l i n e = 32 ”
124 t i t l e=” MissRate o f c a c h e L2 p e r c l u s t e r s i z e ( L s i z e = [ 3 2 , 4 , 4 ] ) ”
125 xlabel=” C l u s t e r s i z e ”
126
127 echo ” G e n e r a t i n g Graph f o r s e t v a r i a b l e c l u s t e r s ”
128 g n u p l o t << EOF
129 set d a t a f i l e s e p a r a t o r ” | ”
130
131 # Line s t y l e f o r a x e s
132 set s t y l e l i n e 80 l t rgb ”#808080”
133
134 # Line s t y l e f o r g r i d
135 set s t y l e l i n e 81 l t 0 # dashed
136 set s t y l e l i n e 81 l t rgb ”#808080” # g r e y
137
138 set grid back l i n e s t y l e 81
139 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and
r i g h t . These
140 # b o r d e r s a r e u s e l e s s and make i t h a r d e r
141 # t o s e e p l o t t e d l i n e s near t h e b o r d e r .
142 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a
border .
143 set x t i c s n o m i r r o r
144 set y t i c s n o m i r r o r
34
36. 145
146 #s e t l o g x
147 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good .
148
149 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r
150 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s
151 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k
152 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .
153 set s t y l e l i n e 1 ps 1 pt 1
154 set s t y l e l i n e 2 ps 1 pt 6
155 set s t y l e l i n e 3 ps 1 pt 2
156 set s t y l e l i n e 4 ps 1 pt 9
157
158 #s e t key t o p r i g h t
159
160 #s e t x r a n g e [ 0 : 1 ]
161 #s e t y r a n g e [ 0 : 1 ]
162
163 #p l o t ” t e m p l a t e . d a t ”
164 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 ,
165 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2
166
167 #s e t s t y l e d a t a l i n e s
168 #s e t key o u t s i d e
169 #s e t x t i c s r o t a t e by −45
170 #s e t s i z e r a t i o 0 . 8
171 set t i t l e ” $ t i t l e ”
172 set xlabel ” $ x l a b e l ”
173 $aux
174 $aux2
175 set ylabel ” T o t a l Miss Rate (%)”
176 plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g
1 : 2 with l p l s 1 t i t l e ’#p r o c s = 2 ’ ,
177 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2
with l p l s 2 t i t l e ’#p r o c s = 4 ’ ,
178 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2
with l p l s 3 t i t l e ’#p r o c s = 8 ’ ,
179 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2
with l p l s 4 t i t l e ’#p r o c s = 16 ’
180
181 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded
182 #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm
183 set output ” ${ f i l e n a m e } . pdf ”
184 replot
185 EOF
186
187 f i l e n a m e=” power / L2MissRate−L1Size16 −L 2 s i z e 1 −l 3 s i z e 4 −v a r C l u s t e r ”
188 s q l=” s e l e c t c l u s t e r , m i s s r a t e from r e s where l 1 s i z e = 16 and
l 2 s i z e = 1 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32
35
37. and l 2 l i n e = 32 and l 3 l i n e = 32 ”
189 t i t l e=” MissRate o f c a c h e L2 p e r c l u s t e r s i z e ( L s i z e = [ 1 6 , 1 , 4 ] ) ”
190 xlabel=” C l u s t e r s i z e ”
191
192 echo ” G e n e r a t i n g Graph f o r s e t v a r i a b l e c l u s t e r s ”
193 g n u p l o t << EOF
194 set d a t a f i l e s e p a r a t o r ” | ”
195
196 # Line s t y l e f o r a x e s
197 set s t y l e l i n e 80 l t rgb ”#808080”
198
199 # Line s t y l e f o r g r i d
200 set s t y l e l i n e 81 l t 0 # dashed
201 set s t y l e l i n e 81 l t rgb ”#808080” # g r e y
202
203 set grid back l i n e s t y l e 81
204 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and
r i g h t . These
205 # b o r d e r s a r e u s e l e s s and make i t h a r d e r
206 # t o s e e p l o t t e d l i n e s near t h e b o r d e r .
207 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a
border .
208 set x t i c s n o m i r r o r
209 set y t i c s n o m i r r o r
210
211 #s e t l o g x
212 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good .
213
214 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r
215 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s
216 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k
217 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .
218 set s t y l e l i n e 1 ps 1 pt 1
219 set s t y l e l i n e 2 ps 1 pt 6
220 set s t y l e l i n e 3 ps 1 pt 2
221 set s t y l e l i n e 4 ps 1 pt 9
222
223 #s e t key t o p r i g h t
224
225 #s e t x r a n g e [ 0 : 1 ]
226 #s e t y r a n g e [ 0 : 1 ]
227
228 #p l o t ” t e m p l a t e . d a t ”
229 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 ,
230 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2
231
232 #s e t s t y l e d a t a l i n e s
233 #s e t key o u t s i d e
234 #s e t x t i c s r o t a t e by −45
36
38. 235 #s e t s i z e r a t i o 0 . 8
236 set t i t l e ” $ t i t l e ”
237 set xlabel ” $ x l a b e l ”
238 $aux
239 $aux2
240 set ylabel ” T o t a l Miss Rate (%)”
241 plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g
1 : 2 with l p l s 1 t i t l e ’#p r o c s = 2 ’ ,
242 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2
with l p l s 2 t i t l e ’#p r o c s = 4 ’ ,
243 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2
with l p l s 3 t i t l e ’#p r o c s = 8 ’ ,
244 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2
with l p l s 4 t i t l e ’#p r o c s = 16 ’
245
246 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded
247 #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm
248 set output ” ${ f i l e n a m e } . pdf ”
249 replot
250 EOF
37