This document provides an overview of writing OpenMP programs on multi-core machines. It discusses:
1) Why OpenMP is useful for parallel programming and its main components like compiler directives and library routines.
2) Elements of OpenMP like parallel regions, work sharing constructs, data scoping, and synchronization methods.
3) Achieving scalable speedup through techniques like breaking data dependencies, avoiding synchronization overheads, and improving data locality with cache and page placement.
A natural extension of the Random Access Machine (RAM) serial architecture is the Parallel Random Access Machine, or PRAM.
PRAMs consist of p processors and a global memory of unbounded size that is uniformly accessible to all processors.
Processors share a common clock but may execute different instructions in each cycle.
Along with idling and contention, communication is a major overhead in parallel programs.
The cost of communication is dependent on a variety of features including the programming model semantics, the network topology, data handling and routing, and associated software protocols.
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
A natural extension of the Random Access Machine (RAM) serial architecture is the Parallel Random Access Machine, or PRAM.
PRAMs consist of p processors and a global memory of unbounded size that is uniformly accessible to all processors.
Processors share a common clock but may execute different instructions in each cycle.
Along with idling and contention, communication is a major overhead in parallel programs.
The cost of communication is dependent on a variety of features including the programming model semantics, the network topology, data handling and routing, and associated software protocols.
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Chap2 - ADSP 21K Manual - Processor and Software OverviewSethCopeland
This is a sample of a manual I developed while at Wideband for a software math and science digital signal processing library for the Analog Devices ADSP-21K. It contains the overview section and a good amount of technical discussion useful for the programmer to understand about the processor and its register interface before commencing programming. This was a useful product feature because there was complete disclosure on Wideband\'s part to make manual useful at the programmer\'s discretion.
Implementing a Distributed Hash Table with Scala and AkkaTristan Penman
A talk summarising one developer's experience implementing a distributed hash table, or DHT, based on the Chord protocol. In particular, the talk will cover some Scala and Akka best practices, and look at how these practices influence the design of a non-trivial actor system.
In all-reduce, each node starts with a buffer of size m and the final results of the operation are identical buffers of size m on each node that are formed by combining the original p buffers using an associative operator.
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...MLconf
Practical Probabilistic Programming with Figaro: Probabilistic reasoning enables you to predict the future, infer the past, and learn from experience. Probabilistic programming enables users to build and reason with a wide variety of probabilistic models without machine learning expertise. In this talk, I will present Figaro, a mature probabilistic programming system with many applications. I will describe the main design principles of the language and show example applications. I will also discuss our current efforts to fully automate and optimize the inference process.
There are two primary forms of data exchange between parallel tasks - accessing a shared data space and exchanging messages.
Platforms that provide a shared data space are called shared-address-space machines or multiprocessors.
Platforms that support messaging are also called message passing platforms or multicomputers.
In this session we will discuss about the parallelism in SQL Server. We will talk about configuration parameters, parallel execution plans, parallel operators and more. We also will talk about problems and best practices
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin
Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Configure
Chap2 - ADSP 21K Manual - Processor and Software OverviewSethCopeland
This is a sample of a manual I developed while at Wideband for a software math and science digital signal processing library for the Analog Devices ADSP-21K. It contains the overview section and a good amount of technical discussion useful for the programmer to understand about the processor and its register interface before commencing programming. This was a useful product feature because there was complete disclosure on Wideband\'s part to make manual useful at the programmer\'s discretion.
Implementing a Distributed Hash Table with Scala and AkkaTristan Penman
A talk summarising one developer's experience implementing a distributed hash table, or DHT, based on the Chord protocol. In particular, the talk will cover some Scala and Akka best practices, and look at how these practices influence the design of a non-trivial actor system.
In all-reduce, each node starts with a buffer of size m and the final results of the operation are identical buffers of size m on each node that are formed by combining the original p buffers using an associative operator.
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...MLconf
Practical Probabilistic Programming with Figaro: Probabilistic reasoning enables you to predict the future, infer the past, and learn from experience. Probabilistic programming enables users to build and reason with a wide variety of probabilistic models without machine learning expertise. In this talk, I will present Figaro, a mature probabilistic programming system with many applications. I will describe the main design principles of the language and show example applications. I will also discuss our current efforts to fully automate and optimize the inference process.
There are two primary forms of data exchange between parallel tasks - accessing a shared data space and exchanging messages.
Platforms that provide a shared data space are called shared-address-space machines or multiprocessors.
Platforms that support messaging are also called message passing platforms or multicomputers.
In this session we will discuss about the parallelism in SQL Server. We will talk about configuration parameters, parallel execution plans, parallel operators and more. We also will talk about problems and best practices
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin
Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Configure
This presentation deals with how one can utilize multiple cores, while working with C/C++ applications using an API called OpenMP. It's a shared memory programming model, built on top of POSIX thread. Also the fork-join model, parallel design pattern are discussed using PetriNets.
An introduction to the OpenMP parallel programming model.
From the Scalable Computing Support Center at Duke University (http://wiki.duke.edu/display/scsc)
Programming is hard. Programming correct C and C++ is particularly hard. Indeed, both in C and certainly in C++, it is uncommon to see a screenful containing only well defined and conforming code.Why do professional programmers write code like this? Because most programmers do not have a deep understanding of the language they are using.While they sometimes know that certain things are undefined or unspecified, they often do not know why it is so. In these slides we will study small code snippets in C and C++, and use them to discuss the fundamental building blocks, limitations and underlying design philosophies of these wonderful but dangerous programming languages.
This content has a CC license. Feel free to use it for whatever you want. You may download the original PDF file from: http://www.pvv.org/~oma/DeepC_slides_oct2012.pdf
Data Parallel and Object Oriented ModelNikhil Sharma
All the content is taken from Advance Computer Architecture book. Which (10.1.3 and 10.1.4)
This PPT covers the basics of Data-Parallel Model and Object-Oriented Model.
Slides from Strata+Hadoop Singapore 2016 presenting how Deep Learning can be scaled both vertically and horizontally, when to use CPUs and when to use GPUs.
Concurrency Programming in Java - 01 - Introduction to Concurrency ProgrammingSachintha Gunasena
This session discusses a basic high-level introduction to concurrency programming with Java which include:
programming basics, OOP concepts, concurrency, concurrent programming, parallel computing, concurrent vs parallel, why concurrency, real world example, terms, Moore's Law, Amdahl's Law, types of parallel computation, MIMD Variants, shared memory model, distributed memory model, client server model, scoop mechanism, scoop preview - a sequential program, in a concurrent setting - using scoop, programming then & now, sequential programming, concurrent programming,
SparkNet implements a scalable, distributed algorithm to train deep neural networks that can be applied to existing batch processing frameworks like MapReduce and Spark.
Work by researchers at UC Berkeley.
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark.
The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage.
See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning.
An Introduction to TensorFlow architectureMani Goswami
Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.
In this lecture, I have introduced to Massive Open Online courses. How they are conducted, how xMoocs are different from cMoocs. Also, included list of platforms which are hosting MOOC courses. Also, listed more than 1700 courses along with top 10 MOOC courses of 2017
This PPT discusses about some programming puzzles that are related to Encryption and also it emphasis the need for strengthening bit-wise operators concept.
This is delivered yesterday in our college to enlighten 1st year ECE and EEE students about engineering, engineering principles, how to be a good engineering students, and finally how to grow as a enterpreneur.
In this lecture, I shall illuminate what is Engineering? How a typical Engineering student is expected to equip himself to successfully complete his course with real engineering flavor.
In this talk, I explain 21st century predictions, 21st century challenges, how other nations are readying to face. IOT, sensor developments are instrumental for eScience or data science.
This presentation is used in a refresher course at Nuzvid. This is one day session of the course. It introduces research avenues in Image Processing and allied areas to faculty participants..
This presentation is used in many places including Ambedkar Institute of Technology, Banglore. I was engaging more than 60 faculty members for 5 full days. Both tutorials and hands on training. This presentation explains Unix Internals, Socket programming, both data gram based and IP based concepts are explained with live examples.
This presentation is used in many places including Vignan, AITAM, This contains both tutorials and hands on training. This presentation explains Unix Internals, Socket programming, both data gram based and IP based concepts are explained with live examples.
This talk is developed to address a refresher course at Yanam for one full day. I have introduced the audience to clustering, both hierarchical and non-hierarchical. Clustering methods such as K-Means, K-Mediods, etc all introduced with live demonstrations.
This talk is given at Vizianagaram where many Engineering college faculty were attended. I have introduced developments in multi-core computers along with their architectural developments. Also, I have explained about high performance computing, where these are used. I have introduced the concept of pipelining, Amdahl's law, issues related to pipelining, MIPS architecture.
I have introduced developments in multi-core computers along with their architectural developments. Also, I have explained about high performance computing, where these are used. At the end, openMP is introduced with many ready to run parallel programs.
In this talk, I have explained about feature selection, extraction with emphasis to image processing. Methods such as Principal Component Analysis, Canonical ANalysis are explained with numerical examples.
Aim of this talk is to highlight the importance of statistics in the 21st century because of the availability of variety of sensors, MEMS, Nano-Sensors, E-sensors, IOT.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
Nbvtalkataitamimageprocessingconf
1. Writing OpenMP Programs on
Many and Multi Core Machines
Prof NB Venkateswarlu
ISTE Visiting Professor 2010-11
CSE, AITAM, Tekkali
venkat_ritch@yahoo.com
www.ritchcenter.com/nbv
2. Agenda
• Why OpenMP ?
• Elements of OpenMP
• Scalable Speedup and Data Locality
• Parallelizing Sequential Programs
• Breaking data dependencies
• Avoiding synchronization overheads
• Achieving Cache and Page Locality
• SGI Tools for Performance Analysis
and Tuning
3. Why OpenMP ?
• Parallel programming is more difficult
than sequential programming
• OpenMP is a scalable, portable,
incremental approach to designing
shared memory parallel programs
• OpenMP supports
– fine and coarse grained parallelism
– data and control parallelism
4. What is OpenMP ?
Three components:
• Set of compiler directives for
– creating teams of threads
– sharing the work among threads
– synchronizing the threads
• Library routines for setting and querying
thread attributes
• Environment variables for controlling run-
time behavior of the parallel program
5. Elements of OpenMP
• Parallel regions and work sharing
• Data scoping
• Synchronization
• Compiling and running OpenMP
programs
6. Parallelism in OpenMP
• The parallel region is the construct for
creating multiple threads in an
OpenMP program
• A team of threads is created at run
time for a parallel region
• A nested parallel region is allowed,
but may contain a team of one thread
• Nested parallelism is enabled with
setenv OMP_NESTED TRUE
8. Hello World in OpenMP
#include <omp.h>
int main() {
int iam =0, np = 1;
#pragma omp parallel private(iam, np)
{
#if defined (_OPENMP)
np = omp_get_num_threads();
iam = omp_get_thread_num();
#endif
printf(“Hello from thread %d out of %d n”, iam, np);
}
}
parallel region directive
with data scoping clause
9. Specifying Parallel Regions
• Fortran !
$OMP PARALLEL [clause [clause…]]
! Block
of code executed by all threads
!$OMP END PARALLEL
• C and C++
#pragma omp parallel [clause [clause...]]
{
/* Block executed by all threads */
10.
11. Work sharing in OpenMP
• Two ways to specify parallel work:
– Explicitly coded in parallel regions
– Work-sharing constructs
» DO and for constructs: parallel loops
» sections
» single
• SPMD type of parallelism supported
12. Work and Data Partitioning
Loop parallelization
• distribute the work among the threads,
without explicitly distributing the data.
• scheduling determines which thread
accesses which data
• communication between threads is
implicit, through data sharing
• synchronization via parallel constructs
or is explicitly inserted in the code
13. Data Partitioning & SPMD
• Data is distributed explicitly among
processes
• With message passing, e.g., MPI,
where no data is shared, data is
explicitly communicated
• Synchronization is explicit or
embedded in communication
• With parallel regions in OpenMP, both
SPMD and data sharing are supported
14. Pros and Cons of SPMD
» Potentially higher parallel fraction
than with loop parallelism
» The fewer parallel regions, the less
overhead
» More explicit synchronization needed
than for loop parallelization
» Does not promote incremental
parallelization and requires manually
assigning data subsets to threads
15. SPMD Example
program mat_init
implicit none
integer, parameter::N=1024
real A(N,N)
integer :: iam, np
iam = 0
np = 1
!$omp parallel private(iam,np)
np = omp_get_num_threads()
iam = omp_get_thread_num()
! Each thread calls work
call work(N, A, iam, np)
!$omp end parallel
end
subroutine work(n, A, iam, np)
integer n, iam, n
real A(n,n)
integer :: chunk,low,high,i,j
chunk = (n + np - 1)/np
low = 1 + iam*chunk
high=min(n,(iam+1)*chunk)
do j = low, high
do I=1,n
A(I,j)=3.14 + &
sqrt(real(i*i*i+j*j+i*j*j))
enddo
enddo
return
A single parallel region, no scheduling needed,
each thread explicitly determines its work
16. Extent of directives
Most directives have as extent a
structured block, or basic block, i.e., a
sequence of statements with a flow of
control that satisfies:
• there is only one entry point in the
block, at the beginning of the block
• there is only one exit point, at the end
of the block; the exceptions are that
exit() in C and stop in Fortran are
allowed
17. Work Sharing Constructs
• DO and for : parallelizes a loop,
dividing the iterations among the
threads
• sections : defines a sequence of
contiguous blocks, the beginning of
each bock being marked by a section
directive. The block within each
section is assigned to one thread
• single: assigns a block of a parallel
region to a single thread
18. Specialized Parallel
Regions
Work-sharing can be specified
combined with a parallel region
• parallel DO and parallel for : a
parallel region which contains a
parallel loop
• parallel sections, a parallel region
that contains a number of section
constructs
19. Scheduling
• Scheduling assigns the iterations of a
parallel loop to the team threads
• The directives [parallel] do and
[parallel] for take the clause
schedule(type [,chunk])
• The optional chunk is a loop-invariant
positive integer specifying the number
of contiguous iterations assigned to a
thread
20. Scheduling
The type can be one of
• static threads are statically assigned
chunks of size chunk in a round-robin
fashion. The default for chunk is
ceiling(N/p) where N is the number of
iterations and p is the number of
processors
• dynamic threads are dynamically
assigned chunks of size chunk, i.e.,
21. Scheduling
when a thread is ready to receive new
work, it is assigned the next pending
chunk. Default value for chunk is 1.
• guided a variant of dynamic
scheduling in which the size of the
chunk decreases exponentially from
chunk to 1. Default value for chunk is
ceiling(N/p)
22. Scheduling
• runtime indicates that the schedule
type and chunk are specified by the
environment variable OMP_SCHEDULE. A
chunk cannot be specified with
runtime.
• Example of run-time specified
scheduling
setenv OMP_SCHEDULE “dynamic,2”
23. Scheduling
• If the schedule clause is missing, an
implementation dependent schedule is
selected. MIPSpro selects by default
the static schedule
• Static scheduling has low overhead
and provides better data locality
• Dynamic and guided scheduling may
provide better load balancing
24. Work Sharing Constructs
A motivating example
for(i=0;I<N;i++) { a[i] = a[i] + b[i];}
#pragma omp parallel
{
int id, i, Nthrds, istart, iend;
id = omp_get_thread_num();
Nthrds = omp_get_num_threads();
istart = id * N / Nthrds;
iend = (id+1) * N / Nthrds;
for(i=istart;I<iend;i++) {a[i]=a[i]+b[i];}
}
#pragma omp parallel
#pragma omp for schedule(static)
for(i=0;I<N;i++) { a[i]=a[i]+b[i];}
OpenMP
parallel region
and a work-
Sequential
code
OpenMP
Parallel Region
OpenMP Parallel
Region and a
work-sharing for
construct
25. Work-sharing ConstructWork-sharing Construct
Threads are assigned an
independent set of
iterations
Threads must wait at the
end of work-sharing
construct
#pragma omp parallel
#pragma omp for
Implicit barrier
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
#pragma omp parallel
#pragma omp for
for(i = 1, i < 13, i++)
c[i] = a[i] + b[i]
26. Combining pragmasCombining pragmas
These two code segments are equivalent
#pragma omp parallel
{
#pragma omp for
for (i=0;i< MAX; i++)
{ res[i]
= huge();
}
}
#pragma omp parallel for
for (i=0;i< MAX; i++) {
res[i] = huge();
}
27. Types of Extents
Two types for the extent of a directive:
• static or lexical extent: the code
textually enclosed between the
beginning and the end of the
structured block following the
directive
• dynamic extent: static extent as well
as the procedures called from within
the static extent
28. Orphaned Directives
A directive which is in the dynamic
extent of another directive but not in
its static extent is said to be orphaned
• Work sharing directives can be
orphaned
• This allows a work-sharing construct
to occur in a subroutine which can be
called both by serial and parallel code,
improving modularity
29. Directive Binding
• Work sharing directives (do, for,
sections, and single) as well as
master and barrier bind to the
dynamically closest parallel directive,
if one exists, and have no effect when
they are not in the dynamic extent of a
parallel region
• The ordered directive binds to the
enclosing do or for directive having
the ordered clause
30. Directive Binding
• critical (and atomic) provide
mutual exclusive execution (and
update) with respect to all the
threads in the program
31. Data Scoping
• Work-sharing and parallel
directives accept data scoping clauses
• Scope clauses apply to the static
extent of the directive and to variables
passed as actual arguments
• The shared clause applied to a
variable means that all threads will
access the single copy of that variable
created in the master thread
32. Data Scoping
• The private clause applied to a
variable means that a volatile copy of
the variable is cloned for each thread
• Semi-private data for parallel loops:
– reduction: variable that is the target of a reduction
operation performed by the loop, e.g., sum
– firstprivate: initialize the private copy from the
value of the shared variable
– lastprivate: upon loop exit, master thread holds
the value seen by the thread assigned the last loop
iteration
33. Threadprivate Data
• The threadprivate directive is
associated with the declaration of a
static variable (C) or common block
(Fortran) and specifies persistent data
(spans parallel regions) cloned, but
not initialized, for each thread
• To guarantee persistence, the dynamic
threads feature must be disabled
setenv OMP_DYNAMIC FALSE
34. Threadprivate Data
• threadprivate data can be initialized
in a thread using the copyin clause
associated with the parallel,
parallel do/for, and parallel
sections directives
• the value stored in the master thread
is copied into each team thread
• Syntax: copyin (name [,name])
where name is a variable or (in Fortran)
a named common block
35. Scoping Rules
• Data declared outside a parallel region
region is shared by default, except for
– loop index variable of parallel do
– data declared as threadprivate
• Local data in the dynamic extent of a
parallel region is private:
– subroutine local variables, and
– C/C++ blocks within a parallel region
36. Scoping Restrictions
• The private clause for a directive in
the dynamic extent of a parallel region
can be specified only for variables that
are shared in the enclosing parallel
region
– That is, a privatized variable cannot
be privatized again
• The shared clause is not allowed for
the DO (Fortran) or for (C) directive
37. Shared Data
• Access to shared data must be
mutually exclusive: a thread at a time
• For shared arrays, when different
threads access mutually exclusive
subscripts, synchronization is not
needed
• For shared scalars, critical sections or
atomic updates must be used
• Consistency operation: flush directive
38. Synchronization
Explicit, via directives:
• critical, implements the critical
sections, providing mutual exclusion
• atomic, implements atomic update of
a shared variable
• barrier, a thread waits at the point
where the directive is placed until all
other threads reach the barrier
39. Synchronization
• ordered, preserves the order of the
sequential execution; can occur at
most once inside a parallel loop
• flush, creates consistent view of
thread-visible data
• master, block in a parallel region that
is executed by the master thread and
skipped by the other threads; unlike
single, there is no implied barrier
40. Implicit Synchronization
• There is an implied barrier at the
end of a parallel region, and of a work-
sharing construct for which a nowait
clause is not specified
• A flush is implied by an explicit or
implicit barrier as well as upon
entry and exit of a critical or
ordered block
41. Directive Nesting
• A parallel directive can appear in
the dynamic extent of another
parallel, i.e., parallel regions can be
nested
• Work-sharing directives binding to the
same parallel directive cannot be
nested
• An ordered directive cannot appear
in the dynamic extent of a critical
directive
42. Directive Nesting
• A barrier or master directive
cannot appear in the dynamic extent of
a work-sharing region ( DO or for,
sections, and single) or ordered
block
• In addition, a barrier directive cannot
appear in the dynamic extent of a
critical or master block
44. Library Routines
OpenMP defines library routines that
can be divided in three categories
1. Query and set multithreading
• get/set number of threads or processors
omp_set_num_threads,
omp_get_num_threads,
omp_in_parallel, …
• get thread ID:
omp_get_thread_num
45. Library Routines
2. Set and get execution environment
• Inquire/set nested parallelism:
omp_get_nested
omp_set_nested
• Inquire/set dynamic number of threads in
different parallel regions:
omp_set_dynamic
omp_get_dynamic
46. Library Routines
3. API for manipulating locks
• A lock variable provides thread
synchronization, has C type omp_lock_t
and Fortran type integer*8, and holds
a 64-bit address
• Locking routines: omp_init_lock,
omp_set_lock,omp_unset_lock...
Man pages: omp_threads, omp_lock
47. Reality Check
Irregular and ambiguous aspects are
sources of language- and
implementation dependent behavior:
• nowait clause is allowed at the
beginning of [parallel] for (C/C+
+) but at the end of [parallel] DO
(Fortran)
• default clause can specify private
scope in Fortran, but not in C/C++
48. Reality Check
• Can only privatize full objects, not
array elements, or fields of data
structures
• For a threadprivate variable or
block one cannot specify any clause
except for the copyin clause
• In MIPSpro 7.3.1 one cannot specify in
the same directive both the
firstprivate and lastprivate
clauses for a variable
49. Reality Check
• With MIPSpro 7.3.1, when a loop is
parallelized with the do (Fortran) or for
(C/C++) directive, the indexes of the nested
loops are, by default, private in Fortran, but
shared in C/C++
Probably, this is a compiler issue
• Fortunately, the compiler warns about
unsynchronized accesses to shared
variables
• This does not occur for parallel do or
parallel for
50. Compiling and Running
• Use MIPSpro with the option -mp both for
compiling and linking
default -MP:open_mp=ON must be in effect
• Fortran:
f90 [-freeform] [-cpp]-mp prog.f
-freeform needed for free form source
-cpp needed when using #ifdef s
• C/C++:
cc -mp -O3 prog.c
CC -mp -O3 prog.C
51. Setting the Number of Threads
• Environment variables:
setenv OMP_NUM_THREADS 8
if OMP_NUM_THREADS is not set, but
MP_SET_NUMTHREADS is set, the latter
defines the number of threads
• Environment variables can be
overridden by the programmer:
omp_set_num_threads(int n)
52. Scalable Speedup
• Most often the memory is the limit to
the performance of a shared memory
program
• On scalable architectures, the latency
and bandwidth of memory accesses
depend on the locality of accesses
• In achieving good speedup of a shared
memory program, data locality is an
essential element
53. What Determines Data
Locality
• Initial data distribution determines on
which node the memory is placed
– first touch or round-robin system policies
– data distribution directives
– explicit page placement
• Work sharing, e.g., loop scheduling,
determines which thread accesses
which data
• Cache friendliness determines how
often main memory is accessed
54. Cache Friendliness
For both serial loops and parallel loops
• locality of references
– spatial locality: use adjacent cache lines and all
items in a cache line
– temporal locality: reuse same cache line; may
employ techniques such as cache blocking
• low cache contention
– avoid the sharing of cache lines among different
objects; may resort to array padding or increasing
the rank of an array
55. Cache Friendliness
• Contention is an issue specific to
parallel loops, e.g., false sharing of
cache lines
cache friendliness =
high locality of references
+
low contention
56. NUMA machines
• Memory hierarchies exist in single-CPU
computers and Symmetric
Multiprocessors (SMPs)
• Distributed shared memory (DSM)
machines based on Non-Uniform
Memory Architecture (NUMA) add
levels to the hierarchy:
– local memory has low latency
– remote memory has high latency
57. Origin2000 memory
hierarchy
Level Latency (cycles)
register 0
primary cache 2..3
secondary cache 8..10
local main memory & TLB hit 75
remote main memory & TLB hit 250
main memory & TLB miss 2000
page fault 10^6
58. Page Level Locality
• An ideal application has full page
locality: pages accessed by a
processor are on the same node as the
processor, and no page is accessed by
more than one processor (no page
sharing)
• Twofold benefit:
» low memory latency
» scalability of memory bandwidth
59. Page Level Locality
• The benefits brought about by page
locality are more important for
programs that are not cache friendly
• We look at several data placement
strategies for improving page locality
» system based placement
» data initialization and directives
» combination of system and program
directed data placement
60. Page sharing due to alignment
array section accessed
by processor 2
array section accessed
by processor 1
• page 1
page 1 page 2 page 3
• Consider an array whose size is twice the size
of a page, and which is distributed between two
nodes
• Page 1 and page 2 are located one node 1,
page 3 is on node 2
• Page 2 is shared by the two processors, due to
the array not starting on a page boundary
array layout
61. Achieving Page Locality
IRIX has two page placement policies:
• first-touch: the process which first
references a virtual address causes that
address to be mapped to a page on the
node where the process runs
• round-robin: pages allocated to a job are
selected from nodes traversed in round-
robin order
• IRIX uses first-touch, unless
setenv _DSM_ROUND_ROBIN
62. Achieving Page Locality
IRIX allows to migrate pages between
nodes, to adjust the page placement
• a page is migrated based on the affinity of
data accesses to that page, which is
derived at run-time from the per-process
cache-miss pattern
• page migration follows the page affinity
with a delay whose magnitude depends
on the aggressiveness of migration
63. Achieving Page Locality
• To enable data migration, except for
explicitly placed data
setenv _DSM_MIGRATION ON
• To enable migration of all data
setenv _DSM_MIGRATION ALL_ON
• To set the aggressiveness of migration
setenv _DSM_MIGRATION_LEVEL n
where n is an integer between 0 (least
aggressive, disables migration) and 100 (most
aggressive, the default)
64. Achieving Page Locality
Methods, from best to worst
• Parallel data initialization, using
OpenMP parallel work constructs such
as parallel do, combined with
operating system’s first-touch
placement policy
» works with heap, local, global arrays
» no data distribution directives needed
» can be used with page migration
65. Achieving Page Locality
• IRIX round-robin page placement
» improves application’s memory
bandwidth
» no change of code needed
» allows both serial and parallel
initialization of data in the program
66. Achieving Page Locality
• Regular distribution directive
» allows serial initialization of data
» data has same layout as in a serial
program
» page granularity of distribution
» cannot distribute heap allocated and
assumed-size arrays
67. Achieving Page Locality
• Page Migration:
» makes initial data placement less
important, e.g., allows sequential data
initialization
» improves locality of a computation
whose data access pattern changes
during the computations
» it is useful for programs that have
stable affinity for long time intervals
68. Achieving Page Locality
» page migration can be combined
with other techniques such as first-
touch or round-robin
» page migration is expensive
» page migration implements CPU
affinity with a delay
69. Reshaped Distribution
• Reshaped distribution directive
» no page granularity limitation
» data layout is most likely different
from the layout in a serial program
» code bloating: each routine that is
passed a reshaped parameter must
have a version specialized for handling
reshaped arrays
– layout is different for reshaped arrays
70. » cannot reshape initialized data, heap
allocated and assumed-size arrays
» overhead of indirect addressing
» side effect: a global structure or
Fortran common block that contains a
reshaped array cannot be declared
threadprivate, and cannot be
localized with the -Wl,-Xlocal option
Reshaped Distribution
71. • Regular distribution
!$SGI distribute a(d1[,d2])
[onto(p1[,p2])]
#pragma distribute a[d1][[d2]] [onto(p1[,p2])]
• Reshaped distribution
!$SGI distribute_reshape a(d1[,d2]) [onto(p1[,p2])]
#pragma distribute_reshape a[d1][[d2]]
• Distribution methods are denoted by d1, d2
• Optional clause onto specifies a processor
grid n1 x n2 , such that n1/n2 = p1/p2
SGI Data Placement Directives
72. Three distribution methods
* means no distribution along the
direction in which it appears
block distributes the elements of an
array in p contiguous chunks of size
ceiling(N/p), where N is the extent in
the distributed direction and p is the
number of processors
Distribution Specification
73. cyclic(k) distributes the elements of
an array in chunks of size k in a round-
robin fashion, i.e.,
– first processor is assigned array elements 1, K+1,
2*k+1,... (in Fortran) or 0, k, 2k,.. (in C)
– second processor is assigned elements 2, K+2,…
(in Fortran) or 1,K+1,2*K+1 (in C/C++)
Interleaved distribution is obtained for
k=1 (the default) and block-cyclic
distribution for k>1
Distribution Specification
74. For regular distribution, one should
distribute the outer dimension of an
array, to minimize the effect of page
granularity:
• Distribute columns in Fortran
!$SGI distribute A(*, block)
• Distribute rows in C/C++
#pragma distribute a(block,*)
Regular Distribution Tip
75. • Assumed size is not allowed for array
formal parameters which are declared
as distributed
• Specify array size:
void foo(int n, double a[n])
{
#pragma distribute_reshape a(block)
…
}
Distributed Arrays as
Formal Parameters
76. If a reshaped array is declared as
threadprivate, the compiler will
silently ignore the threadprivate
directive
» threadprivate is quietly ignored:
double a[n]
#pragma omp threadprivate(a)
#pragma distribute_reshape a(block)
Reshaped Array Pitfall
77. Parallelizing Code
• Optimize single-CPU performance
– maximize cache reuse
– eliminate cache misses
– compiler flags: -LNO:cache_size2=4m
-OPT:IEEE_arithmetic=3 -Ofast=ip27
• Parallelize as high a fraction of the
work as possible
– preserve cache friendliness
78. Parallelizing Code
– avoid synchronization and scheduling
overhead: partition in few parallel regions,
avoid reduction, single and critical
sections, make the code loop fusion
friendly, use static scheduling
– partition work to achieve load balancing
• Check correctness of parallel code
– run OpenMP compiled code first on one
thread, then on several threads
79. Synchronization Overhead
• Parallel regions, work-sharing, and
synchronization incur overhead
• Edinburgh OpenMP Microbenchmarks,
version 1.0, by J. Mark Bull, are used
to measure the cost of synchronization
on a 32 processor Origin 2000, with
300 MHz R12000 processors, and
compiling the benchmarks with
MIPSpro Fortran 90 compiler, version
7. 3.1.1m
82. Insights
• cost (DO) ~ cost(barrier)
• cost (parallel DO) ~ 2 * cost(barrier)
• cost (parallel) > cost (parallel DO)
• atomic is less expensive than critical
• bad scalability for
– reduction
– mutual exclusion: critical, (un)lock
– single
83. Loop Parallelization
• Identify the loops that are bottleneck to
performance
• Parallelize the loops, and ensure that
– no data races are created
– cache friendliness is preserved
– page locality is achieved
– synchronization and scheduling
overheads are minimized
84. Hurdles to Loop
Parallelization
• Data dependencies among iterations
caused by shared variables
• Input/Output operations inside the loop
• Calls to thread-unsafe code, e.g., the
intrinsic function rtc
• Branches out of the loop
• Insufficient work in the loop body
• The MIPSpro auto-parallelizer helps in
identifying these hurdles
85. Auto-Parallelizer
• The MIPSpro auto-parallelizer (APO)
can be used both for automatically
parallelizing loops and for determining
the reasons which prevent a loop from
being parallelized
• The auto-parallelizer is activated using
command line option apo to the f90,
f77, cc, and CC compilers
• Other auto-parallelizer options: apo
list and mplist
86. Auto-Parallelizer
• Example:
f90 -apo list -mplist myprog.f
• apo list enables APO and generates
the file myprog.list which describes
which loops have been parallelized,
which have not, and why not
• mplist generates the parallelized
source program myprog.w2f.f
(myprog.w2c.c for C) equivalent to the
original code myprog.f
87. For More Information
About the Auto-Parallelizer
• The ProMP Parallel Analyzer View
product consists of the program cvpav
• Cvpav analyzes files created by
compiling with the option –apo keep
• Try out the tutorial examples:
cd /usr/demos/ProMP/omp_tutorial
make
cvpav -f omp_demo.f
88. Data Races
• Parallelizing a loop with data
dependencies causes data races:
unordered or interfering accesses by
multiple threads to shared variables,
which make the values of these
variables different from the values
assumed in a serial execution
• A program with data races produces
unpredictable results, which depend on
thread scheduling and speed.
89. Types of Data Dependencies
• Reduction operations:
const int n = 4096;
int a[n], i, sum=0;
for (i = 0; i < n; i++) {
sum += a[i];
}
– Easy to parallelize using reduction
variables
90. Types of Data Dependencies
– Auto-parallelizer is able to detect
reduction and parallelize it
const int n = 4096;
int a[n], i, sum = 0;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < n; i++) {
sum += a[i];
}
91. Types of Data Dependencies
• Carried dependence on a shared
array, e.g., recurrence:
const int n = 4096;
int a[n], i;
for (i = 0; i < n-1; i++) {
a[i] = a[i+1];
}
– Non-trivial to eliminate, the auto-
parallelizer cannot do it
92. Parallelizing the Recurrence
#define N 16384
int a[N], work[N+1];
// Save border element
work[N]= a[0];
// Save & shift even indices
#pragma omp parallel for
for ( i = 2; i < N; i+=2)
{
work[i-1] = a[i];
}
// Update even indices from odd
#pragma omp parallel for
for ( i = 0; i < N-1; i+=2)
{
a[i] = a[i+1];
}
// Update odd indices with even
#pragma omp parallel for
for ( i = 1; i < N-1; i+=2)
{
a[i] = work[i];
}
// Set border element
a[N-1] = work[N];
Idea: Segregate even and odd indices
93. Performing Reduction
The bad scalability of the reduction clause
affects its usefulness, e.g., bad speedup
when summing the elements of a matrix:
#define N 1<<12
#define M 16
int i, j;
double a[N][M], sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
sum += a[i][j];
94. Parallelizing the Sum
#define N 1<<12
#define M 16
int main() {
double a[N][M], sum = 0.0;
#pragma distribute a[block][*]
int i, j = 0;
#pragma omp parallel private(i,j)
{
double mysum = 0.0;
// initialization of a
// not shown
// compute partial sum
#pragma omp for nowait
for (i = 0; i < N; i++)
for (j = 0; j < M; i++)
mysum += a[i][j];
}
// each thread adds its
// partial sum
#pragma omp atomic
sum += mysum;
}
}
Idea: Use explicit partial sums and combine
them atomically
96. Loop Fusion
• Increases the work in the loop body
• Better serial programs: fusion
promotes software pipelining and
reduces the frequency of branches
• Better OpenMP programs: fusion
reduces synchronization and
scheduling overhead
– fewer parallel regions and work-
sharing constructs
97. Promoting Loop Fusion
• Loop fusion inhibited by statements
between loops which may have
dependencies with data accessed by
the loops
• Promote fusion: reorder the code to get
loops which are not separated by
statements creating data dependencies
• Use one parallel do construct for
several adjacent loops; may leave it to
the compiler to actually perform fusion
98. Fusion-friendly code
integer,parameter::n=4096
real :: sum, a(n)
do i=1,n
a(i) = sqrt(dble(i*i+1))
enddo
sum = 0.d0
do i=1,n
sum = sum + a(i)
enddo
integer,parameter::n=4096
real :: sum, a(n)
sum = 0.d0
do i=1,n
a(i) = sqrt(dble(i*i+1))
enddo
do i=1,n
sum = sum + a(i)
enddo
Unfriendly Friendly
99. Tradeoffs in Parallelization
• To increase parallel fraction of work
when parallelizing loops, it is best to
parallelize the outermost loop of a
nested loop
• However, doing so may require loop
transformations such as loop
interchanges, which can destroy cache
friendliness, e.g., defeat cache blocking
100. Tradeoffs in Parallelization
• Static loop scheduling in large chunks
per thread promotes cache and page
locality but may not achieve load
balancing
• Dynamic and interleaved scheduling
achieve good load balancing but cause
poor locality of data references
101. Tuning the Parallel Code
• Examine resource usage, e.g.,
execution time, number of floating
point operations, primary, secondary,
and TLB cache misses and identify
– the performance bottleneck
– the routines generating the
bottleneck
Useful SGI tools: perfex, ssrun, prof
• Correct the performance problem and
verify the desired speedup.
102. • In C, use SGI function syssgi
#include <sys/syssgi.h>
ptrdiff_t syssgi (int request, …)
with a request value of SGI_PHYSP
• In Fortran, use the intrinsic function
integer dsm_home_threadnum
thread=dsm_home_threadnum(arr(i))
- see lab exercise
Investigating Data Placement
103. SGI Performance Tools
• MIPSpro compilers and libraries
• ProDev workshop (formerly CASE
Vision) : cvd, cvperf, cvstatic
• ProMP: parallel analyzer: cvpav
• SpeedShop: profiling execution and
reporting profile data
• Perfex: per process event count statistics
• dprof: address space profiling
104. Performance Data
• Timing: time spent in various sections of
the program
• Events captured by the performance
counters of the R1X000 CPU.
- 32 events, divided in two equal sets
- Examples: clock cycles, L1 and L2 cache,
and TLB misses, floating point operations,
number of instructions
• I/O system calls, heap malloc and
free, floating point exceptions
105. Speedshop Profiling
• Speedshop is tool package which
supports profiling at the function and
source line level
• Uses several methods for collecting
information
– PC and call stack sampling
– basic block counting
– exception tracing
106. Perfex Tool
• Provides event statistics at the
process level
• Reports the number of occurrences of
the events captured by the R1X000
hardware counters in each process of
a parallel program
• In addition, reports information
derived from the event counts, e.g.
MFLOPS, memory bandwidth
107. Perfex Tool
perfex -mp [other options] a.out
• To profile secondary cache misses in the
data cache (event 26) and instruction
cache (event 10):
perfex -mp -e 26 -e 10 a.out
• To multiplex all 32 events (-a) , get time
estimates (-y) and trace exceptions (-x)
perfex -a -x -y a.out
108. Speedshop
• Data Collection
– ssrun main data collection tool.
Running it on a.out creates the files
a.out.experiment.mPID and a.out.experiment.pPID
– ssusage summary of resources
used, similar to the time commands
– ssapi API for caliper points
• Data Analysis
– prof
109. Ssrun sampling Experiments
Statistical sampling, triggered by a preset
time base or by overflow of hardware
counters
-pcsamp PC sampling gives user CPU
time
-usertime call stack sampling, gives
user and system CPU time
-totaltime call stack sampling, gives
walltime
110. Ssrun sampling
Sampling triggered by overflow of R1X000
hardware counters
-gi_hwc Graduated instructions
-gfp_hwc Floating point instructions
-ic_hwc Misses in L1 I-cache
-dc_hwc Misses in L1 D-cache
-dsc_hwc Data misses in L2 cache
-tlb_hwc TLB misses
-prof_hwc User selected event
111. User selected sampling
• Select a hardware counter, say
secondary cache misses (26), and an
overflow value
setenv _SPEEDSHOP_HWC_COUNTER_NUMBER 26
setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW 99
• Run the experiment
ssrun -prof_hwc a.out
• Default counter is L1 I-cache misses
(9) and default overflow is 2,053
112. Ssrun ideal and tracing
experiments
• Ideal Experiment: basic block counting
-ideal counts the number of times
each basic block is executed and
estimates the time. Descendant of pixie
• Tracing
-fpe floating point exceptions
-io file open, read,write, close
-heap malloc and free
113. Prof Tool
• Display event counts or time in routines
sorted in descending order of the counts
• Source line granularity with command line
option -h or -l
• For ideal and usertime experiments get
call hierarchy with -butterfly option
• For ideal experiment can get architecture
information with the -archinfo option
• Cut off report at top 100-p% with -quit p%
114. Address Space Profiling: dprof
• Gives per process histograms of page
accesses
• Sampling with a specified time base
– the current instruction is interrupted
– the address of the operand referenced by the
interrupted instruction is recorded
• Time base is either the interval timer or
an R1X000 hardware counter overflow
• R1X000 counters: man r10k_counters
115. Data Profiling: dprof
• Syntax
dprof [-hwpc [-cntr n] [-ovfl m]]
[-itimer [-ms t]] [-out profile_file]
a.out
• Default is interval timer ( -itimer )
with t=100 ms
• Can select hardware counter (-hwpc)
which has the defaults
n = 0 is the R1X000 cycle counter
m=10000 is the counter’s overflow value
116. The Future of OpenMP
• Data placement directives will
become part of OpenMP
– affinity scheduling may be a useful
feature
• It is desirable to add parallel
input/output to OpenMP
• Java binding of OpenMP
117. Image class
117
class Image {
public:
short* mData;
int mWidth, mHeight, mDepth;
int mVoxelsPerSlice;
int mVoxelsPerVolume;
short* mSlicePointers; // Pointers to the start of each slice
short getVoxel ( int x, int y, int z ) {...}
void setVoxel ( int x, int y, int z, short v ) {...}
};
118. Threshold – OpenMP #1
118
void doThreshold ( Image* in, Image* out ) {
#pragma omp parallel for
for ( int z = 0; z < in->mDepth; z++ ) {
for ( int y = 0; y < in->mHeight; y++ ) {
for ( int x = 0; x < in->mWidth; x++ ) {
if ( in->getVoxel(x,y,z) > 100 ) {
out->setVoxel(x,y,z,1);
} else {
out->setVoxel(x,y,z,0);
}
}
}
}
}
// NB: can loop over slices, rows or columns by moving
// pragma, but must choose at compile time
119. Threshold – OpenMP #2
119
void doThreshold ( Image* in, Image* out ) {
#pragma omp parallel for
for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) {
if ( in->mData[s] > 100 ) {
out->mData[s] = 1;
} else {
out->mData[s] = 0;
}
}
}
// Likely a lot faster than previous code
120. References
Lawrence Livermore National Laboratory
www.llnl.gov/computing/tutorials/workshops/workshop/openMP/MAIN.html
Ohio Supercomputing Center
oscinfo.osc.edu/training/openmp/big
Minnesota Supercomputing Institute
www.msi.umn.edu/tutorials/shared_tutorials/openMP
Edinburgh OpenMP Microbenchmarks
www.epcc.ed.ac.uk/research/openmpbench
Mattson and Eigenmann Tutorial
dynamo.ecn.purdue.edu/~eigenman/EE563/Handouts/BWsc00introOMP.pdf
Mattson and Eigenmann Advanced OpenMP
dynamo.ecn.purdue.edu/~eigenman/EE563/Handouts/BWsc00advancedOMP.pdf