Report on High Performance Computing

Report on
High performance computing
Group:- return 0;
Name:- Prateek Sarangi

Table of content
Chapter
Number
Topic Page
1 Introduction
2 Pointers
3 OpenMP and MPI programs
4 QuantumEspresso, itac and
VTune

1. Introduction
Introduction to parallel computing
Parallel computing is the simultaneous execution of the same task, split into subtasks,
on multiple processors in order to obtain the results faster. Traditionally, a task sent to a
computer was accomplished one process at a time and this was termed as serial
computing. Parallel computing is a method in computation in which two or more
processors (or processor cores) handle different parts (processes) of an overall task
simultaneously.
How is parallel computing is done?
An operating system can ensure that different tasks and user programmes are run in
parallel on the available cores. However, for a serial software programme to take full
advantage of the multi-core architecture ,the programmer needs to restructure and
parallelise the code. A speed-up of application software runtime can no longer be
achieved through frequency scaling, hence programmers parallise their software code to
take advantage of the increasing computing power of multicore architectures.
Concept of Temporal Parallelism:
In order to explain what is meant by parallelism inherent in the solution of a problem,
let us discuss an example of submission of electricity bills. Suppose there are 10000
residents in a locality and they are supposed to submit their electricity bills in one
office. Let us assume the steps to submit the bill are as follows:
1) Go to the appropriate counter to take the form to submit the bill.
2) Submit the filled form along with cash.
3) Get the receipt of submitted bill. Assume that there is only one counter with just
single office person performing all the tasks of giving application forms, accepting the
forms, counting the cash, returning the cash if the need be, and giving the receipts.

This situation is an example of sequential execution. Let us the approximate time taken
by various of events be as follows:
Giving application form = 5 seconds Accepting filled application form and counting the
cash and returning, if required = 5mnts, i.e., 5 ×60= 300 sec. Giving receipts = 5
seconds. Total time taken in processing one bill = 5+300+5 = 310 seconds. Now, if we
have 3 persons sitting at three different counters with
i) One person giving the bill submission form
ii) One person accepting the cash and returning, if necessary and
iii) One person giving the receipt. The time required to process one bill will be 300
seconds because the first and third activity will overlap with the second activity which
takes 300 sec. whereas the first and last activity take only 10 secs each.
This is an example of a parallel processing method as here 3 persons work in parallel. As
three persons work in the same time, it is called temporal parallelism. However, this is
a poor example of parallelism in the sense that one of the actions i.e., the second action
takes 30 times of the time taken by each of the other two actions.
Concept of Data Parallelism :
Consider the situation where the same problem of submission of ‘electricity bill’ is
handled as follows: Again, three are counters. However, now every counter handles all
the tasks of a resident in respect of submission of his/her bill. Again, we assuming that
time required to submit one bill form is the same as earlier, i.e., 5+300+5=310 sec. We
assume all the counters operate simultaneously and each person at a counter takes 310
seconds to process one bill. Then, time taken to process all the 10,000 bills will be 310×
(9999 / 3) + 310×1sec. This time is comparatively much less as compared to time taken
in the earlier situations, viz. 3100000 sec. and 3000000 sec respectively. The situation
discussed here is the concept of data parallelism. In data parallelism, the complete set
of data is divided into multiple blocks and operations on the blocks are applied parallely.
As is clear from this example, data parallelism is faster as compared to earlier
situations. Here, no synchronisation is required between counters(or processers). It is
more tolerant of faults. The working of one person does not effect the other. There is no
communication required between processors. Thus, interprocessor communication is
less. Data parallelism has certain disadvantages.

These are as follows:
i) The task to be performed by each processor is predecided i.e., asssignment of load is
static.
ii) It should be possible to break the input task into mutually exclusive tasks. In the
given example, space would be required counters. This requires multiple hardware
which may be costly.
The estimation of speedup achieved by using the above type of parallel processing is as
follows: Let the number of jobs = m Let the time to do a job = p If each job is divided
into k tasks, Assuming task is ideally divisible into activities, as mentioned above then,
Time to complete one task = p/k Time to complete n jobs without parallel processing =
n.p Time to complete n jobs with parallel processing = k n * p time to complete the task
if parallelism is not used Speed up time to complete the task if parallelism is used.
Pros of Parallel Computing:
• Parallel processing saves time and is better accustomed to time limited tasks.
• Parallel processing is better suited to model or simulate complex real world
phenomena.
• Parallel processing provides theoretically infinite scalability provided enough
resources.
• Many problems are so large and/or complex that it is impractical or impossible to
solve them on a single computer, especially given limited computer resources.
Limitations of Parallel Computing:
• Programming to target parallel architecture is a more time consuming and requires
additional expenditure.
• The overhead (due to data transfer, synchronization, communication & coordination,
thread creation/destruction, etc) associated with parallel programming can
sometimes be quite large and exceed the gains of parallelization.
• Parallel solution are harder to implement, test, debug and support compared to
simpler serially programmed solutions.
Fundamentals of Computing and Multithreading:
» Computer architecture is a set of rules and methods that describe the functionality,
organization, and implementation of computer systems. Examples – the von Neumann
architecture.
» A thread of execution is the smallest sequence of programmed instructions that can
be managed independently by a scheduler. The implementation of threads and
processes differs between operating systems, but in most cases a thread is a
component of a process.

The von Neumann Architecture:
» It is the computer architecture based on descriptions by the Hungarian
mathematician and physicist John von Neumann and others who authored the
general requirements for an electronic computer in their 1945 papers - First Draft of
a Report on the EDVAC.
» The term "von Neumann architecture" has evolved to mean any stored-program
computer in which an instruction fetch and a data operation are kept in a shared
memory and cannot occur at the same time because they share a common bus.
» Differs from earlier computers which were programmed through "hard wiring".
Von Neumann Architecture Scheme:
The von Neumann Bottleneck:
» Due to the data memory and the program memory sharing a single bus in this von
Neumann architecture, the limited throughput between the CPU and the memory
compared to the memory available in most cases.
» Because the single bus can only access one of the two classes of memory at a time,
throughput is lower than the rate at which the CPU can work.
Havard Architechture
The Harvard architecture is a computer architecture with separate storage and signal
pathways for instructions and data. It contrasts with the von Neumann architecture,
where program instructions and data share the same memory and pathways. In a
Harvard architecture, there is no need to make the two memories share characteristics.
In particular, the word width, timing, implementation technology, and memory address
structure can differ. In some systems, instructions for pre-programmed tasks can be
stored in read-only memory while data memory generally requires read-write memory.
Also, a Harvard architecture machine has distinct code and data address spaces:

instruction address zero is not the same as data address zero. Instruction address zero
might identify a twenty-four-bit value, while data address zero might indicate an eight-
bit byte that is not part of that twenty-four-bit value.
» Flynn’s Classical Taxonomy
» Is one of the most widely used classification of parallel computers.
SISD:
» Short for single instruction, single data. A type of parallel computing architecture
that is classified under Flynn's taxonomy. A single processor executes a single
instruction stream, to operate on data stored in a single memory. There is often a
central controller that broadcasts the instruction stream to all the processing
elements.
MISD:
» Short for multiple instruction, single data. A type of parallel computing architecture
that is classified under Flynn's taxonomy. Each processor owns its control unit and its

local memory, making them more powerful than those used in SIMD computers. Each
processor operates under the control of an instruction stream issued by its control
unit, therefore the processors are potentially all executing different programs on
single data while solving different sub-problems of a single problem. This means that
the processors usually operate asynchronously.
SIMD:
» Short for single instruction, multiple data. A type of parallel computing architecture
that is classified under Flynn's taxonomy. A single computer instruction performs the
same identical action (retrieve, calculate, or store) simultaneously on two or more
pieces of data. Typically, this consists of many simple processors, each with a local
memory in which it keeps the data which it will work on. Each processor
simultaneously performs the same instruction on its local data progressing through
the instructions in lock-step, with the instructions issued by the controller processor.
The processors can communicate with each other in order to perform shifts and other
array operations.
MIMD :
• Short for multiple instruction, multiple data. A type of parallel computing
architecture that is classified under Flynn's taxonomy. Multiple computer
instructions, which may or may not be the same, and which may or may not be
synchronized with each other, perform actions simultaneously on two or more
pieces of data. The class of distributed memory MIMD machines is the fastest growing
segment of the family of high-performance computers.
Shared Memory:
» Shared memory generally have in common the ability for all processors to access the
memory as a shared address space.
» Changes in a memory location effected by one processor are visible to all other
processors.

Distributed Memory:
» Distributed memory generally requires a communication network to connect inter-
processor memory.
» Because each processor has its own local memory, it operates independently. Changes
it makes to its local memory have no effect on the memory of other processors.
Hence, the concept of cache coherency does not apply.
Hybrid Distributed Memory:
» Hybrid Distributed Memory implements both shared and distributed memory
architectures.
» Current trends seem to indicate that this type of memory architecture will continue
to prevail and increase at the high end of computing for the foreseeable future.

Why use Multithreading?
With the introduction of multiple cores, multithreading has become extremely important
in terms of the efficiency of your application. With multiple threads and a single core,
your application would have to transition back and forth to give the illusion of
multitasking.
With multiple cores, your application can take advantage of the underlying hardware to
run individual threads through a dedicated core, thus making your application more
responsive and efficient. Again, multithreading basically allows you to take full
advantage of your CPU and the multiple cores, so you don’t waste the extra horsepower.
Developers should make use of multithreading for a few reasons:
• Higher throughput
• Responsive applications that give the illusion of multitasking.
• Efficient utilization of resources. Thread creation is light-weight in comparison to
spawning a brand new process and for web servers that use threads instead of
creating a new process when fielding web requests, consume far fewer resources.
Fundamentals of Hyperthreading?
Modern processors can only handle one instruction from one program at any given point
in time. Each instruction that is sent to the processor is called a thread. What I mean is
that even though it looks like you're multitasking with your computer (running more than
one program at a time) you're really not .

Computer architecture is a set of rules and methods that describe the functionality,
organization, and implementation of computer systems. Examples – the von Neumann
architecture and the Harvard architecture
A thread of execution is the smallest sequence of programmed instructions that can be
managed independently by a scheduler. The implementation of threads and processes
differs between operating systems, but in most cases a thread is a component of a
process.
The CPU will divide it's time and power evenly between all the programs by switching
back and forth. This little charade of switching back and forth tricks the end user (you
and me) and gives us the sense of multitasking.
Dual CPU based systems can work on two independent threads of information from the
software but each processor is still limited at working on one thread at any given
moment though. The software must be able to dish out two separate pieces of
information like Win2000 or Adobe Photoshop for a dual processor system to be really
used, by the way.
For each processor core that is physically present, the operating system addresses two
virtual (logical) cores and shares the workload between them when possible. The main
function of hyper-threading is to increase the number of independent instructions in
the pipeline; it takes advantage of superscalar architecture, in which multiple
instructions operate on separate data in parallel. With HTT, one physical core appears as
two processors to the operating system, allowing concurrent scheduling of two processes
per core. In addition, two or more processes can use the same resources: If resources for
one process are not available, then another process can continue if its resources are
available.
Simultaneous multithreading
The most advanced type of multithreading applies to superscalar processors. Whereas a
normal superscalar processor issues multiple instruction from a single thread every CPU
cycle, in simultaneous multithreading (SMT) a superscalar processor can issue
instructions from multiple threads every CPU cycle. Recognizing that any single thread
has a limited amount of instruction-level parallelism, this type of multithreading tries to
exploit parallelism available across multiple threads to decrease the waste associated
with unused issue slots.
For example:
Cycle i: instructions j and j + 1 from thread A and instruction k from thread B are
simultaneously issued.
Cycle i + 1: instruction j + 2 from thread A, instruction k + 1 from thread B, and
instruction m from thread C are all simultaneously issued.
Cycle i + 2: instruction j + 3 from thread A and instructions m + 1 and m + 2 from thread
C are all simultaneously issued.

To distinguish the other types of multithreading from SMT, the term "temporal
multithreading" is used to denote when instructions from only one thread can be issued
at a time. In addition to the hardware costs discussed for interleaved multithreading,
SMT has the additional cost of each pipeline stage tracking the thread ID of each
instruction being processed. Again, shared resources such as caches and TLBs have to be
sized for the large number of active threads being processed. Implementations include
DEC (later Compaq) EV8 (not completed), Intel Hyper-Threading Technology, IBM
POWER5, Sun Microsystems UltraSPARC T2, Cray XMT, and AMD Bulldozer and Zen
microarchitectures.
PARALLELALGORITHMIC PARADIGMS
A major advance in parallel algorithms has been the identiﬁcation of fundamental algorithmic
techniques. Some of these techniques are also used by sequential algorithms,but play a more
prominent role in parallel algorithms.While others are unique to parallelism.
TYPES OF PARALLELALGORITHM METHODS:
1. DIVIDE AND CONQUER :
• Divide and conquer is a natural paradigm for parallel algorithms.  
• After dividing a problem into two or more sub-problems, the sun-problems can
be solved  
in parallel manner  
• Generally, the sub-problems are solved recursively and thus the next divide step
yields  
even more sub problems to be solved in parallel.  
• Divide and conquer is proven to be one of the most powerful techniques for
solving  
problems in parallel with applications ranging from linear systems to computer
graphics  
and from factoring large numbers to N-Body simulations.  
• For Example- While computing the convex -hull of a set of n points in the plane
(to  
compute the smallest convex hull that enclose all of the points). This can be
implemented by splitting the points into the leftmost and rightmost, recursively

ﬁnding the convex-hull of each set in parallel and then merging the two
resulting hulls.  
2. GREEDY ALGORITHM
• In greedy algorithm of optimising solution, the best solution is at any moment.  
• It is easy to apply for complex problems. I t decides which step will provide the
most  
accurate solution in the next step.  
• This algorithm is called greedy because when the optimal solution to the smaller instances
provided the algorithm does not consider the total program as a whole. Once a solution is
considered, this algorithm never considers the same solution again.  
• This algorithm works recursively by creating a group of objects from the smallest possible
component parts. Recursion is a procedure to solve a problem in which the solution to the
speciﬁc problem is dependent on the solution of the smaller instance of that problem.  
Clusters: What Is an HPC Cluster?
An HPC cluster consists of hundreds or thousands of compute servers that are
networked together. Each server is called a node. The nodes in each cluster work in
parallel with each other, boosting processing speed to deliver high-performance
computing.
HPC Use Cases
Deployed on premises, at the edge, or in the cloud, HPC solutions are used for a variety
of purposes across multiple industries. Examples include:
1. Research labs. HPC is used to help scientists find sources of renewable energy,
understand the evolution of our universe, predict and track storms, and create new
materials.
2. Media and entertainment. HPC is used to edit feature films, render mind-blowing
special effects, and stream live events around the world.
3. Oil and gas. HPC is used to more accurately identify where to drill for new wells and
to help boost production from existing wells.

4. Artificial intelligence and machine learning. HPC is used to detect credit card fraud,
provide self-guided technical support, teach self-driving vehicles, and improve cancer
screening techniques.
5. Financial services. HPC is used to track real-time stock trends and automate trading.
6. HPC is used to design new products, simulate test scenarios, and make sure that parts
are kept in stock so that production lines aren’t held up.
7. HPC is used to help develop cures for diseases like diabetes and cancer and to enable
faster, more accurate patient diagnosis.
Performance. Delivers up to 1 million random read IOPS and 13GB/sec sustained
(maximum burst) write bandwidth per scalable building block. Optimized for both flash
and spinning media, the NetApp HPC solution includes built-in technology that monitors
workloads and automatically adjusts configurations to maximize performance.
Reliability. Fault-tolerant design delivers greater than 99.9999% availability, proven by
more than 1 million systems deployed. Built-in Data Assurance features help make sure
that data is accurate with no drops, corruption, or missed bits. Easy to deploy and
manage. Modular design, on-the-fly (“cut and paste”) replication of storage blocks,
proactive monitoring, and automation scripts all add up to easy, fast and flexible
management.
Scalability. A granular, building-block approach to growth that enables seamless
scalability from terabytes to petabytes by adding capacity in any increment—one or
multiple drives at a time. Lower TCO. Price/performance-optimized building blocks and
the industry’s best density per delivers low power, cooling, and support costs, and 4-
times lower failure rates than commodity HDD and SSD devices.
Green Computing:
Green computing is a contemporary research topic to address climate and energy
challenges. For instance, in order to provide electricity for large-scale cloud
infrastructures and to reach exascale computing, we need huge amounts of energy.
Thus, green computing is a challenge for the future of cloud computing and HPC.
Alternatively, clouds and HPC provide solutions for green computing and climate
change.Green Computing provides an incentive for computing engineers to come up with
such an method so that HPCs can be highly efficient and green for the environment at
the same time by being energy efficient.
For a data center like the MGHPCC, energy efficiency means minimizing the amount of
non-computing “overhead” energy used for cooling, lighting, and power distribution.
Energy modeling during the design phase estimated a 43% reduction in energy costs
compared to the baseline standard (ASHRAE Standard 90.1-2007), and a 44% reduction in
lighting power density for building exteriors below the baseline standard.
Low Environmental Impact
A second major requirement for LEED certification seeks to reduce negative
environmental impacts. As the MGHPCC LEED Certification Review Report shows,

environmental design for LEED Certification requires attention to numerous details,
including construction methods and materials, landscape and site design, and water
conservation. For example, 97% of the construction waste generated while building the
MGHPCC was recycled or reused instead of going to landfills; materials high in recycled
content were used wherever possible, and landscaping was designed to minimize water
use and storm water runoff.
Algorithm Paradigm
An algorithm is a step by step procedure for solving a problem. Paradigm refers to the
“pattern of thought” which governs scientific apprehension during a certain period of
time. A paradigm can be viewed as a very high level algorithm for solving a class of
problems. Various algorithms paradigms include Brute Force Paradigm, Divide and
Conquer, Backtracking, Greedy Algorithm and Dynamic Programming Paradigm. Various
Algorithms Paradigms are used to solve many types of problem according to the type of
problem faced.
Overview on Parallel Programming Paradigms
Paradigm as Shared Memory:
Usually indicated as Multithreading Programming
• Commonly implemented in scientific computing using the OpenMP standard (directive
based)
• Thread management overhead
• Limited scalability
• Write access to shared data can easily lead to race conditions and incorrect data
Total Parallel Overhead:
• The overheads incurred by a parallel program are encapsulated into a single
expression referred to as the overhead function. We define overhead function
or total overhead of a parallel system as the total time collectively spent by all the
processing elements over and above that required by the fastest known sequential
algorithm for solving the same problem on a single processing element. We denote
the overhead function of a parallel system by the symbol To.
• The total time spent in solving a problem summed over all processing elements
is pTP . TS units of this time are spent performing useful work, and the remainder is
overhead. Therefore, the overhead function (To) is given by

Open MP
OpenMP is an Application Program Interface (API). It provides a portable, scalable model
for developers of shared memory parallel applications. The API supports C/C++ and
Fortran on a wide variety of architectures.
It consists of 3 main parts –
1. Compiler directives (eg #pragma omp parallel)
2. Runtime Library Routines (eg omp_get_num_threads())
3. Environment Variables (eg OMP_NUM_THREADS)
Why Open MP?
It provides a standard among a variety of shared memory architectures/platforms and
establishes a simple and limited set of directives for programming shared memory
machines. Significant parallelism can be implemented by using just 3 or 4 directives. It
provides capability to incrementally parallelize a serial program, unlike message-passing
libraries which typically require an all or nothing approach. For High Performance
Computing (HPC) applications, OpenMP is combined with MPI for the distributed memory
parallelism. This is often referred to as Hybrid Parallel Programming. OpenMP is used for
computationally intensive work on each node. MPI is used to accomplish communications
and data sharing between nodes.
How Open MP works-
1. Provides a standard among a variety of shared memory architectures/platforms.
2. Establish a simple and limited set of directives for programming shared memory
machines.
3. Significant parallelism can be implemented by using just 3 or 4 directives.
4. Provide capability to incrementally parallelize a serial program, unlike message-
passing libraries which typically require an all or nothing approach
5. For High Performance Computing (HPC) applications, OpenMP is combined with MPI
for the distributed memory parallelism. This is often referred to as Hybrid Parallel
Programming. OpenMP is used for computationally intensive work on each node
6. MPI is used to accomplish communications and data sharing between nodes.

Clauses for Parallel constructs
#pragma omp parallel [clause, clause, ...]
• Shared
• Private
• Firstprivate
• Lastprivate
• Nowait
• If
• Reduction
• Schedule
• Default
Private Clause
• The values of private data are undefined upon entry to and exit from the specific construct.
• Private copies of the variable are initialized from the original object when entering the
region.
• Enables to affect the data-scope attributes of variables.
Firstprivate Clause
• The clause combines behavior of private clause with automatic initialization of the variables
in its list.

• Specifies that each thread should have its own instance of a variable, and that the variable
should be initialized with the value of the variable, because it exists before the parallel
construct.
Lastprivate Clause
• Performs finalization of private variables.
• Each thread has its own copy.
Shared Clause
• Shared among team of threads.
• Each thread can modify shared variables.
BENEFITS
• Incrementally Parallelization of sequential code.
• Leave thread management to compiler.
• Directly supported by compiler.
1. Compiler Directives- It’s an instruction to the compiler to change how it’s compiling
the code, rather than a piece of the code itself. #include and #define in C/C++ are
considered directives, but they’re instructions to another program - the preprocessor. A
true compiler directive might be something like a pragma, which is a compiler-specific
command for changing what the compiler does, typically error handling.
OpenMP compiler directives are used for various purposes:
i. Spawning a parallel region
ii. Dividing blocks of code among threads
iii. Distributing loop iterations between threads
iv. Serializing sections of code
v. Synchronization of work among threads.
vi. Syntax- #pragma omp parallel default (shared)
2. Run time Libraries- These routines are used for a variety of purposes:
Setting and querying the number of threads. Setting and querying the
dynamic threads features and querying if in a parallel region, and at what
level. for eg.
#include<omp.h>
int omp_get_num_threads(void)

3.Environment Variables- OpenMP provides several environment variables for
controlling the execution of parallel code at run-time.These environment variables can
be used to control such things as:
Setting the number of threads.
Specifying how loop interations are divided
Binding threads to processors.
For eg. OMP_GET_THREADS, OMP_STACKSIZE.
Message Parsing Interface
The Message Passing Interface (MPI) is a library specification that allows HPC to pass
information between its various nodes and clusters. HPC uses OpenMPI, an open-source,
portable implementation of the MPI standard. OpenMPI contains a complete
implementation of version 1.2 of the MPI standard and also MPI-2. Compilers used by MPI
include GNU implementation of C, C++ and Fortran.
Moore's Law:
Moore's Law refers to Moore's perception that the number of transistors on a microchip
doubles every two years, though the cost of computers is halved. Moore's Law states that
we can expect the speed and capability of our computers to increase every couple of
years, and we will pay less for them. Another tenet of Moore's Law asserts that this
growth is exponential. Today, however, the doubling of installed transistors on silicon
chips occurs closer to every 18 months instead of every two years.
• Moore's Law states that the number of transistors on a microchip doubles about
every two years, though the cost of computers is halved.
• In 1965, Gordon E. Moore, the co-founder of Intel, made this observation that
became Moore's Law.
• Another tenet of Moore's Law says that the growth of microprocessors is
exponential.

• Moore's Law states that the number of transistors on a microchip doubles about
every two years, though the cost of computers is halved.
• In 1965, Gordon E. Moore, the co-founder of Intel, made this observation that
became Moore's Law.
• Another statement of Moore's Law says that the growth of microprocessors is
exponential.
Moore's Law has been a driving force of technological and social change, productivity,
and economic growth that are hallmarks of the late-twentieth and early twenty-first
centuries.
Parallel Architecture:
MPI follows the SPMD style, i.e., it splits the workload into different tasks that are
executed on multiple processors. Originally, MPI was designed for distributed memory
architectures, which were popular at that time. Figure below illustrates the
characteristics of these traditional systems, with several CPUs connected to a network
and one memory module per CPU. A parallel MPI program consists of several processes
with associated local memory. In the traditional point of view each process is
associated with one core. Communication among processes is carried out through the
interconnection network by using send and receive routines.
As architectural trends changed, the majority of current clusters contain shared-
memory nodes that are interconnected through a network forming a hybrid
distributed-memory/shared-memory system. Modern clusters could even include
manycore accelerators attached to the nodes. Nowadays MPI implementations are able
to spawn several processes on the same machine. However, in order to improve
performance, many parallel applications use the aforementioned hybrid approach: one
MPI process per node that calls multithreaded [3,10] or CUDA [1,13] functions to fully
exploit the compute capabilities of the existing CPUs and accelerators cards within each
node.

Collective communication and synchronization points:
One of the things to remember about collective communication is that it implies a
synchronization point among processes. This means that all processes must reach a point
in their code before they can all begin executing again.
MPI has a special function that is dedicated to synchronizing processes:
MPI_Barrier(MPI_Comm communicator)
The name of the function is quite descriptive - the function forms a barrier, and no
processes in the communicator can pass the barrier until all of them call the function.
Here’s an illustration. Imagine the horizontal axis represents execution of the program
and the circles represent different processes.
Process zero first calls MPI_Barrier at the first time snapshot (T 1). While process zero is
hung up at the barrier, process one and three eventually make it (T 2). When process
two finally makes it to the barrier (T 3), all of the processes then begin execution again
(T 4).
MPI_Barrier can be useful for many things. One of the primary uses of MPI_Barrier is to
synchronize a program so that portions of the parallel code can be timed accurately.
Want to know how MPI_Barrier is implemented? Sure you do :-) Do you remember the
ring program from the sending and receiving tutorial? To refresh your memory, we wrote
a program that passed a token around all processes in a ring-like fashion. This type of
program is one of the simplest methods to implement a barrier since a token can’t be
passed around completely until all processes work together.
One final note about synchronization - Always remember that every collective call you
make is synchronized. In other words, if you can’t successfully complete an MPI_Barrier,
then you also can’t successfully complete any collective call. If you try to call
MPI_Barrier or other collective routines without ensuring all processes in the
communicator will also call it, your program will idle.

Trapezoidal Rule:
The Trapezoidal Rule for approximating b∫af(x)dx is given by
b∫af(x)dx≈Tn=Δx2[f(x0)+2f(x1)+2f(x2)+…+2f(xn−1)+f(xn)],
where Δx=b−an and xi=a+iΔx.
As n→∞, the right-hand side of the expression approaches the definite
integral b∫af(x)dx.

The program that follows: 
Trapezoidal approach:
The integrating factors for the trapezoidal estimation:

Syntax:
1. MPI_Comm_rank(.....)
MPI_Comm_rank (MPI_Comm communicator , int * rank) ;
2. MPI_Comm_rank (MPI_Comm communicator , int * rank) ;
MPI_Comm_size(.....):
MPI_Comm_size (MPI_Comm communicator , int * size) ;

3. MPI_Send(.....)
MPI_Send (void* msg_buffer , Int msg_size, MPI_Datatype msg_type, Int
destination, Int tag , MPI_Comm communicator ) ;
4. MPI_Recv(.....)
MPI_Recv (void* msg_buffer , Int buf_size, MPI_Datatype buf_type, Int source, Int
tag , MPI_Comm communicator, MPI_Status* );

Successful transmission of Message:
MPI_Send (void* msg_buffer , Int msg_size, MPI_Datatype msg_type, Int
destination, Int tag , MPI_Comm communicator ) ;
MPI_Recv (void* msg_buffer , Int buf_size, MPI_Datatype buf_type, Int source, Int
tag , MPI_Comm communicator, MPI_Status* );

Types of communication for different cases:
Point-to-Point Communication
The most elementary form of message-passing communication involves two nodes,
one passing a message to the other. Although there are several ways that this might
happen in hardware, logically the communication is point-to-point: one node calls a
send routine and the other calls a receive.
A message sent from a sender contains two parts: data (message content) and the
message envelope. The data part of the message consists of a sequence of successive
items of the type indicated by the variable datatype. MPI supports all the basic C
datatypes and allows a more elaborate application to construct new datatypes at
runtime (discussed in an advanced topic tutorial). The basic MPI datatypes for C are
MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_COMPLEX, MPI_CHAR. The message envelope
contains information such as the source (sender), destination (receiver), tag and
communicator.
Order:
Messages are non-overtaking: if a sender sends two messages in succession to the
same destination and both match the same receive, then this operation cannot
receive the second message while the first is still pending. If a receiver posts two
receives in succession and both match the same message, then this message cannot
satisfy the second receive operation, as long as the first one is still pending. This
requirement facilitates matching sends to receives. It guarantees that message-
passing code is deterministic if processes are single-threaded and the wildcard
MPI_ANY_SOURCE is not used in receives.
Progress:
If a pair of matching send and receives have been initiated on two processes, then at
least one of these two operations will complete, independent of other action in the
system. The send operation will complete unless the receive is satisfied and
completed by another message. The receive operation will complete unless the
message sent is consumed by another matching receive that was posted at the same
destination process.
Avoid a Deadlock:
It is possible to get into a deadlock situation if one uses blocking send and receive.
Here is a fragment of code to illustrate the deadlock situation:
MPI_Comm_rank(comm,&rank);
if (rank == 0) {
MPI_Recv(recvbuf,count,MPI_REAL,1,tag,comm,&status);
MPI_Send(sendbuf,count,MPI_REAL,1,tag,comm);
}
elseif (rank == 1) {

}
The receive operation of the first process must complete before its send, and can
complete only if the matching send of the second process is executed. The receive
operation of the second process must complete before its send and can complete
only if the matching send of the first process is executed. This program will always
deadlock. To avoid deadlock, one can use one of the following two examples:
if (rank == 0) {
MPI_Send(sendbuf,count,MPI_REAL,1,tag,comm,);
}
elseif(rank == 1) {
}
or
if (rank == 0) {
}
elseif(rank == 1) {
}
Synchronous Blocking - Point to Point Communication: A nonblocking send call
initiates the send operation, but does not complete it. The send start call will return
before the message is copied out of the send buffer. A separate send complete call is
needed to complete the communication, i.e., to verify that the data have been
copied out of the send buffer. Here is the syntax of the nonblocking send operation:
MPI_Isend(void* buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm,
MPI_Request *request);

Non-blocking communications:
For the moment, we have only seen blocking point-to-point communication. That
means that when a process sends or receive information, it has to wait (Animation
from Cornell virtual workshop) for the transmission to end to get back to what it was
doing.
In this case, process 0 has some information to send to process 1. But both are
working on very different things and, as such, take different time to finish their
computations. Process 0 is ready to send its data first, but since process 1 has not
finished its own computations, process 0 has to wait for process 1 to be ready before
getting back to its own work. Process 1 finishes treating the data really quickly and
now waits for process 0 to finish for getting new data. This way of sending messages
is possible in MPI and called non-blocking communications.
What is happening in MPI is a bit different. Non-blocking communications always
require to be initialised and completed. What that means is that now, we will call a
send and a receive commands to initialise the communication. Then, instead of
waiting to complete the send (or the receive), the process will continue working, and
will check once in a while to see if the communication is completed. This might be a
bit obscure so let's work an example together. First in pseudo-code and then in C++.
Let's imagine that process 0 has first to work for 3 seconds, then for 6. At the same
time, process 1 has to work for 5 seconds, then for 3. They must synchronise some
time in the middle, and at the end.

Network buffer mechanism:
 
Non - Blocking point to point communication :
● MPI_Isend (&buf,count,datatype,dest,tag,comm,&request)
● MPI_Irecv (&buf,count,datatype,source,tag,comm,&request)
● MPI_Issend (&buf,count,datatype,dest,tag,comm,&request) ○ Synchronous non-
blocking send.
● Check for Asynchronous Transfer :
○ MPI_Test(MPI_Request *request, int *flag, MPI_Status * status) ■
Flag: ● if flag == 0, the send/receive operation is not yet complete
● if flag != 0, the send/receive operation is complete and the variable status
contains information about the messag
■ status: contains information about the message (use the information only if flag !=
0
Collective Communication Routines:
MPI_Bcast ( void* data , Int
count , MPI_Datatype datatype , Int
source_process , MPI_Comm comm ) ;

Eg : MPI_Bcast(a, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD) ;
MPI_Scatter ( void* send_buffer , Int
send_count , MPI_Datatype send_datatype , void*
recv_buffer , Int recv_count ,
MPI_Datatype recv_datatype , Int source_process ,
MPI_Comm comm ) ;
Eg : MPI_Scatter ( a, local_n, MPI_DOUBLE, local_a, local_n, MPI_DOUBLE, 0, comm) ;
SCHEDULER
Scheduling parallel jobs has been an active investigation area. The scheduler has
to deal with heterogeneous workloads and try to obtain throughputs and response
times such that ensures good performance.
SLURM
-It stands for Simple Linux Utility for Resource Management . 
-It is a scheduling software that controls all the jobs running on the Hipergator
cluster . -It uses a best ﬁt algorithm based on Hilbert curve sceduling 
-It needs
-How many CPUs you want and how you want them grouped -How long your job
will run 
-Commands that will be run 
-How much RAM your job will use
-It Provides three functions 
-Allocating exclusive and non exclusive access to resources to users for some
duration. -Providing a framework for starting,executing,monitoring work(parallel
job) 
- Arbitrating contention for resources by managing a queue of pending jobs.

2. Pointers
How branch predictor algorithm works?
It is a digital circuit which tries to predict in which way a branch would go before the
proper command is to be executed. Its purpose is to improve the flow in the instruction pipeline.
Branch predictors play a critical role in achieving highly effective performance in modern
pipelined microprocessors such as x86. It is based on speculative algorithm. Branch predictor
predicts what the next line of code would be, and speculative algorithm predicts what the
output of that code would be, before it is actually being executed.
Implementation of Branch Prediction.
Static Branch Prediction- Static prediction is the simplest branch prediction technique
because it does not rely on information about the dynamic history of code executing. Instead, it
predicts the outcome of a branch based solely on the branch instruction. Evaluates branches in
decode stage and have a single cycle instruction fetch.
Dynamic Branch Prediction- Uses information about taken or not taken branches
gathered at run time to predict the outcome of a branch.
Random Branch Prediction – Random Branch Predictor uses algorithm which predicts
what the next branch would come to the process cycle to be executed at run time. It has a
prediction rate of around 50% .
Next line prediction – fetches each line of instruction with a pointer to the next line.
The next line predictor points to aligned pointers and predicts its outcome so that the execution
time is less.
What is Cache Coherency?
Cache coherence is the uniformity of shared resource data that ends up stored in
multiple local caches. When clients in a system maintain caches of a common memory resource,
problems may arise with incoherent data, which is particularly the case with CPUs in a
multiprocessing system.
When more than 1 cahce are connected with each other, there is a ambiguity of data, and this
leads to inconsistency of data. This is called cache coherency.
Process vs Threads
A process usually represents an independent execution unit with its own memory area,
system resources and scheduling slot.
A thread is typically a "division" within the process which usually share the same memory and
operating system resources, and share the time allocated to that process.

Process operations are controlled by PCB which is a kernel data structure. PCB uses the three
kinds of functions which are scheduling, dispatching and context save.
For thread,the kernel allocates a stack and a thread control block (TCB) to each thread.
Threads are implemented in three different ways:kernel-level threads, user-level threads,
hybrid threads. Threads can have three states running, ready and blocked.
A thread can't have individual existence whereas process can exit individually.
A process is heavy weighted, but a thread is light weighted.
Loosely Coupled vs Tightly coupled
Cache Memory and Levels of Cache Memory
A Cache is used by the CPU to access data from the main memory in short time. It is a
small and very fast temporary storage memory. It is designed to speed up the transfer of data or
instructions. CPU Cache is located inside or near to the CPU chip. The data/instructions which
are most recently or frequently used by the CPU are stored in CPU. A copy of data/instructions
is stored as a cache when the CPU uses them for the first time which retrieved from RAM. The
next time when CPU needs the data/instruction, it looks in the cache. If the required data/
instruction is found there, then it is retrieved from the cache memory instead of main memory.
SERIALNO. LOOSELY
COUPLED
TIGHTLY
COUPLED
1 It has distributed
memory.
It has shared
memory.
2 Efficient when
tasks running on
different
processors ,has
minimal
interaction
Efficient for high
speed or real-
time processing.
3 It generally , do
not encounter
memory conflict.
It experience
more memory
conflict.
4 Data rate is low. Data rate is high
as compared to
loosely coupled.

Types/Levels of cache memory
A computer has several different levels of cache memory. All levels of cache memory are faster
than the RAM. The cache which is closer to the CPU is always faster than the other levels but it
costs more and stores less data than other levels. As multiple processors operate in parallel, and
independently multiple caches may possess different copies of the same memory block, this
creates cache coherence problem. Cache coherence schemes help to avoid this problem by
maintaining a uniform state for each cached block of data.
Types of Cache Memory in a CPU
Level 1 or L1 Cache Memory
The L1 cache memory is built on processor chip and it is very fast because it runs on the speed
of the processor. It is also called primary or internal cache. It has less memory compared to
other levels of cache and can store up to the 64kb cache memory. This cache is made of SRAM
(Static RAM). Each time the processor requests information from memory, the cache controller
on the chip uses special circuitry to first check if the memory data is already in the cache. If it is
present, then the system is spared from the time-consuming access to the main memory. L1
cache is also usually split two ways, into the instruction cache and the data cache. The
instruction cache deals with the information about the operation that the CPU has to perform,
while the data cache holds the data on which the operation is to be performed.
Examples of L1 cache are accumulator, Program counter and address register, etc
The L2 cache memory is larger but slower than L1 cache. It is used to see recent accesses that
are not picked by the L1 cache and it usually stores 64kb to the 2MB cache memory. An L2 cache
is also found on the CPU. If L1 and L2 cache are used together, then the missing information that
is not present in the L1 cache can be retrieved quickly from the L2 cache. Like L1 caches, L2
caches are composed of SRAM but they are larger. L2 is usually a separate static RAM (SRAM) chip
and it is located between the CPU and DRAM (Main memory).
The L3 Cache memory is an enhanced form of memory present on the motherboard of the
computer. It is an extra cache built into the motherboard between the processor and main
memory to speed up the processing operations. It reduces the time gap between request and
retrieving of the data and instructions much more quickly than the main memory. L3 cache is
being used with processors nowadays, having more than 3MB of storage in it.

What are sockets?
Sockets allow communication between two different processes on the same or different
machines. To be more precise, it's a way to talk to other computers using standard Unix file
descriptors
To a programmer, a socket looks and behaves much like a low-level file descriptor. This is
because commands such as read() and write() work with sockets in the same way they do with
files and pipes.
There are four types of sockets available to the users. The first two are most commonly used and
the last two are rarely used.
Stream Sockets − Delivery in a networked environment is guaranteed. If you send
through the stream socket three items "A, B, C", they will arrive in the same order − "A, B, C".
These sockets use TCP (Transmission Control Protocol) for data transmission. If delivery is
impossible, the sender receives an error indicator. Data records do not have any boundaries.
Datagram Sockets − Delivery in a networked environment is not guaranteed. They're
connectionless because you don't need to have an open connection as in Stream Sockets − you
build a packet with the destination information and send it out. They use UDP (User Datagram
Protocol).
Raw Sockets − These provide users access to the underlying communication protocols,
which support socket abstractions. These sockets are normally datagram oriented, though their
exact characteristics are dependent on the interface provided by the protocol. Raw sockets are
not intended for the general user; they have been provided mainly for those interested in
developing new communication protocols, or for gaining access to some of the more cryptic
facilities of an existing protocol.
Sequenced Packet Sockets − They are similar to a stream socket, with the exception
that record boundaries are preserved. This interface is provided only as a part of the Network
Systems (NS) socket abstraction, and is very important in most serious NS applications.
Sequenced-packet sockets allow the user to manipulate the Sequence Packet Protocol (SPP) or
Internet Datagram Protocol (IDP) headers on a packet or a group of packets, either by writing a
prototype header along with whatever data is to be sent, or by specifying a default header to be
used with all outgoing data, and allows the user to receive the headers on incoming packets.

What is context Switching?
Context Switching involves storing the context or state of a process so that it can be
reloaded when required and execution can be resumed from the same point as earlier. This is a
feature of a multitasking operating system and allows a single CPU to be shared by multiple
processes.
Clustering and how Clusters handle Load Balancing
A cluster is a group of resources that are trying to achieve a common objective, and are
aware of one another. Clustering usually involves setting up the resources (servers usually) to
exchange details on a particular channel (port) and keep exchanging their states, so a resource’s
state is replicated at other places as well. It usually also includes load balancing, wherein, the
request is routed to one of the resources in the cluster as per the load balancing policy.
Load balancing can also happen without clustering when we have multiple independent servers
that have same setup, but other than that, are unaware of each other. Then, we can use a load
balancer to forward requests to either one server or other, but one server does not use the other
server’s resources. Also, one resource does not share its state with other resources. Each load
balancer basically does following tasks: Continuously check which servers are up. When a new
request is received, send it to one of the servers as per the load balancing policy. When a
request is received for a user who already has a session, send the user to the same server.

What is Multithreading?
Multithreading is the capability of a processor or a single core in a multicore processor to
be able to produce thread of execution concurrently.
In a multithreaded application, the threads share the resources of a single or multiple cores,
which include the computing units, the CPU caches, and other essential resources.
Simultaneous Multithreading- Simultaneous multithreading (SMT) is a technique for
improving the overall efficiency of CPUs with hardware multithreading. SMT permits multiple
independent threads of execution to better utilize the resources provided by modern processor
architectures.
The name multithreading is ambiguous, because not only can multiple threads be executed
simultaneously on one CPU core, but also multiple tasks/processes (with different page tables,
different task state segments, different protection rings, different I/O permissions, etc).
Two concurrent hardware threads per CPU core are the most common, but some processors
support up to eight concurrent threads per core.
Hyperthreading- Hyper-threading is Intel’s proprietary simultaneous multithreading
implementation used to improve parallelization of computations performed on x86
microprocessors.
What is Interconnect and its types used in HPCs?
Interconnect is the way by which various computers communicate with each other. The
biggest advantage HPCs have over ordinary consumer level computers is the Interconnect
technology used by them, which allows them to significantly boost efficiency, and allows them to
utilise the resources of other computers. High performance system interconnect technology can
be divided into three categories: Ethernet, InfiniBand, and vendor specific interconnects, which
includes custom interconnects the recently introduced Intel Omni-Path technology.
Ethernet as an Interconnect- Ethernet is established as the dominant low level
interconnect standard for mainstream commercial computing requirements. Above the physical
level, the software layers to coordinate communication resulted in TCP/IP becoming widely
adopted as the primary commercial networking protocol. Ethernet is established as the dominant
low level interconnect standard for mainstream commercial computing requirements. Above the
physical level, the software layers to coordinate communication resulted in TCP/IP becoming
widely adopted as the primary commercial networking protocol.
Infiniband as an Interconnect- InfiniBand is designed for scalability, using a switched
fabric network topology together with remote direct memory access (RDMA) to reduce CPU
overhead. The InfiniBand protocol stack is considered less burdensome than the TCP protocol
required for Ethernet. This enables InfiniBand to maintain a performance and latency edge in
comparison to Ethernet in many high performance workloads, and is generally used in Cluster
Computers

3. OpenMP and MPI Programs
1. Write a MPI program that should print your name only if number of processes is even
otherwise return error message.
#include<stdio.h>
#include<mpi.h>
int main() {
int comm_sz;
int my_rank;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(comm_sz%2==0 && my_rank==0)
printf(“Subratn");
else if(my_rank==0)
printf("Errorn");
MPI_Finalize();
}

2. Write a MPI program that should determine partner process and then send and receive
message (Your name and number) with it.
#include<stdio.h>
#include<string.h>
#include<mpi.h>
int main() {
int comm_sz;
int my_rank;
char name[100];
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank==0) {
int n;
strcpy(name, "Subrat"); for(int i=1; i<comm_sz; i++) {
MPI_Send(name, 100, MPI_CHAR, i, 0, MPI_COMM_WORLD); MPI_Recv(&n, 1, MPI_INT, i,
1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%dn", n);
}
}
else {
MPI_Recv(name, 100, MPI_CHAR, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Send(&my_rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); printf("%sn", name);
}
MPI_Finalize();
}

3. Observe the difference between blocking and non-blocking communication
I) Blocking communication:-
#include<stdio.h>
#include<string.h>
#include<mpi.h>
#define me 0
#define partner 1
#define MAX_STRING 1000
int main(void)
{
char
greeting[MAX_STRING],greeting1[MAX_STRING],greeting2[MAX_STRING],greeting3[MAX_STRING];
int comm_sz ;
int my_rank ;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if(my_rank==partner)
{
sprintf(greeting, "Welcome to the world of Parallel Computing. greeting2 I am Process
no %d out of %dn", my_rank, comm_sz);
MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, me, 0, MPI_COMM_WORLD) ;
MPI_Recv(greeting, MAX_STRING, MPI_CHAR,me, 0,
MPI_COMM_WORLD,MPI_STATUS_IGNORE);
printf("%sn",greeting);
}
else
{
MPI_Recv(greeting, MAX_STRING, MPI_CHAR,partner, 0,
MPI_COMM_WORLD,MPI_STATUS_IGNORE);

sprintf(greeting, "Welcome to the world of Parallel Computing.greeting I am Process
MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, partner, 0, MPI_COMM_WORLD) ;
}
MPI_Finalize();
return 0;
}

II) Non-blocking communication:-
#include<stdio.h>
#include<string.h>
#include<mpi.h>
#define me 0
#define partner 1
int main(void)
{
char
greeting[MAX_STRING],greeting1[MAX_STRING],greeting2[MAX_STRING],greeting3[MAX_STRING];
int comm_sz ;
int my_rank ;
MPI_Request request,request2;
if(my_rank==partner)
{
sprintf(greeting, "Welcome to the world of Parallel Computing. greeting2 I am Process
MPI_Isend(greeting, strlen(greeting)+1, MPI_CHAR, me, 0,
MPI_COMM_WORLD,&request) ;
}
else
{
MPI_Irecv(greeting, MAX_STRING, MPI_CHAR,partner, 0,
MPI_COMM_WORLD,&request2);

sprintf(greeting, "Welcome to the world of Parallel Computing.greeting I am Process
MPI_Isend(greeting, strlen(greeting)+1, MPI_CHAR, partner, 0,
MPI_COMM_WORLD,&request) ;
printf("%s",greeting);
}
MPI_Finalize();
return 0;
}

4. Write a C program to calculate the value of pi. (DartBoard Algorithm) hint- divide no of
darts.
#include<stdio.h>
#include<mpi.h>
#define R 97
#define NUM_SQUARE 10
int main() {
int px, py;
int limit = 2*R + 1;
int comm_sz;
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); long ncircle = 0;
for(long i=0; i<NUM_SQUARE/comm_sz; i++) {
px = rand()%limit - R;
py = rand()%limit - R;
printf("(%d, %d)n", px, py); if((px*px + py*py) < R*R) ncircle++;
}
if(my_rank==0) {
int a[comm_sz];
long num_circle = ncircle; for(int q=1; q < comm_sz; q++) MPI_Recv(&a[q], 1, MPI_INT, q, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for(int i=1; i<comm_sz; i++) num_circle += a[i];
printf("pi = %Lfn", (long double) (4 * ((long double) num_circle/NUM_SQUARE)));
}
else
MPI_Send(&ncircle, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Finalize();
}

5. Write a C-Program that initialise matrix A and B (user’s size), multiply both matrix and
store result in matrix C.
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define NRA 62 /* number of rows in matrix A */
#define NCA 15 /* number of columns in matrix A */
#define NCB 7 /* number of columns in matrix B */
#define MASTER 0 /* taskid of first task */
#define FROM_MASTER 1 /* setting a message type */
#define FROM_WORKER 2 /* setting a message type */
int main (int argc, char *argv[])
{
int numtasks, /* number of tasks in partition */
taskid, /* a task identifier */
numworkers, /* number of worker tasks */
source, /* task id of message source */
dest, /* task id of message destination */
mtype, /* message type */
rows, /* rows of matrix A sent to each worker */
averow, extra, offset, /* used to determine rows sent to each worker */
i, j, k, rc; /* misc */
double a[NRA][NCA], /* matrix A to be multiplied */
b[NCA][NCB], /* matrix B to be multiplied */
c[NRA][NCB]; /* result matrix C */
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
if (numtasks < 2 ) {
printf("Need at least two MPI tasks. Quitting...n");
MPI_Abort(MPI_COMM_WORLD, rc);
exit(1);
}
numworkers = numtasks-1;
if (taskid == MASTER)
{
printf("mpi_mm has started with %d tasks.n",numtasks);
printf("Initializing arrays...n");
for (i=0; i<NRA; i++)
for (j=0; j<NCA; j++)
a[i][j]= i+j;
for (i=0; i<NCA; i++)
for (j=0; j<NCB; j++)
b[i][j]= i*j;
/* Send matrix data to the worker tasks */
averow = NRA/numworkers;
extra = NRA%numworkers;
offset = 0;
mtype = FROM_MASTER;
for (dest=1; dest<=numworkers; dest++)
{
rows = (dest <= extra) ? averow+1 : averow;
printf("Sending %d rows to task %d offset=%dn",rows,dest,offset);
MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype,
MPI_COMM_WORLD);

MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
offset = offset + rows;
}
/* Receive results from worker tasks */
mtype = FROM_WORKER;
for (i=1; i<=numworkers; i++)
{
source = i;
MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype,
MPI_COMM_WORLD, &status);
printf("Received results from task %dn",source);
}
/* Print results */
printf("******************************************************n");
printf("Result Matrix:n");
for (i=0; i<NRA; i++)
{
printf("n");
for (j=0; j<NCB; j++)
printf("%6.2f ", c[i][j]);
}
printf("n******************************************************n");
printf ("Done.n");
}
if (taskid > MASTER)
{
mtype = FROM_MASTER;
MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD,
&status);
MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++)
for (i=0; i<rows; i++)
{
c[i][k] = 0.0;
for (j=0; j<NCA; j++)
c[i][k] = c[i][k] + a[i][j] * b[j][k];
}
mtype = FROM_WORKER;
MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}
MPI_Finalize();
}

Output 4 nodes:-
Output 5 nodes:-

6. Write a MPI program that take data(name or number), send that to all the processor and
print them.
#include<stdio.h>
#include<string.h>
#include<mpi.h>
int main() {
int comm_sz;
int my_rank;
char name[100];
int num;
strcpy(name, "Subrat"); num = 1729;
}
MPI_Bcast(name, 100, MPI_CHAR, 0, MPI_COMM_WORLD); MPI_Bcast(&num, 1, MPI_INT,
0, MPI_COMM_WORLD); printf("Name: %snNum: %dn", name, num); MPI_Finalize();
}

7. Write a MPI program that should return the sum of all processes involved note:-reduce.
#include<stdio.h>
#include<string.h>
#include<mpi.h>
#define n 10
int main() {
int comm_sz;
int my_rank;
int sum = 0, num[100], a[1000]; MPI_Init(NULL, NULL);
printf("Enter the numbers:n"); for (int i = 0; i < n; ++i) scanf("%d", &a[i]);
}
//MPI_Barrier(MPI_COMM_WORLD); MPI_Scatter(a, n/comm_sz, MPI_INT, num, n/comm_sz,
MPI_INT, 0, MPI_COMM_WORLD);
MPI_Reduce(num, &sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if(my_rank==0) {
printf("Sum = %dn", sum);
}
MPI_Finalize();
}

8. Write a MPI program that should return the sum of all processes involved note:-reduceall.
#include<stdio.h>
#include<string.h>
#include<mpi.h>
#define n 10
int main() {
int comm_sz;
int my_rank;
}
//MPI_Barrier(MPI_COMM_WORLD); MPI_Scatter(a, n/comm_sz, MPI_INT, num, n/comm_sz,
MPI_INT, 0, MPI_COMM_WORLD);
MPI_Allreduce(num, &sum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); if(my_rank==0) {
printf("Sum = %dn", sum);
}
MPI_Finalize();
}

9. WAP such that it should initilize an array of 1 - 25 and divide these values among 5 process
equally note:- use 2d array.
#include<stdio.h>
#include<string.h>
#include<mpi.h>
#define n 10
int main() {
int comm_sz;
int my_rank;
int a[5][5], num[5];
printf("Enter the numbers:n"); for (int i = 0; i < 5; i++) {
for(int j=0; j<5; j++) {
scanf("%d", &a[i][j]);
}
}
}
//MPI_Barrier(MPI_COMM_WORLD); MPI_Scatter(a, 5, MPI_INT, num, 5, MPI_INT, 0,
MPI_COMM_WORLD); for(int i=0; i<5; i++)
printf("Process %d: %dn", my_rank, num[i]); MPI_Finalize();
}

10. WAP to decomposition simple data such that the master task should 1st initialize an array
and then distribute an equal portion of that array to the other tasks. After the other task
receive their portion of data, they should perform an ADDITION Operation on elements of
array.
#include<stdio.h>
#include<string.h>
#include<mpi.h>
#define n 10
int main() {
int comm_sz;
int my_rank;
}
MPI_Scatter(a, n/comm_sz, MPI_INT, num, n/comm_sz, MPI_INT, 0, MPI_COMM_WORLD);
for(int i=0; i<n/comm_sz; i++) sum += num[i];
int allsum[comm_sz];
MPI_Gather(&sum, 1, MPI_INT, allsum, 1, MPI_INT, 0, MPI_COMM_WORLD); if(my_rank==0)
{
for(int i=0; i<comm_sz; i++) printf("Process%d sum = %dn", i, allsum[i]);
}
MPI_Finalize();
}

Write a program to generate all the permutations given the number of
characters to be taken and the size of the string.
1. Implementation in serial program:-
#include<iostream>
#include <omp.h>
using namespace std;
int main(int argc, char const *argv[])
{
int n, i, j, k;
char ch[36], t = 'a';
for(i = 0; i<36; i++){
ch[i] = t;
t++;
if(t == '{')
t = '0';
}
cout<<"Enter the numbers of characters:- ";
cin>>n;
double before = omp_get_wtime();
for(i = 0; i < n*n*n; i++)
printf("Result:- %c%c%cnExecution in thread number:-
%dn", ch[(i/(n*n))%n], ch[(i/n)%n], ch[i%n],
omp_get_thread_num());
double after = omp_get_wtime();
cout<<"Time total:- "<<after - before<<endl;
return 0;
}

2. Implementation in OpenMP:-
#include<iostream>
#include <omp.h>
{
int n, i, j, k;
char ch[36], t = 'a';
for(i = 0; i<36; i++){
ch[i] = t;
t++;
if(t == '{')
t = '0';
}
cin>>n;
double before = omp_get_wtime();
#pragma omp parallel for num_threads(32)
for(i = 0; i < n*n*n; i++)
double after = omp_get_wtime();
cout<<"Time total:- "<<after - before<<endl;
return 0;
}

3. Implementation in MPI:-
#include<iostream>
#include<omp.h>
#include<mpi.h>
#define MASTER 0
#define send_tag 2001
#define return_tag 2002
{
int n, i, j, k, id, sender;
char ch[36], t = 'a';
MPI_Status status;
for(i = 0; i<36; i++){
ch[i] = t;
t++;
if(t == '{')
t = '0';
}
cin>>n;
int comm_sz;
int my_rank, div, end, start, to_send, to_recieve, ierr, partial_ans;
char greeting[MAX_STRING];
if(my_rank != MASTER){
div = (n*n*n) / comm_sz;
for(id=1; id<comm_sz; id++){

start = i*div + 1;
end = (i+1)*div;
if((n-end) < comm_sz)
end = n-1;
to_send = end - start + 1;
}
for(i=0; i < div; i++)
sprintf(greeting, "Result:- %c%c%cnExecution in thread
number:- %dn", ch[(i/(n*n))%n], ch[(i/n)%n], ch[i%n],
MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0,
MPI_COMM_WORLD);
}
else{
for(int q=1; q < div; q++){
MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%s n", greeting);
}
}
return 0;
}

4. VTune, ITAC and Quantum Espresso
Intel® VTune™
Intel VTune is an application created by Intel for software performance
analysis of serial and multithreaded applications on 32 and 64-bit x86 based
machines. VTune Profiler helps analyse the algorithm choices and identify
where and how your application can benefit from available hardware
resources.
Code optimization
VTune assists in various kinds of code profiling including stack sampling,
thread profiling and hardware event sampling. The profiler result consists of
details such as time spent in each sub routine which can be drilled down to
the instruction level. The time taken by the instructions are indicative of
any stalls in the pipeline during instruction execution. The tool can be also
used to analyse thread and storage performance.

Intel® Trace Analyzer and Collector
Intel Trace Collector is a tool for tracing MPI applications. It intercepts all
MPI calls and generates tracefiles that can be analysed with Intel Trace
Analyzer for understanding the application behaviour. Intel® Trace Collector
can also trace non-MPI applications, like socket communication in
distributed applications or serial programs.
Tracing
In software engineering, tracing essentially is a specialized form of logging
to record information about the execution of a program at runtime. This
information is typically used by programmers for debugging purposes, and
additionally, depending on the type and detail of information contained in a
trace log, by experienced system administrators or technical-support
personnel and by software monitoring tools to diagnose common problems
with software.

Quantum ESPRESSO is a suite for electronic-structure calculation and
materials modelling at the nanoscale, distributed free software. It is based
on density-functional theory, plane wave basis sets, and pseudopotentials.
ESPRESSO is an acronym for opEn-Source Package for Research in Electronic
Structure, Simulation, and Optimization.
Installation:-
Git clone https://github.com/QEF/q-e.git
Cd q-e
./configure
Make all

Cu.in
&control
prefix=''
outdir='temp'
pseudo_dir = '.',
/
&system
ibrav= 2, celldm(1) =6.824, nat= 1, ntyp= 1,
ecutwfc =30.0,
occupations='smearing', smearing='mp', degauss=0.06
/
&electrons
/
ATOMIC_SPECIES
Cu 63.546 Cu.pbesol-dn-kjpaw_psl.1.0.0.UPF
ATOMIC_POSITIONS
Cu 0.00 0.00 0.00
K_POINTS automatic
8 8 8 0 0 0

Report on High Performance Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Report on High Performance Computing

Similar to Report on High Performance Computing (20)

Recently uploaded

Recently uploaded (20)

Report on High Performance Computing