SlideShare a Scribd company logo
1 of 34
Download to read offline
Doing Parallel
Tools, Techniques and Scope of Parallel Computation
Jayanti Prasad
Inter-University Centre for Astronomy & Astrophysics
Pune, India (411007)
November 04, 2011
Department of Physics, Punjab University, Chandigarh
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 1 / 34
Plan of the Talk
Introduction
Why Parallel Computation
Parallel Problems
Performance
Solving Parallel Problems
Parallel Platforms
Pipeline and Vector Processors
Multicore Processors
Clusters
GPUs and APUs
Model of Parallel Programming
Shared Memory Programming
Distributed Memory Programming
GPU Programming
Summary and Conclusions
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 2 / 34
Introduction Why Parallel Computation?
Why Parallel Computation ?
Solving problems fast (saving time and money !).
Solving more problems (concurrency !).
Solving large problems, or problems which are not possible to solve on
a single computer (discoveries !).
Examples of Parallel Problems
Pattern matching or search.
Processing large data volume.
Simulations.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 3 / 34
Introduction Why Parallel Computation?
SKA : Imaging processor computation = 1015 operations/second; Final
processed data output = 10 GB/second
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 4 / 34
Introduction Why Parallel Computation?
LHC, CERN : 15 Petabytes (15 million Gigabytes) of data annually.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 5 / 34
Introduction Why Parallel Computation?
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 6 / 34
Introduction Why Parallel Computation?
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 7 / 34
Introduction Why Parallel Computation?
Performance
Processor:
Higher clock rate (at present around 3 Ghz).
More instructions per clock cycle.
More number of transistors (at present around 1 billion).
Performance (and complexity) of processors doubles in every 18
months (Moore’s law) and making faster and faster processors is
difficult due to heating and speed of light problem.
Memory:
Latency.
Bandwidth.
Hierarchy (caches).
So far the performance growth due to increase in the clock rate has
been 55% and that due to number of transistors has been 75%.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 8 / 34
Introduction Why Parallel Computation?
At present more cores with low clock rate are preferred as compared to
one core with higher clock.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 9 / 34
Introduction Parallel Problems
Parallel problems
Not all problems are parallel problems.
In general there are always some sections of a large problem which
can be solved in parallel.
The gain in the computational time or speed up1 due to parallel
computation for a problem depends on the fraction of the total time
the problem spends in the parallel sections (Amdahl’s law).
In most cases speed up never varies linearly with the problem size,
and the number of processing units.
If the time taken in communication or Input-Output (IO) is more than
the computation time then chances of performance gain due to
parallel computation are less.
1
time taken by a single processing unit/time taken by N processing units
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 10 / 34
Introduction Parallel Problems
Examples of Parallel Problems
Scalar Product:
S =
N
X
i=1
Ai Bi (1)
Linear-Algebra: Matrix multiplication
Cij =
M
X
k=1
AikBkj (2)
Integration:
y = 4
Z 1
0
dx
1 + x2
(3)
Dynamical Simulations:
fi =
N
X
j=1
mj (~
xj − ~
xi )
(|~
xj − ~
xi |)3/2
(4)
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 11 / 34
Introduction Parallel Problems
Two steps towards parallel solution
1 Identify the sections of your problem which are independent
(asynchronous) and so can be solved in parallel (concurrently).
2 Map the parallel sections following some efficient scheme
(decomposition), on the hardware resources you have, using some
software tools.
In general there is a many to one mapping between the multi-dimensional
space of the parallel sections and the computing units.
f : In
− −− > Ip
(5)
where n is the dimensionality of the “problem space” and p is the
dimensionality of the “processing space”.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 12 / 34
Parallel Platforms Flynns taxonomy
Parallel Platforms
Single-Instruction, Single-Data (SISD) - von Neumann model.
Multiple-Instruction, Single-Data (MISD).
Single-Instruction, Multiple-Data (SIMD) - most common.
Multiple-Instruction, Multiple-Data (MIMD).
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 13 / 34
Parallel Platforms Pipeline
(1) Pipeline
In order to add two numbers five computational steps are needed:
If we have one functional unit, for every step then the addition still
requires five clock cycles, however, all the units being busy at the same
time, one result is produced every cycle.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 14 / 34
Parallel Platforms Vector Processors
(2) Vector Processors
A processor that performs one instruction on several data sets is
called a vector processor.
The most common form of parallel computation is in the form of
Single Instruction Multiple Data (SIMD) i.e., same computational
steps are applied on different data sets.
Problems which can be broken into small problems for parallelization
are called embarrassingly parallel problems e.g., SIMD.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 15 / 34
Parallel Platforms Multi-core processors
(3) Multi-core Processors
A single processor can have more than one computation units, called
cores, having their own resources for executing instructions
independently.
A multiprocessor system have many processors and each one of them
can have more than one cores. Note that just by looking on the
motherboard you can count the processors but not the cores.
A single core can support more than one threads.
A multi-core processor presents multiple virtual CPUs to the user and
operating system.
Note that in general all the cores of a processor share the main
memory and some cache memory.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 16 / 34
Parallel Platforms Multi-core processors
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 17 / 34
Parallel Platforms Multi-core processors
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 18 / 34
Parallel Platforms Multi-core processors
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 19 / 34
Parallel Platforms Clusters
(4) Clusters
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 20 / 34
Parallel Platforms GPUs
(5) Graphical Processing Units (GPUs)
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 21 / 34
Parallel Platforms GPUs
(6) Accelerated Processing Units (APUs)
Using its Fusion technology AMD incorporates multi-core CPU (x86)
technology with a powerful DirectX capable discrete-level graphics and
parallel processing engine onto a single die to create the first Accelerated
Processing Unit (APU).
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 22 / 34
Parallel Platforms GPUs
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 23 / 34
Parallel Programing Shared Memory Programming
Shared Memory Programming
Shared memory model, in which more than one processors share
physical memory, is one of the easiest ways to achieve parallel
computation.
There are many tools (Application Programming Interfaces or API)
like OpenMp, pthreads and Intel Threading Blocks (ITBB) available
for shared memory programming.
Note that shared address space model is different from shared
memory model.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 24 / 34
Parallel Programing Shared Memory Programming
Processes and threads
The building blocks of Linux system are processes and each processes
has it own data, instructions and memory space.
Threads are sub-units of processes and easy to create (because no
data protection is needed) and share memory space.
In modern processors threads can execute more than one tasks
concurrently, due availability of resources (many cores on the same
processors).
Threading API provide tools to assign to threads, share data and
instruction and for synchronization.
In general, multi-threading problems follow the fork-join model.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 25 / 34
Parallel Programing Shared Memory Programming
Fork-Join
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 26 / 34
Parallel Programing Distributed Memory Programming
Distributed Memory Programming
Data and instructions between a set of homogeneous computing units
are shared by explicit communications using tools (API) like MPI.
Each computing unit is assigned a unique identification number (id)
which is used to establish communication and share the data.
Communication between computing units may be one to one (send,
receive type) or it can be collective (broadcast, scatter, gather etc.).
There is no upper limit on the number of the computing units the
system can have, however, communication complexity and overhead
makes it difficult to make a very large system.
In general the computation follows the master-slave paradigm.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 27 / 34
Parallel Programing Distributed Memory Programming
Master-Slave Model
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 28 / 34
Parallel Programing Distributed Memory Programming
GPU Programming
On a General Purpose Graphical Processing Unit (GP-GPU) a large
number of processing units (cores) are available which can work
simultaneously.
The sections of a program which take a lot of time, and can be easily
split into tasks which can be run in parallel, can be transferred to the
GPU.
GPUs are very good for SIMD system.
The GPU and CPU do not share the memory space so the data has to
be explicitly copied from the CPU to the GPU and back.
For Nvidia GPUs a C like programming environment (CUDA) is
available.
OpenCL can be used to program any GPGPU.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 29 / 34
Summary and conclusions
Summary and conclusions
It is not easy to make a super-fast single processor so multi-processor
computing is the only way to get more computing power.
When more than one processors (cores) share the same memory,
shared memory programming is used e.g., pthreads,OpenMp, ITBB
etc.
Shared memory programming is fast and it easy to get linear scaling
since communication is not an issue.
When processors do not share memory, explicit communication is
used as in MPI and PVM.
Distributed memory programming is the main way to solve large
problems (when thousands of processors are needed).
General Purpose Graphical Processing Units (GP-GPU) can provide
very high performance at very low cost, however, programming is
somewhat complicated and parallelism is limited to only SIMD.
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 30 / 34
Summary and conclusions
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 31 / 34
Summary and conclusions
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 32 / 34
Summary and conclusions
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 33 / 34
Summary and conclusions
Thank You !
http://www.iucaa.ernet.in/jayanti/
Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 34 / 34

More Related Content

Similar to doing_parallel.pdf

Parallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MPParallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MP
IJSRED
 
Asynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NETAsynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NET
ssusere19c741
 
11.0004www.iiste.org call for paper.a file system level snapshot in ext4--35-30
11.0004www.iiste.org call for paper.a file system level snapshot in ext4--35-3011.0004www.iiste.org call for paper.a file system level snapshot in ext4--35-30
11.0004www.iiste.org call for paper.a file system level snapshot in ext4--35-30
Alexander Decker
 

Similar to doing_parallel.pdf (20)

Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
 
PI-Tool To Improve Performance of Application In Multi-core Architecture
PI-Tool To Improve Performance of Application In Multi-core ArchitecturePI-Tool To Improve Performance of Application In Multi-core Architecture
PI-Tool To Improve Performance of Application In Multi-core Architecture
 
Parallel Processing.pptx
Parallel Processing.pptxParallel Processing.pptx
Parallel Processing.pptx
 
Parallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MPParallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MP
 
Asynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NETAsynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NET
 
Benchmarking open source deep learning frameworks
Benchmarking open source deep learning frameworksBenchmarking open source deep learning frameworks
Benchmarking open source deep learning frameworks
 
Architecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureArchitecting Solutions for the Manycore Future
Architecting Solutions for the Manycore Future
 
LOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSLOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONS
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collections
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
 
A Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High PerformanceA Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High Performance
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladev
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Os
OsOs
Os
 
Os
OsOs
Os
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
 
4.a file system level snapshot in ext4 -35-30
4.a file system level snapshot in ext4 -35-304.a file system level snapshot in ext4 -35-30
4.a file system level snapshot in ext4 -35-30
 
11.0004www.iiste.org call for paper.a file system level snapshot in ext4--35-30
11.0004www.iiste.org call for paper.a file system level snapshot in ext4--35-3011.0004www.iiste.org call for paper.a file system level snapshot in ext4--35-30
11.0004www.iiste.org call for paper.a file system level snapshot in ext4--35-30
 
Browser-Based Collaborative Modeling in Near Real-Time
Browser-Based Collaborative Modeling in Near Real-TimeBrowser-Based Collaborative Modeling in Near Real-Time
Browser-Based Collaborative Modeling in Near Real-Time
 

More from Jayanti Prasad Ph.D. (14)

parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
thesis_seminar.pdf
thesis_seminar.pdfthesis_seminar.pdf
thesis_seminar.pdf
 
pso2015.pdf
pso2015.pdfpso2015.pdf
pso2015.pdf
 
nbody_iitk.pdf
nbody_iitk.pdfnbody_iitk.pdf
nbody_iitk.pdf
 
likelihood_p2.pdf
likelihood_p2.pdflikelihood_p2.pdf
likelihood_p2.pdf
 
likelihood_p1.pdf
likelihood_p1.pdflikelihood_p1.pdf
likelihood_p1.pdf
 
Artificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdfArtificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdf
 
Scikit-learn1
Scikit-learn1Scikit-learn1
Scikit-learn1
 
CMB Likelihood Part 2
CMB Likelihood Part 2CMB Likelihood Part 2
CMB Likelihood Part 2
 
CMB Likelihood Part 1
CMB Likelihood Part 1CMB Likelihood Part 1
CMB Likelihood Part 1
 
Cmb part3
Cmb part3Cmb part3
Cmb part3
 
Cmb part2
Cmb part2Cmb part2
Cmb part2
 
Cmb part1
Cmb part1Cmb part1
Cmb part1
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

doing_parallel.pdf

  • 1. Doing Parallel Tools, Techniques and Scope of Parallel Computation Jayanti Prasad Inter-University Centre for Astronomy & Astrophysics Pune, India (411007) November 04, 2011 Department of Physics, Punjab University, Chandigarh Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 1 / 34
  • 2. Plan of the Talk Introduction Why Parallel Computation Parallel Problems Performance Solving Parallel Problems Parallel Platforms Pipeline and Vector Processors Multicore Processors Clusters GPUs and APUs Model of Parallel Programming Shared Memory Programming Distributed Memory Programming GPU Programming Summary and Conclusions Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 2 / 34
  • 3. Introduction Why Parallel Computation? Why Parallel Computation ? Solving problems fast (saving time and money !). Solving more problems (concurrency !). Solving large problems, or problems which are not possible to solve on a single computer (discoveries !). Examples of Parallel Problems Pattern matching or search. Processing large data volume. Simulations. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 3 / 34
  • 4. Introduction Why Parallel Computation? SKA : Imaging processor computation = 1015 operations/second; Final processed data output = 10 GB/second Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 4 / 34
  • 5. Introduction Why Parallel Computation? LHC, CERN : 15 Petabytes (15 million Gigabytes) of data annually. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 5 / 34
  • 6. Introduction Why Parallel Computation? Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 6 / 34
  • 7. Introduction Why Parallel Computation? Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 7 / 34
  • 8. Introduction Why Parallel Computation? Performance Processor: Higher clock rate (at present around 3 Ghz). More instructions per clock cycle. More number of transistors (at present around 1 billion). Performance (and complexity) of processors doubles in every 18 months (Moore’s law) and making faster and faster processors is difficult due to heating and speed of light problem. Memory: Latency. Bandwidth. Hierarchy (caches). So far the performance growth due to increase in the clock rate has been 55% and that due to number of transistors has been 75%. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 8 / 34
  • 9. Introduction Why Parallel Computation? At present more cores with low clock rate are preferred as compared to one core with higher clock. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 9 / 34
  • 10. Introduction Parallel Problems Parallel problems Not all problems are parallel problems. In general there are always some sections of a large problem which can be solved in parallel. The gain in the computational time or speed up1 due to parallel computation for a problem depends on the fraction of the total time the problem spends in the parallel sections (Amdahl’s law). In most cases speed up never varies linearly with the problem size, and the number of processing units. If the time taken in communication or Input-Output (IO) is more than the computation time then chances of performance gain due to parallel computation are less. 1 time taken by a single processing unit/time taken by N processing units Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 10 / 34
  • 11. Introduction Parallel Problems Examples of Parallel Problems Scalar Product: S = N X i=1 Ai Bi (1) Linear-Algebra: Matrix multiplication Cij = M X k=1 AikBkj (2) Integration: y = 4 Z 1 0 dx 1 + x2 (3) Dynamical Simulations: fi = N X j=1 mj (~ xj − ~ xi ) (|~ xj − ~ xi |)3/2 (4) Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 11 / 34
  • 12. Introduction Parallel Problems Two steps towards parallel solution 1 Identify the sections of your problem which are independent (asynchronous) and so can be solved in parallel (concurrently). 2 Map the parallel sections following some efficient scheme (decomposition), on the hardware resources you have, using some software tools. In general there is a many to one mapping between the multi-dimensional space of the parallel sections and the computing units. f : In − −− > Ip (5) where n is the dimensionality of the “problem space” and p is the dimensionality of the “processing space”. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 12 / 34
  • 13. Parallel Platforms Flynns taxonomy Parallel Platforms Single-Instruction, Single-Data (SISD) - von Neumann model. Multiple-Instruction, Single-Data (MISD). Single-Instruction, Multiple-Data (SIMD) - most common. Multiple-Instruction, Multiple-Data (MIMD). Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 13 / 34
  • 14. Parallel Platforms Pipeline (1) Pipeline In order to add two numbers five computational steps are needed: If we have one functional unit, for every step then the addition still requires five clock cycles, however, all the units being busy at the same time, one result is produced every cycle. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 14 / 34
  • 15. Parallel Platforms Vector Processors (2) Vector Processors A processor that performs one instruction on several data sets is called a vector processor. The most common form of parallel computation is in the form of Single Instruction Multiple Data (SIMD) i.e., same computational steps are applied on different data sets. Problems which can be broken into small problems for parallelization are called embarrassingly parallel problems e.g., SIMD. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 15 / 34
  • 16. Parallel Platforms Multi-core processors (3) Multi-core Processors A single processor can have more than one computation units, called cores, having their own resources for executing instructions independently. A multiprocessor system have many processors and each one of them can have more than one cores. Note that just by looking on the motherboard you can count the processors but not the cores. A single core can support more than one threads. A multi-core processor presents multiple virtual CPUs to the user and operating system. Note that in general all the cores of a processor share the main memory and some cache memory. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 16 / 34
  • 17. Parallel Platforms Multi-core processors Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 17 / 34
  • 18. Parallel Platforms Multi-core processors Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 18 / 34
  • 19. Parallel Platforms Multi-core processors Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 19 / 34
  • 20. Parallel Platforms Clusters (4) Clusters Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 20 / 34
  • 21. Parallel Platforms GPUs (5) Graphical Processing Units (GPUs) Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 21 / 34
  • 22. Parallel Platforms GPUs (6) Accelerated Processing Units (APUs) Using its Fusion technology AMD incorporates multi-core CPU (x86) technology with a powerful DirectX capable discrete-level graphics and parallel processing engine onto a single die to create the first Accelerated Processing Unit (APU). Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 22 / 34
  • 23. Parallel Platforms GPUs Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 23 / 34
  • 24. Parallel Programing Shared Memory Programming Shared Memory Programming Shared memory model, in which more than one processors share physical memory, is one of the easiest ways to achieve parallel computation. There are many tools (Application Programming Interfaces or API) like OpenMp, pthreads and Intel Threading Blocks (ITBB) available for shared memory programming. Note that shared address space model is different from shared memory model. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 24 / 34
  • 25. Parallel Programing Shared Memory Programming Processes and threads The building blocks of Linux system are processes and each processes has it own data, instructions and memory space. Threads are sub-units of processes and easy to create (because no data protection is needed) and share memory space. In modern processors threads can execute more than one tasks concurrently, due availability of resources (many cores on the same processors). Threading API provide tools to assign to threads, share data and instruction and for synchronization. In general, multi-threading problems follow the fork-join model. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 25 / 34
  • 26. Parallel Programing Shared Memory Programming Fork-Join Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 26 / 34
  • 27. Parallel Programing Distributed Memory Programming Distributed Memory Programming Data and instructions between a set of homogeneous computing units are shared by explicit communications using tools (API) like MPI. Each computing unit is assigned a unique identification number (id) which is used to establish communication and share the data. Communication between computing units may be one to one (send, receive type) or it can be collective (broadcast, scatter, gather etc.). There is no upper limit on the number of the computing units the system can have, however, communication complexity and overhead makes it difficult to make a very large system. In general the computation follows the master-slave paradigm. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 27 / 34
  • 28. Parallel Programing Distributed Memory Programming Master-Slave Model Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 28 / 34
  • 29. Parallel Programing Distributed Memory Programming GPU Programming On a General Purpose Graphical Processing Unit (GP-GPU) a large number of processing units (cores) are available which can work simultaneously. The sections of a program which take a lot of time, and can be easily split into tasks which can be run in parallel, can be transferred to the GPU. GPUs are very good for SIMD system. The GPU and CPU do not share the memory space so the data has to be explicitly copied from the CPU to the GPU and back. For Nvidia GPUs a C like programming environment (CUDA) is available. OpenCL can be used to program any GPGPU. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 29 / 34
  • 30. Summary and conclusions Summary and conclusions It is not easy to make a super-fast single processor so multi-processor computing is the only way to get more computing power. When more than one processors (cores) share the same memory, shared memory programming is used e.g., pthreads,OpenMp, ITBB etc. Shared memory programming is fast and it easy to get linear scaling since communication is not an issue. When processors do not share memory, explicit communication is used as in MPI and PVM. Distributed memory programming is the main way to solve large problems (when thousands of processors are needed). General Purpose Graphical Processing Units (GP-GPU) can provide very high performance at very low cost, however, programming is somewhat complicated and parallelism is limited to only SIMD. Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 30 / 34
  • 31. Summary and conclusions Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 31 / 34
  • 32. Summary and conclusions Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 32 / 34
  • 33. Summary and conclusions Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 33 / 34
  • 34. Summary and conclusions Thank You ! http://www.iucaa.ernet.in/jayanti/ Jayanti Prasad (IUCAA-Pune) An Overview of Parallel Compuatation November 04, 2011 34 / 34