This document provides an overview of parallel computing. It discusses why parallel computation is needed due to limitations in increasing processor speed. It then covers various parallel platforms including shared and distributed memory systems. It describes different parallel programming models and paradigms including MPI, OpenMP, Pthreads, CUDA and more. It also discusses key concepts like load balancing, domain decomposition, and synchronization which are important for parallel programming.
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
Exascale computing refers to a computing system which is capable to at least one exaflop in next couple of years. Many new programming models, architectures and algorithms have been introduced to attain the objective for exascale computing system. The primary objective is to enhance the system performance. In modern/super computers, GPU is being used to attain the high computing performance. However, it’s the objective of proposed technologies and programming models is almost same to make the GPU more powerful. But these technologies are still facing the number of challenges including parallelism, scale and complexity and also many more that must be fixed to achieve make computing system more powerful and efficient. In this paper, we have present a testing tool architecture for a parallel programming approach using two programming models as CUDA and OpenMP. Both CUDA and OpenMP could be used to program shared memory and GPU cores. The object of this architecture is to identify the static errors in the program that occurred during writing the code and cause absence of parallelism. Our architecture enforces the developers to write the feasible code through we can avoid from the essential errors in the program and run successfully.
What is high performance Computing, examples of OpenMP, MPI and Pthreads. How obtain HPC benefits using OpenSolaris.
This version was presented at OpenSolaris Day in Porto Alegre, Brazil, in April 16, 2008.
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
Exascale computing refers to a computing system which is capable to at least one exaflop in next couple of years. Many new programming models, architectures and algorithms have been introduced to attain the objective for exascale computing system. The primary objective is to enhance the system performance. In modern/super computers, GPU is being used to attain the high computing performance. However, it’s the objective of proposed technologies and programming models is almost same to make the GPU more powerful. But these technologies are still facing the number of challenges including parallelism, scale and complexity and also many more that must be fixed to achieve make computing system more powerful and efficient. In this paper, we have present a testing tool architecture for a parallel programming approach using two programming models as CUDA and OpenMP. Both CUDA and OpenMP could be used to program shared memory and GPU cores. The object of this architecture is to identify the static errors in the program that occurred during writing the code and cause absence of parallelism. Our architecture enforces the developers to write the feasible code through we can avoid from the essential errors in the program and run successfully.
What is high performance Computing, examples of OpenMP, MPI and Pthreads. How obtain HPC benefits using OpenSolaris.
This version was presented at OpenSolaris Day in Porto Alegre, Brazil, in April 16, 2008.
"This deck is from the opening session of the "Introduction to Programming Pascal (P100) with CUDA 8" workshop at CSCS in Lugano, Switzerland. The three-day course is intended to offer an introduction to Pascal computing using CUDA 8."
Watch the video: http://wp.me/p3RLHQ-gsQ
Learn more: http://www.cscs.ch/events/event_detail/index.html?tx_seminars_pi1%5BshowUid%5D=155
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Please contact me to download this pres.A comprehensive presentation on the field of Parallel Computing.It's applications are only growing exponentially day by days.A useful seminar covering basics,its classification and implementation thoroughly.
Visit www.ameyawaghmare.wordpress.com for more info
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Parallel computing is computing architecture paradigm ., in which processing required to solve a problem is done in more than one processor parallel way.
"This deck is from the opening session of the "Introduction to Programming Pascal (P100) with CUDA 8" workshop at CSCS in Lugano, Switzerland. The three-day course is intended to offer an introduction to Pascal computing using CUDA 8."
Watch the video: http://wp.me/p3RLHQ-gsQ
Learn more: http://www.cscs.ch/events/event_detail/index.html?tx_seminars_pi1%5BshowUid%5D=155
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Please contact me to download this pres.A comprehensive presentation on the field of Parallel Computing.It's applications are only growing exponentially day by days.A useful seminar covering basics,its classification and implementation thoroughly.
Visit www.ameyawaghmare.wordpress.com for more info
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Parallel computing is computing architecture paradigm ., in which processing required to solve a problem is done in more than one processor parallel way.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
4. Why parallel computation ?
The speed with which a single processor can process data
(number of instructions per second, FLOPS) cannot be
increased indefinitely.
It is difficult to cool faster CPUs.
Faster CPUs demand smaller chip size which again creates
more heat.
Using a fast/parallel computer :
Can solve existing problems/more problems in less time.
Can solve completely new problems leading to new findings.
In Astronomy High Performance Computing is used for two
main purposes
Processing large volume of data
Dynamical simulations
References :
Ananth Grama et. al. 2003, Introduction to Parallel Computing, Addison Wesley
Kevin Dowd & Charles R. Severance, High Performance Computing, OReily
5. Performance
Computation time for different problems may depend on the
followings:
Input-Output:- problems which require a lot of disk
read/write.
Communication:- Problems, like dynamical simulations need
a lot of data to be communicated among processors. Fast
inter-connects (Gig-bit Ethernet, Infiniband, Myrinet etc.) are
highly recommended.
Memory:- Problems which need a large amount of data to
be available in the main memory (DRAM), demand more
memory. Latency and bandwidth are very important.
Processor:- Problems in which a large number or
computations have to be done.
Cache :- Better memory hierarchy (RAM,L1,L2,L3..) and
efficient use of that benefits all problems.
6. Performance measurement and benchmarking
FLOPS:- Computational power is measured in units of
Floating Point Operations Per Second or FLOPS (Mega, Giga,
Tera, Peta,.. FLOPS).
Speed UP:-
Speed Up =
T1
TN
(1)
where T1 is the time needed to solve a problem on one
processor and Tn is the time needed to solve that on N
processors.
7.
8. Pipeline
If we want to add two numbers we need five computational steps:
If we have one functional unit for every step then the addition still
requires five clock cycles, however, all the units being busy at the
same time, one result is produced every cycle. Reference : Tobias Wittwer 2005,
An Introduction to Parallel Programming
9. Vector Processors
A processor that performs one instruction on several data sets
is called a vector processor.
The most common form of parallel computation is in the form
of Single Instruction Multiple Data (SIMD) i.e., same
computational steps are applied on different data sets.
Problems which can be broken into small problems for
parallelization are called embarrassingly parallel problems e.g.,
SIMD.
10. Multi-core Processors
That part of a processor which execute instructions in an
application is called a core.
Note that in a multiprocessor system we can have each
processor having one core or more than one cores.
In general the number of hardware threads are equal to the
number of cores.
In some processors one core can support more than one
software threads.
A multi-core processor presents multiple virtual CPUs to the
user and operating system which are not physically countable
but OS can give you clue (check /proc).
Reference : Darryl Gove, 2011, Multi-core Application Programming, Addison-Wesley
14. Parallel Problems
Scalar Product
S =
N
X
i=1
Ai Bi (2)
Linear-Algebra: Matrix multiplication
Cij =
M
X
k=1
AikBkj (3)
Integration
y = 4
Z 1
0
dx
1 + x2
(4)
Dynamical simulations
fi =
N
X
j=1
mj (~
xj − ~
xi )
(|~
xj − ~
xi |)3/2
(5)
17. Synchronization
More than one processors working on a single problem must
coordinate (if the problem is not embarrassingly parallel
problem).
The OpenMp CRITICAL directive specifies a region of code
that must be executed by only one thread at a time.
The BARRIER (in OpenMp, MPI etc.,) directive
synchronizes all threads in the team.
In pthreads Mutexes are used to avoid data inconsistencies
due to simultaneous operations by multiple threads upon the
same memory area at the same time.
18. Partitioning or decomposing a problem: Divide and
conquer
Partitioning or decomposing a problem is related to the
opportunities for parallel execution.
On the basis of whether we are diving the executions or data
we can have two type of decompositions:
Functional decomposition
Domain decomposition
Load balance
19. Domain decomposition
Figure: The left figure shows the use of orthogonal recursive bisection
and the right one shows Peano-Hilbert space filling curve for domain
decomposition. Note that parts which have Peano-Hilbert keys close to
each other are physically also close to each other.
Examples: N-body, antenna-baselines, sky maps (l,m)
20. Mapping and indexing
Indexing in MPI
MPI Comm rank(MPI COMM WORLD, &id);
Indexing in OpenMP
id = omp get thread num();
Indexing in CUDA
1 tid=threadIdx.x + blockDim.x*( threadIdx.y+blockDim.y * threadIdx.z);
2 bid=blockIdx.x + gridDim.x * blockIdx.y;
3 nthreads = blockDim.x * blockDim.y * blockDim.z;
4 nblocks = gridDim.x * gridDim.y;
5 id = tid + nthreads * bid;
1 dim3 dimGrid(grszx ,grszy), dimBlock(blsz ,blsz ,blsz );
2 VecAdd $<<< dimGrid ,dimBlock >>>$ ();
21. Mapping
A problem of size N can be divided among NP processors in
the following ways:
A processor with identification number id get the data
between the data index istart and iend where
istart = id ×
N
Np
(6)
iend = (id + 1) ×
N
Np
(7)
Every processor can pick data skipping NP elements
for(i = 0; i < N; i+ = Np) (8)
Note that N/Np may not always an integer so keep load
balance in mind.
22. Shared Memory programming : Threading Building Blocks
Intel Threading building blocks (ITBB) is provided in the form
of C++ runtime library which and can run on any platform.
The main advantage of TBB is that it works at a higher level
than raw threads, yet does not require exotic languages or
compilers.
Most threading packages require you to create, join, and
manage threads. Programming directly in terms of threads
can be tedious and can lead to inefficient programs because
threads are low-level, heavy constructs that are close to the
hardware.
TBB runtime library automatically schedules tasks onto
threads in a way that makes efficient use of processor
resources. The runtime is very effective in load-balancing also.
23. Shared Memory programming : Pthreads
1 # include <pthread.h>
2 void * print_hello_world (void * param) {
3 long tid =( long)param;
4 printf("Hello world from %ld ! n",tid );
5 }
6
7 int main(int argc , char *argv []){
8 pthread_t mythread[NTHREADS ];
9 long i;
10 for(i=0; i < NTHREADS; i++)
11 pthread_create (& mythread[i],NULL ,& print_hello_world ,( void *)i);
12 pthread_exit (NULL );
13 return (0);
14 }
Reference : David R. Butenhof 1997, Programming with Posix threads, Addison-Wesley
24. Shared Memory programming : OpenMp
No need to make any change in the structure of the program.
Need only three things to be done :
#include< omp.h >
#pragma omp parallel for shared () private ()
-fopenmp when compiling
In general available on all GNU/Linux system by default
References :
Rohit Chandra et. al. 2001, Parallel Programming in OpenMP, Morgan Kaufmann
Barbara Chapman et. al. 2008, Using OpenMP, MIT Press
http://www.iucaa.ernet.in/∼jayanti/openmp.html
25. OpenMP: Example
1 # include <stdio.h>
2 # include <omp.h>
3 int main (int argc , char*argv []){
4 int nthreads , tid , numthrd;
5 // set the number of threads
6 omp_set_num_threads (atoi(argv [1]));
7
8 # pragma omp parallel private(tid)
9
10 {
11 tid = omp_get_thread_num (); // obtain the thread id
12
13 nthreads = omp_get_num_threads (); // find number of threads
14
15 printf("tHello World from thread %d of %dn", tid ,nthreads );
16
17 }
18 return (0);
19 }
26. Distributed Memory Programming : MPI
MPI is a solution for very large problems
Communication
Point to Point
Collective
Communication overhead can dominate computation and it
may be hard to get linear scaling.
Is distributed in the form of libraries e.g., libmpi,libmpich or in
the form of compilers e.g., mpicc,mpif90.
Programs have to be completely restructured.
References :
Gropp, William et. al. 1999, Using MPI 2, MIT Press
http://www.iucaa.ernet.in/∼jayanti/mpi.html
27. MPI : Example
1 # include <stdio.h>
2 # include <mpi.h>
3 int main(int argc , char *argv []){
4 int rank , size , len;
5 char name[ MPI_MAX_PROCESSOR_NAME ];
6 MPI_Init (&argc , &argv );
7 MPI_Comm_rank (MPI_COMM_WORLD , &rank );
8 MPI_Comm_size (MPI_COMM_WORLD , &size );
9 MPI_Get_processor_name (name , &len);
10 printf ("Hello world! I’m %d of %d on %sn", rank , size , name );
11 MPI_Finalize ();
12 return (0);
13 }
28. GPU Programming : CUDA
Programming language similar to C with few extra constructs.
Parallel section is written in the form of kernel which is
executed on GPU.
Two different memory spaces : one for CPU and another of
GPU. Data has to be explicitly copied back and forth.
A very large number of threads can be used. Note that GPU
cannot match the complexity of CPU. It is mostly used for
SIMD programming.
References :
David B. Kirk and Wen-mei W. Hwu 2010, Programming Massively Parallel Processors, Morgan Kaufmann
Jason Sanders & Edward Kandrot 2011, Cuda By examples, Addison-Wesley
http://www.iucaa.ernet.in/∼jayanti/cuda.html
29. CUDA : Example
1 __global__ void force_pp(float *pos_d , float *acc_d ,int n){
2
3 int tidx = threadIdx.x;
4 int tidy = threadIdx.y;
5 int tidz = threadIdx.z;
6
7 int myid = (blockDim.z * (tidy+blockDim.y *tidx )) + tidz;
8 int nthreads = blockDim.z * blockDim.y * blockDim.x;
9
10 for(int i = myid; i < n; i+= nthreads ){
11 for(int l=0; l < ndim; l++)
12 acc_d[l+ndim*i] = ...;
13
14 }//
15 __syncthreads ();
16 }
17 // this was the device part
1 dim3 dimGrid (1);
2 dim3 dimBlock(BLOCK_SIZE , BLOCK_SIZE ,BLOCK_SIZE );
3 cudaMemcpy(pos_d , pos ,npart*ndim*sizeof(float), cudaMemcpyHostToDevice );
4 force_pp <<< dimGrid ,dimBlock >>>( pos_d ,acc_d ,npart );
5 cudaMemcpy(acc ,acc_d ,npart*ndim*sizeof(float), cudaMemcpyDeviceToHost );
30. Nvidia Quadro FX 3700
— General Information for device 0 —
Name: Quadro FX 3700
Compute capability: 1.1
Clock rate: 1242000
Device copy overlap: Enabled
Kernel execition timeout : Disabled
— Memory Information for device 0 —
Total global mem: 536150016
Total constant Mem: 65536
Max mem pitch: 262144
Texture Alignment: 256
— MP Information for device 0 —
Multiprocessor count: 14
Shared mem per mp: 16384
Registers per mp: 8192
Threads in warp: 32
Max threads per block: 512
Max thread dimensions: (512, 512, 64)
Max grid dimensions: (65535, 65535, 1)
31. Dynamic & static libraries
Most of the open source software are distributed in the form of
source codes and from which libraries are created for the use.
In general libraries are not portable.
One of the most common problems which a user face is due
to not linking libraries.
The most common way to create libraries is: source code −→
object code −→ library.
gcc -c first.c
gcc -c second.c
ar rc libtest.a first.o second.o
gcc -shared -Wl,-soname, libtest.so.0 -o libtest.so.0 first.o second.o -lc
The above library can be used in the following way
gcc program.c -L/LIBPATH -ltest
Dynamic library gets preference over static one.
32. Hyper threading
Hyper-Threading Technology used in Intel R
Xeon
TM
and
Intel R
Pentium
TM
4 processors, makes a single physical
processor appear as two logical processors to the operating
system.
Hyper-Threading duplicates the architectural state on each
processor, while sharing one set of execution resources.
sharing system resources, such as cache or memory bus, may
degrade system performance and Hyper-Threading can
improve the performance of some applications, but not all.
33. Summary
It is not easy to make a super-fast single processor so
multi-processor computing is the only way to get more
computing power.
When more than one processors (cores) share the same
memory shared memory programming is uded e.g.,
pthread,OpenMp, itbb etc.
Shared memory programming is fast and it easy to get linear
scaling since communication is not an issue.
When processors having their own memory are used for
parallel computation, distributed memory programming is
used e.g., MPI, PVM.
Distributed memory programming is the main way to solve
large problems (when thousands of processors are needed).
General Purpose Graphical Processing Units (GPGPU) can
provide very high performance at very low cost, however,
programming is somewhat complicated and parallelism is
limited to only SIMD.