SlideShare a Scribd company logo
1 of 39
Download to read offline
Introduction To Parallel Computing




1   © 2012 WIPRO LTD | WWW.WIPRO.COM
Outline

    • Overview
    • Theoretical background
    • Parallel computing systems
    • Parallel programming models
    • MPI/OpenMP examples




2                    © 2012 WIPRO LTD | WWW.WIPRO.COM
OVERVIEW




3              © 2012 WIPRO LTD | WWW.WIPRO.COM
What is Parallel Computing?
    • Parallel computing: use of multiple processors or
       computers working together on a common task.
       - Each processor works on its section of the problem
       - Processors can exchange information




4                            © 2012 WIPRO LTD | WWW.WIPRO.COM
Why Do Parallel Computing?




5              © 2012 WIPRO LTD | WWW.WIPRO.COM
Parallelism




6         © 2012 WIPRO LTD | WWW.WIPRO.COM
7   © 2012 WIPRO LTD | WWW.WIPRO.COM
8   © 2012 WIPRO LTD | WWW.WIPRO.COM
Limits of Parallel Computing

    • Theoretical Upper Limits
       - Amdahl’s Law
       - Gustafson’s Law
    • Practical Limits
       - Load balancing
       - Non-computational sections
    • Other Considerations
       - time to re-write code

9                          © 2012 WIPRO LTD | WWW.WIPRO.COM
Amdahl’s Law
     • All parallel programs contain:
         - parallel sections (we hope!)
         - serial sections (we despair!)
     • Serial sections limit the parallel effectiveness
     • Amdahl’s Law states this formally
         - Effect of multiple processors on speed up
                                                1
                          SP  T S                                     Example:
                                 TP        fs  fp                     fs = 0.5, fp = 0.5, P = 2
         where
             • fs = serial fraction of code                            Sp, max = 1 / (0.5 + 0.25) = 1.333


             • P = number of processors




10                                  © 2012 WIPRO LTD | WWW.WIPRO.COM
Amdahl’s Law




11       © 2012 WIPRO LTD | WWW.WIPRO.COM
Practical Limits: Amdahl’s Law vs. Reality

      • In reality, the situation is even worse than predicted by Amdahl’s
         Law due to:
          - Load balancing (waiting)
          - Scheduling (shared processors or memory)
          - Cost of Communications
          - I/O




12                                © 2012 WIPRO LTD | WWW.WIPRO.COM
PARALLEL SYSTEMS




13              © 2012 WIPRO LTD | WWW.WIPRO.COM
“Old school” hardware classification

                               Single Instruction Multiple Instruction
               Single Data            SISD                            MISD
               Multiple Data         SIMD                             MIMD

     SISD No parallelism in either instruction or data streams (mainframes)
     SIMD Exploit data parallelism (stream processors, GPUs)
     MISD Multiple instructions operating on the same data stream. Unusual,
        mostly for fault-tolerance purposes (space shuttle flight computer)

     MIMD Multiple instructions operating independently on multiple data
       streams (most modern general purpose computers, head nodes)
                                                    NOTE: GPU references frequently refer to
                                                    SIMT, or single instruction multiple thread

14                                 © 2012 WIPRO LTD | WWW.WIPRO.COM
Hardware in parallel computing

     Memory access                                    Processor type
     • Shared memory                                  •     Single core CPU
                                                              - Intel Xeon (Prestonia, Wallatin)
         - SGI Altix                                          - AMD Opteron (Sledgehammer, Venus)
         - IBM Power series nodes                             - IBM POWER (3, 4)

                                                      •     Multi-core CPU (since 2005)
     • Distributed memory                                     - Intel Xeon (Paxville, Woodcrest,
                                                                  Harpertown, Westmere, Sandy Bridge…)
         - Uniprocessor clusters                              - AMD Opteron (Barcelona, Shanghai,

                                                              - IBM POWER (5, 6…)
     • Hybrid/Multi-processor                                 - Fujitsu SPARC64 VIIIfx (8 cores)

        clusters (Ranger, Lonestar)                   •     Accelerators
                                                              - GPGPU
                                                              - MIC
     • Flash based (e.g. Gordon)


15                                  © 2012 WIPRO LTD | WWW.WIPRO.COM
Shared and distributed memory




     • All processors have access to a            • Memory is local to each
        pool of shared memory                       processor

     • Access times vary from CPU to              • Data exchange by message
        CPU in NUMA systems                          passing over a network

     • Example: SGI Altix, IBM P5                 • Example: Clusters with single-
        nodes                                        socket blades



16                              © 2012 WIPRO LTD | WWW.WIPRO.COM
Hybrid systems




     • A limited number, N, of processors have access to a common pool
        of shared memory

     • To use more than N processors requires data exchange over a
        network

     • Example: Cluster with multi-socket blades


17                             © 2012 WIPRO LTD | WWW.WIPRO.COM
Multi-core systems




     • Extension of hybrid model

     • Communication details increasingly complex
        - Cache access
        - Main memory access
        - Quick Path / Hyper Transport socket connections
        - Node to node connection via network

18                              © 2012 WIPRO LTD | WWW.WIPRO.COM
Accelerated (GPGPU and MIC) Systems




     • Calculations made in both CPU and accelerator

     • Provide abundance of low-cost flops

     • Typically communicate over PCI-e bus

     • Load balancing critical for performance


19                                © 2012 WIPRO LTD | WWW.WIPRO.COM
PROGRAMMING MODELS




20            © 2012 WIPRO LTD | WWW.WIPRO.COM
Types of parallelism

     • Data Parallelism
       - Each processor performs the same task on
       different data (remember SIMD, MIMD)

     • Task Parallelism
       - Each processor performs a different task on the
       same data (remember MISD, MIMD)

     • Many applications incorporate both

21                        © 2012 WIPRO LTD | WWW.WIPRO.COM
Implementation: Single Program
                 Multiple    Data
     • Dominant programming model for shared and distributed
        memory machines
     • One source code is written
     • Code can have conditional execution based on which
        processor is executing the copy
     • All copies of code start simultaneously and communicate and
        synchronize with each other periodically




22                           © 2012 WIPRO LTD | WWW.WIPRO.COM
SPMD Model




23      © 2012 WIPRO LTD | WWW.WIPRO.COM
Task Parallel Programming Example


     • One code will run on 2 CPUs
     • Program has 2 tasks (a and b) to be done by 2 CPUs

                                   CPU A                            CPU B
     program.f:
     …                    program.f:                            program.f:
     initialize           …                                     …
     if CPU=a then        initialize                            initialize
        do task a         …                                     …
     elseif CPU=b then
        do task b         do task a                             do task b
     end if               …                                     …
     ….
     end program          end program                           end program


24                           © 2012 WIPRO LTD | WWW.WIPRO.COM
Shared Memory Programming: pthreads


     • Shared memory systems (SMPs, ccNUMAs)
     have a single address space
     • applications can be developed in which loop
     iterations (with no dependencies) are
     executed by different processors
     • Threads are ‘lightweight processes’ (same PID)
     • Allows ‘MIMD’ codes to execute in shared
     address space

25                     © 2012 WIPRO LTD | WWW.WIPRO.COM
Shared Memory Programming: OpenMP


     • Built on top of pthreads
     • shared memory codes are mostly data
     parallel, ‘SIMD’ kinds of codes
     • OpenMP is a standard for shared memory
     programming (compiler directives)
     • Vendors offer native compiler directives



26                    © 2012 WIPRO LTD | WWW.WIPRO.COM
Accessing Shared Variables

     • If multiple processors want to write to a
        shared variable at the same time, there could

       - Process 1 and 2

       - compute X+1
       - write X

     • Programmer, language, and/or architecture
     must provide ways of resolving conflicts
        (mutexes and semaphores)

27                         © 2012 WIPRO LTD | WWW.WIPRO.COM
OpenMP Example #1: Parallel Loop

         !$OMP PARALLEL DO
           do i=1,128
            b(i) = a(i) + c(i)
           end do
         !$OMP END PARALLEL DO

     • The first directive specifies that the loop immediately following should be
        executed in parallel.

     • The second directive specifies the end of the parallel section (optional).

     • For codes that spend the majority of their time executing the content of
        simple loops, the PARALLEL DO directive can result in significant parallel
        performance.



28                                   © 2012 WIPRO LTD | WWW.WIPRO.COM
OpenMP Example #2: Private Variables

         !$OMP PARALLEL DO SHARED(A,B,C,N) PRIVATE(I,TEMP)
         do I=1,N
          TEMP = A(I)/B(I)
          C(I) = TEMP + SQRT(TEMP)
         end do
         !$OMP END PARALLEL DO

     • In this loop, each processor needs its own private copy of the variable
         TEMP.

     • If TEMP were shared, the result would be unpredictable since multiple
         processors would be writing to the same memory location.




29                                  © 2012 WIPRO LTD | WWW.WIPRO.COM
Distributed Memory Programming: MPI

     • Distributed memory systems have separate address spaces
        for each processor

     • Local memory accessed faster than remote memory

     • Data must be manually decomposed

     • MPI is the de facto standard for distributed memory
        programming (library of subprogram calls)

     • Vendors typically have native libraries such as SHMEM
        (T3E) and LAPI (IBM)


30                          © 2012 WIPRO LTD | WWW.WIPRO.COM
Data Decomposition
     • For distributed memory systems, the ‘whole’ grid is
        decomposed to the individual nodes
        - Each node works on its section of the problem
        - Nodes can exchange information




31                            © 2012 WIPRO LTD | WWW.WIPRO.COM
Typical Data Decomposition

     • Example: integrate 2-D propagation problem:




32                            © 2012 WIPRO LTD | WWW.WIPRO.COM
MPI Example #1
     • Every MPI program needs these:

     #include “mpi.h”
     int main(int argc, char *argv[])
     {
       int nPEs, iam;
       /* Initialize MPI */
       ierr = MPI_Init(&argc, &argv);
       /* How many total PEs are there */
       ierr = MPI_Comm_size(MPI_COMM_WORLD, &nPEs);
       /* What node am I (what is my rank?) */
       ierr = MPI_Comm_rank(MPI_COMM_WORLD, &iam);

         ierr = MPI_Finalize();
     }




33                            © 2012 WIPRO LTD | WWW.WIPRO.COM
MPI Example #2

     #include “mpi.h”
     int main(int argc, char *argv[])
     {
         int numprocs, myid;
         MPI_Init(&argc,&argv);
         MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
         MPI_Comm_rank(MPI_COMM_WORLD,&myid);
         /* print out my rank and this run's PE size */
         printf("Hello from %d of %dn", myid, numprocs);
         MPI_Finalize();
     }



34                             © 2012 WIPRO LTD | WWW.WIPRO.COM
MPI: Sends and Receives

     • MPI programs must send and receive data between the
        processors (communication)

     • The most basic calls in MPI (besides the three initialization
        and one finalization calls) are:
        - MPI_Send
        - MPI_Recv


     • These calls are blocking: the source processor issuing the
        send/receive cannot move to the next statement until the
        target processor issues the matching receive/send.


35                             © 2012 WIPRO LTD | WWW.WIPRO.COM
Message Passing Communication
     • Processes in message passing programs communicate by
        passing messages


     • Basic message passing primitives: MPI_CHAR, MPI_SHORT, …

     • Send (parameters list)

     • Receive (parameter list)

     • Parameters depend on the library used

     • Barriers



36                              © 2012 WIPRO LTD | WWW.WIPRO.COM
37   © 2012 WIPRO LTD | WWW.WIPRO.COM
Final Thoughts

     • These are exciting and turbulent times in HPC.
     • Systems with multiple shared memory nodes and
     multiple cores per node are the norm.
     • Accelerators are rapidly gaining acceptance.
     • Going forward, the most practical programming
     paradigms to learn are:
         - Pure MPI
         - MPI plus multithreading (OpenMP or pthreads)
         - Accelerator models (MPI or multithreading for
         MIC, CUDA or OpenCL for GPU)

38                       © 2012 WIPRO LTD | WWW.WIPRO.COM
THANK YOU




39    © 2012 WIPRO LTD | WWW.WIPRO.COM

More Related Content

What's hot

Parallel computing
Parallel computingParallel computing
Parallel computing
virend111
 
Parallel Programing Model
Parallel Programing ModelParallel Programing Model
Parallel Programing Model
Adlin Jeena
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
Jason Shih
 
Full introduction to_parallel_computing
Full introduction to_parallel_computingFull introduction to_parallel_computing
Full introduction to_parallel_computing
Supasit Kajkamhaeng
 

What's hot (20)

Introduction to Google App Engine
Introduction to Google App EngineIntroduction to Google App Engine
Introduction to Google App Engine
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computing
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
CPU vs GPU Comparison
CPU  vs GPU ComparisonCPU  vs GPU Comparison
CPU vs GPU Comparison
 
Distributed System
Distributed System Distributed System
Distributed System
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
 
Parallel Programing Model
Parallel Programing ModelParallel Programing Model
Parallel Programing Model
 
5. IO virtualization
5. IO virtualization5. IO virtualization
5. IO virtualization
 
PPT on Cloud computing
PPT on Cloud computingPPT on Cloud computing
PPT on Cloud computing
 
parallel Questions & answers
parallel Questions & answersparallel Questions & answers
parallel Questions & answers
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
Virtualization in cloud computing ppt
Virtualization in cloud computing pptVirtualization in cloud computing ppt
Virtualization in cloud computing ppt
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel Programming
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
 
Cloud computing & energy efficiency using cloud to decrease the energy use in...
Cloud computing & energy efficiency using cloud to decrease the energy use in...Cloud computing & energy efficiency using cloud to decrease the energy use in...
Cloud computing & energy efficiency using cloud to decrease the energy use in...
 
Full introduction to_parallel_computing
Full introduction to_parallel_computingFull introduction to_parallel_computing
Full introduction to_parallel_computing
 

Viewers also liked (8)

Parallel programming
Parallel programmingParallel programming
Parallel programming
 
Concurrent/ parallel programming
Concurrent/ parallel programmingConcurrent/ parallel programming
Concurrent/ parallel programming
 
MBA Dissertation - The voice of the Stakeholder: Customer attitudes to the ro...
MBA Dissertation - The voice of the Stakeholder: Customer attitudes to the ro...MBA Dissertation - The voice of the Stakeholder: Customer attitudes to the ro...
MBA Dissertation - The voice of the Stakeholder: Customer attitudes to the ro...
 
Multiprocessing with python
Multiprocessing with pythonMultiprocessing with python
Multiprocessing with python
 
A project report on training and development
A project report on training and developmentA project report on training and development
A project report on training and development
 
Project Report Format
Project Report FormatProject Report Format
Project Report Format
 
An Introduction to Python Concurrency
An Introduction to Python ConcurrencyAn Introduction to Python Concurrency
An Introduction to Python Concurrency
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similar to Intro to parallel computing

Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
Ofer Rosenberg
 
Chap1
Chap1Chap1
Chap1
adisi
 
Ceg4131 models
Ceg4131 modelsCeg4131 models
Ceg4131 models
anandme07
 
Ov psim demo_slides_power_pc
Ov psim demo_slides_power_pcOv psim demo_slides_power_pc
Ov psim demo_slides_power_pc
simon56
 

Similar to Intro to parallel computing (20)

Lecture10_RealTimeOperatingSystems.pptx
Lecture10_RealTimeOperatingSystems.pptxLecture10_RealTimeOperatingSystems.pptx
Lecture10_RealTimeOperatingSystems.pptx
 
CA UNIT IV.pptx
CA UNIT IV.pptxCA UNIT IV.pptx
CA UNIT IV.pptx
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 
RISC-V 30908 patra
RISC-V 30908 patraRISC-V 30908 patra
RISC-V 30908 patra
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Chap1
Chap1Chap1
Chap1
 
Ceg4131 models
Ceg4131 modelsCeg4131 models
Ceg4131 models
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORS
 
Towards Software Defined Persistent Memory
Towards Software Defined Persistent MemoryTowards Software Defined Persistent Memory
Towards Software Defined Persistent Memory
 
Multiprocessor.pptx
 Multiprocessor.pptx Multiprocessor.pptx
Multiprocessor.pptx
 
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
 
Net essentials6e ch1
Net essentials6e ch1Net essentials6e ch1
Net essentials6e ch1
 
Ov psim demo_slides_power_pc
Ov psim demo_slides_power_pcOv psim demo_slides_power_pc
Ov psim demo_slides_power_pc
 
C for Cuda - Small Introduction to GPU computing
C for Cuda - Small Introduction to GPU computingC for Cuda - Small Introduction to GPU computing
C for Cuda - Small Introduction to GPU computing
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx
 
FOSDEM 2017 - Open J9 The Next Free Java VM
FOSDEM 2017 - Open J9 The Next Free Java VMFOSDEM 2017 - Open J9 The Next Free Java VM
FOSDEM 2017 - Open J9 The Next Free Java VM
 
C++ Programming and the Persistent Memory Developers Kit
C++ Programming and the Persistent Memory Developers KitC++ Programming and the Persistent Memory Developers Kit
C++ Programming and the Persistent Memory Developers Kit
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics Upstreaming
 
Introduction to parallel computing
Introduction to parallel computingIntroduction to parallel computing
Introduction to parallel computing
 

More from Piyush Mittal (20)

Power mock
Power mockPower mock
Power mock
 
Design pattern tutorial
Design pattern tutorialDesign pattern tutorial
Design pattern tutorial
 
Reflection
ReflectionReflection
Reflection
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Intel open mp
Intel open mpIntel open mp
Intel open mp
 
Cuda toolkit reference manual
Cuda toolkit reference manualCuda toolkit reference manual
Cuda toolkit reference manual
 
Matrix multiplication using CUDA
Matrix multiplication using CUDAMatrix multiplication using CUDA
Matrix multiplication using CUDA
 
Channel coding
Channel codingChannel coding
Channel coding
 
Basics of Coding Theory
Basics of Coding TheoryBasics of Coding Theory
Basics of Coding Theory
 
Java cheat sheet
Java cheat sheetJava cheat sheet
Java cheat sheet
 
Google app engine cheat sheet
Google app engine cheat sheetGoogle app engine cheat sheet
Google app engine cheat sheet
 
Git cheat sheet
Git cheat sheetGit cheat sheet
Git cheat sheet
 
Vi cheat sheet
Vi cheat sheetVi cheat sheet
Vi cheat sheet
 
Css cheat sheet
Css cheat sheetCss cheat sheet
Css cheat sheet
 
Cpp cheat sheet
Cpp cheat sheetCpp cheat sheet
Cpp cheat sheet
 
Ubuntu cheat sheet
Ubuntu cheat sheetUbuntu cheat sheet
Ubuntu cheat sheet
 
Php cheat sheet
Php cheat sheetPhp cheat sheet
Php cheat sheet
 
oracle 9i cheat sheet
oracle 9i cheat sheetoracle 9i cheat sheet
oracle 9i cheat sheet
 
Open ssh cheet sheat
Open ssh cheet sheatOpen ssh cheet sheat
Open ssh cheet sheat
 

Intro to parallel computing

  • 1. Introduction To Parallel Computing 1 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 2. Outline • Overview • Theoretical background • Parallel computing systems • Parallel programming models • MPI/OpenMP examples 2 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 3. OVERVIEW 3 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 4. What is Parallel Computing? • Parallel computing: use of multiple processors or computers working together on a common task. - Each processor works on its section of the problem - Processors can exchange information 4 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 5. Why Do Parallel Computing? 5 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 6. Parallelism 6 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 7. 7 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 8. 8 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 9. Limits of Parallel Computing • Theoretical Upper Limits - Amdahl’s Law - Gustafson’s Law • Practical Limits - Load balancing - Non-computational sections • Other Considerations - time to re-write code 9 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 10. Amdahl’s Law • All parallel programs contain: - parallel sections (we hope!) - serial sections (we despair!) • Serial sections limit the parallel effectiveness • Amdahl’s Law states this formally - Effect of multiple processors on speed up 1 SP  T S Example: TP fs  fp fs = 0.5, fp = 0.5, P = 2 where • fs = serial fraction of code Sp, max = 1 / (0.5 + 0.25) = 1.333 • P = number of processors 10 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 11. Amdahl’s Law 11 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 12. Practical Limits: Amdahl’s Law vs. Reality • In reality, the situation is even worse than predicted by Amdahl’s Law due to: - Load balancing (waiting) - Scheduling (shared processors or memory) - Cost of Communications - I/O 12 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 13. PARALLEL SYSTEMS 13 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 14. “Old school” hardware classification Single Instruction Multiple Instruction Single Data SISD MISD Multiple Data SIMD MIMD SISD No parallelism in either instruction or data streams (mainframes) SIMD Exploit data parallelism (stream processors, GPUs) MISD Multiple instructions operating on the same data stream. Unusual, mostly for fault-tolerance purposes (space shuttle flight computer) MIMD Multiple instructions operating independently on multiple data streams (most modern general purpose computers, head nodes) NOTE: GPU references frequently refer to SIMT, or single instruction multiple thread 14 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 15. Hardware in parallel computing Memory access Processor type • Shared memory • Single core CPU - Intel Xeon (Prestonia, Wallatin) - SGI Altix - AMD Opteron (Sledgehammer, Venus) - IBM Power series nodes - IBM POWER (3, 4) • Multi-core CPU (since 2005) • Distributed memory - Intel Xeon (Paxville, Woodcrest, Harpertown, Westmere, Sandy Bridge…) - Uniprocessor clusters - AMD Opteron (Barcelona, Shanghai, - IBM POWER (5, 6…) • Hybrid/Multi-processor - Fujitsu SPARC64 VIIIfx (8 cores) clusters (Ranger, Lonestar) • Accelerators - GPGPU - MIC • Flash based (e.g. Gordon) 15 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 16. Shared and distributed memory • All processors have access to a • Memory is local to each pool of shared memory processor • Access times vary from CPU to • Data exchange by message CPU in NUMA systems passing over a network • Example: SGI Altix, IBM P5 • Example: Clusters with single- nodes socket blades 16 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 17. Hybrid systems • A limited number, N, of processors have access to a common pool of shared memory • To use more than N processors requires data exchange over a network • Example: Cluster with multi-socket blades 17 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 18. Multi-core systems • Extension of hybrid model • Communication details increasingly complex - Cache access - Main memory access - Quick Path / Hyper Transport socket connections - Node to node connection via network 18 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 19. Accelerated (GPGPU and MIC) Systems • Calculations made in both CPU and accelerator • Provide abundance of low-cost flops • Typically communicate over PCI-e bus • Load balancing critical for performance 19 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 20. PROGRAMMING MODELS 20 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 21. Types of parallelism • Data Parallelism - Each processor performs the same task on different data (remember SIMD, MIMD) • Task Parallelism - Each processor performs a different task on the same data (remember MISD, MIMD) • Many applications incorporate both 21 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 22. Implementation: Single Program Multiple Data • Dominant programming model for shared and distributed memory machines • One source code is written • Code can have conditional execution based on which processor is executing the copy • All copies of code start simultaneously and communicate and synchronize with each other periodically 22 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 23. SPMD Model 23 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 24. Task Parallel Programming Example • One code will run on 2 CPUs • Program has 2 tasks (a and b) to be done by 2 CPUs CPU A CPU B program.f: … program.f: program.f: initialize … … if CPU=a then initialize initialize do task a … … elseif CPU=b then do task b do task a do task b end if … … …. end program end program end program 24 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 25. Shared Memory Programming: pthreads • Shared memory systems (SMPs, ccNUMAs) have a single address space • applications can be developed in which loop iterations (with no dependencies) are executed by different processors • Threads are ‘lightweight processes’ (same PID) • Allows ‘MIMD’ codes to execute in shared address space 25 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 26. Shared Memory Programming: OpenMP • Built on top of pthreads • shared memory codes are mostly data parallel, ‘SIMD’ kinds of codes • OpenMP is a standard for shared memory programming (compiler directives) • Vendors offer native compiler directives 26 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 27. Accessing Shared Variables • If multiple processors want to write to a shared variable at the same time, there could - Process 1 and 2 - compute X+1 - write X • Programmer, language, and/or architecture must provide ways of resolving conflicts (mutexes and semaphores) 27 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 28. OpenMP Example #1: Parallel Loop !$OMP PARALLEL DO do i=1,128 b(i) = a(i) + c(i) end do !$OMP END PARALLEL DO • The first directive specifies that the loop immediately following should be executed in parallel. • The second directive specifies the end of the parallel section (optional). • For codes that spend the majority of their time executing the content of simple loops, the PARALLEL DO directive can result in significant parallel performance. 28 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 29. OpenMP Example #2: Private Variables !$OMP PARALLEL DO SHARED(A,B,C,N) PRIVATE(I,TEMP) do I=1,N TEMP = A(I)/B(I) C(I) = TEMP + SQRT(TEMP) end do !$OMP END PARALLEL DO • In this loop, each processor needs its own private copy of the variable TEMP. • If TEMP were shared, the result would be unpredictable since multiple processors would be writing to the same memory location. 29 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 30. Distributed Memory Programming: MPI • Distributed memory systems have separate address spaces for each processor • Local memory accessed faster than remote memory • Data must be manually decomposed • MPI is the de facto standard for distributed memory programming (library of subprogram calls) • Vendors typically have native libraries such as SHMEM (T3E) and LAPI (IBM) 30 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 31. Data Decomposition • For distributed memory systems, the ‘whole’ grid is decomposed to the individual nodes - Each node works on its section of the problem - Nodes can exchange information 31 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 32. Typical Data Decomposition • Example: integrate 2-D propagation problem: 32 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 33. MPI Example #1 • Every MPI program needs these: #include “mpi.h” int main(int argc, char *argv[]) { int nPEs, iam; /* Initialize MPI */ ierr = MPI_Init(&argc, &argv); /* How many total PEs are there */ ierr = MPI_Comm_size(MPI_COMM_WORLD, &nPEs); /* What node am I (what is my rank?) */ ierr = MPI_Comm_rank(MPI_COMM_WORLD, &iam); ierr = MPI_Finalize(); } 33 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 34. MPI Example #2 #include “mpi.h” int main(int argc, char *argv[]) { int numprocs, myid; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* print out my rank and this run's PE size */ printf("Hello from %d of %dn", myid, numprocs); MPI_Finalize(); } 34 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 35. MPI: Sends and Receives • MPI programs must send and receive data between the processors (communication) • The most basic calls in MPI (besides the three initialization and one finalization calls) are: - MPI_Send - MPI_Recv • These calls are blocking: the source processor issuing the send/receive cannot move to the next statement until the target processor issues the matching receive/send. 35 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 36. Message Passing Communication • Processes in message passing programs communicate by passing messages • Basic message passing primitives: MPI_CHAR, MPI_SHORT, … • Send (parameters list) • Receive (parameter list) • Parameters depend on the library used • Barriers 36 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 37. 37 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 38. Final Thoughts • These are exciting and turbulent times in HPC. • Systems with multiple shared memory nodes and multiple cores per node are the norm. • Accelerators are rapidly gaining acceptance. • Going forward, the most practical programming paradigms to learn are: - Pure MPI - MPI plus multithreading (OpenMP or pthreads) - Accelerator models (MPI or multithreading for MIC, CUDA or OpenCL for GPU) 38 © 2012 WIPRO LTD | WWW.WIPRO.COM
  • 39. THANK YOU 39 © 2012 WIPRO LTD | WWW.WIPRO.COM