Parallel Computing 2007: Overview

Parallel Computing 2007: Overview February 26-March 1 2007 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN [email_address] http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/

Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object]

Books For Lectures ,[object Object],[object Object]

Some Remarks ,[object Object],[object Object],[object Object]

Job Mixes (on a Chip) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],A1 A2 A3 A4 C B E D1 D2 F

Three styles of “Jobs” ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],A1 A2 A3 A4 C B E D1 D2 F

Data Parallelism in Algorithms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Functional Parallelism in Algorithms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Structure(Architecture) of Applications ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Motivating Task ,[object Object],[object Object],[object Object],[object Object],[object Object]

What is …? What if …? Is it …? R ecognition M ining S ynthesis Create a model instance RMS: Recognition Mining Synthesis Model-based multimodal recognition Find a model instance Model Real-time analytics on dynamic, unstructured, multimodal datasets Photo-realism and physics-based animation Model-less Real-time streaming and transactions on static – structured datasets Very limited realism Tomorrow Today

What is a tumor? Is there a tumor here? What if the tumor progresses? It is all about dealing efficiently with complex multimodal datasets R ecognition M ining S ynthesis Images courtesy: http://splweb.bwh.harvard.edu:8000/pages/images_movies.html

Why Parallel Computing is Hard ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Structure of Complex Systems ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],map map map map map map S natural application S computer Time Space Time Space Map

Languages in Complex Systems Picture ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],map map map map map map

Structure of Modern Java System: GridSphere ,[object Object]

Another Java Code; Batik Scalable Vector Graphics SVG Browser ,[object Object],[object Object]

Are Applications Parallel? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Seismic Simulation of Los Angeles Basin ,[object Object],Divide surface into 4 parts and assign calculation of waves in each part to a separate processor

Parallelizable Software ,[object Object],[object Object],[object Object],[object Object],S natural application S computer Time Space Time Space Map

Potential in a Vacuum Filled Rectangular Box ,[object Object],[object Object],[object Object]

Basic Sequential Algorithm ,[object Object],[object Object], New = (  Left +  Right +  Up +  Down ) / 4  Up  Down  Left  Right  New

Update on the Mesh 14 by 14 Internal Mesh

Parallelism is Straightforward ,[object Object],[object Object]

Communication is Needed ,[object Object],[object Object]

Communication Must be Reduced ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Summary of Laplace Speed Up ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

All systems have various Dimensions

Parallel Processing in Society It’s all well known ……

Divide problem into parts; one part for each processor 8-person parallel processor

Amdahl’s Law of Parallel Processing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Typical modern application performance

Performance of Typical Science Code I FLASH Astrophysics code from DoE Center at Chicago Plotted as time as a function of number of nodes Scaled Speedup as constant grain size as number of nodes increases

Performance of Typical Science Code II FLASH Astrophysics code from DoE Center at Chicago on Blue Gene Note both communication and simulation time are independent of number of processors – again the scaled speedup scenario Communication Simulation

FLASH is a pretty serious code

Rich Dynamic Irregular Physics

FLASH Scaling at fixed total problem size Increasing Problem Size Rollover occurs at increasing number of processors as problem size increases

The Web is also just message passing Neural Network

1984 Slide – today replace hypercube by cluster

Inside CPU or Inner Parallelism Between CPU’s Called Outer Parallelism

Now we discuss classes of application

“ Space-Time” Picture ,[object Object],[object Object],[object Object],“ Internal” (to data chunk) application spatial dependence ( n degrees of freedom) maps into time on the computer Application Time Application Space t 0 t 1 t 2 t 3 t 4 Computer Time 4-way Parallel Computer (CPU’s) T 0 T 1 T 2 T 3 T 4

Data Parallel Time Dependence ,[object Object],[object Object],[object Object],[object Object],[object Object],Synchronization on MIMD machines is accomplished by messaging It is automatic on SIMD machines! Application Time Application Space Synchronous Identical evolution algorithms t 0 t 1 t 2 t 3 t 4

Local Messaging for Synchronization ,[object Object],[object Object],[object Object],[object Object],[object Object],……… 8 Processors Application and Processor Time Application Space Communication Phase Compute Phase Communication Phase Compute Phase Communication Phase Compute Phase Communication Phase

Loosely Synchronous Applications ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Distinct evolution algorithms for each data point in each processor Application Time Application Space t 0 t 1 t 2 t 3 t 4

Irregular 2D Simulation -- Flow over an Airfoil ,[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],Heterogeneous Problems

Asynchronous Applications ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Application Time Application Space Application Space Application Time

Computer Chess ,[object Object],[object Object],Increasing search depth

Discrete Event Simulations ,[object Object],[object Object],[object Object],[object Object],[object Object],Battle of Hastings

Dataflow ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Wing Airflow Radar Signature Engine Airflow Structural Analysis Noise Optimization Communication Bus Large Applications

Grid Workflow Datamining in Earth Science ,[object Object],[object Object],NASA GPS Earthquake Streaming Data Support Transformations Data Checking Hidden Markov Datamining (JPL) Display (GIS) Real Time Archival

Grid Workflow Data Assimilation in Earth Science ,[object Object]

Web 2.0 has Services of varied pedigree linked by Mashups – expect interesting developments as some of services run on multicore clients

Mashups are Workflow? ,[object Object],[object Object],[object Object],[object Object]

Work/Dataflow and Parallel Computing I ,[object Object],[object Object],[object Object],[object Object],[object Object]

Work/Dataflow and Parallel Computing II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Google MapReduce Simplified Data Processing on Large Clusters ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Other Application Classes ,[object Object],[object Object],[object Object],[object Object],[object Object]

Event-based “Dataflow” ,[object Object],[object Object],[object Object],[object Object],[object Object],Event Broker

A small discussion of hardware

Blue Gene/L Complex System with replicated chips and a 3D toroidal interconnect

1024 processors in full system with ten dimensional hypercube Interconnect 1987 MPP

Discussion of Memory Structure and Applications

Parallel Architecture I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Dataflow Performance Bandwidth Latency Size Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Main Memory L2 Cache

Communication on Shared Memory Architecture ,[object Object],[object Object]

GPU Coprocessor Architecture ,[object Object]

IBM Cell Processor ,[object Object],Applications running well on Cell or AMD GPU should run scalablyon future mainline multicore chips Focus on memory bandwidth key (dataflow not deltaflow)

Parallel Architecture II ,[object Object],[object Object],Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Interconnection Network Dataflow Dataflow “ Deltaflow” or Events

Memory to CPU Information Flow ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cache and Distributed Memory Analogues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Cache L3 Cache L2 Cache Core Cache Main Memory

Space Time Structure of a Hierarchical Multicomputer

Cache v Distributed Memory Overhead ,[object Object],[object Object],[object Object],[object Object]

Space-Time Decompositions for the parallel one dimensional wave equation Standard Parallel Computing Choice

Amdahl’s misleading law I ,[object Object],[object Object]

Amdahl’s misleading law II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Hierarchical Algorithms meet Amdahl ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],0 1 2 3 Processors Level 4 Mesh Level 3 Mesh Level 2 Mesh Level 1 Mesh Level 0 Mesh

A Discussion of Software Models

Programming Paradigms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Parallel Software Paradigms I: Workflow ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

The Marine Corps Lack of Programming Paradigm Library Model ,[object Object],[object Object],[object Object],[object Object],[object Object]

Parallel Software Paradigms II: Component Parallel and Program Parallel ,[object Object],[object Object],[object Object]

Parallel Software Paradigms III: Component Parallel and Program Parallel continued ,[object Object],[object Object],[object Object],[object Object],[object Object]

Parallel Software Paradigms IV: Component Parallel and Program Parallel continued ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Data Structure Parallel I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Data Structure Parallel II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Data Structure Parallel III ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Parallelizing Compilers I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Parallelizing Compilers II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

OpenMP and Parallelizing Compilers ,[object Object],[object Object],[object Object],[object Object]

OpenMP Parallel Constructs ,[object Object],Master Thread Master Thread Master Thread Master Thread again with an implicit barrier synchronization SECTIONS Fork Join Heterogeneous Team SINGLE Fork Join DO/for loop Fork Join Homogeneous Team

Performance of OpenMP, MPI, CAF, UPC ,[object Object],[object Object],[object Object],MPI OpenMP MPI OpenMP UPC CAF MPI MPI Multigrid OpenMP MPI MPI OpenMP MPI MPI OpenMP Conjugate Gradient

Component Parallel I: MPI ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

MPI Execution Model ,[object Object],[object Object],[object Object],8 fixed executing threads (processes)

MPI Features I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

MPI Features II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

300 MPI2 routines from Argonne MPICH2

Why people like MPI! ,[object Object],[object Object],[object Object],[object Object],cluster After Optimization of UPC cluster

Component Parallel: PGAS Languages I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Ghost Cells ,[object Object],[object Object],[object Object],[object Object]

PGAS Languages II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Other Component Parallel Models ,[object Object],[object Object],[object Object],[object Object],or Appropriate Mechanisms depends on application structure Is structure?

Component Synchronization Patterns ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Microsoft CCR ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Pipeline which is Simplest loosely synchronous execution in CCR Note CCR supports thread spawning model MPI usually uses fixed threads with message rendezvous Message Message Message Message Message Message Next Stage Message Thread3 Port3 Message Message Message Thread3 Port3 Message Message Message Thread2 Port2 Message Message Message Thread2 Port2 Message Message Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread3 Port3 Message Message Message Thread2 Port2 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message One Stage Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message

Thread0 Message Thread3 EndPort Message Thread2 Message Thread1 Message Idealized loosely synchronous endpoint (broadcast) in CCR An example of MPI Collective in CCR Message Thread0 Port0 Message Message Message Thread3 Port3 Message Message Message Thread2 Port2 Message Message Message Thread1 Port1 Message Message

Write Exchanged Messages Port3 Port2 Thread0 Thread3 Thread2 Thread1 Port1 Port0 Thread0 Write Exchanged Messages Port3 Thread2 Port2 Exchanging Messages with 1D Torus Exchange topology for loosely synchronous execution in CCR Thread0 Read Messages Thread3 Thread2 Thread1 Port1 Port0 Thread3 Thread1

Thread0 Port3 Thread2 Port2 Port1 Port0 Thread3 Thread1 Thread2 Port2 Thread0 Port0 Port3 Thread3 Port1 Thread1 Thread3 Port3 Thread2 Port2 Thread0 Port0 Thread1 Port1 (a) Pipeline (b) Shift (d) Exchange Thread0 Port3 Thread2 Port2 Port1 Port0 Thread3 Thread1 (c) Two Shifts Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive

Stages (millions) Fixed amount of computation (4.10 7 units) divided into 4 cores and from 1 to 10 7 stages on HP Opteron Multicore . Each stage separated by reading and writing CCR ports in Pipeline mode Time Seconds 8.04 microseconds per stage averaged from 1 to 10 million stages Overhead = Computation Computation Component if no Overhead 4-way Pipeline Pattern 4 Dispatcher Threads HP Opteron

Stages (millions) Fixed amount of computation (4.10 7 units) divided into 4 cores and from 1 to 10 7 stages on Dell Xeon Multicore . Each stage separated by reading and writing CCR ports in Pipeline mode Time Seconds 12.40 microseconds per stage averaged from 1 to 10 million stages 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Overhead = Computation Computation Component if no Overhead

Summary of Stage Overheads for AMD 2-core 2-processor Machine ,[object Object]

Summary of Stage Overheads for Intel 2-core 2-processor Machine ,[object Object],[object Object]

Summary of Stage Overheads for Intel 4-core 2-processor Machine ,[object Object],[object Object]

AMD 2-core 2-processor Bandwidth Measurements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Intel 2-core 2-processor Bandwidth Measurements ,[object Object],[object Object]

Typical Bandwidth measurements showing effect of cache with slope change 5,000 stages with run time plotted against size of double array copied in each stage from thread to stepped locations in a large array on Dell Xeon Multicore Time Seconds 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to 100,000 double words Array Size: Millions of Double Words Slope Change (Cache Effect)

Timing of HP Opteron Multicore as a function of number of simultaneous two-way service messages processed (November 2006 DSS Release) ,[object Object],DSS Service Measurements

Parallel Runtime ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Horror of Hybrid Computing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

A general discussion of Some miscellaneous issues

Load Balancing Particle Dynamics ,[object Object],[object Object],Equal Volume Decomposition Universe Simulation Galaxy or Star or ... 16 Processors ,[object Object]

Reduce Communication ,[object Object],[object Object],[object Object],[object Object],[object Object],Block Decomposition Cyclic Decomposition

Minimize Load Imbalance ,[object Object],[object Object],[object Object],[object Object],Block Decomposition Cyclic Decomposition

Parallel Irregular Finite Elements ,[object Object],[object Object],Processor

Irregular Decomposition for Crack ,[object Object],[object Object],[object Object],Region assigned to 1 processor Work Load Not Perfect ! Processor

Further Decomposition Strategies ,[object Object],[object Object],[object Object],Computer Chess Tree Current Position (node in Tree) First Set Moves Opponents Counter Moves California gets its independence

Physics Analogy for Load Balancing ,[object Object]

Physics Analogy to discuss Load Balancing ,[object Object],C i is compute time of i’th process V i,j is communication needed between i and j and attractive as minimized when i and j nearby Processes are particles in analogy

Forces are generated by constraints of minimizing H and they can be thought of as springs ,[object Object],[object Object]

Suppose we load balance by Annealing the physical analog system

Optimal v. stable scattered Decompositions ,[object Object],Optimal overall

Time Dependent domain (optimal) Decomposition compared to stable Scattered Decomposition

Use of Time averaged Energy for Adaptive Particle Dynamics

Parallel Computing 2007: Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Parallel Computing 2007: Overview

Similar to Parallel Computing 2007: Overview (20)

More from Geoffrey Fox

More from Geoffrey Fox (20)

Recently uploaded

Recently uploaded (20)

Parallel Computing 2007: Overview