SlideShare a Scribd company logo
1 of 147
Parallel Computing 2007: Overview February 26-March 1 2007 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN [email_address] http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/
Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object]
Books For Lectures ,[object Object],[object Object]
Some Remarks ,[object Object],[object Object],[object Object]
Job Mixes (on a Chip) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],A1 A2 A3 A4 C  B  E  D1 D2 F
Three styles of “Jobs” ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],A1 A2 A3 A4 C  B  E  D1 D2 F
Data Parallelism in Algorithms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Functional Parallelism in Algorithms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Structure(Architecture) of Applications ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Motivating Task ,[object Object],[object Object],[object Object],[object Object],[object Object]
What is …? What if …? Is it …? R ecognition M ining S ynthesis Create a model  instance RMS: Recognition Mining Synthesis Model-based multimodal recognition Find a model instance Model Real-time analytics on dynamic, unstructured, multimodal datasets Photo-realism and physics-based animation Model-less Real-time streaming and transactions on  static – structured datasets Very limited realism Tomorrow Today
What is a tumor? Is there a tumor here? What if the tumor progresses? It is all about dealing efficiently with complex multimodal datasets R ecognition M ining S ynthesis Images courtesy: http://splweb.bwh.harvard.edu:8000/pages/images_movies.html
Intel’s Application Stack
Why Parallel Computing is Hard ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Structure of Complex Systems ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],map map map map map map S natural application S computer Time Space Time Space Map
Languages in Complex Systems Picture ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],map map map map map map
Structure of Modern Java System: GridSphere  ,[object Object]
Another Java Code; Batik Scalable Vector Graphics SVG Browser ,[object Object],[object Object]
Are Applications Parallel? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Seismic Simulation of Los Angeles Basin ,[object Object],Divide surface into 4 parts and assign calculation of waves in each part to a separate processor
Parallelizable Software  ,[object Object],[object Object],[object Object],[object Object],S natural application S computer Time Space Time Space Map
Potential in a Vacuum Filled Rectangular Box ,[object Object],[object Object],[object Object]
Basic Sequential Algorithm ,[object Object],[object Object],   New  = (    Left  +     Right  +     Up  +     Down  ) / 4   Up  Down  Left  Right  New
Update on the Mesh 14 by 14 Internal Mesh
Parallelism is Straightforward ,[object Object],[object Object]
Communication  is Needed ,[object Object],[object Object]
Communication Must be Reduced ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Summary of Laplace Speed Up ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
All systems have various Dimensions
Parallel Processing in Society It’s all well known ……
 
Divide problem into parts; one part for each processor 8-person parallel processor
 
Amdahl’s Law of Parallel Processing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
 
 
 
Typical modern application performance
Performance of Typical Science Code I FLASH Astrophysics code from DoE Center at Chicago Plotted as time as a function of number of nodes Scaled Speedup as constant grain size as number of nodes increases
Performance of Typical Science Code II FLASH Astrophysics code from DoE Center at Chicago on Blue Gene Note both communication and simulation time are independent of number of processors – again the scaled speedup scenario Communication Simulation
FLASH is a pretty serious code
Rich Dynamic Irregular Physics
FLASH Scaling at fixed total problem size Increasing Problem Size Rollover occurs at increasing number of processors as problem  size increases
Back to Hadrian’s Wall
The Web is also just message passing Neural Network
1984 Slide – today replace hypercube by cluster
 
 
Inside CPU or Inner Parallelism Between CPU’s Called Outer Parallelism
And today Sensors
 
Now we discuss classes of application
“ Space-Time” Picture ,[object Object],[object Object],[object Object],“ Internal” (to data chunk) application spatial dependence ( n  degrees of freedom) maps into time on the computer Application Time Application Space t 0 t 1 t 2 t 3 t 4 Computer  Time 4-way Parallel Computer (CPU’s) T 0 T 1 T 2 T 3 T 4
Data Parallel Time Dependence ,[object Object],[object Object],[object Object],[object Object],[object Object],Synchronization on MIMD machines is accomplished by messaging It is automatic on SIMD machines! Application Time Application Space Synchronous Identical evolution algorithms t 0 t 1 t 2 t 3 t 4
Local Messaging for Synchronization ,[object Object],[object Object],[object Object],[object Object],[object Object],……… 8 Processors Application and Processor Time Application Space Communication Phase Compute Phase Communication Phase Compute Phase Communication Phase Compute Phase Communication Phase
Loosely Synchronous Applications ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Distinct evolution algorithms for each data point in each processor Application Time Application Space t 0 t 1 t 2 t 3 t 4
Irregular 2D Simulation -- Flow over an Airfoil ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],Heterogeneous Problems
Asynchronous Applications ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Application Time Application Space Application Space Application Time
Computer Chess ,[object Object],[object Object],Increasing search depth
Discrete Event Simulations ,[object Object],[object Object],[object Object],[object Object],[object Object],Battle of Hastings
Dataflow ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Wing Airflow Radar Signature Engine Airflow Structural Analysis Noise Optimization Communication Bus Large Applications
Grid Workflow Datamining in Earth Science ,[object Object],[object Object],NASA GPS Earthquake Streaming Data Support Transformations Data Checking Hidden Markov Datamining (JPL) Display (GIS) Real Time Archival
Grid Workflow Data Assimilation in Earth Science ,[object Object]
Web 2.0 has Services of varied pedigree linked by Mashups –  expect interesting developments as some of services run on multicore clients
Mashups are Workflow? ,[object Object],[object Object],[object Object],[object Object]
Work/Dataflow and Parallel Computing I ,[object Object],[object Object],[object Object],[object Object],[object Object]
Work/Dataflow and Parallel Computing II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Google MapReduce Simplified Data Processing on Large Clusters ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Other Application Classes ,[object Object],[object Object],[object Object],[object Object],[object Object]
Event-based “Dataflow” ,[object Object],[object Object],[object Object],[object Object],[object Object],Event Broker
A small discussion of hardware
Blue Gene/L Complex System with replicated chips and a 3D toroidal interconnect
1024 processors in full system with ten dimensional hypercube Interconnect 1987 MPP
Discussion of Memory Structure and Applications
Parallel Architecture I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Dataflow Performance Bandwidth Latency Size Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Main Memory L2 Cache
Communication on Shared Memory Architecture ,[object Object],[object Object]
GPU Coprocessor Architecture ,[object Object]
IBM Cell Processor ,[object Object],Applications running well on Cell or AMD GPU should run scalablyon future mainline multicore chips Focus on memory bandwidth key (dataflow not deltaflow)
Parallel Architecture II ,[object Object],[object Object],Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Cache L3 Cache Main Memory L2 Cache Core Cache Interconnection Network Dataflow Dataflow “ Deltaflow” or Events
Memory to CPU Information Flow ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Cache and Distributed Memory Analogues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Cache L3 Cache L2 Cache Core Cache Main Memory
Space Time Structure  of a  Hierarchical  Multicomputer
Cache v Distributed Memory Overhead ,[object Object],[object Object],[object Object],[object Object]
Space-Time Decompositions for the parallel one dimensional wave equation Standard Parallel Computing Choice
Amdahl’s misleading law I ,[object Object],[object Object]
Amdahl’s misleading law II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hierarchical Algorithms meet Amdahl ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],0  1  2  3 Processors Level 4 Mesh Level 3 Mesh Level 2 Mesh Level 1 Mesh Level 0 Mesh
A Discussion of Software Models
Programming Paradigms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Parallel Software Paradigms I: Workflow ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Marine Corps Lack of Programming Paradigm Library Model ,[object Object],[object Object],[object Object],[object Object],[object Object]
Parallel Software Paradigms II: Component Parallel and Program Parallel ,[object Object],[object Object],[object Object]
Parallel Software Paradigms III: Component Parallel and Program Parallel continued ,[object Object],[object Object],[object Object],[object Object],[object Object]
Parallel Software Paradigms IV: Component Parallel and Program Parallel continued ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Structure Parallel I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Structure Parallel II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Structure Parallel III ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Parallelizing Compilers I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Parallelizing Compilers II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
OpenMP and Parallelizing Compilers ,[object Object],[object Object],[object Object],[object Object]
OpenMP Parallel Constructs ,[object Object],Master Thread Master Thread Master Thread Master Thread again with an implicit barrier synchronization SECTIONS Fork Join Heterogeneous Team SINGLE Fork Join DO/for loop Fork Join Homogeneous Team
Performance of OpenMP, MPI, CAF, UPC ,[object Object],[object Object],[object Object],MPI OpenMP MPI OpenMP UPC CAF MPI MPI Multigrid OpenMP MPI MPI OpenMP MPI MPI OpenMP Conjugate Gradient
Component Parallel I: MPI ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MPI Execution Model ,[object Object],[object Object],[object Object],8 fixed executing threads (processes)
MPI Features I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MPI Features II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
300 MPI2 routines from Argonne MPICH2
MPICH2 Performance
Multicore MPI Performance
Why people like MPI! ,[object Object],[object Object],[object Object],[object Object],cluster After Optimization of UPC cluster
Component Parallel: PGAS Languages I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Ghost Cells ,[object Object],[object Object],[object Object],[object Object]
PGAS Languages II ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Other Component Parallel Models ,[object Object],[object Object],[object Object],[object Object],or Appropriate Mechanisms depends on application structure Is structure?
Component Synchronization Patterns ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Microsoft CCR ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Pipeline which is Simplest loosely synchronous execution in CCR Note CCR supports thread spawning model MPI usually uses fixed threads with message rendezvous Message Message Message Message Message Message Next Stage Message Thread3 Port3 Message Message Message Thread3 Port3 Message Message Message Thread2 Port2 Message Message Message Thread2 Port2 Message Message Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread3 Port3 Message Message Message Thread2 Port2 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message One Stage Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message
Thread0 Message Thread3 EndPort Message Thread2 Message Thread1 Message Idealized loosely synchronous endpoint (broadcast) in CCR An example of MPI Collective in CCR Message Thread0 Port0 Message Message Message Thread3 Port3 Message Message Message Thread2 Port2 Message Message Message Thread1 Port1 Message Message
Write Exchanged Messages Port3 Port2 Thread0 Thread3 Thread2 Thread1 Port1 Port0 Thread0 Write Exchanged Messages Port3 Thread2 Port2 Exchanging Messages with 1D Torus Exchange topology for  loosely synchronous execution in CCR Thread0 Read  Messages Thread3 Thread2 Thread1 Port1 Port0 Thread3 Thread1
Thread0 Port3 Thread2 Port2 Port1 Port0 Thread3 Thread1 Thread2 Port2 Thread0 Port0 Port3 Thread3 Port1 Thread1 Thread3 Port3 Thread2 Port2 Thread0 Port0 Thread1 Port1 (a) Pipeline (b) Shift (d) Exchange Thread0 Port3 Thread2 Port2 Port1 Port0 Thread3 Thread1 (c) Two Shifts Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive
Stages (millions) Fixed amount of computation (4.10 7  units) divided into 4 cores and from 1 to 10 7  stages on  HP Opteron Multicore . Each stage separated by reading and writing CCR ports in Pipeline mode Time Seconds 8.04 microseconds per stage averaged from 1 to 10 million stages Overhead = Computation Computation Component if no Overhead 4-way Pipeline Pattern 4 Dispatcher Threads HP Opteron
Stages (millions) Fixed amount of computation (4.10 7  units) divided into 4 cores and from 1 to 10 7  stages on  Dell Xeon Multicore . Each stage separated by reading and writing CCR ports in Pipeline mode Time Seconds 12.40 microseconds per stage averaged from 1 to 10 million stages 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Overhead = Computation Computation Component if no Overhead
Summary of Stage Overheads for AMD  2-core 2-processor Machine ,[object Object]
Summary of Stage Overheads for Intel 2-core 2-processor Machine ,[object Object],[object Object]
Summary of Stage Overheads for Intel 4-core 2-processor Machine ,[object Object],[object Object]
AMD 2-core 2-processor  Bandwidth Measurements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Intel 2-core 2-processor Bandwidth Measurements ,[object Object],[object Object]
Typical Bandwidth measurements showing effect of cache with slope change 5,000 stages with run time plotted against size of double array copied in each stage from thread to stepped locations in a large array on   Dell Xeon Multicore Time Seconds 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Total Bandwidth 1.0 Gigabytes/Sec  up to one million double words and  1.75 Gigabytes/Sec up to  100,000 double words  Array Size: Millions of Double Words Slope Change (Cache Effect)
Timing of HP Opteron Multicore as a function of number of simultaneous two-way service messages processed (November 2006 DSS Release) ,[object Object],DSS Service Measurements
Parallel Runtime ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Horror of Hybrid Computing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
A general discussion of Some miscellaneous issues
Load Balancing Particle Dynamics ,[object Object],[object Object],Equal Volume Decomposition Universe Simulation Galaxy or Star or ... 16 Processors ,[object Object]
Reduce Communication ,[object Object],[object Object],[object Object],[object Object],[object Object],Block Decomposition Cyclic Decomposition
Minimize Load Imbalance ,[object Object],[object Object],[object Object],[object Object],Block Decomposition Cyclic Decomposition
Parallel Irregular Finite Elements ,[object Object],[object Object],Processor
Irregular Decomposition for Crack ,[object Object],[object Object],[object Object],Region assigned to 1 processor Work Load Not Perfect ! Processor
Further Decomposition Strategies ,[object Object],[object Object],[object Object],Computer  Chess Tree Current Position (node in Tree) First Set Moves Opponents Counter Moves California gets its independence
Physics Analogy for Load Balancing ,[object Object]
Physics Analogy to discuss Load Balancing ,[object Object],C i  is compute time of i’th process  V i,j  is communication needed between i and j and attractive as minimized when i and j nearby Processes are particles in analogy
Forces are generated by constraints of minimizing H and they can be thought of as springs ,[object Object],[object Object]
Suppose we load balance by Annealing the physical analog system
Optimal v. stable scattered Decompositions ,[object Object],Optimal overall
Time Dependent domain (optimal) Decomposition compared to stable Scattered Decomposition
Use of Time averaged Energy for Adaptive Particle Dynamics

More Related Content

What's hot

Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1
Marcirio Chaves
 
Parallel computing
Parallel computingParallel computing
Parallel computing
virend111
 

What's hot (20)

Introduction to parallel computing
Introduction to parallel computingIntroduction to parallel computing
Introduction to parallel computing
 
Parallel Processing
Parallel ProcessingParallel Processing
Parallel Processing
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Patterns For Parallel Computing
Patterns For Parallel ComputingPatterns For Parallel Computing
Patterns For Parallel Computing
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
INTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGINTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSING
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester
 
Parallel processing coa
Parallel processing coaParallel processing coa
Parallel processing coa
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel Programming
 
Lecture02 types
Lecture02 typesLecture02 types
Lecture02 types
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer Architecture
 

Viewers also liked

Computer architecture
Computer architecture Computer architecture
Computer architecture
Ashish Kumar
 
Real time Operating System
Real time Operating SystemReal time Operating System
Real time Operating System
Tech_MX
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
Yasir Khan
 

Viewers also liked (14)

Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
 
Crypto workshop part 3 - Don't do this yourself
Crypto workshop part 3 - Don't do this yourselfCrypto workshop part 3 - Don't do this yourself
Crypto workshop part 3 - Don't do this yourself
 
Circuit complexity
Circuit complexityCircuit complexity
Circuit complexity
 
Omg co p proactive computing oct 2010
Omg co p   proactive computing oct 2010Omg co p   proactive computing oct 2010
Omg co p proactive computing oct 2010
 
Intel Perceptual Computing SDK
Intel Perceptual Computing SDKIntel Perceptual Computing SDK
Intel Perceptual Computing SDK
 
Intel RealSense & Perceptual Computing
Intel RealSense & Perceptual ComputingIntel RealSense & Perceptual Computing
Intel RealSense & Perceptual Computing
 
Perceptual Computing
Perceptual ComputingPerceptual Computing
Perceptual Computing
 
Religare enterprises ltd
Religare enterprises ltdReligare enterprises ltd
Religare enterprises ltd
 
Introduction to near real time computing
Introduction to near real time computingIntroduction to near real time computing
Introduction to near real time computing
 
Computer architecture
Computer architecture Computer architecture
Computer architecture
 
Real time Operating System
Real time Operating SystemReal time Operating System
Real time Operating System
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
 
Tecnología de la comunicación
Tecnología de la comunicaciónTecnología de la comunicación
Tecnología de la comunicación
 

Similar to Parallel Computing 2007: Overview

Software engineering
Software engineeringSoftware engineering
Software engineering
Fahe Em
 
Software engineering
Software engineeringSoftware engineering
Software engineering
Fahe Em
 
PresentationTest
PresentationTestPresentationTest
PresentationTest
bolu804
 

Similar to Parallel Computing 2007: Overview (20)

Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
ThesisProposal
ThesisProposalThesisProposal
ThesisProposal
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
 
Chap 1(one) general introduction
Chap 1(one)  general introductionChap 1(one)  general introduction
Chap 1(one) general introduction
 
Data Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : NotesData Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : Notes
 
parallel programming models
 parallel programming models parallel programming models
parallel programming models
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Advanced computer architecture unit 5
Advanced computer architecture  unit 5Advanced computer architecture  unit 5
Advanced computer architecture unit 5
 
unit 1.pptx
unit 1.pptxunit 1.pptx
unit 1.pptx
 
B1802030511
B1802030511B1802030511
B1802030511
 
Using a Cloud to Replenish Parched Groundwater Modeling Efforts
Using a Cloud to Replenish Parched Groundwater Modeling EffortsUsing a Cloud to Replenish Parched Groundwater Modeling Efforts
Using a Cloud to Replenish Parched Groundwater Modeling Efforts
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
PresentationTest
PresentationTestPresentationTest
PresentationTest
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
Concurrency and Parallelism, Asynchronous Programming, Network Programming
Concurrency and Parallelism, Asynchronous Programming, Network ProgrammingConcurrency and Parallelism, Asynchronous Programming, Network Programming
Concurrency and Parallelism, Asynchronous Programming, Network Programming
 
Real-world Concurrency : Notes
Real-world Concurrency : NotesReal-world Concurrency : Notes
Real-world Concurrency : Notes
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用
 

More from Geoffrey Fox

Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Geoffrey Fox
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
Geoffrey Fox
 

More from Geoffrey Fox (20)

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online Education
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC Technology
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and Education
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 

Recently uploaded

Recently uploaded (20)

Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 

Parallel Computing 2007: Overview

  • 1. Parallel Computing 2007: Overview February 26-March 1 2007 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN [email_address] http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. What is …? What if …? Is it …? R ecognition M ining S ynthesis Create a model instance RMS: Recognition Mining Synthesis Model-based multimodal recognition Find a model instance Model Real-time analytics on dynamic, unstructured, multimodal datasets Photo-realism and physics-based animation Model-less Real-time streaming and transactions on static – structured datasets Very limited realism Tomorrow Today
  • 12. What is a tumor? Is there a tumor here? What if the tumor progresses? It is all about dealing efficiently with complex multimodal datasets R ecognition M ining S ynthesis Images courtesy: http://splweb.bwh.harvard.edu:8000/pages/images_movies.html
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24. Update on the Mesh 14 by 14 Internal Mesh
  • 25.
  • 26.
  • 27.
  • 28.
  • 29. All systems have various Dimensions
  • 30. Parallel Processing in Society It’s all well known ……
  • 31.  
  • 32. Divide problem into parts; one part for each processor 8-person parallel processor
  • 33.  
  • 34.
  • 35.  
  • 36.  
  • 37.  
  • 38.  
  • 40. Performance of Typical Science Code I FLASH Astrophysics code from DoE Center at Chicago Plotted as time as a function of number of nodes Scaled Speedup as constant grain size as number of nodes increases
  • 41. Performance of Typical Science Code II FLASH Astrophysics code from DoE Center at Chicago on Blue Gene Note both communication and simulation time are independent of number of processors – again the scaled speedup scenario Communication Simulation
  • 42. FLASH is a pretty serious code
  • 44. FLASH Scaling at fixed total problem size Increasing Problem Size Rollover occurs at increasing number of processors as problem size increases
  • 46. The Web is also just message passing Neural Network
  • 47. 1984 Slide – today replace hypercube by cluster
  • 48.  
  • 49.  
  • 50. Inside CPU or Inner Parallelism Between CPU’s Called Outer Parallelism
  • 52.  
  • 53. Now we discuss classes of application
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66. Web 2.0 has Services of varied pedigree linked by Mashups – expect interesting developments as some of services run on multicore clients
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73. A small discussion of hardware
  • 74. Blue Gene/L Complex System with replicated chips and a 3D toroidal interconnect
  • 75. 1024 processors in full system with ten dimensional hypercube Interconnect 1987 MPP
  • 76. Discussion of Memory Structure and Applications
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84. Space Time Structure of a Hierarchical Multicomputer
  • 85.
  • 86. Space-Time Decompositions for the parallel one dimensional wave equation Standard Parallel Computing Choice
  • 87.
  • 88.
  • 89.
  • 90. A Discussion of Software Models
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109. 300 MPI2 routines from Argonne MPICH2
  • 112.
  • 113.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.
  • 119. Pipeline which is Simplest loosely synchronous execution in CCR Note CCR supports thread spawning model MPI usually uses fixed threads with message rendezvous Message Message Message Message Message Message Next Stage Message Thread3 Port3 Message Message Message Thread3 Port3 Message Message Message Thread2 Port2 Message Message Message Thread2 Port2 Message Message Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread3 Port3 Message Message Message Thread2 Port2 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message One Stage Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread0 Port0 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message Message Thread1 Port1 Message Message
  • 120. Thread0 Message Thread3 EndPort Message Thread2 Message Thread1 Message Idealized loosely synchronous endpoint (broadcast) in CCR An example of MPI Collective in CCR Message Thread0 Port0 Message Message Message Thread3 Port3 Message Message Message Thread2 Port2 Message Message Message Thread1 Port1 Message Message
  • 121. Write Exchanged Messages Port3 Port2 Thread0 Thread3 Thread2 Thread1 Port1 Port0 Thread0 Write Exchanged Messages Port3 Thread2 Port2 Exchanging Messages with 1D Torus Exchange topology for loosely synchronous execution in CCR Thread0 Read Messages Thread3 Thread2 Thread1 Port1 Port0 Thread3 Thread1
  • 122. Thread0 Port3 Thread2 Port2 Port1 Port0 Thread3 Thread1 Thread2 Port2 Thread0 Port0 Port3 Thread3 Port1 Thread1 Thread3 Port3 Thread2 Port2 Thread0 Port0 Thread1 Port1 (a) Pipeline (b) Shift (d) Exchange Thread0 Port3 Thread2 Port2 Port1 Port0 Thread3 Thread1 (c) Two Shifts Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive
  • 123. Stages (millions) Fixed amount of computation (4.10 7 units) divided into 4 cores and from 1 to 10 7 stages on HP Opteron Multicore . Each stage separated by reading and writing CCR ports in Pipeline mode Time Seconds 8.04 microseconds per stage averaged from 1 to 10 million stages Overhead = Computation Computation Component if no Overhead 4-way Pipeline Pattern 4 Dispatcher Threads HP Opteron
  • 124. Stages (millions) Fixed amount of computation (4.10 7 units) divided into 4 cores and from 1 to 10 7 stages on Dell Xeon Multicore . Each stage separated by reading and writing CCR ports in Pipeline mode Time Seconds 12.40 microseconds per stage averaged from 1 to 10 million stages 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Overhead = Computation Computation Component if no Overhead
  • 125.
  • 126.
  • 127.
  • 128.
  • 129.
  • 130. Typical Bandwidth measurements showing effect of cache with slope change 5,000 stages with run time plotted against size of double array copied in each stage from thread to stepped locations in a large array on Dell Xeon Multicore Time Seconds 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to 100,000 double words Array Size: Millions of Double Words Slope Change (Cache Effect)
  • 131.
  • 132.
  • 133.
  • 134. A general discussion of Some miscellaneous issues
  • 135.
  • 136.
  • 137.
  • 138.
  • 139.
  • 140.
  • 141.
  • 142.
  • 143.
  • 144. Suppose we load balance by Annealing the physical analog system
  • 145.
  • 146. Time Dependent domain (optimal) Decomposition compared to stable Scattered Decomposition
  • 147. Use of Time averaged Energy for Adaptive Particle Dynamics