SlideShare a Scribd company logo
7/17/2009 1 Parallel and High Performance Computing Burton Smith Technical Fellow Microsoft
Agenda Introduction Definitions Architecture and Programming Examples Conclusions 7/17/2009 2
Introduction 7/17/2009 3
“Parallel and High Performance”? “Parallel computing is a form of computation in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994 A High Performance (Super) Computer is: One of the 500 fastest computers as measured by HPL: the High Performance Linpack benchmark A computer that costs 200.000.000 руб or more Necessarily parallel, at least since the 1970’s 7/17/2009 4
Recent Developments For 20 years, parallel and high performance computing have been the same subject Parallel computing is now mainstream It reaches well beyond HPC into client systems: desktops, laptops, mobile phones HPC software once had to stand alone Now, it can be based on parallel PC software The result: better tools and new possibilities 7/17/2009 5
 The Emergence of the Parallel Client Uniprocessor performance is leveling off Instruction-level parallelism nears a limit (ILP Wall) Power is getting painfully high (Power Wall) Caches show diminishing returns (Memory Wall) Logic density continues to grow (Moore’s Law) So uniprocessors will collapse in area and cost Cores per chip need to increase exponentially We must all learn to write parallel programs So new “killer apps” will enjoy more speed
The ILP Wall Instruction-level parallelism preserves the serial programming model While getting speed from “undercover” parallelism  For example, see HPS†: out-of-order issue, in-order retirement, register renaming, branch prediction, speculation, … At best, we get a few instructions/clock † Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116.
The Power Wall In the old days, power was kept roughly constant Dynamic power, equal to CV2f, dominated Every shrink of .7 in feature size halved transistor area  Capacitance C and voltage V also decreased by .7 Even with the clock frequency f increased by 1.4, power per transistor was cut in half Now, shrinking no longer reduces V very much So even at constant frequency, power density doubles Static (leakage) power is also getting worse Simpler, slower processors are more efficient And to conserve power, we can turn some of them off
The Memory Wall We can get bigger caches from more transistors Does this suffice, or is there a problem scaling up? To speed up 2X without changing bandwidth below the cache, the miss rate must be halved How much bigger does the cache have to be?† For dense matrix multiply or dense LU, 4x bigger For sorting or FFTs, the square of its former size For sparse or dense matrix-vector multiply, impossible Deeper interconnects increase miss latency Latency tolerance needs memory access parallelism † H.T. Kung, “Memory requirements for balanced computer architectures,”   13th International Symposium on Computer Architecture, 1986, pp. 49−54.
Overcoming the Memory Wall Provide more memory bandwidth Increase DRAM I/O bandwidth per gigabyte Increase microprocessor off-chip bandwidth Use architecture to tolerate memory latency More latency  more threads or longer vectors No change in programming model is needed Use caches for bandwidth as well as latency Let compilers control locality Keep cache lines short Avoid mis-speculation
The End of The von Neumann Model “Instructions are executed one at a time…” We have relied on this idea for 60 years Now it (and things it brought) must change Serial programming is easier than parallel programming, at least for the moment But serial programs are now slow programs We need parallel programming paradigms that will make all programmers successful The stakes for our field’s vitality are high Computing must be reinvented
Definitions 7/17/2009 12
Asymptotic Notation Quantities are often meaningful only within a constant factor Algorithm performance analyses, for example f(n) = O(g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)| f(n) = (g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)| f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n)) 7/17/2009 13
Speedup, Time, and Work The speedup of a computation is how much faster it runs in parallel compared to serially If one processor takes T1 and p of them take Tp then the p-processor speedup is Sp = T1/Tp The work done is the number of operations performed, either serially or in parallel W1 = O(T1) is the serial work, Wp the parallel work We say a parallel computation is work-optimal ifWp = O(W1) = O(T1) We say a parallel computation is time-optimal ifTp = O(W1/p) = O(T1/p) 7/17/2009 14
Latency, Bandwidth, & Concurrency In any system that moves items from input to output without creating or destroying them, Queueing theory calls this result Little’s law latency × bandwidth = concurrency concurrency = 6 bandwidth = 2 latency = 3
Architecture ANDPROGRAMMING 7/17/2009 16
Parallel Processor Architecture SIMD: Each instruction operates concurrently on multiple data items MIMD: Multiple instruction sequences execute concurrently Concurrency is expressible in space or time Spatial: the hardware is replicated Temporal: the hardware is pipelined 7/17/2009 17
Trends in Parallel Processors Today’s chips are spatial MIMD at top level To get enough performance, even in PCs Temporal MIMD is also used SIMD is tending back toward spatial Intel’s Larrabee combines all three Temporal concurrency is easily “adjusted” Vector length or number of hardware contexts Temporal concurrency tolerates latency Memory latency in the SIMD case For MIMD, branches and synchronization also 7/17/2009 18
Parallel Memory Architecture A shared memory system is one in which any processor can address any memory location Quality of access can be either uniform (UMA) or nonuniform (NUMA), in latency and/or bandwidth A distributed memory system is one in which processors can’t address most of memory The disjoint memory regions and their associated processors are usually called nodes A cluster is a distributed memory system with more than one processor per node Nearly all HPC systems are clusters   7/17/2009 19
Parallel Programming Variations Data Parallelism andTask Parallelism Functional Style and Imperative Style Shared Memory and Message Passing …and more we won’t have time to look at  A parallel application may use all of them  7/17/2009 20
Data Parallelism and Task Parallelism A computation is data parallel when similar independent sub-computations are done simultaneously on multiple data items Applying the same function to every element of a data sequence, for example A computation is task parallel when dissimilar independent sub-computations are done simultaneously Controlling the motions of a robot, for example It sounds like SIMD vs. MIMD, but isn’t quite Some kinds of data parallelism need MIMD 7/17/2009 21
Functional and Imperative Programs A program is said to be written in (pure) functional style if it has no mutable state Computing = naming and evaluating expressions  Programs with mutable state are usually called imperative because the state changes must be done when and where specified: while (z < x) { x = y; y = z; z = f(x, y);} return y; Often, programs can be written either way: let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y; 7/17/2009 22
Shared Memory and Message Passing Shared memory programs access data in a shared address space When to access the data is the big issue Subcomputations therefore must synchronize Message passing programs transmit data between subcomputations The sender computes a value and then sends it The receiver recieves a value and then uses it Synchronization can be built in to communication Message passing can be implemented very well on shared memory architectures 7/17/2009 23
Barrier Synchronization A barrier synchronizes multiple parallel sub-computations by letting none proceed until all have arrived It is named after the barrier used to start horse races It guarantees everything before the barrier finishes before anything after it begins It is a central feature in several data-parallel languages such as OpenMP 7/17/2009 24
Mutual Exclusion This type of synchronization ensures only one subcomputation can do a thing at any time If the thing is a code block, it is a critical section It classically uses a lock: a data structure with which subcomputations can stop and start Basic operations on a lock object L might be  Acquire(L): blocks until other subcomputations are finished with L, then acquires exclusive ownership Release(L): yields L and unblocks some Acquire(L) A lot has been written on these subjects 7/17/2009 25
Non-Blocking Synchronization The basic idea is to achieve mutual exclusion using memory read-modify-write operations Most commonly used is compare-and-swap:  CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by new Arbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeeds If there is significant updating contention at addr, the repeated computation of new may be wasteful 7/17/2009 26
Load Balancing Some processors may be busier than others To balance the workload, subcomputations can be scheduled on processors dynamically A technique for parallel loops is self-scheduling: processors repetitively grab chunks of iterations In guided self-scheduling, the chunk sizes shrink Analogous imbalances can occur in memory Overloaded memory locations are called hot spots Parallel algorithms and data structures must be designed to avoid them Imbalanced messaging is sometimes seen 7/17/2009 27
Examples 7/17/2009 28
A Data Parallel Example: Sorting 7/17/2009 29 void sort(int *src, int *dst,int size, intnvals) { inti, j, t1[nvals], t2[nvals]; 	for (j = 0 ; j < nvals ; j++) {	t1[j] = 0; } 	for (i = 0 ; i < size ; i++) {	t1[src[i]]++; }	//t1[] now contains a histogram of the values 	t2[0] = 0; 	for (j = 1 ; j < nvals ; j++) { 	t2[j] = t2[j-1] + t1[j-1]; }	//t2[j] now contains the origin for value j 	for (i = 0 ; i < size ; i++) {dst[t2[src[i]]++] = src[i]; } }
When Is a Loop Parallelizable? The loop instances must safely interleave A way to do this is to only read the data  Another way is to isolate data accesses Look at the first loop: The accesses to t1[] are isolated from each other  This loop can run in parallel “as is” 7/17/2009 30 for (j = 0 ; j < nvals ; j++) {     t1[j] = 0; }
Isolating Data Updates The second loop seems to have a problem: Two iterations may access the same t1[src[i]] If both reads precede both increments, oops! A few ways to isolate the iteration conflicts: Use an “isolated update” (lock prefix) instruction Use an array of locks, perhaps as big as t1[]  Use non-blocking updates Use a transaction 7/17/2009 31 for (i = 0 ; i < size ; i++) {     t1[src[i]]++; }
Dependent Loop Iterations The 3rd loop is an interesting challenge: Each iteration depends on the previous one This loop is an example of a prefix computation If • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 … Prefix computations are often known as scans Scan can be done in efficiently in parallel 7/17/2009 32 for (j = 1 ; j < nvals ; j++) { 	t2[j] = t2[j-1] + t1[j-1]; 	}
Cyclic Reduction Each vertical line represents a loop iteration The associated sequence element is to its right On step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k 7/17/2009 33 a b c d e f g a ab bc cd de ef fg a ab abc abcd bcde cdef defg a ab abc abcd abcde abcdef abcdefg
Applications of Scan Linear recurrences like the third loop Polynomial evaluation String comparison High-precision addition Finite automata Each xi is the next-state function given the ith input symbol and • is function composition APL compress When only the final value is needed, the computation is called a reduction instead It’s a little bit cheaper than a full scan
More Iterations nThan Processors p 7/17/2009 35 Wp = 3n + O(p log p), Tp = 3n / p + O(log p)
OpenMP OpenMP is a widely-implemented extension to C++ and Fortran for data† parallelism It adds directives to serial programs A few of the more important directives: #pragmaomp parallel for <modifiers><for loop> #pragmaomp atomic<binary op=,++ or -- statement> #pragmaomp critical <name><structured block> #pragmaomp barrier 7/17/2009 36 †And perhaps task parallelism soon
The Sorting Example in OpenMP Only the third “scan” loop is a problem We can at least do this loop “manually”: 7/17/2009 37 nt = omp_get_num_threads(); intta[nt], tb[nt]; #omp parallel for for(myt = 0; myt < nt; myt++) {   //Set ta[myt]= local sum of nvals/nt elements of t1[]   #pragmaomp barrier   for(k = 1; k <= myt; k *= 2){ tb[myt] = ta[myt]; ta[myt] += tb[myt - k];   #pragmaomp barrier   }   fix = (myt > 0) ? ta[myt – 1] : 0;   //Setnvals/ntelements of t2[] to fix + local scan of t1[] }
Parallel Patterns Library (PPL) PPL is a Microsoft C++ library built on top of the ConcRT user-mode scheduling runtime It supports mixed data- and task-parallelism: parallel_for, parallel_for_each, parallel_invoke agent, send, receive, choice, join, task_group Parallel loops use C++ lambda expressions: Updates can be isolated using intrinsic functions Microsoft and Intel plan to unify PPL and TBB 7/17/2009 38 parallel_for(1,nvals,[&t1](int j) {   t1[j] = 0; }); (void)_InterlockedIncrement(t1[src[i]]++);
Dynamic Resource Management PPL programs are written for an arbitrary number of processors, could be just one Load balancing is mostly done by work stealing There are two kinds of  work to steal: Work that is unblocked and waiting for a processor Work that is not yet started and is potentially parallel Work of the latter kind will be done serially unless it is first stolen by another processor This makes recursive divide and conquer easy There is no concern about when to stop parallelism 7/17/2009 39
A Quicksort Example void quicksort (vector<int>::iterator first,                 vector<int>::iterator last) {     if (last - first < 2){return;} int pivot = *first;     auto mid1 = partition (first, last,                    [=](int e){return e < pivot;});     auto mid2 = partition (mid1, last,                    [=](int e){return e == pivot;}); parallel_invoke(          [=] { quicksort(first, mid1); },          [=] { quicksort(mid2, last); }     ); };  7/17/2009 40
LINQ and PLINQ LINQ (Language Integrated Query) extends the .NET languages C#, Visual Basic, and F# A LINQ query is really just a functional monad It queries databases, XML, or any IEnumerable PLINQ is a parallel implementation of LINQ Non-isolated functions must be avoided Otherwise it is hard to tell the two apart 7/17/2009 41
A  LINQ  Example 7/17/2009 42 PLINQ .AsParallel() var q = from n in names         where n.Name == queryInfo.Name &&  n.State == queryInfo.State && n.Year >= yearStart && n.Year <= yearEnd         orderbyn.Year ascending         select n;
Message Passing Interface (MPI) MPI is a widely used message passing library for distributed memory HPC systems Some of its basic functions: A few of its “collective communication” functions: 7/17/2009 43 MPI_Init MPI_Comm_rank MPI_Comm_size MPI_Send MPI_Recv MPI_Reduce MPI_Allreduce MPI_Scan MPI_Exscan MPI_Barrier MPI_Gather MPI_Allgather MPI_Alltoall
Sorting in MPI Roughly, it could work like this on n nodes: Run the first two loops locally Use MPI_Allreduce to build a global histogram Run the third loop (redundantly) at every node Allocate n value intervals to nodes (redundantly) Balancing the data per node as well as possible Run the fourth loop using the local histogram Use MPI_Alltoall to redistribute the data Merge the n sorted subarrays on each node Collective communication is expensive But sorting needs it (see the Memory Wall slide) 7/17/2009 44
Another Way to Sort in MPI The Samplesort algorithm is like Quicksort It works like this on n nodes: Sort the local data on each node independently Take s samples of the sorted data on each node Use MPI_Allgather to send all nodes all samples Compute n  1 splitters (redundantly) on all nodes Balancing the data per node as well as possible Use MPI_Alltoall to redistribute the data Merge the n sorted subarrays on each node 7/17/2009 45
CONCLUSIONS 7/17/2009 46
Parallel Computing Has Arrived We must rethink how we write programs And we are definitely doing that Other things will also need to change Architecture Operating systems Algorithms Theory Application software We are seeing the biggest revolution in computing since its very beginnings 7/17/2009 47

More Related Content

What's hot

Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
Dr Shashikant Athawale
 
Matrix Multiplication Report
Matrix Multiplication ReportMatrix Multiplication Report
Matrix Multiplication Report
International Islamic University
 
Introduction To Parallel Computing
Introduction To Parallel ComputingIntroduction To Parallel Computing
Introduction To Parallel Computing
Jörn Dinkla
 
2014 valat-phd-defense-slides
2014 valat-phd-defense-slides2014 valat-phd-defense-slides
2014 valat-phd-defense-slides
Sébastien Valat
 
Bulk-Synchronous-Parallel - BSP
Bulk-Synchronous-Parallel - BSPBulk-Synchronous-Parallel - BSP
Bulk-Synchronous-Parallel - BSP
Md Syed Ahamad
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
Syed Zaid Irshad
 
Dryad Paper Review and System Analysis
Dryad Paper Review and System AnalysisDryad Paper Review and System Analysis
Dryad Paper Review and System Analysis
JinGui LI
 
Feng’s classification
Feng’s classificationFeng’s classification
Feng’s classification
Narayan Kandel
 
Lecture 2
Lecture 2Lecture 2
Lecture 2Mr SMAK
 
Solution(1)
Solution(1)Solution(1)
Solution(1)
Gopi Saiteja
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
Lino Possamai
 
Parallel computing
Parallel computingParallel computing
Parallel computingvirend111
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
Shah Zaib
 
Chapter 1 pc
Chapter 1 pcChapter 1 pc
Chapter 1 pc
Hanif Durad
 
Modelling Adaptation Policies As Domain-Specific Constraints
Modelling Adaptation Policies As Domain-Specific ConstraintsModelling Adaptation Policies As Domain-Specific Constraints
Modelling Adaptation Policies As Domain-Specific Constraints
FoCAS Initiative
 
Chapter 4 pc
Chapter 4 pcChapter 4 pc
Chapter 4 pc
Hanif Durad
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
BaliThorat1
 
Chap6 slides
Chap6 slidesChap6 slides
Chap6 slides
BaliThorat1
 

What's hot (20)

Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
Matrix Multiplication Report
Matrix Multiplication ReportMatrix Multiplication Report
Matrix Multiplication Report
 
Introduction To Parallel Computing
Introduction To Parallel ComputingIntroduction To Parallel Computing
Introduction To Parallel Computing
 
2014 valat-phd-defense-slides
2014 valat-phd-defense-slides2014 valat-phd-defense-slides
2014 valat-phd-defense-slides
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Bulk-Synchronous-Parallel - BSP
Bulk-Synchronous-Parallel - BSPBulk-Synchronous-Parallel - BSP
Bulk-Synchronous-Parallel - BSP
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
Dryad Paper Review and System Analysis
Dryad Paper Review and System AnalysisDryad Paper Review and System Analysis
Dryad Paper Review and System Analysis
 
Feng’s classification
Feng’s classificationFeng’s classification
Feng’s classification
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Solution(1)
Solution(1)Solution(1)
Solution(1)
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
 
Chapter 1 pc
Chapter 1 pcChapter 1 pc
Chapter 1 pc
 
Modelling Adaptation Policies As Domain-Specific Constraints
Modelling Adaptation Policies As Domain-Specific ConstraintsModelling Adaptation Policies As Domain-Specific Constraints
Modelling Adaptation Policies As Domain-Specific Constraints
 
CO Module 5
CO Module 5CO Module 5
CO Module 5
 
Chapter 4 pc
Chapter 4 pcChapter 4 pc
Chapter 4 pc
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
 
Chap6 slides
Chap6 slidesChap6 slides
Chap6 slides
 

Viewers also liked

отладка Mpi приложений
отладка Mpi приложенийотладка Mpi приложений
отладка Mpi приложенийMichael Karpov
 
2009 10-31 есть ли жизнь после mpi
2009 10-31 есть ли жизнь после mpi2009 10-31 есть ли жизнь после mpi
2009 10-31 есть ли жизнь после mpiMichael Karpov
 
Ropa para Mujer
Ropa para MujerRopa para Mujer
Ropa para Mujerpulleiva
 
трасировка Mpi приложений
трасировка Mpi приложенийтрасировка Mpi приложений
трасировка Mpi приложенийMichael Karpov
 
якобовский - введение в параллельное программирование (1)
якобовский - введение в параллельное программирование (1)якобовский - введение в параллельное программирование (1)
якобовский - введение в параллельное программирование (1)Michael Karpov
 
российские суперкомпьютеры (современность)
российские суперкомпьютеры (современность)российские суперкомпьютеры (современность)
российские суперкомпьютеры (современность)Michael Karpov
 

Viewers also liked (8)

отладка Mpi приложений
отладка Mpi приложенийотладка Mpi приложений
отладка Mpi приложений
 
Les03
Les03Les03
Les03
 
2009 10-31 есть ли жизнь после mpi
2009 10-31 есть ли жизнь после mpi2009 10-31 есть ли жизнь после mpi
2009 10-31 есть ли жизнь после mpi
 
Ropa para Mujer
Ropa para MujerRopa para Mujer
Ropa para Mujer
 
трасировка Mpi приложений
трасировка Mpi приложенийтрасировка Mpi приложений
трасировка Mpi приложений
 
якобовский - введение в параллельное программирование (1)
якобовский - введение в параллельное программирование (1)якобовский - введение в параллельное программирование (1)
якобовский - введение в параллельное программирование (1)
 
российские суперкомпьютеры (современность)
российские суперкомпьютеры (современность)российские суперкомпьютеры (современность)
российские суперкомпьютеры (современность)
 
Test design print
Test design printTest design print
Test design print
 

Similar to 20090720 smith

parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
Jayanti Prasad Ph.D.
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
VIKAS SINGH BHADOURIA
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
Sudarsun Santhiappan
 
Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Marcirio Chaves
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
Mehul Patel
 
Chap 1(one) general introduction
Chap 1(one)  general introductionChap 1(one)  general introduction
Chap 1(one) general introduction
Malobe Lottin Cyrille Marcel
 
Introduction to parallel computing
Introduction to parallel computingIntroduction to parallel computing
Introduction to parallel computing
VIKAS SINGH BHADOURIA
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
Peter Lawrey
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
Revolution Analytics
 
An Overview of Distributed Debugging
An Overview of Distributed DebuggingAn Overview of Distributed Debugging
An Overview of Distributed Debugging
Anant Narayanan
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
inside-BigData.com
 
Mainframe
MainframeMainframe
Mainframe
Kanika Kapoor
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
Joy Rahman
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
Ameya Waghmare
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
KRamasamy2
 
CAQA5e_ch1 (3).pptx
CAQA5e_ch1 (3).pptxCAQA5e_ch1 (3).pptx
CAQA5e_ch1 (3).pptx
SPOCSumaLatha
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
VAISHNAVI MADHAN
 

Similar to 20090720 smith (20)

parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Chap 1(one) general introduction
Chap 1(one)  general introductionChap 1(one)  general introduction
Chap 1(one) general introduction
 
Introduction to parallel computing
Introduction to parallel computingIntroduction to parallel computing
Introduction to parallel computing
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
An Overview of Distributed Debugging
An Overview of Distributed DebuggingAn Overview of Distributed Debugging
An Overview of Distributed Debugging
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
 
Mainframe
MainframeMainframe
Mainframe
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
CAQA5e_ch1 (3).pptx
CAQA5e_ch1 (3).pptxCAQA5e_ch1 (3).pptx
CAQA5e_ch1 (3).pptx
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 

More from Michael Karpov

EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...
EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...
EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...
Michael Karpov
 
Movement to business goals: Data, Team, Users (4C Conference)
Movement to business goals: Data, Team, Users (4C Conference)Movement to business goals: Data, Team, Users (4C Conference)
Movement to business goals: Data, Team, Users (4C Conference)
Michael Karpov
 
Save Africa: NASA hackathon 2016
Save Africa: NASA hackathon 2016 Save Africa: NASA hackathon 2016
Save Africa: NASA hackathon 2016
Michael Karpov
 
Из третьего мира - в первый: ошибки в развивающихся продуктах (AgileDays 2014)
Из третьего мира - в первый: ошибки в развивающихся продуктах (AgileDays 2014) Из третьего мира - в первый: ошибки в развивающихся продуктах (AgileDays 2014)
Из третьего мира - в первый: ошибки в развивающихся продуктах (AgileDays 2014)
Michael Karpov
 
Один день из жизни менеджера. Тактика: хорошие практики, скрытые опасности и ...
Один день из жизни менеджера. Тактика: хорошие практики, скрытые опасности и ...Один день из жизни менеджера. Тактика: хорошие практики, скрытые опасности и ...
Один день из жизни менеджера. Тактика: хорошие практики, скрытые опасности и ...Michael Karpov
 
Поговорим про ошибки (Sumit)
Поговорим про ошибки (Sumit)Поговорим про ошибки (Sumit)
Поговорим про ошибки (Sumit)
Michael Karpov
 
(2niversity) проектная работа tips&tricks
(2niversity) проектная работа   tips&tricks(2niversity) проектная работа   tips&tricks
(2niversity) проектная работа tips&tricks
Michael Karpov
 
"Пользователи: сигнал из космоса". CodeFest mini 2012
"Пользователи: сигнал из космоса". CodeFest mini 2012"Пользователи: сигнал из космоса". CodeFest mini 2012
"Пользователи: сигнал из космоса". CodeFest mini 2012
Michael Karpov
 
(Analyst days2012) Как мы готовим продукты - вклад аналитиков
(Analyst days2012) Как мы готовим продукты - вклад аналитиков(Analyst days2012) Как мы готовим продукты - вклад аналитиков
(Analyst days2012) Как мы готовим продукты - вклад аналитиков
Michael Karpov
 
Как сделать команде приятное - Михаил Карпов (Яндекс)
Как сделать команде приятное - Михаил Карпов (Яндекс)Как сделать команде приятное - Михаил Карпов (Яндекс)
Как сделать команде приятное - Михаил Карпов (Яндекс)
Michael Karpov
 
Как мы готовим продукты
Как мы готовим продуктыКак мы готовим продукты
Как мы готовим продукты
Michael Karpov
 
Hpc Visualization with WebGL
Hpc Visualization with WebGLHpc Visualization with WebGL
Hpc Visualization with WebGLMichael Karpov
 
Hpc Visualization with X3D (Michail Karpov)
Hpc Visualization with X3D (Michail Karpov)Hpc Visualization with X3D (Michail Karpov)
Hpc Visualization with X3D (Michail Karpov)Michael Karpov
 
сбор требований с помощью Innovation games
сбор требований с помощью Innovation gamesсбор требований с помощью Innovation games
сбор требований с помощью Innovation games
Michael Karpov
 
Зачем нам Это? или Как продать agile команде
Зачем нам Это? или Как продать agile командеЗачем нам Это? или Как продать agile команде
Зачем нам Это? или Как продать agile команде
Michael Karpov
 
"Зачем нам Это?" или как продать Agile команде
"Зачем нам Это?" или как продать Agile команде"Зачем нам Это?" или как продать Agile команде
"Зачем нам Это?" или как продать Agile командеMichael Karpov
 
"Зачем нам Это?" или как продать Agile команде
"Зачем нам Это?" или как продать Agile команде"Зачем нам Это?" или как продать Agile команде
"Зачем нам Это?" или как продать Agile командеMichael Karpov
 
Высоконагруженая команда - AgileDays 2010
Высоконагруженая команда - AgileDays 2010Высоконагруженая команда - AgileDays 2010
Высоконагруженая команда - AgileDays 2010
Michael Karpov
 

More from Michael Karpov (20)

EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...
EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...
EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...
 
Movement to business goals: Data, Team, Users (4C Conference)
Movement to business goals: Data, Team, Users (4C Conference)Movement to business goals: Data, Team, Users (4C Conference)
Movement to business goals: Data, Team, Users (4C Conference)
 
Save Africa: NASA hackathon 2016
Save Africa: NASA hackathon 2016 Save Africa: NASA hackathon 2016
Save Africa: NASA hackathon 2016
 
Из третьего мира - в первый: ошибки в развивающихся продуктах (AgileDays 2014)
Из третьего мира - в первый: ошибки в развивающихся продуктах (AgileDays 2014) Из третьего мира - в первый: ошибки в развивающихся продуктах (AgileDays 2014)
Из третьего мира - в первый: ошибки в развивающихся продуктах (AgileDays 2014)
 
Один день из жизни менеджера. Тактика: хорошие практики, скрытые опасности и ...
Один день из жизни менеджера. Тактика: хорошие практики, скрытые опасности и ...Один день из жизни менеджера. Тактика: хорошие практики, скрытые опасности и ...
Один день из жизни менеджера. Тактика: хорошие практики, скрытые опасности и ...
 
Поговорим про ошибки (Sumit)
Поговорим про ошибки (Sumit)Поговорим про ошибки (Sumit)
Поговорим про ошибки (Sumit)
 
(2niversity) проектная работа tips&tricks
(2niversity) проектная работа   tips&tricks(2niversity) проектная работа   tips&tricks
(2niversity) проектная работа tips&tricks
 
"Пользователи: сигнал из космоса". CodeFest mini 2012
"Пользователи: сигнал из космоса". CodeFest mini 2012"Пользователи: сигнал из космоса". CodeFest mini 2012
"Пользователи: сигнал из космоса". CodeFest mini 2012
 
(Analyst days2012) Как мы готовим продукты - вклад аналитиков
(Analyst days2012) Как мы готовим продукты - вклад аналитиков(Analyst days2012) Как мы готовим продукты - вклад аналитиков
(Analyst days2012) Как мы готовим продукты - вклад аналитиков
 
Как сделать команде приятное - Михаил Карпов (Яндекс)
Как сделать команде приятное - Михаил Карпов (Яндекс)Как сделать команде приятное - Михаил Карпов (Яндекс)
Как сделать команде приятное - Михаил Карпов (Яндекс)
 
Как мы готовим продукты
Как мы готовим продуктыКак мы готовим продукты
Как мы готовим продукты
 
Hpc Visualization with WebGL
Hpc Visualization with WebGLHpc Visualization with WebGL
Hpc Visualization with WebGL
 
Hpc Visualization with X3D (Michail Karpov)
Hpc Visualization with X3D (Michail Karpov)Hpc Visualization with X3D (Michail Karpov)
Hpc Visualization with X3D (Michail Karpov)
 
сбор требований с помощью Innovation games
сбор требований с помощью Innovation gamesсбор требований с помощью Innovation games
сбор требований с помощью Innovation games
 
Зачем нам Это? или Как продать agile команде
Зачем нам Это? или Как продать agile командеЗачем нам Это? или Как продать agile команде
Зачем нам Это? или Как продать agile команде
 
"Зачем нам Это?" или как продать Agile команде
"Зачем нам Это?" или как продать Agile команде"Зачем нам Это?" или как продать Agile команде
"Зачем нам Это?" или как продать Agile команде
 
"Зачем нам Это?" или как продать Agile команде
"Зачем нам Это?" или как продать Agile команде"Зачем нам Это?" или как продать Agile команде
"Зачем нам Это?" или как продать Agile команде
 
HPC Visualization
HPC VisualizationHPC Visualization
HPC Visualization
 
Hpc Visualization
Hpc VisualizationHpc Visualization
Hpc Visualization
 
Высоконагруженая команда - AgileDays 2010
Высоконагруженая команда - AgileDays 2010Высоконагруженая команда - AgileDays 2010
Высоконагруженая команда - AgileDays 2010
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

20090720 smith

  • 1. 7/17/2009 1 Parallel and High Performance Computing Burton Smith Technical Fellow Microsoft
  • 2. Agenda Introduction Definitions Architecture and Programming Examples Conclusions 7/17/2009 2
  • 4. “Parallel and High Performance”? “Parallel computing is a form of computation in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994 A High Performance (Super) Computer is: One of the 500 fastest computers as measured by HPL: the High Performance Linpack benchmark A computer that costs 200.000.000 руб or more Necessarily parallel, at least since the 1970’s 7/17/2009 4
  • 5. Recent Developments For 20 years, parallel and high performance computing have been the same subject Parallel computing is now mainstream It reaches well beyond HPC into client systems: desktops, laptops, mobile phones HPC software once had to stand alone Now, it can be based on parallel PC software The result: better tools and new possibilities 7/17/2009 5
  • 6. The Emergence of the Parallel Client Uniprocessor performance is leveling off Instruction-level parallelism nears a limit (ILP Wall) Power is getting painfully high (Power Wall) Caches show diminishing returns (Memory Wall) Logic density continues to grow (Moore’s Law) So uniprocessors will collapse in area and cost Cores per chip need to increase exponentially We must all learn to write parallel programs So new “killer apps” will enjoy more speed
  • 7. The ILP Wall Instruction-level parallelism preserves the serial programming model While getting speed from “undercover” parallelism For example, see HPS†: out-of-order issue, in-order retirement, register renaming, branch prediction, speculation, … At best, we get a few instructions/clock † Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116.
  • 8. The Power Wall In the old days, power was kept roughly constant Dynamic power, equal to CV2f, dominated Every shrink of .7 in feature size halved transistor area Capacitance C and voltage V also decreased by .7 Even with the clock frequency f increased by 1.4, power per transistor was cut in half Now, shrinking no longer reduces V very much So even at constant frequency, power density doubles Static (leakage) power is also getting worse Simpler, slower processors are more efficient And to conserve power, we can turn some of them off
  • 9. The Memory Wall We can get bigger caches from more transistors Does this suffice, or is there a problem scaling up? To speed up 2X without changing bandwidth below the cache, the miss rate must be halved How much bigger does the cache have to be?† For dense matrix multiply or dense LU, 4x bigger For sorting or FFTs, the square of its former size For sparse or dense matrix-vector multiply, impossible Deeper interconnects increase miss latency Latency tolerance needs memory access parallelism † H.T. Kung, “Memory requirements for balanced computer architectures,” 13th International Symposium on Computer Architecture, 1986, pp. 49−54.
  • 10. Overcoming the Memory Wall Provide more memory bandwidth Increase DRAM I/O bandwidth per gigabyte Increase microprocessor off-chip bandwidth Use architecture to tolerate memory latency More latency  more threads or longer vectors No change in programming model is needed Use caches for bandwidth as well as latency Let compilers control locality Keep cache lines short Avoid mis-speculation
  • 11. The End of The von Neumann Model “Instructions are executed one at a time…” We have relied on this idea for 60 years Now it (and things it brought) must change Serial programming is easier than parallel programming, at least for the moment But serial programs are now slow programs We need parallel programming paradigms that will make all programmers successful The stakes for our field’s vitality are high Computing must be reinvented
  • 13. Asymptotic Notation Quantities are often meaningful only within a constant factor Algorithm performance analyses, for example f(n) = O(g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)| f(n) = (g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)| f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n)) 7/17/2009 13
  • 14. Speedup, Time, and Work The speedup of a computation is how much faster it runs in parallel compared to serially If one processor takes T1 and p of them take Tp then the p-processor speedup is Sp = T1/Tp The work done is the number of operations performed, either serially or in parallel W1 = O(T1) is the serial work, Wp the parallel work We say a parallel computation is work-optimal ifWp = O(W1) = O(T1) We say a parallel computation is time-optimal ifTp = O(W1/p) = O(T1/p) 7/17/2009 14
  • 15. Latency, Bandwidth, & Concurrency In any system that moves items from input to output without creating or destroying them, Queueing theory calls this result Little’s law latency × bandwidth = concurrency concurrency = 6 bandwidth = 2 latency = 3
  • 17. Parallel Processor Architecture SIMD: Each instruction operates concurrently on multiple data items MIMD: Multiple instruction sequences execute concurrently Concurrency is expressible in space or time Spatial: the hardware is replicated Temporal: the hardware is pipelined 7/17/2009 17
  • 18. Trends in Parallel Processors Today’s chips are spatial MIMD at top level To get enough performance, even in PCs Temporal MIMD is also used SIMD is tending back toward spatial Intel’s Larrabee combines all three Temporal concurrency is easily “adjusted” Vector length or number of hardware contexts Temporal concurrency tolerates latency Memory latency in the SIMD case For MIMD, branches and synchronization also 7/17/2009 18
  • 19. Parallel Memory Architecture A shared memory system is one in which any processor can address any memory location Quality of access can be either uniform (UMA) or nonuniform (NUMA), in latency and/or bandwidth A distributed memory system is one in which processors can’t address most of memory The disjoint memory regions and their associated processors are usually called nodes A cluster is a distributed memory system with more than one processor per node Nearly all HPC systems are clusters 7/17/2009 19
  • 20. Parallel Programming Variations Data Parallelism andTask Parallelism Functional Style and Imperative Style Shared Memory and Message Passing …and more we won’t have time to look at A parallel application may use all of them 7/17/2009 20
  • 21. Data Parallelism and Task Parallelism A computation is data parallel when similar independent sub-computations are done simultaneously on multiple data items Applying the same function to every element of a data sequence, for example A computation is task parallel when dissimilar independent sub-computations are done simultaneously Controlling the motions of a robot, for example It sounds like SIMD vs. MIMD, but isn’t quite Some kinds of data parallelism need MIMD 7/17/2009 21
  • 22. Functional and Imperative Programs A program is said to be written in (pure) functional style if it has no mutable state Computing = naming and evaluating expressions Programs with mutable state are usually called imperative because the state changes must be done when and where specified: while (z < x) { x = y; y = z; z = f(x, y);} return y; Often, programs can be written either way: let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y; 7/17/2009 22
  • 23. Shared Memory and Message Passing Shared memory programs access data in a shared address space When to access the data is the big issue Subcomputations therefore must synchronize Message passing programs transmit data between subcomputations The sender computes a value and then sends it The receiver recieves a value and then uses it Synchronization can be built in to communication Message passing can be implemented very well on shared memory architectures 7/17/2009 23
  • 24. Barrier Synchronization A barrier synchronizes multiple parallel sub-computations by letting none proceed until all have arrived It is named after the barrier used to start horse races It guarantees everything before the barrier finishes before anything after it begins It is a central feature in several data-parallel languages such as OpenMP 7/17/2009 24
  • 25. Mutual Exclusion This type of synchronization ensures only one subcomputation can do a thing at any time If the thing is a code block, it is a critical section It classically uses a lock: a data structure with which subcomputations can stop and start Basic operations on a lock object L might be Acquire(L): blocks until other subcomputations are finished with L, then acquires exclusive ownership Release(L): yields L and unblocks some Acquire(L) A lot has been written on these subjects 7/17/2009 25
  • 26. Non-Blocking Synchronization The basic idea is to achieve mutual exclusion using memory read-modify-write operations Most commonly used is compare-and-swap: CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by new Arbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeeds If there is significant updating contention at addr, the repeated computation of new may be wasteful 7/17/2009 26
  • 27. Load Balancing Some processors may be busier than others To balance the workload, subcomputations can be scheduled on processors dynamically A technique for parallel loops is self-scheduling: processors repetitively grab chunks of iterations In guided self-scheduling, the chunk sizes shrink Analogous imbalances can occur in memory Overloaded memory locations are called hot spots Parallel algorithms and data structures must be designed to avoid them Imbalanced messaging is sometimes seen 7/17/2009 27
  • 29. A Data Parallel Example: Sorting 7/17/2009 29 void sort(int *src, int *dst,int size, intnvals) { inti, j, t1[nvals], t2[nvals]; for (j = 0 ; j < nvals ; j++) { t1[j] = 0; } for (i = 0 ; i < size ; i++) { t1[src[i]]++; } //t1[] now contains a histogram of the values t2[0] = 0; for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1]; } //t2[j] now contains the origin for value j for (i = 0 ; i < size ; i++) {dst[t2[src[i]]++] = src[i]; } }
  • 30. When Is a Loop Parallelizable? The loop instances must safely interleave A way to do this is to only read the data Another way is to isolate data accesses Look at the first loop: The accesses to t1[] are isolated from each other This loop can run in parallel “as is” 7/17/2009 30 for (j = 0 ; j < nvals ; j++) { t1[j] = 0; }
  • 31. Isolating Data Updates The second loop seems to have a problem: Two iterations may access the same t1[src[i]] If both reads precede both increments, oops! A few ways to isolate the iteration conflicts: Use an “isolated update” (lock prefix) instruction Use an array of locks, perhaps as big as t1[] Use non-blocking updates Use a transaction 7/17/2009 31 for (i = 0 ; i < size ; i++) { t1[src[i]]++; }
  • 32. Dependent Loop Iterations The 3rd loop is an interesting challenge: Each iteration depends on the previous one This loop is an example of a prefix computation If • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 … Prefix computations are often known as scans Scan can be done in efficiently in parallel 7/17/2009 32 for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1]; }
  • 33. Cyclic Reduction Each vertical line represents a loop iteration The associated sequence element is to its right On step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k 7/17/2009 33 a b c d e f g a ab bc cd de ef fg a ab abc abcd bcde cdef defg a ab abc abcd abcde abcdef abcdefg
  • 34. Applications of Scan Linear recurrences like the third loop Polynomial evaluation String comparison High-precision addition Finite automata Each xi is the next-state function given the ith input symbol and • is function composition APL compress When only the final value is needed, the computation is called a reduction instead It’s a little bit cheaper than a full scan
  • 35. More Iterations nThan Processors p 7/17/2009 35 Wp = 3n + O(p log p), Tp = 3n / p + O(log p)
  • 36. OpenMP OpenMP is a widely-implemented extension to C++ and Fortran for data† parallelism It adds directives to serial programs A few of the more important directives: #pragmaomp parallel for <modifiers><for loop> #pragmaomp atomic<binary op=,++ or -- statement> #pragmaomp critical <name><structured block> #pragmaomp barrier 7/17/2009 36 †And perhaps task parallelism soon
  • 37. The Sorting Example in OpenMP Only the third “scan” loop is a problem We can at least do this loop “manually”: 7/17/2009 37 nt = omp_get_num_threads(); intta[nt], tb[nt]; #omp parallel for for(myt = 0; myt < nt; myt++) { //Set ta[myt]= local sum of nvals/nt elements of t1[] #pragmaomp barrier for(k = 1; k <= myt; k *= 2){ tb[myt] = ta[myt]; ta[myt] += tb[myt - k]; #pragmaomp barrier } fix = (myt > 0) ? ta[myt – 1] : 0; //Setnvals/ntelements of t2[] to fix + local scan of t1[] }
  • 38. Parallel Patterns Library (PPL) PPL is a Microsoft C++ library built on top of the ConcRT user-mode scheduling runtime It supports mixed data- and task-parallelism: parallel_for, parallel_for_each, parallel_invoke agent, send, receive, choice, join, task_group Parallel loops use C++ lambda expressions: Updates can be isolated using intrinsic functions Microsoft and Intel plan to unify PPL and TBB 7/17/2009 38 parallel_for(1,nvals,[&t1](int j) { t1[j] = 0; }); (void)_InterlockedIncrement(t1[src[i]]++);
  • 39. Dynamic Resource Management PPL programs are written for an arbitrary number of processors, could be just one Load balancing is mostly done by work stealing There are two kinds of work to steal: Work that is unblocked and waiting for a processor Work that is not yet started and is potentially parallel Work of the latter kind will be done serially unless it is first stolen by another processor This makes recursive divide and conquer easy There is no concern about when to stop parallelism 7/17/2009 39
  • 40. A Quicksort Example void quicksort (vector<int>::iterator first, vector<int>::iterator last) { if (last - first < 2){return;} int pivot = *first; auto mid1 = partition (first, last, [=](int e){return e < pivot;}); auto mid2 = partition (mid1, last, [=](int e){return e == pivot;}); parallel_invoke( [=] { quicksort(first, mid1); }, [=] { quicksort(mid2, last); } ); }; 7/17/2009 40
  • 41. LINQ and PLINQ LINQ (Language Integrated Query) extends the .NET languages C#, Visual Basic, and F# A LINQ query is really just a functional monad It queries databases, XML, or any IEnumerable PLINQ is a parallel implementation of LINQ Non-isolated functions must be avoided Otherwise it is hard to tell the two apart 7/17/2009 41
  • 42. A LINQ Example 7/17/2009 42 PLINQ .AsParallel() var q = from n in names         where n.Name == queryInfo.Name && n.State == queryInfo.State && n.Year >= yearStart && n.Year <= yearEnd         orderbyn.Year ascending         select n;
  • 43. Message Passing Interface (MPI) MPI is a widely used message passing library for distributed memory HPC systems Some of its basic functions: A few of its “collective communication” functions: 7/17/2009 43 MPI_Init MPI_Comm_rank MPI_Comm_size MPI_Send MPI_Recv MPI_Reduce MPI_Allreduce MPI_Scan MPI_Exscan MPI_Barrier MPI_Gather MPI_Allgather MPI_Alltoall
  • 44. Sorting in MPI Roughly, it could work like this on n nodes: Run the first two loops locally Use MPI_Allreduce to build a global histogram Run the third loop (redundantly) at every node Allocate n value intervals to nodes (redundantly) Balancing the data per node as well as possible Run the fourth loop using the local histogram Use MPI_Alltoall to redistribute the data Merge the n sorted subarrays on each node Collective communication is expensive But sorting needs it (see the Memory Wall slide) 7/17/2009 44
  • 45. Another Way to Sort in MPI The Samplesort algorithm is like Quicksort It works like this on n nodes: Sort the local data on each node independently Take s samples of the sorted data on each node Use MPI_Allgather to send all nodes all samples Compute n  1 splitters (redundantly) on all nodes Balancing the data per node as well as possible Use MPI_Alltoall to redistribute the data Merge the n sorted subarrays on each node 7/17/2009 45
  • 47. Parallel Computing Has Arrived We must rethink how we write programs And we are definitely doing that Other things will also need to change Architecture Operating systems Algorithms Theory Application software We are seeing the biggest revolution in computing since its very beginnings 7/17/2009 47