• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
20090720 smith
 

20090720 smith

on

  • 755 views

 

Statistics

Views

Total Views
755
Views on SlideShare
753
Embed Views
2

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 2

http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    20090720 smith 20090720 smith Presentation Transcript

    • 7/17/2009
      1
      Parallel and High Performance Computing
      Burton Smith
      Technical Fellow
      Microsoft
    • Agenda
      Introduction
      Definitions
      Architecture and Programming
      Examples
      Conclusions
      7/17/2009
      2
    • Introduction
      7/17/2009
      3
    • “Parallel and High Performance”?
      “Parallel computing is a form of computation in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994
      A High Performance (Super) Computer is:
      One of the 500 fastest computers as measured by HPL: the High Performance Linpack benchmark
      A computer that costs 200.000.000 руб or more
      Necessarily parallel, at least since the 1970’s
      7/17/2009
      4
    • Recent Developments
      For 20 years, parallel and high performance computing have been the same subject
      Parallel computing is now mainstream
      It reaches well beyond HPC into client systems: desktops, laptops, mobile phones
      HPC software once had to stand alone
      Now, it can be based on parallel PC software
      The result: better tools and new possibilities
      7/17/2009
      5
    • The Emergence of the Parallel Client
      Uniprocessor performance is leveling off
      Instruction-level parallelism nears a limit (ILP Wall)
      Power is getting painfully high (Power Wall)
      Caches show diminishing returns (Memory Wall)
      Logic density continues to grow (Moore’s Law)
      So uniprocessors will collapse in area and cost
      Cores per chip need to increase exponentially
      We must all learn to write parallel programs
      So new “killer apps” will enjoy more speed
    • The ILP Wall
      Instruction-level parallelism preserves the serial programming model
      While getting speed from “undercover” parallelism
      For example, see HPS†: out-of-order issue, in-order retirement, register renaming, branch prediction, speculation, …
      At best, we get a few instructions/clock
      † Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116.
    • The Power Wall
      In the old days, power was kept roughly constant
      Dynamic power, equal to CV2f, dominated
      Every shrink of .7 in feature size halved transistor area
      Capacitance C and voltage V also decreased by .7
      Even with the clock frequency f increased by 1.4, power per transistor was cut in half
      Now, shrinking no longer reduces V very much
      So even at constant frequency, power density doubles
      Static (leakage) power is also getting worse
      Simpler, slower processors are more efficient
      And to conserve power, we can turn some of them off
    • The Memory Wall
      We can get bigger caches from more transistors
      Does this suffice, or is there a problem scaling up?
      To speed up 2X without changing bandwidth below the cache, the miss rate must be halved
      How much bigger does the cache have to be?†
      For dense matrix multiply or dense LU, 4x bigger
      For sorting or FFTs, the square of its former size
      For sparse or dense matrix-vector multiply, impossible
      Deeper interconnects increase miss latency
      Latency tolerance needs memory access parallelism
      † H.T. Kung, “Memory requirements for balanced computer architectures,” 13th International Symposium on Computer Architecture, 1986, pp. 49−54.
    • Overcoming the Memory Wall
      Provide more memory bandwidth
      Increase DRAM I/O bandwidth per gigabyte
      Increase microprocessor off-chip bandwidth
      Use architecture to tolerate memory latency
      More latency  more threads or longer vectors
      No change in programming model is needed
      Use caches for bandwidth as well as latency
      Let compilers control locality
      Keep cache lines short
      Avoid mis-speculation
    • The End of The von Neumann Model
      “Instructions are executed one at a time…”
      We have relied on this idea for 60 years
      Now it (and things it brought) must change
      Serial programming is easier than parallel programming, at least for the moment
      But serial programs are now slow programs
      We need parallel programming paradigms that will make all programmers successful
      The stakes for our field’s vitality are high
      Computing must be reinvented
    • Definitions
      7/17/2009
      12
    • Asymptotic Notation
      Quantities are often meaningful only within a constant factor
      Algorithm performance analyses, for example
      f(n) = O(g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|
      f(n) = (g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|
      f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n))
      7/17/2009
      13
    • Speedup, Time, and Work
      The speedup of a computation is how much faster it runs in parallel compared to serially
      If one processor takes T1 and p of them take Tp then the p-processor speedup is Sp = T1/Tp
      The work done is the number of operations performed, either serially or in parallel
      W1 = O(T1) is the serial work, Wp the parallel work
      We say a parallel computation is work-optimal ifWp = O(W1) = O(T1)
      We say a parallel computation is time-optimal ifTp = O(W1/p) = O(T1/p)
      7/17/2009
      14
    • Latency, Bandwidth, & Concurrency
      In any system that moves items from input to output without creating or destroying them,
      Queueing theory calls this result Little’s law
      latency × bandwidth = concurrency
      concurrency = 6
      bandwidth = 2
      latency = 3
    • Architecture ANDPROGRAMMING
      7/17/2009
      16
    • Parallel Processor Architecture
      SIMD: Each instruction operates concurrently on multiple data items
      MIMD: Multiple instruction sequences execute concurrently
      Concurrency is expressible in space or time
      Spatial: the hardware is replicated
      Temporal: the hardware is pipelined
      7/17/2009
      17
    • Trends in Parallel Processors
      Today’s chips are spatial MIMD at top level
      To get enough performance, even in PCs
      Temporal MIMD is also used
      SIMD is tending back toward spatial
      Intel’s Larrabee combines all three
      Temporal concurrency is easily “adjusted”
      Vector length or number of hardware contexts
      Temporal concurrency tolerates latency
      Memory latency in the SIMD case
      For MIMD, branches and synchronization also
      7/17/2009
      18
    • Parallel Memory Architecture
      A shared memory system is one in which any processor can address any memory location
      Quality of access can be either uniform (UMA) or nonuniform (NUMA), in latency and/or bandwidth
      A distributed memory system is one in which processors can’t address most of memory
      The disjoint memory regions and their associated processors are usually called nodes
      A cluster is a distributed memory system with more than one processor per node
      Nearly all HPC systems are clusters
      7/17/2009
      19
    • Parallel Programming Variations
      Data Parallelism andTask Parallelism
      Functional Style and Imperative Style
      Shared Memory and Message Passing
      …and more we won’t have time to look at
      A parallel application may use all of them
      7/17/2009
      20
    • Data Parallelism and Task Parallelism
      A computation is data parallel when similar independent sub-computations are done simultaneously on multiple data items
      Applying the same function to every element of a data sequence, for example
      A computation is task parallel when dissimilar independent sub-computations are done simultaneously
      Controlling the motions of a robot, for example
      It sounds like SIMD vs. MIMD, but isn’t quite
      Some kinds of data parallelism need MIMD
      7/17/2009
      21
    • Functional and Imperative Programs
      A program is said to be written in (pure) functional style if it has no mutable state
      Computing = naming and evaluating expressions
      Programs with mutable state are usually called imperative because the state changes must be done when and where specified:
      while (z < x) { x = y; y = z; z = f(x, y);} return y;
      Often, programs can be written either way:
      let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y;
      7/17/2009
      22
    • Shared Memory and Message Passing
      Shared memory programs access data in a shared address space
      When to access the data is the big issue
      Subcomputations therefore must synchronize
      Message passing programs transmit data between subcomputations
      The sender computes a value and then sends it
      The receiver recieves a value and then uses it
      Synchronization can be built in to communication
      Message passing can be implemented very well on shared memory architectures
      7/17/2009
      23
    • Barrier Synchronization
      A barrier synchronizes multiple parallel sub-computations by letting none proceed until all have arrived
      It is named after the barrier
      used to start horse races
      It guarantees everything before the barrier finishes before anything after it begins
      It is a central feature in several data-parallel languages such as OpenMP
      7/17/2009
      24
    • Mutual Exclusion
      This type of synchronization ensures only one subcomputation can do a thing at any time
      If the thing is a code block, it is a critical section
      It classically uses a lock: a data structure with which subcomputations can stop and start
      Basic operations on a lock object L might be
      Acquire(L): blocks until other subcomputations are finished with L, then acquires exclusive ownership
      Release(L): yields L and unblocks some Acquire(L)
      A lot has been written on these subjects
      7/17/2009
      25
    • Non-Blocking Synchronization
      The basic idea is to achieve mutual exclusion using memory read-modify-write operations
      Most commonly used is compare-and-swap:
      CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by new
      Arbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeeds
      If there is significant updating contention at addr, the repeated computation of new may be wasteful
      7/17/2009
      26
    • Load Balancing
      Some processors may be busier than others
      To balance the workload, subcomputations can be scheduled on processors dynamically
      A technique for parallel loops is self-scheduling: processors repetitively grab chunks of iterations
      In guided self-scheduling, the chunk sizes shrink
      Analogous imbalances can occur in memory
      Overloaded memory locations are called hot spots
      Parallel algorithms and data structures must be designed to avoid them
      Imbalanced messaging is sometimes seen
      7/17/2009
      27
    • Examples
      7/17/2009
      28
    • A Data Parallel Example: Sorting
      7/17/2009
      29
      void sort(int *src, int *dst,int size, intnvals) {
      inti, j, t1[nvals], t2[nvals];
      for (j = 0 ; j < nvals ; j++) { t1[j] = 0;
      }
      for (i = 0 ; i < size ; i++) { t1[src[i]]++;
      } //t1[] now contains a histogram of the values
      t2[0] = 0;
      for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1];
      } //t2[j] now contains the origin for value j
      for (i = 0 ; i < size ; i++) {dst[t2[src[i]]++] = src[i];
      }
      }
    • When Is a Loop Parallelizable?
      The loop instances must safely interleave
      A way to do this is to only read the data
      Another way is to isolate data accesses
      Look at the first loop:
      The accesses to t1[] are isolated from each other
      This loop can run in parallel “as is”
      7/17/2009
      30
      for (j = 0 ; j < nvals ; j++) {
      t1[j] = 0;
      }
    • Isolating Data Updates
      The second loop seems to have a problem:
      Two iterations may access the same t1[src[i]]
      If both reads precede both increments, oops!
      A few ways to isolate the iteration conflicts:
      Use an “isolated update” (lock prefix) instruction
      Use an array of locks, perhaps as big as t1[]
      Use non-blocking updates
      Use a transaction
      7/17/2009
      31
      for (i = 0 ; i < size ; i++) {
      t1[src[i]]++;
      }
    • Dependent Loop Iterations
      The 3rd loop is an interesting challenge:
      Each iteration depends on the previous one
      This loop is an example of a prefix computation
      If • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 …
      Prefix computations are often known as scans
      Scan can be done in efficiently in parallel
      7/17/2009
      32
      for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1];
      }
    • Cyclic Reduction
      Each vertical line represents a loop iteration
      The associated sequence element is to its right
      On step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k
      7/17/2009
      33
      a
      b
      c
      d
      e
      f
      g
      a
      ab
      bc
      cd
      de
      ef
      fg
      a
      ab
      abc
      abcd
      bcde
      cdef
      defg
      a
      ab
      abc
      abcd
      abcde
      abcdef
      abcdefg
    • Applications of Scan
      Linear recurrences like the third loop
      Polynomial evaluation
      String comparison
      High-precision addition
      Finite automata
      Each xi is the next-state function given the ith input symbol and • is function composition
      APL compress
      When only the final value is needed, the computation is called a reduction instead
      It’s a little bit cheaper than a full scan
    • More Iterations nThan Processors p
      7/17/2009
      35
      Wp = 3n + O(p log p), Tp = 3n / p + O(log p)
    • OpenMP
      OpenMP is a widely-implemented extension to C++ and Fortran for data† parallelism
      It adds directives to serial programs
      A few of the more important directives:
      #pragmaomp parallel for <modifiers><for loop>
      #pragmaomp atomic<binary op=,++ or -- statement>
      #pragmaomp critical <name><structured block>
      #pragmaomp barrier
      7/17/2009
      36
      †And perhaps task parallelism soon
    • The Sorting Example in OpenMP
      Only the third “scan” loop is a problem
      We can at least do this loop “manually”:
      7/17/2009
      37
      nt = omp_get_num_threads();
      intta[nt], tb[nt];
      #omp parallel for
      for(myt = 0; myt < nt; myt++) {
      //Set ta[myt]= local sum of nvals/nt elements of t1[]
      #pragmaomp barrier
      for(k = 1; k <= myt; k *= 2){
      tb[myt] = ta[myt];
      ta[myt] += tb[myt - k];
      #pragmaomp barrier
      }
      fix = (myt > 0) ? ta[myt – 1] : 0;
      //Setnvals/ntelements of t2[] to fix + local scan of t1[]
      }
    • Parallel Patterns Library (PPL)
      PPL is a Microsoft C++ library built on top of the ConcRT user-mode scheduling runtime
      It supports mixed data- and task-parallelism:
      parallel_for, parallel_for_each, parallel_invoke
      agent, send, receive, choice, join, task_group
      Parallel loops use C++ lambda expressions:
      Updates can be isolated using intrinsic functions
      Microsoft and Intel plan to unify PPL and TBB
      7/17/2009
      38
      parallel_for(1,nvals,[&t1](int j) {
      t1[j] = 0;
      });
      (void)_InterlockedIncrement(t1[src[i]]++);
    • Dynamic Resource Management
      PPL programs are written for an arbitrary number of processors, could be just one
      Load balancing is mostly done by work stealing
      There are two kinds of work to steal:
      Work that is unblocked and waiting for a processor
      Work that is not yet started and is potentially parallel
      Work of the latter kind will be done serially unless it is first stolen by another processor
      This makes recursive divide and conquer easy
      There is no concern about when to stop parallelism
      7/17/2009
      39
    • A Quicksort Example
      void quicksort (vector<int>::iterator first,
      vector<int>::iterator last) {
      if (last - first < 2){return;}
      int pivot = *first;
      auto mid1 = partition (first, last,
      [=](int e){return e < pivot;});
      auto mid2 = partition (mid1, last,
      [=](int e){return e == pivot;});
      parallel_invoke(
      [=] { quicksort(first, mid1); },
      [=] { quicksort(mid2, last); }
      );
      };
      7/17/2009
      40
    • LINQ and PLINQ
      LINQ (Language Integrated Query) extends the .NET languages C#, Visual Basic, and F#
      A LINQ query is really just a functional monad
      It queries databases, XML, or any IEnumerable
      PLINQ is a parallel implementation of LINQ
      Non-isolated functions must be avoided
      Otherwise it is hard to tell the two apart
      7/17/2009
      41
    • A LINQ Example
      7/17/2009
      42
      PLINQ
      .AsParallel()
      var q = from n in names
              where n.Name == queryInfo.Name &&
      n.State == queryInfo.State &&
      n.Year >= yearStart &&
      n.Year <= yearEnd
              orderbyn.Year ascending
              select n;
    • Message Passing Interface (MPI)
      MPI is a widely used message passing library for distributed memory HPC systems
      Some of its basic functions:
      A few of its “collective communication” functions:
      7/17/2009
      43
      MPI_Init
      MPI_Comm_rank
      MPI_Comm_size
      MPI_Send
      MPI_Recv
      MPI_Reduce
      MPI_Allreduce
      MPI_Scan
      MPI_Exscan
      MPI_Barrier
      MPI_Gather
      MPI_Allgather
      MPI_Alltoall
    • Sorting in MPI
      Roughly, it could work like this on n nodes:
      Run the first two loops locally
      Use MPI_Allreduce to build a global histogram
      Run the third loop (redundantly) at every node
      Allocate n value intervals to nodes (redundantly)
      Balancing the data per node as well as possible
      Run the fourth loop using the local histogram
      Use MPI_Alltoall to redistribute the data
      Merge the n sorted subarrays on each node
      Collective communication is expensive
      But sorting needs it (see the Memory Wall slide)
      7/17/2009
      44
    • Another Way to Sort in MPI
      The Samplesort algorithm is like Quicksort
      It works like this on n nodes:
      Sort the local data on each node independently
      Take s samples of the sorted data on each node
      Use MPI_Allgather to send all nodes all samples
      Compute n  1 splitters (redundantly) on all nodes
      Balancing the data per node as well as possible
      Use MPI_Alltoall to redistribute the data
      Merge the n sorted subarrays on each node
      7/17/2009
      45
    • CONCLUSIONS
      7/17/2009
      46
    • Parallel Computing Has Arrived
      We must rethink how we write programs
      And we are definitely doing that
      Other things will also need to change
      Architecture
      Operating systems
      Algorithms
      Theory
      Application software
      We are seeing the biggest revolution in computing since its very beginnings
      7/17/2009
      47