blueGeneLTahoeAug2002.ppt
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

blueGeneLTahoeAug2002.ppt

on

  • 489 views

 

Statistics

Views

Total Views
489
Views on SlideShare
489
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is very nicely illustrated in a figure I borrowed from Attila Gursoy’s thesis. Module A needs to avail services from Module B and C. In a message-passing paradigm, these modules cannot execute concurrently, thus idle times in one modules cannot be substituted by computations from another module. This is possible in message-driven paradigm. We therefore base our component architecture on a message-driven runtime system, called Converse. We describe Converse next.
  • The first substantiation of our approach comes from Ian Foster’s parallel composition principle, which states that a programming language should allow individual components to concurrently interleave their execution. And the order of execution of these components should be implicitly determined by data availability.

blueGeneLTahoeAug2002.ppt Presentation Transcript

  • 1. Programming Models for Blue Gene/L : Charm++, AMPI and Applications Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign http://charm.cs.uiuc.edu
  • 2. Outline
    • Scaling to BG/L
      • Communication,
      • Mapping, reliability/FT,
      • Critical paths & load imbalance
    • The virtualization model
      • Basic ideas
      • Charm++ and AMPI
      • Virtualization: a magic bullet:
        • Logical decomposition,
        • Software eng.,
        • Flexible map
      • Message driven execution:
      • Principle of persistence
      • Runtime optimizations
    • BG/L Prog. Development env
      • Emulation setup
      • Simulation and Perf Prediction
        • Timestamp correction
        • Sdag and determinacy
      • Applications using BG/C,BG/L
        • NAMD on Lemieux
        • LeanMD,
        • 3D FFT
    • Ongoing research:
      • Load balancing
      • Communication optimization
      • Other models: Converse
    • Compiler support
  • 3. Technical Approach
    • Seek optimal division of labor between “system” and programmer:
      • Decomposition done by programmer, everything else automated
    Specialization MPI HPF Automation Charm++ Expression Scheduling Mapping Decomposition
  • 4. Object-based Decomposition
    • Idea:
      • Divide the computation into a large number of pieces
        • Independent of number of processors
        • Typically larger than number of processors
      • Let the system map objects to processors
    • Old idea? Fox (’86?), DRMS,
    • Our approach is “virtualization++”
      • Language and runtime support for virtualization
      • Exploitation of virtualization to the hilt
  • 5. Object-based Parallelization User View User is only concerned with interaction between objects System implementation
  • 6. Realizations: Charm++
    • Charm++:
      • Parallel C++ with Data Driven Objects (Chares)
      • Object Arrays/ Object Collections
      • Object Groups:
        • Global object with a “representative” on each PE
      • Asynchronous method invocation
        • Prioritized scheduling
      • Information sharing abstractions: readonly, tables,..
      • Mature, robust, portable ( http://charm.cs.uiuc.edu )
  • 7. Chare Arrays
    • Elements are data-driven objects
    • Elements are indexed by a user-defined data type-- [sparse] 1D, 2D, 3D, tree, ...
    • Send messages to index, receive messages at element. Reductions and broadcasts across the array
    • Dynamic insertion, deletion, migration-- and everything still has to work!
  • 8. Object Arrays
    • A collection of data-driven objects (aka chares),
      • With a single global name for the collection, and
      • Each member addressed by an index
      • Mapping of element objects to processors handled by the system
    A[0] A[1] A[2] A[3] A[..] User’s view
  • 9. Object Arrays
    • A collection of chares,
      • with a single global name for the collection, and
      • each member addressed by an index
      • Mapping of element objects to processors handled by the system
    A[0] A[1] A[2] A[3] A[..] A[3] A[0] User’s view System view
  • 10. Object Arrays
    • A collection of chares,
      • with a single global name for the collection, and
      • each member addressed by an index
      • Mapping of element objects to processors handled by the system
    A[0] A[1] A[2] A[3] A[..] A[3] A[0] User’s view System view
  • 11. Comparison with MPI
    • Advantage: Charm++
      • Modules/Abstractions are centered on application data structures,
        • Not processors
      • Several other…
    • Advantage: MPI
      • Highly popular, widely available, industry standard
      • “ Anthropomorphic” view of processor
        • Many developers find this intuitive
    • But mostly:
      • There is no hope of weaning people away from MPI
      • There is no need to choose between them!
  • 12. Adaptive MPI
    • A migration path for legacy MPI codes
      • Allows them dynamic load balancing capabilities of Charm++
    • AMPI = MPI + dynamic load balancing
    • Uses Charm++ object arrays and migratable threads
    • Minimal modifications to convert existing MPI programs
      • Automated via AMPizer
        • Based on Polaris Compiler Framework
    • Bindings for
      • C, C++, and Fortran90
  • 13. AMPI: 7 MPI processes
  • 14. AMPI: 7 MPI “processes” Implemented as virtual processors (user-level migratable threads) Real Processors
  • 15. II: Consequences of Virtualization
    • Better Software Engineering
    • Message Driven Execution
    • Flexible and dynamic mapping to processors
  • 16. Modularization
    • Logical Units decoupled from “Number of processors”
      • E.G. Oct tree nodes for particle data
      • No artificial restriction on the number of processors
        • Cube of power of 2
    • Modularity:
      • Software engineering: cohesion and coupling
      • MPI’s “are on the same processor” is a bad coupling principle
      • Objects liberate you from that:
        • E.G. Solid and fluid moldules in a rocket simulation
  • 17. Rocket Simulation
    • Large Collaboration headed Mike Heath
      • DOE supported center
    • Challenge:
      • Multi-component code, with modules from independent researchers
      • MPI was common base
    • AMPI: new wine in old bottle
      • Easier to convert
      • Can still run original codes on MPI, unchanged
  • 18. Rocket simulation via virtual processors Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocflo Rocflo Rocflo Rocflo
  • 19. AMPI and Roc*: Communication Rocflo Rocflo Rocflo Rocflo Rocflo Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid
  • 20. Message Driven Execution Scheduler Scheduler Message Q Message Q
  • 21. Adaptive Overlap via Data-driven Objects
    • Problem:
      • Processors wait for too long at “receive” statements
    • Routine communication optimizations in MPI
      • Move sends up and receives down
      • Sometimes. Use irecvs, but be careful
    • With Data-driven objects
      • Adaptive overlap of computation and communication
      • No object or threads holds up the processor
      • No need to guess which is likely to arrive first
  • 22. Adaptive overlap and modules SPMD and Message-Driven Modules ( From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance , Ph.D Thesis, Apr 1994.)
  • 23. Modularity and Adaptive Overlap “ Parallel Composition Principle: For effective composition of parallel components, a compositional programming language should allow concurrent interleaving of component execution, with the order of execution constrained only by availability of data.” (Ian Foster, Compositional parallel programming languages , ACM Transactions of Programming Languages and Systems, 1996)
  • 24. Handling OS Jitter via MDE
    • MDE encourages asynchrony
      • Asynchronous reductions, for example
      • Only data dependence should force synchronization
    • One benefit:
      • Consider an algorithm with N steps
        • Each step has different load balance: Tij
        • Loose dependence between steps
          • (on neighbors, for example)
      • Sum-of-max (MPI) vs max-of-sum (MDE)
    • OS Jitter:
      • Causes random processors to add delays in each step
      • Handled Automatically by MDE
  • 25. Virtualization/MDE leads to predictability
    • Ability to predict:
      • Which data is going to be needed and
      • Which code will execute
      • Based on the ready queue of object method invocations
    • So, we can:
      • Prefetch data accurately
      • Prefetch code if needed
      • Out-of-core execution
      • Caches vs controllable SRAM
    S S Q Q
  • 26. Flexible Dynamic Mapping to Processors
    • The system can migrate objects between processors
      • Vacate workstation used by a parallel program
      • Dealing with extraneous loads on shared workstations
      • Shrink and Expand the set of processors used by an app
        • Adaptive job scheduling
        • Better System utilization
      • Adapt to speed difference between processors
        • E.g. Cluster with 500 MHz and 1 Ghz processors
    • Automatic checkpointing
      • Checkpointing = migrate to disk!
      • Restart on a different number of processors
  • 27. Load Balancing with AMPI/Charm++ Turing cluster has processors with different speeds 267.75 299.85 301.56 235.19 Time Step 133.76 149.01 150.08 117.16 Pre-Cor Iter 46.83 52.20 52.50 41.86 Solid update 86.89 96.73 97.50 75.24 Fluid update 8P3,8P2 w. LB 8P3,8P2 w/o LB 16P2 16P3 Phase
  • 28. Principle of Persistence
    • Once the application is expressed in terms of interacting objects:
      • Object communication patterns and computational loads tend to persist over time
      • In spite of dynamic behavior
        • Abrupt and large,but infrequent changes (eg:AMR)
        • Slow and small changes (eg: particle migration)
    • Parallel analog of principle of locality
      • Heuristics, that holds for most CSE applications
      • Learning / adaptive algorithms
      • Adaptive Communication libraries
      • Measurement based load balancing
  • 29. Measurement Based Load Balancing
    • Based on Principle of persistence
    • Runtime instrumentation
      • Measures communication volume and computation time
    • Measurement based load balancers
      • Use the instrumented data-base periodically to make new decisions
      • Many alternative strategies can use the database
        • Centralized vs distributed
        • Greedy improvements vs complete reassignments
        • Taking communication into account
        • Taking dependences into account (More complex)
  • 30. Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked
  • 31. “Overhead” of Virtualization
  • 32. Optimizing for Communication Patterns
    • The parallel-objects Runtime System can observe, instrument, and measure communication patterns
      • Communication is from/to objects, not processors
      • Load balancers use this to optimize object placement
      • Communication libraries can optimize
        • By substituting most suitable algorithm for each operation
        • Learning at runtime
      • E.g. Each to all individualized sends
        • Performance depends on many runtime characteristics
        • Library switches between different algorithms
    V. Krishnan, MS Thesis, 1996
  • 33. Example: All to all on Lemieux
  • 34. The Other Side: Pipelining
    • A sends a large message to B, whereupon B computes
      • Problem: B is idle for a long time, while the message gets there.
      • Solution: Pipelining
        • Send the message in multiple pieces, triggering a computation on each
    • Objects makes this easy to do:
    • Example:
      • Ab Initio Computations using Car-Parinello method
      • Multiple 3D FFT kernel
    Recent collaboration with: R. Car, M. Klein, G. Martyna, M, Tuckerman, N. Nystrom, J. Torrellas
  • 35.  
  • 36. Effect of Pipelining Multiple Concurrent 3D FFTs, on 64 Processors of Lemieux V. Ramkumar (PPL)
  • 37. Control Points: learning and tuning
    • The RTS can automatically optimize the degree of pipelining
      • If it is given a control point (knob) to tune
      • By the application
    Controlling pipelining between a pair of objects: S. Krishnan, PhD Thesis, 1994 Controlling degree of virtualization: Orchestration Framework: M. Bhandarkar PhD thesis, 2002
  • 38. So, What Are We Doing About It?
    • How to develop any programming environment for a machine that isn’t built yet
    • Blue Gene/C emulator using charm++
      • Completed last year
      • Implememnts low level BG/C API
        • Packet sends, extract packet from comm buffers
      • Emulation runs on machines with hundreds of “normal” processors
    • Charm++ on blue Gene /C Emulator
  • 39. So, What Are We Doing About It?
    • How to develop any programming environment for a machine that isn’t built yet
    • Blue Gene/C emulator using charm++
      • Completed last year
      • Implememnts low level BG/C API
        • Packet sends, extract packet from comm buffers
      • Emulation runs on machines with hundreds of “normal” processors
    • Charm++ on blue Gene /C Emulator
  • 40. Structure of the Emulators Blue Gene/C Low-level API Charm++ Converse Converse Charm++ BG/C low level API Charm++
  • 41. Structure of the Emulators Blue Gene/C Low-level API Charm++ Converse Converse Charm++ BG/C low level API Charm++
  • 42. Emulation on a Parallel Machine Simulating (Host) Processor BG/C Nodes Hardware thread
  • 43. Emulation on a Parallel Machine Simulating (Host) Processor BG/C Nodes Hardware thread
  • 44. Extensions to Charm++ for BG/C
    • Microtasks:
      • Objects may fire microtasks that can be executed by any thread on the same node
      • Increases parallelism
      • Overhead: sub-microsecond
    • Issue:
      • Object affinity: map to thread or node?
        • Thread, currently.
        • Microtasks alleviate load balancing within a node
  • 45. Extensions to Charm++ for BG/C
    • Microtasks:
      • Objects may fire microtasks that can be executed by any thread on the same node
      • Increases parallelism
      • Overhead: sub-microsecond
    • Issue:
      • Object affinity: map to thread or node?
        • Thread, currently.
        • Microtasks alleviate load balancing within a node
  • 46. Emulation efficiency
    • How much time does it take to run an emulation?
      • 8 Million processors being emulated on 100
      • In addition, lower cache performance
      • Lots of tiny messages
    • On a Linux cluster:
      • Emulation shows good speedup
  • 47. Emulation efficiency
    • How much time does it take to run an emulation?
      • 8 Million processors being emulated on 100
      • In addition, lower cache performance
      • Lots of tiny messages
    • On a Linux cluster:
      • Emulation shows good speedup
  • 48. Emulation efficiency 1000 BG/C nodes (10x10x10) Each with 200 threads (total of 200,000 user-level threads) But Data is preliminary, based on one simulation
  • 49. Emulation efficiency 1000 BG/C nodes (10x10x10) Each with 200 threads (total of 200,000 user-level threads) But Data is preliminary, based on one simulation
  • 50. Emulator to Simulator
    • Step 1: Coarse grained simulation
      • Simulation: performance prediction capability
      • Models contention for processor/thread
      • Also models communication delay based on distance
      • Doesn’t model memory access on chip, or network
      • How to do this in spite of out-of-order message delivery?
        • Rely on determinism of Charm++ programs
        • Time stamped messages and threads
        • Parallel time-stamp correction algorithm
  • 51. Emulator to Simulator
    • Step 1: Coarse grained simulation
      • Simulation: performance prediction capability
      • Models contention for processor/thread
      • Also models communication delay based on distance
      • Doesn’t model memory access on chip, or network
      • How to do this in spite of out-of-order message delivery?
        • Rely on determinism of Charm++ programs
        • Time stamped messages and threads
        • Parallel time-stamp correction algorithm
  • 52. Emulator to Simulator
    • Step 2: Add fine grained procesor simulation
      • Sarita Adve: RSIM based simulation of a node
        • SMP node simulation: completed
      • Also: simulation of interconnection network
      • Millions of thread units/caches to simulate in detail?
    • Step 3: Hybrid simulation
      • Instead: use detailed simulation to build model
      • Drive coarse simulation using model behavior
      • Further help from compiler and RTS
  • 53. Emulator to Simulator
    • Step 2: Add fine grained procesor simulation
      • Sarita Adve: RSIM based simulation of a node
        • SMP node simulation: completed
      • Also: simulation of interconnection network
      • Millions of thread units/caches to simulate in detail?
    • Step 3: Hybrid simulation
      • Instead: use detailed simulation to build model
      • Drive coarse simulation using model behavior
      • Further help from compiler and RTS
  • 54. Applications on the current system
    • Using BG Charm++
    • LeanMD:
      • Research quality Molecular Dyanmics
      • Version 0: only electrostatics + van der Vaal
    • Simple AMR kernel
      • Adaptive tree to generate millions of objects
        • Each holding a 3D array
      • Communication with “neighbors”
        • Tree makes it harder to find nbrs, but Charm makes it easy
  • 55. Modeling layers Applications Libraries/RTS Chip Architecture Network model For each: need a detailed simulation and a simpler (e.g. table-driven) “model” And methods for combining them
  • 56. Modeling layers Applications Libraries/RTS Chip Architecture Network model For each: need a detailed simulation and a simpler (e.g. table-driven) “model” And methods for combining them
  • 57. Timestamp correction
    • Basic execution:
      • Timestamped messages
    • Correction needed when:
      • A message arrives with an earlier timestamp than other messages “processed” already
    • Cases:
      • Messages to Handlers or simple objects
      • MPI style threads, without wildcard or irecvs
      • Charm++ with dependence expressed via structured dagger
  • 58. Timestamps Correction M8 M1 M7 M6 M5 M4 M3 M2 RecvTime Execution TimeLine
  • 59. Timestamps Correction M8 M1 M7 M6 M5 M4 M3 M2 RecvTime Execution TimeLine
  • 60. Timestamps Correction M1 M7 M6 M5 M4 M3 M2 RecvTime Execution TimeLine M8 Execution TimeLine M1 M7 M6 M5 M4 M3 M2 M8 RecvTime Correction Message
  • 61. Timestamps Correction M1 M7 M6 M5 M4 M3 M2 RecvTime Execution TimeLine Correction Message (M4) M4 Correction Message (M4) M4 M1 M7 M4 M3 M2 RecvTime Execution TimeLine M5 M6 Correction Message M1 M7 M6 M4 M3 M2 RecvTime Execution TimeLine M5 Correction Message
  • 62. Performance of correction Algorithm
    • Without correction
      • 15 seconds to emulate a 18msec timstep
      • 10x10x10 nodes with k threads each (200?)
    • With correction
      • Version 1: 42 minutes per step!
      • Version 2:
        • “ Chase” and correct messages still in queues
        • Optimize search for messages in the log data
        • Currently at 30 secs per step
  • 63. Applications on the current system
    • Using BG Charm++
    • LeanMD:
      • Research quality Molecular Dyanmics
      • Version 0: only electrostatics + van der Vaal
    • Simple AMR kernel
      • Adaptive tree to generate millions of objects
        • Each holding a 3D array
      • Communication with “neighbors”
        • Tree makes it harder to find nbrs, but Charm makes it easy
  • 64. Example: Molecular Dynamics in NAMD
    • Collection of [charged] atoms, with bonds
      • Newtonian mechanics
      • Thousands of atoms (1,000 - 500,000)
      • 1 femtosecond time-step, millions needed!
    • At each time-step
      • Calculate forces on each atom
        • Bonds:
        • Non-bonded: electrostatic and van der Waal’s
      • Calculate velocities and advance positions
      • Multiple Time Stepping : PME (3D FFT) every 4 steps
    Collaboration with K. Schulten, R. Skeel, and coworkers
  • 65. NAMD: Molecular Dynamics
    • Collection of [charged] atoms, with bonds
    • Newtonian mechanics
    • At each time-step
      • Calculate forces on each atom
        • Bonds:
        • Non-bonded: electrostatic and van der Waal’s
      • Calculate velocities and advance positions
    • 1 femtosecond time-step, millions needed!
    • Thousands of atoms (1,000 - 100,000)
    Collaboration with K. Schulten, R. Skeel, and coworkers
  • 66. Further MD
    • Use of cut-off radius to reduce work
      • 8 - 14 Å
      • Faraway charges ignored!
    • 80-95 % work is non-bonded force computations
    • Some simulations need faraway contributions
  • 67. Scalability
    • The Program should scale up to use a large number of processors.
      • But what does that mean?
    • An individual simulation isn’t truly scalable
    • Better definition of scalability:
      • If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size
  • 68. Isoefficiency
    • Quantify scalability
    • How much increase in problem size is needed to retain the same efficiency on a larger machine?
    • Efficiency : Seq. Time/ (P · Parallel Time)
      • parallel time =
        • computation + communication + idle
  • 69. Traditional Approaches
    • Replicated Data:
      • All atom coordinates stored on each processor
      • Non-bonded Forces distributed evenly
      • Analysis: Assume N atoms, P processors
        • Computation: O( N/P )
        • Communication: O( N log P )
        • Communication/Computation ratio: P log P
        • Fraction of communication increases with number of processors, independent of problem size!
    Not Scalable
  • 70. Atom decomposition
    • Partition the Atoms array across processors
      • Nearby atoms may not be on the same processor
      • Communication: O(N) per processor
      • Communication/Computation: O(P)
    Not Scalable
  • 71. Force Decomposition
    • Distribute force matrix to processors
      • Matrix is sparse, non uniform
      • Each processor has one block
      • Communication: N/sqrt(P)
      • Ratio: sqrt(P)
    • Better scalability
      • (can use 100+ processors)
      • Hwang, Saltz, et al:
      • 6% on 32 Pes 36% on 128 processor
    Not Scalable
  • 72. Spatial Decomposition
    • Allocate close-by atoms to the same processor
    • Three variations possible:
      • Partitioning into P boxes, 1 per processor
        • Good scalability, but hard to implement
      • Partitioning into fixed size boxes, each a little larger than the cutoff disctance
      • Partitioning into smaller boxes
    • Communication: O(N/P)
  • 73. Spatial Decomposition in NAMD
    • NAMD 1 used spatial decomposition
    • Good theoretical isoefficiency, but for a fixed size system, load balancing problems
    • For midsize systems, got good speedups up to 16 processors….
    • Use the symmetry of Newton’s 3rd law to facilitate load balancing
  • 74. Spatial Decomposition
  • 75. Spatial Decomposition
  • 76. Object Based Parallelization for MD
  • 77. FD + SD
    • Now, we have many more objects to load balance:
      • Each diamond can be assigned to any processor
      • Number of diamonds (3D):
      • 14·Number of Patches
  • 78. Bond Forces
    • Multiple types of forces:
      • Bonds(2), Angles(3), Dihedrals (4), ..
      • Luckily, each involves atoms in neighboring patches only
    • Straightforward implementation:
      • Send message to all neighbors,
      • receive forces from them
      • 26*2 messages per patch!
  • 79. Bonded Forces:
    • Assume one patch per processor
    B C A
  • 80. Optimizations in scaling to 1000
    • Parallelization is based on parallel objects
      • Charm++ : a parallel C++
    • Series of optimizations were implemented to scale performance to 1000+ processors
    • Examples:
      • Load Balancing:
        • Grainsize distributions
  • 81. Grainsize and Amdahls’s law
    • A variant of Amdahl’s law, for objects:
      • The fastest time can be no shorter than the time for the biggest single object!
    • How did it apply to us?
      • Sequential step time was 57 seconds
      • To run on 2k processors, no object should be more than 28 msecs.
      • Analysis using our tools showed:
  • 82. Grainsize analysis Solution: Split compute objects that may have too much work: using a heuristics based on number of interacting atoms Problem
  • 83. Grainsize reduced
  • 84. NAMD performance using virtualization
    • Written in Charm++
    • Uses measurement based load balancing
    • Object level performance feedback
      • using “projections” tool for Charm++
      • Identifies problems at source level easily
      • Almost suggests fixes
    • Attained unprecedented performance
  • 85.  
  • 86.  
  • 87. PME parallelization Impor4t picture from sc02 paper (sindhura’s)
  • 88.  
  • 89.  
  • 90.  
  • 91. Performance: NAMD on Lemieux ATPase: 320,000+ atoms including water
  • 92. LeanMD for BG/L
    • Need many more objects:
      • Generalize hybrid decompostion scheme
        • 1-away to k-away
  • 93. Object Based Parallelization for MD
  • 94.  
  • 95. Role of compilers
    • New uses of compiler analysis
      • Apparently simple, but then again, data flow analysis must have seemed simple
      • Supporting Threads,
      • Shades of global variables
      • Minimizing state at migration
      • Border fusion
      • Split-phase semantics (UPC).
      • Components (separately compiled)
      • Border fusion
    • Compiler – RTS collaboration needed!
  • 96. Summary
    • Virtualization as a magic bullet
      • Logical decomposition, better software eng.
      • Flexible and dynamic mapping to processors
    • Message driven execution:
      • Adaptive overlap, modularity, predictability
    • Principle of persistence
      • Measurement based load balancing,
      • Adaptive communication libraries
    • Future:
      • Compiler support
      • Realize the potential:
        • Strategies and applications
    More info: http://charm.cs.uiuc.edu
  • 97. Component Frameworks
    • Seek optimal division of labor between “system” and programmer:
      • Decomposition done by programmer, everything else automated
      • Develop standard library of reusable parallel components
    Domain specific frameworks Specialization MPI expression Scheduling Mapping Decomposition HPF Charm++