BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kak...
Motivations <ul><li>Extremely large parallel machines around the corner </li></ul><ul><ul><li>Examples:  </li></ul></ul><u...
BlueGene/L
Roadmap <ul><li>Explore suitable programming models </li></ul><ul><ul><li>Charm++ (Message-driven) </li></ul></ul><ul><ul>...
Charm++ - Object-based programming model User View User is only concerned with interaction between objects System implemen...
Charm++ Object-based Programming Model <ul><li>Processor virtualization </li></ul><ul><ul><li>Divide computation into larg...
Charm++ for Peta-scale Machines <ul><li>Explicit management of resources </li></ul><ul><ul><li>This data on that processor...
AMPI - MPI + processor virtualization Implemented as virtual processors (user-level migratable threads) Real Processors 7 ...
Parallel Emulator <ul><li>Actually run a parallel program </li></ul><ul><ul><li>Emulate full machine on existing parallel ...
Emulation on a Parallel Machine Emulating 8M threads on 96 ASCI-Red processors Simulating (Host) Processor Simulated multi...
Emulator Performance <ul><li>Scalable </li></ul><ul><li>Emulating a real-world MD application on a 200K processor BG machi...
Emulator to Simulator <ul><li>Predicting parallel performance </li></ul><ul><li>Modeling parallel performance accurately i...
Performance Prediction <ul><li>Parallel Discrete Event Simulation (PDES) </li></ul><ul><ul><li>Logical processor (LP) has ...
Predict Parallel Components <ul><li>How to predict parallel components? </li></ul><ul><ul><li>Multiple resolution levels <...
Prior PDES Work <ul><li>Conservative vs. optimistic protocols </li></ul><ul><ul><li>Conservative:  (example:  DaSSF ) </li...
Why not use existing PDES? <ul><li>Major synchronization overheads </li></ul><ul><ul><li>Rollback/restart overhead </li></...
Timestamp Correction <ul><li>Messages should be executed in the order of their timestamps </li></ul><ul><li>Causality erro...
Simulation of Different Applications <ul><li>Linear-order applications  </li></ul><ul><ul><li>No wildcard MPI receives </l...
Structured-Dagger <ul><li>entry void jacobiLifeCycle() </li></ul><ul><li>{ </li></ul><ul><li>for (i=0; i<MAX_ITER; i++) </...
Time Stamping messages LP Virtual Timer: curT Message sent: RecvT(msg) = curT+Latency Message scheduled: curT = max(curT, ...
Timestamps Correction M1 M7 M6 M5 M4 M3 M2 RecvTime Execution TimeLine M8 Execution TimeLine M1 M7 M6 M5 M4 M3 M2 M8 RecvT...
Architecture of BigSim Simulator Charm++ and MPI applications Simulation output trace logs Performance visualization (Proj...
Architecture of BigSim Simulator Charm++ and MPI applications Simulation output trace logs BigNetSim (POSE) Network Simula...
Big Network Simulation <ul><li>Simulate network  behavior : packetization, routing, contention, etc. </li></ul><ul><li>Inc...
BigSim Validation on Lemieux 32 real processors
Jacobi on a 64K BG/L
Case Study - LeanMD <ul><li>Molecular dynamics simulation designed for large machines </li></ul><ul><li>K-away cut-off par...
Load Imbalance Histogram
Performance of the BigSim Real processors (PSC Lemieux)
Conclusions <ul><li>Improved the simulation efficiency by taking advantage of “inherent determinacy” of parallel applicati...
Future Work <ul><li>Improving simulation accuracy </li></ul><ul><ul><li>Instruction level simulator </li></ul></ul><ul><ul...
Upcoming SlideShare
Loading in …5
×

A Parallel Simulator For Performance Prediction Of Extremely Large Parallel Machines

727 views
690 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
727
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • A Parallel Simulator For Performance Prediction Of Extremely Large Parallel Machines

    1. 1. BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University of Illinois at Urbana-Champaign
    2. 2. Motivations <ul><li>Extremely large parallel machines around the corner </li></ul><ul><ul><li>Examples: </li></ul></ul><ul><ul><ul><li>ASCI Purple (12K, 100TF) </li></ul></ul></ul><ul><ul><ul><li>BlueGene/L (64K, 360TF) </li></ul></ul></ul><ul><ul><ul><li>BlueGene/C (8M, 1PF) </li></ul></ul></ul><ul><ul><li>PF machines likely to have 100k+ processors (1M?) </li></ul></ul><ul><li>Would existing parallel applications scale? </li></ul><ul><ul><li>Machines are not there </li></ul></ul><ul><ul><li>Parallel performance is hard to model without actually running the program </li></ul></ul>
    3. 3. BlueGene/L
    4. 4. Roadmap <ul><li>Explore suitable programming models </li></ul><ul><ul><li>Charm++ (Message-driven) </li></ul></ul><ul><ul><li>MPI and its extension - AMPI (adaptive version of MPI) </li></ul></ul><ul><li>Use a parallel emulator to run applications </li></ul><ul><li>Coarse-grained simulator for performance prediction (not hardware simulation) </li></ul>
    5. 5. Charm++ - Object-based programming model User View User is only concerned with interaction between objects System implementation
    6. 6. Charm++ Object-based Programming Model <ul><li>Processor virtualization </li></ul><ul><ul><li>Divide computation into large number of pieces </li></ul></ul><ul><ul><ul><li>Independent of number of processors </li></ul></ul></ul><ul><ul><ul><li>Typically larger than number of processors </li></ul></ul></ul><ul><ul><li>Let system map objects to processors </li></ul></ul><ul><ul><ul><li>Empowers an adaptive, intelligent runtime system </li></ul></ul></ul>User View System implementation
    7. 7. Charm++ for Peta-scale Machines <ul><li>Explicit management of resources </li></ul><ul><ul><li>This data on that processor </li></ul></ul><ul><ul><li>This work on that processor </li></ul></ul><ul><li>Object can migrate </li></ul><ul><ul><li>Automatic efficient resource management </li></ul></ul><ul><li>One sided communication </li></ul><ul><li>Asynchronous global operations (reductions, ..) </li></ul>
    8. 8. AMPI - MPI + processor virtualization Implemented as virtual processors (user-level migratable threads) Real Processors 7 MPI “processes”
    9. 9. Parallel Emulator <ul><li>Actually run a parallel program </li></ul><ul><ul><li>Emulate full machine on existing parallel machines </li></ul></ul><ul><li>Based on a common low level abstraction (API) </li></ul><ul><ul><li>Many multiprocessor nodes connected via message passing </li></ul></ul><ul><li>Emulator supports Charm++/AMPI </li></ul>Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant V. Kalé, `` A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops '' in NGS Program Workshop, IPDPS2002
    10. 10. Emulation on a Parallel Machine Emulating 8M threads on 96 ASCI-Red processors Simulating (Host) Processor Simulated multi-processor nodes Simulated processor
    11. 11. Emulator Performance <ul><li>Scalable </li></ul><ul><li>Emulating a real-world MD application on a 200K processor BG machine </li></ul>Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant V. Kalé, `` A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops '' in NGS Program Workshop, IPDPS02
    12. 12. Emulator to Simulator <ul><li>Predicting parallel performance </li></ul><ul><li>Modeling parallel performance accurately is challenging </li></ul><ul><ul><li>Communication subsystem </li></ul></ul><ul><ul><li>Behavior of runtime system </li></ul></ul><ul><ul><li>Size of the machine is big </li></ul></ul>
    13. 13. Performance Prediction <ul><li>Parallel Discrete Event Simulation (PDES) </li></ul><ul><ul><li>Logical processor (LP) has virtual clock </li></ul></ul><ul><ul><li>Events are time-stamped </li></ul></ul><ul><ul><li>State of an LP changes when an event arrives to it </li></ul></ul><ul><li>Our emulator was extended to carry out PDES </li></ul>
    14. 14. Predict Parallel Components <ul><li>How to predict parallel components? </li></ul><ul><ul><li>Multiple resolution levels </li></ul></ul><ul><ul><li>Sequential component : </li></ul></ul><ul><ul><ul><li>User supplied expression </li></ul></ul></ul><ul><ul><ul><li>Performance counters </li></ul></ul></ul><ul><ul><ul><li>Instruction level simulation </li></ul></ul></ul><ul><ul><li>Parallel component : </li></ul></ul><ul><ul><ul><li>Simple latency-based network model </li></ul></ul></ul><ul><ul><ul><li>Contention-based network simulation </li></ul></ul></ul>
    15. 15. Prior PDES Work <ul><li>Conservative vs. optimistic protocols </li></ul><ul><ul><li>Conservative: (example: DaSSF ) </li></ul></ul><ul><ul><ul><li>Ensure safety of processing events in global fashion </li></ul></ul></ul><ul><ul><ul><li>Typically require a look-ahead – high global synchronization overhead </li></ul></ul></ul><ul><ul><ul><li>MPI-SIM </li></ul></ul></ul><ul><ul><li>Optimistic: (examples: Time Warp , SPEEDS ) </li></ul></ul><ul><ul><ul><li>Each LP process the earliest event on its own, undo earlier out of order execution when causality errors occur </li></ul></ul></ul><ul><ul><ul><li>Exploit parallelism of simulation better, and is preferred </li></ul></ul></ul>
    16. 16. Why not use existing PDES? <ul><li>Major synchronization overheads </li></ul><ul><ul><li>Rollback/restart overhead </li></ul></ul><ul><ul><li>Checkpointing overhead </li></ul></ul><ul><li>We can do better in simulation of some parallel applications </li></ul><ul><ul><li>Property of Inherent determinacy in parallel applications </li></ul></ul><ul><ul><li>Most parallel programs are written to be deterministic, example “Jacobi” </li></ul></ul>
    17. 17. Timestamp Correction <ul><li>Messages should be executed in the order of their timestamps </li></ul><ul><li>Causality error due to out-of-order message delivery </li></ul><ul><li>Rollback and checkpoint are necessary in traditional methods </li></ul><ul><li>Inherent determinacy is hidden in applications </li></ul><ul><li>Need to capture event dependency </li></ul><ul><ul><li>Run-time detection </li></ul></ul><ul><ul><li>Use language “structured dagger” to express dependency </li></ul></ul>
    18. 18. Simulation of Different Applications <ul><li>Linear-order applications </li></ul><ul><ul><li>No wildcard MPI receives </li></ul></ul><ul><ul><li>Strong determinacy, no timestamp correction necessary </li></ul></ul><ul><li>Reactive applications (atomic) </li></ul><ul><ul><li>Message driven objects </li></ul></ul><ul><ul><li>Methods execute as corresponding messages arrive </li></ul></ul><ul><li>Multi-dependent applications </li></ul><ul><ul><li>Irecvs with WaitAll (MPI) </li></ul></ul><ul><ul><li>Uses of structured dagger to capture dependency (Charm++) </li></ul></ul>
    19. 19. Structured-Dagger <ul><li>entry void jacobiLifeCycle() </li></ul><ul><li>{ </li></ul><ul><li>for (i=0; i<MAX_ITER; i++) </li></ul><ul><li>{ </li></ul><ul><li>atomic {sendStripToLeftAndRight();} </li></ul><ul><li>overlap </li></ul><ul><li>{ </li></ul><ul><li>when getStripFromLeft(Msg *leftMsg) </li></ul><ul><li>{ atomic { copyStripFromLeft(leftMsg); } } </li></ul><ul><li>when getStripFromRight(Msg *rightMsg) </li></ul><ul><li>{ atomic { copyStripFromRight(rightMsg); } } </li></ul><ul><li>} </li></ul><ul><li>atomic{ doWork(); /* Jacobi Relaxation */ } </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
    20. 20. Time Stamping messages LP Virtual Timer: curT Message sent: RecvT(msg) = curT+Latency Message scheduled: curT = max(curT, RecvT(msg))
    21. 21. Timestamps Correction M1 M7 M6 M5 M4 M3 M2 RecvTime Execution TimeLine M8 Execution TimeLine M1 M7 M6 M5 M4 M3 M2 M8 RecvTime Correction Message
    22. 22. Architecture of BigSim Simulator Charm++ and MPI applications Simulation output trace logs Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM, ..) Simple Network Model Performance counters Load Balancing Module
    23. 23. Architecture of BigSim Simulator Charm++ and MPI applications Simulation output trace logs BigNetSim (POSE) Network Simulator Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM, ..) Simple Network Model Performance counters Load Balancing Module Offline PDES
    24. 24. Big Network Simulation <ul><li>Simulate network behavior : packetization, routing, contention, etc. </li></ul><ul><li>Incorporate with post-mortem timestamp correction via POSE </li></ul><ul><li>Switches are connected in torus network </li></ul>BGSIM Emulator POSE Timestamp Correction BG Log Files (tasks & dependencies) Timestamp-corrected Tasks BigNetSim
    25. 25. BigSim Validation on Lemieux 32 real processors
    26. 26. Jacobi on a 64K BG/L
    27. 27. Case Study - LeanMD <ul><li>Molecular dynamics simulation designed for large machines </li></ul><ul><li>K-away cut-off parallelization </li></ul><ul><ul><ul><ul><li>Benchmark er-gre with 3-away </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>36573 atoms </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>1.6 million objects </li></ul></ul></ul></ul><ul><ul><ul><ul><li>8 step simulation </li></ul></ul></ul></ul><ul><ul><ul><ul><li>32k processor BG machine </li></ul></ul></ul></ul><ul><li>Running on 400 PSC Lemieux processors </li></ul>Performance visualization tools
    28. 28. Load Imbalance Histogram
    29. 29. Performance of the BigSim Real processors (PSC Lemieux)
    30. 30. Conclusions <ul><li>Improved the simulation efficiency by taking advantage of “inherent determinacy” of parallel applications </li></ul><ul><li>Explored simulation techniques show good parallel scalability </li></ul><ul><li>http://charm.cs.uiuc.edu </li></ul>
    31. 31. Future Work <ul><li>Improving simulation accuracy </li></ul><ul><ul><li>Instruction level simulator </li></ul></ul><ul><ul><li>Network simulator </li></ul></ul><ul><li>Developing run-time techniques (load balancing) for very large machines using the simulator </li></ul>

    ×