Scalable Parallel Performance Measurement with the Scalasca Toolset
MitgliedderHelmholtz-GemeinschaftScalable ParallelPerformance Measurementwith the Scalasca ToolsetBernd MohrJune 2013
June 2013 JSC 2Parallel Architectures: State of the ArtNetwork or Switch...N0 N1 NkInter-connectP0 Pn...MemoryA0Am... Inter-connectP0 Pn...MemoryA0Am...Inter-connectP0 Pn...A0Am...MemoryPiCore0 Core1 CorerL10 L11 L1L20 L2r/2L30...... AjRouter RouterRouterRouter RouterRouterRouterRouter Router RouterRouter Router RouterRouter Router RouterRouter Router RouterRouter Router RouterRouter Router RouterRouter Router RouterRouter Router RouterRouter Router RouterorSMPNUMA
June 2013 JSC 3Parallel Performance Challenges• Current and future systems (will) consist of Complex configurations With a huge number of components Very likely heterogeneous• Deep software hierarchies of large, complex software components willbe required to make use of such systems Sophisticated integrated performancemeasurement, analysis, and optimization capabilitieswill be required to efficiently operate such systems Tools which provide insight not just numbers or charts needed!
June 2013 JSC 4“A picture is worth 1000 words…”• “Real world” example• MPI ring program
June 2013 JSC 5“What about 1000’s of pictures?”(with 100’s of menu options)
June 2013 JSC 6Example Automatic Analysis: Late Sender
June 2013 JSC 7Scalasca: Example MPI PatternstimeprocessENTER EXIT SEND RECV COLLEXIT(a) Late Sendertimeprocess(b) Late Receivertimeprocess(d) Wait at N x Ntimeprocess(c) Late Sender / Wrong Order
June 2013 JSC 8The Scalasca Project• Scalable Analysis ofLarge Scale Applications• Approach Instrument C, C++, and Fortran parallel applications Based on MPI, OpenMP, SHMEM, or hybrid Option 1: scalable call-path profiling Option 2: scalable event trace analysis Collect event traces Search trace for event patterns representing inefficiencies Categorize and rank inefficiencies found• Supports MPI 2.2 (P2P, collectives, RMA, IO) and OpenMP 3.0 (excl. nesting)http://www.scalasca.org/
June 2013 JSC 11timeScalasca Root Cause Analysis• Root-cause analysis Wait states typically caused by loador communication imbalancesearlier in the program Waiting time can also propagate(e.g., indirect waiting time) Enhanced performance analysis tofind the root cause of wait states• Approach Distinguish between directand indirect waiting time Identify call path/processcombinations delaying otherprocesses and causing firstorder waiting time Identify original delayRecvSendSendfoofoofoobarbar RecvABCcauseRecvRecvDirect waitIndirect waitRecvbarDELAY
June 2013 JSC 12Scalasca Example: CESM Sea Ice ModuleDirect WaitTime Analysis• Direct waitcaused by ranksprocessing areasnear the northand southice borders
June 2013 JSC 13Scalasca Example: CESM Sea Ice ModuleIndirect WaitTime Analysis• Indirect waitsoccurs forranks processingwarmer areas
June 2013 JSC 14Scalasca Example: CESM Sea Ice ModuleDelay CostsAnalysis• Delays NOTcaused on ranksprocessingice!
June 2013 JSC 15NEW: Scalasca on Intel MICExample:• TACC Stampede• NAS BT-MZ code• MPI/OpenMP• 8x16 CPU threads (2 MPI/node)• 60x16 MIC threads (15 MPI/MIC)Supported modes• Host-only or MIC-only• SymmetricNot yet supported modes• Offload
June 2013 JSC 16Acknowledgements• Scalasca team (JSC) (GRS)• SponsorsMichaelKnoblochBerndMohrPeterPhilippenMarkusGeimerDanielLorenzChristianRösselDavidBöhmeMarc-AndréHermannsPavelSaviankouMarcSchlütterIljaZhukovAlexandreStrubeBrianWylieFelixWolfAnkeVisserMonikaLückeAamerShahAlexandruCalotoiuJieJiangSergeiShudlerGuoyongMaoPhilippGschwandtner
June 2013 JSC 17Questions?• Check outhttp://www.scalasca.org• Or contact us firstname.lastname@example.org