Advanced Modular Software Performance Monitoring

225 views

Published on

CPU profiling with Intel® VTune™ Amplifier XE

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
225
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Advanced Modular Software Performance Monitoring

  1. 1. Advanced Modular SoftwarePerformance MonitoringCPU profiling withIntel® VTune™ Amplifier XEAlexander MazurovFerrara University, CERN
  2. 2. 2I. Event Processing SoftwareII. ProfilersIII. Intel® VTune™ Amplifier XEIV. Gaudi FrameworkV. Gaudi Intel Profiler AuditorVI. Profiling examples
  3. 3. 3PhysicseventsThe HiggsBosonSimulation * Trigger * AnalysisI. Event Processing Software
  4. 4. 4DetectoreventsEventsto storage106events/sec 4500events/secLHCb High Level Trigger (HLT) SoftwareMoore
  5. 5. 5II. ProfilersCollect information relatedto how an application orsystem perform.
  6. 6. 6Measure frequency and duration offunctions calls and/or codeinstructions.CPU Profiler
  7. 7. 7Profiling Techniques- Hardware counters- Instrumenting the code
  8. 8. 8Hardware countersExploit hardware performance countersfrom Performance Monitoring Unit (PMU)Counters:- Translation lookaside buffer (TLB) misses- Cache misses- Stall cycles- Memory access latency- ...Perfmon2 * Intel VTune Amplifier
  9. 9. 9Instrumenting the code- Statically:* Change code manually / automatically* Compiler assisted (gcc -pg)- Dynamically (at runtime):* Change code in runtime- Valgrind- Google Performance Tools- Intel VTune Amplifier
  10. 10. 10III. VTune™ Amplifier XEPerformance Profiling Tool- x86 (32 and 64-bit)- GUI and CLI
  11. 11. 11VTune™ FeaturesRuntime instrumenting profiler- User-mode sampling- Hardware-based sampling- Concurrency and locks andwaits analysis- Threading timeline- Attach to a running process- Source view
  12. 12. 121) Interupts a process2) Collect samples of all activeinstruction addresses3) Restore a call sequence uponeach sample.How user-mode sampling works?
  13. 13. 13User-mode analysis types- Hotspots- Concurrency- Locks and Waits
  14. 14. 14User-mode samplingHotspots analysis:
  15. 15. 15Group results
  16. 16. 16Call Stack
  17. 17. 17Filter by timeline
  18. 18. 18CPU time by code lineDebug mode (-g)
  19. 19. 19User-mode sampling is a statistical methodand does not provide a 100% accurateresults.Accuracy depends on:- Duration of the collection- Speed of processor- Amount of software activity- Sampling interval* recommended value is 10 ms* profiling is only 5% slowerSampling Accuracy
  20. 20. 20Integrating VTune™ Amplifierto Event Processing Framework
  21. 21. 21IV. GaudiEvent processing frameworkMooreTriggerGaussSimulationBrunelReconstructionOnlineMonitoringand commissioningDaVinciPhysicsanalysis
  22. 22. 22Gaudi ArchitectureAlgorithms * Services * Tools
  23. 23. 23Moore Event LoopHlt1DiMuonHighMassFilterSequenceHlt1DiMuonHighMassStreamerFastVeloHltMuonRecVelo2CandidatesDiMuonHighMassGECLooseUnitcreateITLiteClusterscreateVeloLiteClustersAlgorithmsSequenceHow to profile algorithms?
  24. 24. 24V. Gaudi Intel ProfilingAuditorVTune™ User API+Gaudi Auditors API
  25. 25. 25VTune™ User API- Start/Pause profiling- Mark profiling regions
  26. 26. 26Gaudi Auditors APIAlgorithm Start event End eventCallback functions
  27. 27. 27Algorithms profiling (I)CPU time per sequence branch
  28. 28. 28Algorithms profiling (II)
  29. 29. 29Gaudi configurationfrom Configurables import IntelProfilerAuditorprofiler = IntelProfilerAuditor()profiler.StartFromEventN = 5000profiler.StopAtEventN = 15000AuditorSvc().Auditors +=  [profiler]
  30. 30. 30Run:$> intelprofiler -o /collected/data job.pyAnalyze (GUI):$> amplxe-gui /collecter/data/r001hsAnalyze (CLI):$> amplxe-cl -reports hotspots -r /collecter/data/r001hs
  31. 31. 31VI. Profiling examples1. Memory allocation functions2. Measuring profiling accuracy3. Custom reports
  32. 32. 321. Memory allocation functionsoperatornew from libstdc++ library:tc_new from tcmalloc library:tc_new uses twice less time then operatornew
  33. 33. 332. Measuring profiling accuracyIntel Profiling Auditorvs .Timing AuditorMeasures the absolute time ofalgorithms run1000 events
  34. 34. 343. Custom reportsBuild reports using CSV files exportedfrom VTune Amplifier
  35. 35. 35ConclusionsIntel® VTune™ Amplifier XE:+ Various analysis types and reports+ Rich User API+ Reasonable overhead time

×