Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Instrumenting parsecs raytrace


Published on

(Check my blog @ )

In this presentation I present the performance metrics and results of running the parsec benchmark with the raytrace application on Upc's boada server

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Instrumenting parsecs raytrace

  1. 1. Instrumenting abenchmarkapplicationTools and Measurements TechniquesProject by Mário Almeida (EMDC)Barcelona, 25 April 2012
  2. 2. Index (1/2)Tools and configuration● Parsec ○ Overview ○ Benchmark programs● Extrae● Paraver● Configuration 1
  3. 3. Index (2/2)Measurements● Raytrace ○ Overview ○ Code ○ Inputs ○ Traces ○ Load Balancing ○ Cache misses and instructions ○ Execution time ○ Configuration comparisons ○ Extrae overheadConclusions 2
  4. 4. Tools and configuration
  5. 5. ParsecOverview● Benchmark with the following characteristics: ○ Multithreaded ○ Emerging workloads ○ Diverse ○ Not HPC-focused ○ Research 3
  6. 6. ParsecBenchmark programs● blackscholes● bodytrack● canneal● dedup● facesim● ferret● fluidanimate● freqmine● raytrace● ... 4
  7. 7. Extrae● Instrumentation package to trace programs and run with shared memory model and message passing programming. 5
  8. 8. Paraver● Detailed quantitative analysis of a program performance.● Concurrent comparative analysis of several traces.● Support for mixed message passing and shared memory.● Building of derived metrics. 6
  9. 9. Configuration (1/4)Boada server:● Dual CPU Six Core with Hyperthreading.● Kills applications after a few minutes.● 24 GB of RAM.Boada server:● Used cpulimit to limit the cpu usage up to four cores. 7
  10. 10. Configuration (2/4)Installed and/or configured:● Parsec 2.1 with raytrace package only.● Extrae 2.2.1.● Paraver 4.3.0 (in my laptop).● CpuLimit● Minor configurations on .bashrc.● Multiple scripts to clean, build and run. 8
  11. 11. Configuration (3/4) 9
  12. 12. Configuration (4/4) 10
  13. 13. Measurements
  14. 14. RaytraceOverview● Physical simulation for visualization● Computer animation● Input is a complex object of many triangles. 11
  15. 15. RaytraceCodeFor every pixel in the image calculate trajectory of ray striking pixel find closest intersection point of ray with scenegeometry calculate contribution of all lights at intersection point recursively trace specularly reflected rayend for 12
  16. 16. RaytraceInputs● simsmall - 1 million polygons (480x270)● simmedium - 1 million poly (960x540)● simlarge - 1 million poly (1920x1080)● native - 10 million poly (1920x1080) 13
  17. 17. RaytraceTrace (1/2)Only 10% of the execution time is parallel! Not created Running 14
  18. 18. RaytraceTrace (2/2)Render time is proportional to the # of frames! Init and adding object Build Context Render 15
  19. 19. RaytraceLoad balancing (1/2)Not created Create Threads Task Barrier Wait for all threads 16
  20. 20. RaytraceLoad balancing (2/2)Good load balancing between the slavethreads. 17
  21. 21. RaytraceCache and instructions High number of cache misses Very low number of cache misses There were no significative diferences of IPC between threads. 18
  22. 22. RaytraceExecution time (1/3) These are average times from multiple executions of the parallel code only and without extrae overhead. There was a high average deviation of 0.3 seconds in the experiments. Bigger inputs were more accurate. 19
  23. 23. RaytraceExecution time (2/3) There was a smaller average deviation of 0.03 seconds. With 64 threads it runs almost three times faster! 20
  24. 24. RaytraceExecution time (3/3) There was a even smaller average deviation of 0.02 seconds. With 64 threads it runs almost three times faster! 21
  25. 25. RaytraceConfiguration comparison In the case of the limited configuration, although perfomance doesnt seem to degrade, the execution time seems to stabilize for more than 8 threads. 22
  26. 26. RaytraceExtrae overhead 23
  27. 27. Conclusions
  28. 28. Conclusions (1/3)● The system seemed to perform worse for a number of threads multiple of the total number of physical cores.● The program has a good load balancing.● Fine-granular parallelism. 24
  29. 29. Conclusions (2/3)● Although it wasnt possible to verify, increasing the input should cause higher cache misses, because of the big working sets that wont fit on the memory.● Memory bandwidth should be the main issue for good speedups.● Boada killed almost all the native input executions. 25
  30. 30. Conclusions (3/3)● Paraver simplifies the process of analyzing an application performance.● Better knowledge of the systems architecture would be needed in order further analyse the performance of the application. 26
  31. 31. Questions