Upgrading to Haswell Architecture
and Windows 8.1:
Benchmarking Hybrid (CPUGPU) Parallel
Processing (CUDA) – enabled, MATL...
Contents
 Introduction
 A “Real-Life” Hybrid (CPU-GPU) Algorithm
 Hardware and Software of Testing
 Performance
 Comp...
Introduction
 Addressing the question:
Is it worth upgrading from Ivy Bridge to a Haswell Architecture in order to improv...
A “Real-Life” Hybrid Algorithm (1/2)
 Hybrid: Executes in both CPU and GPU
Consider an image processing algorithm impleme...
A “Real-Life” Hybrid Algorithm (2/2)
 We are faced with the pronounced overhead of transferring data to and from the
GPU ...
Hardware and Software of Testing
 System I:
 SCAN UK Workstation with NVIDIA GTX TITAN, Intel i7 3770K @ 4.5 GHz, 32GB R...
Performance (total runtimes in seconds)
System I (a)
(TITAN on WinSrv 2012)
System I (b)
(TITAN on Win 8.1)
System II (a)
...
Performance (total run times)
5296.33
2439
4051.67
3246
0
1000
2000
3000
4000
5000
6000
Total Processing Time:
Total Proce...
Performance (total run times)
2027.02
1081.01
383.01 388.95
225.76
880.47
513.60
191.83 167.28 112.32
1572.31
808.79
316.3...
Performance (Indicative times to process the image)
Parameters: Magnification, Fudge Factor, Sigma and HSize
Image Paramet...
Performance (Indicative times to process the image)
Parameters: Magnification, Fudge Factor, Sigma and HSize
0.37 0.47
4.4...
Performance Comparison (% change in runtimes)
System II (a) vs. System I (a) System II (b) vs System I (b) System I (b) vs...
Performance Comparison (% change in runtimes)
23.501
22.433
25.182
17.4
33.332
18.454
-33.087
-43.318
-25.952
-37.395
-19....
% Change in performance based on the time to perform parametric processing
Parameters: Magnification, Fudge Factor, Sigma ...
% Change in performance based on the time to perform parametric processing
Parameters: Magnification, Fudge Factor, Sigma ...
Enter MATLAB 2013b
In July 1st 2013 Mathworks made available to eligible users the Pre-release
version of 2013b. Please no...
Performance (Indicative times to process the image & % change)
Parameters: Magnification, Fudge Factor, Sigma and Hsize
on...
Performance (Indicative times to process the image)
Parameters: Magnification, Fudge Factor, Sigma and Hsize
on MATLAB 201...
A Footnote on Feedback
Test 1 Test 2 Test 3 Average
Total Time 6220 5085 4584 5296.333
Edge (800/CPU) 2376.47 1984.338 172...
Conclusion
 Measuring the performance of hybrid algorithms is similar to asking “how
long is a piece of spring”. However,...
Acknowledgements
I would like to thank the following individuals for their help in measuring and optimising the
performanc...
Upcoming SlideShare
Loading in …5
×

Hybrid CPU GPU MATLAB Image Processing Benchmarking

5,617 views

Published on

An attempt to quantify the substantial performance improvement observed on Windows 8.1\ Nvidia GTX 780M\Intel HD 4600 via the latest NVIDIA Driver (326.01) that may help other users - particularly of the MATLAB Image Processing and Parallel Computing Toolboxes - to consider upgrading...

Published in: Technology
  • Be the first to comment

Hybrid CPU GPU MATLAB Image Processing Benchmarking

  1. 1. Upgrading to Haswell Architecture and Windows 8.1: Benchmarking Hybrid (CPUGPU) Parallel Processing (CUDA) – enabled, MATLAB (2013a and 2013b) Image Processing Algorithms on GTX TITAN and GTX 780M DIMITRIS VAYENAS, POSTGRADUATE STUDENT DEPARTMENT OF COMPUTER SCIENCE @ THE UNIVERSITY OF OXFORD & SOFTWARE INCUBATOR @ ISIS INNOVATION LTD.
  2. 2. Contents  Introduction  A “Real-Life” Hybrid (CPU-GPU) Algorithm  Hardware and Software of Testing  Performance  Comparison  Enter MATLAB 2013b (Prerelease)  Conclusion  Footnote on Feedback  Acknowledgements
  3. 3. Introduction  Addressing the question: Is it worth upgrading from Ivy Bridge to a Haswell Architecture in order to improve performance?  Intel claims that its new HD 4600 Integrated Graphics Core inside the 4th Generation Intel i7 processors can increase performance over the previous architecture by up to 7 times.  What kind of performance improvements can we look forward to in “real life examples”  What other conditions influence performance?
  4. 4. A “Real-Life” Hybrid Algorithm (1/2)  Hybrid: Executes in both CPU and GPU Consider an image processing algorithm implemented in MATLAB that executes the following functions: The tasks in black are performed in the GPU and the tasks in red performed in the CPU. In this example the data will have to travel between CPU and GPU 6 times. Addition 4/7/2013: Please refer to slides 16-17 about the GPU support in MATLAB 2013b!
  5. 5. A “Real-Life” Hybrid Algorithm (2/2)  We are faced with the pronounced overhead of transferring data to and from the GPU while the performance of the CPU plays a significant role; this consideration is usually ignored by most performance benchmarks that test either the GPU or the CPU, but rarely both.  Although it is preferable to run all tasks in the GPU, the current version of MATLAB does not, yet, support these functions in the Parallel Computing Toolbox. It is expected that the new version in September 2013 will support more functions.  As we will see the NVIDIA Drivers have substantial impact on the performance  Method: We apply to the same image (of dimensions 1099X1599 pixels) an array of parameters, effectively processing it 400 times with the MATLAB Profiler activated. In each processed image, the processing time is printed. The runs were repeated three times for each configuration.
  6. 6. Hardware and Software of Testing  System I:  SCAN UK Workstation with NVIDIA GTX TITAN, Intel i7 3770K @ 4.5 GHz, 32GB RAM @ 2133 MHz, SSD with over 500 MB/s at Read and Write  A) OS: Windows Server 2012 Datacentre Edition NVIDIA Driver: 320.49  B) OS: Windows 8.1 NVIDIA Driver: 326.01  System II:  Schenker W503 with NVIDIA GTX 780M, Intel i7 4800MQ @ 3.5 GHz, 16 GB RAM @1600 MHz, SSD with over 500 MB/s at Read and Write  A) OS: Windows Server 2012 Datacentre Edition NVIDIA Driver: 320.49  B) OS: Windows 8.1 NVIDIA Driver: 326.01
  7. 7. Performance (total runtimes in seconds) System I (a) (TITAN on WinSrv 2012) System I (b) (TITAN on Win 8.1) System II (a) (780M on WinSrv 2012) System II (b) (780M on Win 8.1) Total Processing Time: 5296.33 2439.00 4051.67 3246.00 Indicative Sub-Tasks: (# of runs per test/executed in (CPU or GPU)) Edge (800/CPU) 2027.02 880.47 1572.31 1261.87 Regionprops (400/CPU) 1081.01 513.60 808.79 646.88 Imfilter (1600/GPU) 383.01 191.83 316.36 263.57 Imresize(1200/CPU) 388.95 167.28 259.31 199.59 Padarray (2000/CPU) 225.76 112.32 184.10 149.07
  8. 8. Performance (total run times) 5296.33 2439 4051.67 3246 0 1000 2000 3000 4000 5000 6000 Total Processing Time: Total Processing Time (seconds) (TITAN on WinSrv 2012) (TITAN on Win 8.1) (780M on WinSrv 2012) (780M on Win 8.1)
  9. 9. Performance (total run times) 2027.02 1081.01 383.01 388.95 225.76 880.47 513.60 191.83 167.28 112.32 1572.31 808.79 316.36 259.31 184.10 1261.87 646.88 263.57 199.59 149.07 0.00 500.00 1000.00 1500.00 2000.00 2500.00 Edge (800/CPU) Regionprops (400/CPU) Imfilter (1600/GPU) Imresize(1200/CPU) Padarray (2000/CPU) Indicative Sub-task time totals (less is better) The (same) image is processed 400 times but the same function can operate 2–5 times (TITAN on WinSrv 2012) (TITAN on Win 8.1) (780M on WinSrv 2012) (780M on Win 8.1)
  10. 10. Performance (Indicative times to process the image) Parameters: Magnification, Fudge Factor, Sigma and HSize Image Parameters System I (a) System I (b) System II (a) System II (b) (results in seconds) Mag_1_FF_0.2_S_0.2_HS_1 (1,757,301 Pixels) 0.37 0.14 0.47 0.18 Mag_1_FF_0.6_S_0.6_HS_61 (1,757,301 Pixels) 0.47 0.24 0.66 0.39 Mag_1_FF_1_S_0.8_HS_1 (1,757,301 Pixels) 4.49 1.77 2.02 0.50 Mag_3_FF_0.4_S_0.8_HS_41 (15,815,709 Pixels) 3.07 1.14 2.76 1.46 Mag_5_FF_0.4_S_0.8_HS_41 (43,932,525 Pixels) 7.26 3.16 5.38 3.95 Mag_7_FF_0.4_S_0.8_HS_41 (86,107,749 Pixels) 13.79 6.27 9.30 8.43 Mag_9_FF_0.4_S_0.8_HS_41 (142,341,381 Pixels) 22.90 10.76 17.49 14.89 Mag_9_FF_1_S_0.8_HS_41 (142,341,381 Pixels) 21.75 10.81 16.72 15.58
  11. 11. Performance (Indicative times to process the image) Parameters: Magnification, Fudge Factor, Sigma and HSize 0.37 0.47 4.49 3.07 7.26 13.79 22.90 21.75 0.14 0.24 1.77 1.14 3.16 6.27 10.76 10.81 0.47 0.66 2.02 2.76 5.38 9.30 17.49 16.72 0.18 0.39 0.50 1.46 3.95 8.43 14.89 15.58 0.00 5.00 10.00 15.00 20.00 25.00 Execution time in seconds to process image with specific parameters (TITAN on WinSrv 2012) (TITAN on Win 8.1) (780M on WinSrv 2012) (780M on Win 8.1)
  12. 12. Performance Comparison (% change in runtimes) System II (a) vs. System I (a) System II (b) vs System I (b) System I (b) vs System I (a) System II (b) vs System II (a) Total Execution Time: 23.501 -33.087 53.949 19.885 Edge (800/CPU) 22.433 -43.318 56.564 19.744 Regionprops (400/CPU) 25.182 -25.952 52.489 20.018 Imfilter (1600/GPU) 17.400 -37.395 49.913 16.686 Imresize(1200/CPU) 33.332 -19.320 56.993 23.028 Padarray (2000/CPU) 18.454 -32.713 50.247 19.030
  13. 13. Performance Comparison (% change in runtimes) 23.501 22.433 25.182 17.4 33.332 18.454 -33.087 -43.318 -25.952 -37.395 -19.32 -32.713 53.949 56.564 52.489 49.913 56.993 50.247 19.885 19.744 20.018 16.686 23.028 19.03 TOTAL EXECUTION TIME: EDGE (800/CPU) REGIONPROPS (400/CPU) IMFILTER (1600/GPU) IMRESIZE(1200/CPU) PADARRAY (2000/CPU) System II (a) vs. System I (a) System II (b) vs System I (b) System I (b) vs System I (a) System II (b) vs System II (a)
  14. 14. % Change in performance based on the time to perform parametric processing Parameters: Magnification, Fudge Factor, Sigma and HSize System II (a) vs. System I (a) System II (b) vs. System I (b) System I (b) vs. System I (a) System II (b) vs. System II (a) Mag_1_FF_0.2_S_0.2_HS_1 -28.036 -35.424 62.991 60.856 Mag_1_FF_0.6_S_0.6_HS_61 -41.357 -61.181 48.262 41.006 Mag_1_FF_1_S_0.8_HS_1 54.993 71.937 60.621 75.447 Mag_3_FF_0.4_S_0.8_HS_41 10.061 -27.943 62.916 47.246 Mag_5_FF_0.4_S_0.8_HS_41 25.996 -25.057 56.562 26.595 Mag_7_FF_0.4_S_0.8_HS_41 32.596 -34.632 54.570 9.259 Mag_9_FF_0.4_S_0.8_HS_41 23.624 -38.327 53.004 14.884 Mag_9_FF_1_S_0.8_HS_41 23.125 -44.185 50.315 6.812
  15. 15. % Change in performance based on the time to perform parametric processing Parameters: Magnification, Fudge Factor, Sigma and HSize -28.036 -41.357 54.993 10.061 25.996 32.596 23.624 23.125 -35.424 -61.181 71.937 -27.943 -25.057 -34.632 -38.327 -44.185 62.991 48.262 60.621 62.916 56.562 54.57 53.004 50.315 60.856 41.006 75.447 47.246 26.595 9.259 14.884 6.812 -80 -60 -40 -20 0 20 40 60 80 100 System II (a) vs. System I (a) System II (b) vs. System I (b) System I (b) vs. System I (a) System II (b) vs. System II (a)
  16. 16. Enter MATLAB 2013b In July 1st 2013 Mathworks made available to eligible users the Pre-release version of 2013b. Please note the following changes with regards to the improved GPU Support over the previous version: GPU Support for: MATLAB 2013a MATLAB 2013b fspecial NO YES rgb2gray NO YES edge NO YES Imresize NO YES Imfilter NO YES cat NO YES labelmatrix NO YES
  17. 17. Performance (Indicative times to process the image & % change) Parameters: Magnification, Fudge Factor, Sigma and Hsize on MATLAB 2013b vs. MATLAB 2013a on the Titan-Windows 8.1 System(*) Image Processing Parameters MATLAB 2013a MATLAB 2013b %Improvement in 2013b Mag_1_FF_0.2_S_0.2_HS_1 0.1364 0.0860 36.9163 Mag_1_FF_0.6_S_0.6_HS_61 0.2408 0.1871 22.3088 Mag_1_FF_1_S_0.8_HS_1 1.7667 0.3202 81.8779 Mag_3_FF_0.4_S_0.8_HS_41 1.1386 0.6810 40.1874 Mag_5_FF_0.4_S_0.8_HS_41 3.1550 1.7387 44.8923 Mag_7_FF_0.4_S_0.8_HS_41 6.2651 3.2429 48.2387 Please note that the performance improvements in the Final version of MATLAB 2013b in September are expected to be even higher for the following reasons: 1. The gpuArray(Imresize) function was not fully utilised; it was successfully employed in just one out of the three instances used in the algorithm; awaiting for documentation by Mathworks in order to provide suitable inputs 2. Comparative results with the previous tests cannot be given due to a limitation I discovered in the morphmexgpu module that prevents execution in the GPU when the array has over 2^27 (134,217,728) elements; do note that the original image was 1099x1599 thus with x9 magnification we have 142,341,381 pixels!) Preliminary results for magnifications up to x5 (i.e. 240 processes as opposed to 400 in the original test) the total time for edge detection dropped from 186.45 to just 4.65 seconds (x40 improvement!) and for the Imfilter from 40.905 seconds to 1.33 seconds (x30 improvement!)
  18. 18. Performance (Indicative times to process the image) Parameters: Magnification, Fudge Factor, Sigma and Hsize on MATLAB 2013b vs. MATLAB 2013a on the Titan-Windows 8.1 System 0 1 2 3 4 5 6 7 Mag_1_FF_0.2_S_0.2_HS_1 Mag_1_FF_0.6_S_0.6_HS_61 Mag_1_FF_1_S_0.8_HS_1 Mag_3_FF_0.4_S_0.8_HS_41 Mag_5_FF_0.4_S_0.8_HS_41 Mag_7_FF_0.4_S_0.8_HS_41 MATLAB 2013a MATLAB 2013b
  19. 19. A Footnote on Feedback Test 1 Test 2 Test 3 Average Total Time 6220 5085 4584 5296.333 Edge (800/CPU) 2376.47 1984.338 1720.265 2027.024 Regionprops (400/CPU) 1234.067 1052.34 956.622 1081.01 Imfilter (1600/GPU) 446.796 363.177 339.045 383.006 Imresize(1200/C PU) 463.96 364.319 338.574 388.951 Padarray (2000/CPU) 264.433 208.12 204.734 225.7623 Test 1 Test 2 Test 3 Average Mag_1_FF_0.2_ S_0.2_HS_1 0.34793 0.36036 0.39699 0.368427 Mag_1_FF_0.6_ S_0.6_HS_61 0.44981 0.47965 0.46689 0.46545 Mag_1_FF_1_S_ 0.8_HS_1 1.0416 1.0138 11.4042 4.486533 Mag_3_FF_0.4_ S_0.8_HS_41 3.3094 2.7042 3.1976 3.0704 Mag_5_FF_0.4_ S_0.8_HS_41 8.5507 7.5295 5.7096 7.263267 Mag_7_FF_0.4_ S_0.8_HS_41 18.5439 13.666 9.1622 13.7907 Mag_9_FF_0.4_ S_0.8_HS_41 28.7419 25.4112 14.5562 22.9031 Mag_9_FF_1_S_ 0.8_HS_41 18.8778 17.5202 28.8458 21.74793 Admittedly all of us were surprised by the relatively poor performance of the GTX TITAN on Windows Server 2012 with the 320.49 driver. For your reference, here follows the details of the three measurements on this particular configuration. Please note that on the best performing Test 3, the time to process “Mag_1_FF_1_S_0.8_HS_1” is indeed correct (11.4042 s); a testament to the unpredictable behaviour of the implementation given the poor correlation with the results in tests 1 and 2.
  20. 20. Conclusion  Measuring the performance of hybrid algorithms is similar to asking “how long is a piece of spring”. However, as manufacturers fine-tunetweak their products in order to perform better in standard benchmarking tools it is advisable to create your application related benchmarks  The performance improvements due to the new architecture in Intel’s fourth generation i7 family are considerable!  NVIDIA appears to offer improved support of its GTX 7*** Series on Windows 8.1 where we have seen improvement of over 75% -for a specific set of parameters- and up to 56% overall on identical hardware running Windows 8.1 with 326.01 driver vs. Windows Server 2012 with the 320.49 driver.  Update your drivers and make full-use of the MatLab 2013b!
  21. 21. Acknowledgements I would like to thank the following individuals for their help in measuring and optimising the performance of my MATLAB code, through their extensive knowledge of MATLAB andor CUDA:  Dr. Mike Giles, Professor of Scientific Computing at the University of Oxford; resident expert for NVIDIA and MATLAB  Dr. James Lebak, Parallel Computing Software Engineer at Mathworks Boston HQ  Captain (USMC) John Roberts, Senior Principal GPGPU Software Engineer at BAE Systems, Inc. (and formerly of NVIDIA); John also heads the CUDA Vision Workbench project. Many thanks to XMG-Schenker for supporting my research effort through their generous sponsorship of my Schenker W503 Finally, I am thankfull to all the viewers who recommended this presentation to others that led to achieve such a unprecedented popularity with over 800 views in the first two days.

×