SlideShare a Scribd company logo
The benefits of upgrading to Haswell Architecture
and Windows 8.1:
Benchmarking of Hybrid (CPUGPU) Parallel
Processing (CUDA) – enabled, MATLAB Image
Processing Algorithms in GTX TITAN and GTX 780M
DIMITRIS VAYENAS, POSTGRADUATE STUDENT
DEPARTMENT OF COMPUTER SCIENCE @ THE UNIVERSITY OF OXFORD &
SOFTWARE INCUBATOR AT ISIS INNOVATION LTD.
Contents
 Introduction
 A “Real-Life” Hybrid (CPU-GPU) Algorithm
 Hardware and Software of Testing
 Performance
 Comparison
 Conclusion
 Acknowledgements
Introduction
 In this laboratory we are attempting to address the following question:
Is it is worth upgrading from Ivy Bridge to a Haswell Architecture in order to
improve performance?
 Intel claims that its new HD 4600 Integrated Graphics Core in the 4th
Generation Intel i7 processors can increase performance over the previous
architecture by up to 7 times.
 What kind of performance improvements can we look forward in “real life
examples” and under what conditions?
A “Real-Life” Hybrid Algorithm (1/2)
 Hybrid: Executes in both CPU and GPU
Consider a MATLAB implemented algorithm containing the following steps:
A “Real-Life” Hybrid Algorithm (2/2)
 In the hybrid Algorithm the tasks in black are performed in the GPU while
the tasks in red performed in the CPU.
 Thus, we have the usual overhead of transferring the data to and from the
GPU whereas the performance of the CPU plays a significant role; this
consideration is usually ignored by most graphics performance
benchmarks who test either the GPU or the CPU, but not both.
 Ideally we should have liked to run all tasks in the GPU, however the
current version of MATLAB does not, yet, support these functions in the
Parallel Processing Unit.
 As we will see the NVIDIA Drivers have substantial impact on Performance
Hardware and Software of Testing
 System I:
 SCAN Workstation with NVIDIA GTX TITAN, Intel i7 3770K @ 4.5 GHz, 32GB RAM @ 2133
MHz, SSD with over 500 MB/s at Read and Write
OS: Windows Server 2012 Datacentre Edition
NVIDIA Driver: 320.49
 System II:
 Schenker W503 with NVIDIA GTX 780M, Intel i7 4800 @ 3.5 GHz, 16 GB RAM @1600
MHz, SSD with over 500 MB/s at Read and Write
 A) OS: Windows Server 2012 Datacentre Edition
NVIDIA Driver: 320.49
 B) OS: Windows 8.1
NVIDIA Driver: 326.01
(Important Notice: Figures for System I on Windows 8.1 will be added here by
Wednesday 3/7/2013)
Performance (total runtimes)
Task System I
(TITAN on WinSrv 2012)
System II (a)
(780M on WinSrv 2012)
System II (b)
(780M on Win 8.1)
(number of runs per
test/where (CPU or GPU))
(results in seconds – best is less)
Edge (800/CPU) 1720.265 1661.289 1261.870
Regionprops (400/CPU) 956.622 899.934 646.883
Imfilter (1600/GPU) 339.045 339.477 263.572
Imresize(1200/CPU) 338.574 295.782 199.593
Padarray (2000/CPU) 204.734 196.303 149.067
Imfilter (1600/GPU) 126.362 131.112 101.717
Performance (total run times)
1720.265
956.622
339.045 338.574
204.734
1661.289
899.934
339.477 295.782
196.303
1261.87
646.883
263.572
199.593
149.067
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Edge (800) Regionprops (400) Imfilter (1600) Imresize(1200) Padarray (2000)
Task time totals (less is better)
System I System II (a) System II (b)
Performance (Indicative times to process an image)
Parameters: Magnification, Fudge Factor, Sigma and HSize
Image Processing System I System II (a) System II (b)
(results in seconds)
Mag_1_FF_0.2_S_0.2_HS_1 0.39699 0.49159 0.18465
Mag_1_FF_0.6_S_0.6_HS_61 0.46689 0.62617 0.38815
Mag_1_FF_1_S_0.8_HS_1 11.4042 8.1427 0.49579
Mag_3_FF_0.4_S_0.8_HS_41 3.1976 2.8881 1.4568
Mag_5_FF_0.4_S_0.8_HS_41 5.7096 4.4588 3.9456
Mag_7_FF_0.4_S_0.8_HS_41 9.1622 10.6905 8.4348
Mag_9_FF_0.4_S_0.8_HS_41 14.5562 17.9971 14.8889
Mag_9_FF_1_S_0.8_HS_41 28.8458 17.0872 15.5799
Performance (Indicative times to process an image)
Parameters: Magnification, Fudge Factor, Sigma and HSize
0
0.39699
0.46689
11.4042
3.1976
5.7096
9.1622
14.5562
28.8458
0.49159
0.62617
8.1427
2.8881
4.4588
10.6905
17.9971
17.0872
0.18465
0.38815
0.49579
1.4568
3.9456
8.4348
14.8889
15.5799
EXECUTION TIME IN SECONDS TO PROCESS SPECIFIC IMAGES
System I System II (a) System II (b)
Performance Comparison (total run times)
Task System II (a) vs. System I System II (b) vs.
System II (a)
System II (b) vs. System I
(number of runs per
test/where (CPU or GPU))
Percentage Change
Edge (800/CPU) 3.4 24.0 26.6
Regionprops (400/CPU) 5.9 28.1 32.4
Imfilter (1600/GPU) -0.1 22.4 22.3
Imresize(1200/CPU) 12.6 32.5 41.0
Padarray (2000/CPU) 4.1 24.1 27.2
Imfilter (1600/GPU) -3.8 22.4 19.5
Performance Comparison (total run times)
3.4
5.9
-0.1
12.6
4.1
-3.8
24
28.1
22.4
32.5
24.1
22.4
26.6
32.4
22.3
41
27.2
-10
-5
0
5
10
15
20
25
30
35
40
45
Percentage Change
System II (a) vs. System I System II (b) vs. System II (a) System II (b) vs. System I
Performance Comparison based on the time to process image
Parameters: Magnification, Fudge Factor, Sigma and HSize
Image Processing System II (a) vs.
System I
System II (b) vs. System
II (a)
System II (b) vs.
System I
Percentage of Change
Mag_1_FF_0.2_S_0.2_HS_1 -23.8 62.4 53.5
Mag_1_FF_0.6_S_0.6_HS_61 -34.1 38.0 16.9
Mag_1_FF_1_S_0.8_HS_1 28.6 93.9 95.7
Mag_3_FF_0.4_S_0.8_HS_41 9.7 49.6 54.4
Mag_5_FF_0.4_S_0.8_HS_41 21.9 11.5 30.9
Mag_7_FF_0.4_S_0.8_HS_41 -16.7 21.1 7.9
Mag_9_FF_0.4_S_0.8_HS_41 -23.6 17.3 -2.3
Mag_9_FF_1_S_0.8_HS_41 40.8 8.8 46.0
Performance Comparison based on the time to process image
Parameters: Magnification, Fudge Factor, Sigma and HSize
0
-23.8
-34.1
28.6
9.7
21.9
-16.7
-23.6
40.8
62.4
38
93.9
49.6
11.5
21.1 17.3
8.8
53.5
16.9
95.7
54.4
30.9
7.9
-2.3
46
-60
-40
-20
0
20
40
60
80
100
120
Percentage change in image processing
System II (a) vs. System I System II (b) vs. System II (a) System II (b) vs. System I
Conclusion
 The performance improvements due to the new architecture in Intel’s fourth
generation i7 family are substantial as we notice the great improvements for
related of the i7 4800 Mobile CPU over the overclocked i7 3770K!
 NVIDIA also seems to offer improved support of its GTX 7*** Series on Windows
8.1 where we have seen improvement of over 93.9% for a set of parameters
and over 20% overall on an identical hardware running on Windows 8.1 with
326.01 driver vs. the 320.49 driver.
 Obviously, measuring the performance of hybrid algorithms is similar to asking
“how long is a piece of spring”, but given the fact that we see manufacturers
fine-tuning their products in order to perform better in standard benchmarking
tools it is always wise to create your own benchmarks that fit your applications
Acknowledgements
I would like to thank the following individuals for their help in measuring and
optimising the performance of my MATLAB code, through their extensive
knowledge of MATLAB andor CUDA:
 Dr. Mike Giles, Professor of Scientific Computing at the University of Oxford; resident
expert for NVIDIA and MATLAB
 Dr. James Lebak, Parallel Computing Software Engineer at MathWorksat
Mathworks Boston HQ.
 Captain (USMC) John Roberts, Senior Principal GPGPU Software Engineer at BAE
Systems, Inc. (formerly of NVIDIA); John also heads the CUDA Vision Workbench
project.
I would also like to thank XMG-Schenker for supporting my research effort
through their generous sponsorship of my Schenker W503
Hybrid CPU GPU MATLAB Image Processing Benchmarking
Hybrid CPU GPU MATLAB Image Processing Benchmarking
Hybrid CPU GPU MATLAB Image Processing Benchmarking
Hybrid CPU GPU MATLAB Image Processing Benchmarking
Hybrid CPU GPU MATLAB Image Processing Benchmarking

More Related Content

PDF
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
PDF
Accelerate performance on machine learning workloads with the Dell EMC PowerE...
PDF
Accelerate performance on machine learning workloads with the Dell EMC PowerE...
PPTX
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
PPTX
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
PPT
3 D technology-new manner to turn a quick turnaround prototyping.
PDF
PDF
Scaling MLOps on NVIDIA DGX Systems
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Accelerate performance on machine learning workloads with the Dell EMC PowerE...
Accelerate performance on machine learning workloads with the Dell EMC PowerE...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
3 D technology-new manner to turn a quick turnaround prototyping.
Scaling MLOps on NVIDIA DGX Systems

What's hot (8)

PDF
Latest HPC News from NVIDIA
PDF
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
PDF
Nvidia SC13 Podcast
PDF
Evolution of Supermicro GPU Server Solution
PDF
RAPIDS Overview
PDF
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
PDF
GTC 2017: Powering the AI Revolution
PPTX
OpenACC Monthly Highlights June 2017
Latest HPC News from NVIDIA
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Nvidia SC13 Podcast
Evolution of Supermicro GPU Server Solution
RAPIDS Overview
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
GTC 2017: Powering the AI Revolution
OpenACC Monthly Highlights June 2017
Ad

Similar to Hybrid CPU GPU MATLAB Image Processing Benchmarking (20)

PPTX
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
PDF
GPU Compute in Medical and Print Imaging
 
PDF
Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1
PPTX
High End Modeling & Imaging with Intel Iris Pro Graphics
PDF
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
PDF
Trends towards the merge of HPC + Big Data systems
PDF
N A G P A R I S280101
PPTX
GPU Algorithms and trends 2018
PDF
CUDA vs OpenCL
PPT
Optimizing Direct X On Multi Core Architectures
PDF
Cuda 6 performance_report
PDF
PDF
V3I8-0460
PDF
E3MV - Embedded Vision - Sundance
PDF
Accelerating Real Time Applications on Heterogeneous Platforms
PDF
GTC Europe 2017 Keynote
PDF
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
PPTX
Rocketick accelerated verilog simulations
PDF
CMES201308262603_16563
PPTX
Computação acelerada – a era das ap us roberto brandão, ciência
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
GPU Compute in Medical and Print Imaging
 
Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1
High End Modeling & Imaging with Intel Iris Pro Graphics
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
Trends towards the merge of HPC + Big Data systems
N A G P A R I S280101
GPU Algorithms and trends 2018
CUDA vs OpenCL
Optimizing Direct X On Multi Core Architectures
Cuda 6 performance_report
V3I8-0460
E3MV - Embedded Vision - Sundance
Accelerating Real Time Applications on Heterogeneous Platforms
GTC Europe 2017 Keynote
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
Rocketick accelerated verilog simulations
CMES201308262603_16563
Computação acelerada – a era das ap us roberto brandão, ciência
Ad

Recently uploaded (20)

PPTX
Tartificialntelligence_presentation.pptx
PDF
Hybrid model detection and classification of lung cancer
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
project resource management chapter-09.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
August Patch Tuesday
Tartificialntelligence_presentation.pptx
Hybrid model detection and classification of lung cancer
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
cloud_computing_Infrastucture_as_cloud_p
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Group 1 Presentation -Planning and Decision Making .pptx
Zenith AI: Advanced Artificial Intelligence
1 - Historical Antecedents, Social Consideration.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
project resource management chapter-09.pdf
Getting Started with Data Integration: FME Form 101
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Univ-Connecticut-ChatGPT-Presentaion.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Programs and apps: productivity, graphics, security and other tools
OMC Textile Division Presentation 2021.pptx
Developing a website for English-speaking practice to English as a foreign la...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
August Patch Tuesday

Hybrid CPU GPU MATLAB Image Processing Benchmarking

  • 1. The benefits of upgrading to Haswell Architecture and Windows 8.1: Benchmarking of Hybrid (CPUGPU) Parallel Processing (CUDA) – enabled, MATLAB Image Processing Algorithms in GTX TITAN and GTX 780M DIMITRIS VAYENAS, POSTGRADUATE STUDENT DEPARTMENT OF COMPUTER SCIENCE @ THE UNIVERSITY OF OXFORD & SOFTWARE INCUBATOR AT ISIS INNOVATION LTD.
  • 2. Contents  Introduction  A “Real-Life” Hybrid (CPU-GPU) Algorithm  Hardware and Software of Testing  Performance  Comparison  Conclusion  Acknowledgements
  • 3. Introduction  In this laboratory we are attempting to address the following question: Is it is worth upgrading from Ivy Bridge to a Haswell Architecture in order to improve performance?  Intel claims that its new HD 4600 Integrated Graphics Core in the 4th Generation Intel i7 processors can increase performance over the previous architecture by up to 7 times.  What kind of performance improvements can we look forward in “real life examples” and under what conditions?
  • 4. A “Real-Life” Hybrid Algorithm (1/2)  Hybrid: Executes in both CPU and GPU Consider a MATLAB implemented algorithm containing the following steps:
  • 5. A “Real-Life” Hybrid Algorithm (2/2)  In the hybrid Algorithm the tasks in black are performed in the GPU while the tasks in red performed in the CPU.  Thus, we have the usual overhead of transferring the data to and from the GPU whereas the performance of the CPU plays a significant role; this consideration is usually ignored by most graphics performance benchmarks who test either the GPU or the CPU, but not both.  Ideally we should have liked to run all tasks in the GPU, however the current version of MATLAB does not, yet, support these functions in the Parallel Processing Unit.  As we will see the NVIDIA Drivers have substantial impact on Performance
  • 6. Hardware and Software of Testing  System I:  SCAN Workstation with NVIDIA GTX TITAN, Intel i7 3770K @ 4.5 GHz, 32GB RAM @ 2133 MHz, SSD with over 500 MB/s at Read and Write OS: Windows Server 2012 Datacentre Edition NVIDIA Driver: 320.49  System II:  Schenker W503 with NVIDIA GTX 780M, Intel i7 4800 @ 3.5 GHz, 16 GB RAM @1600 MHz, SSD with over 500 MB/s at Read and Write  A) OS: Windows Server 2012 Datacentre Edition NVIDIA Driver: 320.49  B) OS: Windows 8.1 NVIDIA Driver: 326.01 (Important Notice: Figures for System I on Windows 8.1 will be added here by Wednesday 3/7/2013)
  • 7. Performance (total runtimes) Task System I (TITAN on WinSrv 2012) System II (a) (780M on WinSrv 2012) System II (b) (780M on Win 8.1) (number of runs per test/where (CPU or GPU)) (results in seconds – best is less) Edge (800/CPU) 1720.265 1661.289 1261.870 Regionprops (400/CPU) 956.622 899.934 646.883 Imfilter (1600/GPU) 339.045 339.477 263.572 Imresize(1200/CPU) 338.574 295.782 199.593 Padarray (2000/CPU) 204.734 196.303 149.067 Imfilter (1600/GPU) 126.362 131.112 101.717
  • 8. Performance (total run times) 1720.265 956.622 339.045 338.574 204.734 1661.289 899.934 339.477 295.782 196.303 1261.87 646.883 263.572 199.593 149.067 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Edge (800) Regionprops (400) Imfilter (1600) Imresize(1200) Padarray (2000) Task time totals (less is better) System I System II (a) System II (b)
  • 9. Performance (Indicative times to process an image) Parameters: Magnification, Fudge Factor, Sigma and HSize Image Processing System I System II (a) System II (b) (results in seconds) Mag_1_FF_0.2_S_0.2_HS_1 0.39699 0.49159 0.18465 Mag_1_FF_0.6_S_0.6_HS_61 0.46689 0.62617 0.38815 Mag_1_FF_1_S_0.8_HS_1 11.4042 8.1427 0.49579 Mag_3_FF_0.4_S_0.8_HS_41 3.1976 2.8881 1.4568 Mag_5_FF_0.4_S_0.8_HS_41 5.7096 4.4588 3.9456 Mag_7_FF_0.4_S_0.8_HS_41 9.1622 10.6905 8.4348 Mag_9_FF_0.4_S_0.8_HS_41 14.5562 17.9971 14.8889 Mag_9_FF_1_S_0.8_HS_41 28.8458 17.0872 15.5799
  • 10. Performance (Indicative times to process an image) Parameters: Magnification, Fudge Factor, Sigma and HSize 0 0.39699 0.46689 11.4042 3.1976 5.7096 9.1622 14.5562 28.8458 0.49159 0.62617 8.1427 2.8881 4.4588 10.6905 17.9971 17.0872 0.18465 0.38815 0.49579 1.4568 3.9456 8.4348 14.8889 15.5799 EXECUTION TIME IN SECONDS TO PROCESS SPECIFIC IMAGES System I System II (a) System II (b)
  • 11. Performance Comparison (total run times) Task System II (a) vs. System I System II (b) vs. System II (a) System II (b) vs. System I (number of runs per test/where (CPU or GPU)) Percentage Change Edge (800/CPU) 3.4 24.0 26.6 Regionprops (400/CPU) 5.9 28.1 32.4 Imfilter (1600/GPU) -0.1 22.4 22.3 Imresize(1200/CPU) 12.6 32.5 41.0 Padarray (2000/CPU) 4.1 24.1 27.2 Imfilter (1600/GPU) -3.8 22.4 19.5
  • 12. Performance Comparison (total run times) 3.4 5.9 -0.1 12.6 4.1 -3.8 24 28.1 22.4 32.5 24.1 22.4 26.6 32.4 22.3 41 27.2 -10 -5 0 5 10 15 20 25 30 35 40 45 Percentage Change System II (a) vs. System I System II (b) vs. System II (a) System II (b) vs. System I
  • 13. Performance Comparison based on the time to process image Parameters: Magnification, Fudge Factor, Sigma and HSize Image Processing System II (a) vs. System I System II (b) vs. System II (a) System II (b) vs. System I Percentage of Change Mag_1_FF_0.2_S_0.2_HS_1 -23.8 62.4 53.5 Mag_1_FF_0.6_S_0.6_HS_61 -34.1 38.0 16.9 Mag_1_FF_1_S_0.8_HS_1 28.6 93.9 95.7 Mag_3_FF_0.4_S_0.8_HS_41 9.7 49.6 54.4 Mag_5_FF_0.4_S_0.8_HS_41 21.9 11.5 30.9 Mag_7_FF_0.4_S_0.8_HS_41 -16.7 21.1 7.9 Mag_9_FF_0.4_S_0.8_HS_41 -23.6 17.3 -2.3 Mag_9_FF_1_S_0.8_HS_41 40.8 8.8 46.0
  • 14. Performance Comparison based on the time to process image Parameters: Magnification, Fudge Factor, Sigma and HSize 0 -23.8 -34.1 28.6 9.7 21.9 -16.7 -23.6 40.8 62.4 38 93.9 49.6 11.5 21.1 17.3 8.8 53.5 16.9 95.7 54.4 30.9 7.9 -2.3 46 -60 -40 -20 0 20 40 60 80 100 120 Percentage change in image processing System II (a) vs. System I System II (b) vs. System II (a) System II (b) vs. System I
  • 15. Conclusion  The performance improvements due to the new architecture in Intel’s fourth generation i7 family are substantial as we notice the great improvements for related of the i7 4800 Mobile CPU over the overclocked i7 3770K!  NVIDIA also seems to offer improved support of its GTX 7*** Series on Windows 8.1 where we have seen improvement of over 93.9% for a set of parameters and over 20% overall on an identical hardware running on Windows 8.1 with 326.01 driver vs. the 320.49 driver.  Obviously, measuring the performance of hybrid algorithms is similar to asking “how long is a piece of spring”, but given the fact that we see manufacturers fine-tuning their products in order to perform better in standard benchmarking tools it is always wise to create your own benchmarks that fit your applications
  • 16. Acknowledgements I would like to thank the following individuals for their help in measuring and optimising the performance of my MATLAB code, through their extensive knowledge of MATLAB andor CUDA:  Dr. Mike Giles, Professor of Scientific Computing at the University of Oxford; resident expert for NVIDIA and MATLAB  Dr. James Lebak, Parallel Computing Software Engineer at MathWorksat Mathworks Boston HQ.  Captain (USMC) John Roberts, Senior Principal GPGPU Software Engineer at BAE Systems, Inc. (formerly of NVIDIA); John also heads the CUDA Vision Workbench project. I would also like to thank XMG-Schenker for supporting my research effort through their generous sponsorship of my Schenker W503