SlideShare a Scribd company logo
1 of 23
Download to read offline
science + computing ag
IT Service and Software Solutions for Complex Computing Environments
Tuebingen | Muenchen | Berlin | Duesseldorf
Performance Analysis and Optimizations of CAE Applications
Case Study: STAR-CCM+
Dr. Fisnik Kraja
HPC Services and Software
02 April 2014
HPC Advisory Council Switzerland Conference 2014
 Founded in 1989
 ~300 Employees in Tübingen, München, Berlin, Düsseldorf, Ingolstadt
 Focus on technical computing (CAD, CAE, CAT)
 We count the following among our customers:
 Automobile manufacturers
 Suppliers of the automobile industry
 Manufacturers of microelectronic components
 Aerospace companies
 Manufacturing
 Chemical and pharmaceutical companies
 Public Sector
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
2/22
science + computing at a glance
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
s+c core competencies:
IT Services | Consulting | Software
Distributed Computing
Automation/
Process Optimization
Migration/
Consolidation
IT Operation
IT Security High Performance
Computing
IT Management
Distributed Resource
Management
3/22
Main Contributors
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
 Applications and Performance Team
 Madeleine Richards
 Damien Declat (TL)
 HPC Services and Software Team
 Josef Hellauer
 Oliver Schröder
 Jan Wender (TL)
 CD-adapco
4/22
Overview
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
 Introduction
 Benchmarking Environment
 Initial Performance Optimizations
 Performance Analysis and Comparison
 CPU Frequency Dependency
 Memory Hierarchie Dependency
 CPU Comparison
 Hyperthreading Impact
 Intel(R) Turbo Boost Analysis
 MPI Profiling
 Conclusions
5/22
Introduction
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
 ISV software is in general pre-compiled
 A lot of optimization possibilities exist in the HPC environment
 Node selection and scheduling (scheduler)
 System- and node-level task placement and binding (runtimes)
 Operating system optimizations
 Purpose of this study is to:
 Analyse the behavior of an ISV application
 Apply optimizations that improve resource utilization
 The test case
 STAR-CCM+ 8.02.008
 Platform Computing MPI-08.02
 Aerodynamics Simulation (60M)
6/22
Benchmarking
Environment
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
Compute Node
Sid
B710
Sid
B71010c
Sid
B71012c
Robin
ivy27-12c-hton
Robin
ivy27-12c-E3-htoff
Processor Intel E5-2697 V2 IvyBridge Intel E5-2680 V2 IvyBridge Intel E5-2697 V2 IvyBridge
Intel E5-2697 V2
IvyBridge
Intel E5-2697 V2
IvyBridge
Frequency 2.70 GHz 2.80 GHz 2.70 GHz 2.70 GHz 2.70 GHz
Cores per processor 12 10 12 12 12
Sockets per node 2 2 2 2 2
Cores per nodes 24 20 24 24 24
Memory 32 GB 64 GB 64 GB 64 GB 64 GB
Frequency of memory 1866 MHz 1866 MHz 1866 MHz 1866 MHz 1866 MHz
IO – FS NFS over IB NFS over IB NFS over IB NFS over IB NFS over IB
Interconnect IB FDR IB FDR IB FDR IB FDR IB FDR
STAR-CCM+ 8.02.008 8.02.008 8.02.008 8.02.008 8.02.008
Platform MPI 8.2.0.0 8.2.0.0 8.2.0.0 8.2.0.0 8.2.0.0
SLURM 2.6.0 2.6.0 2.6.0 2.5.0 2.5.0
OFED 1.5.4.1 1.5.4.1 1.5.4.1 1.5.4.1 1.5.4.1
7/22
Initial Optimizations
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
1. CPU Binding (cb)
• Bind tasks to specific physical/logical cores. This eliminates the overhead coming from
thread migrations and improves data locality in combination with the first touch policy.
2. Zone Reclaim (zr)
• Linux measures at startup the transfer rate between NUMA nodes and decides whether to
enable or not Zone Reclaim in order to optimize memory performance on NUMA systems.
We enabled it by force.
3. Transparent Huge Pages (thp)
• Latest Linux kernels support different page sizes. In some cases, to improve the
performance huge pages are used since memory management is simplified. THP is an
abstraction layer in RHEL 6 that makes it easy to use huge pages.
4. Optimizations for Intel (R) CPUs
• Turbo Boost (trb) enables the processor to run above its base operating frequency via
dynamic control of the CPU's clock rate
• Hyperthreading (ht) allows the operating system to address two (or more) virtual or logical
cores per physical core, and share the workload between them when possible.
8/22
Obtained Improvements
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
120 Tasks 240 Tasks 480 Tasks 960 Tasks
6 Nodes 12 Nodes 24 Nodes 48 Nodes
Opt: none 3412.6 1779.3 938.1 553.0
Opt: cb 2719.0 1196.5 578.7 316.7
Opt: cb,zr 2199.0 1103.5 577.3 325.8
Opt: cb,zr,thp 2191.5 1094.4 575.2 313.6
Opt: cb,zr,thp,trb 2048.6 1027.7 537.2 300.9
Opt: cb,rz,thp,ht 1953.8 1017.3 543.8 317.2
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
3500.0
4000.0
ElapsedTimeinSeconds
STAR-CCM+ on Nodes with E5-2680v2 CPUs
(cb=cpu_bind, zr=zone reclaim, thp=transparent huge pages, trb=turbo, ht=hyperthreading)
9/22
25 %
23 %
74 %
Performance Analysis and Comparison
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
1. What can we do?
 Analyze the dependency on:
• CPU Frequency
• Memory Hierarchy
 Compare the performance of different CPU types
 Analyze the impact of Hyperthreading and Turbo Boost
 Profile MPI communications
2. Why should we do that?
 To better utilize the resources in a heterogeneous environment
• by selecting the appropriate compute nodes
• by giving each job exactly the resources needed (neither more nor less)
 To find an informed compromise between performance and costs
• power consumption and licenses
 To predict the behavior of the application on upcoming systems
10/22
Dependency on the CPU Frequency
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
1. This test case shows a 85-88% dependency on the CPU frequency
 85% on 6 Nodes
 87% on 12 Nodes
 88% on 24 and also on 48 Nodes
y = 4993.5x-0.852
R² = 0.9975
y = 2567.1x-0.871
R² = 0.9977
y = 1358.5x-0.878
R² = 0.9985
y = 751.72x-0.882
R² = 0.9982
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
3500.0
4000.0
4500.0
5000.0
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2
ElapsedTimeinSeconds
CPU Frequency in GHz
6 Nodes (120 Tasks)
12 Nodes (240 Tasks)
24 Nodes (480 Tasks)
48 Nodes (960 Tasks)
Power (6 Nodes (120 Tasks))
Power (12 Nodes (240 Tasks))
Power (24 Nodes (480 Tasks))
Power (48 Nodes (960 Tasks))
11/22
CPU Frequency Impact on Memory Throughput
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
1. Memory throughput increases with almost the same ratio as the speedup:
 The integrated memory controller and the caches are faster
 We can see that memory is not a bottleneck
2. Almost the same behavior is observed with different tests on 6 and 48 nodes
12/22
1.2 GHz 1.6 GHz 2.0 GHz 2.4 GHz 2.8 GHz 3.1 GHz - TB
Speedup(6 Nodes) 1.000 1.304 1.581 1.821 2.034 2.170
Memory Throughput Increase (6 Nodes) 1.000 1.301 1.577 1.815 2.029 2.168
Speedup(48 Nodes) 1.000 1.259 1.501 1.823 2.046 2.171
Memory Throughput Increase (48 Nodes) 1.000 1.262 1.505 1.807 2.037 2.160
0.000
0.500
1.000
1.500
2.000
2.500
CPU Comparison
E5-2697v2 vs. E5-2680v2
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
 The 12 core CPU is faster by 8-9 %
 However, there are also drawbacks:
• Increased power consumption
• Increased license costs
6 (120) 12 (240)
Sid B71010c - E5-2680v2 2450.4 1236.20
Sid B71012c - E5-2697v2 2389.2 1205.6
0
500
1000
1500
2000
2500
3000
ElapsedTimeinSeconds
Nodes (Tasks)
E5-2697v2(10c@2.4) vs. E5-2680v2(10c@2.4 GHz)
6 12
Sid B71010c - 2.8 GHz 2191.5 1094.4
Sid B71012c - 2.7 GHz 2011.9 991.0
0
500
1000
1500
2000
2500
ElapsedTimeinSeconds
Nodes
E5-2697v2(12c@2.7) vs. E5-2680v2(10c@2.8)
 In these tests we use 10 cores @
2.4 GHz on both CPU Types
 The E5-2697v2 is still faster
 Why?
13/22
37546.3 36361.3 34003.8 29162.2
8657.4 8133.4 7277.4 6151.4
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
120 Tasks 240 Tasks 480 Tasks 960 Tasks
6 Nodes 12 Nodes 24 Nodes 48 Nodes
Read (MB/s) Write (MB/s)
120 Tasks 240 Tasks 480 Tasks 960 Tasks
6 Nodes 12 Nodes 24 Nodes 48 Nodes
Socket 0 24011.3 22451.9 21014.0 17394.4
Socket 1 22192.4 22058.6 20341.6 18025.4
0
5000
10000
15000
20000
25000
30000
MB/s
Average memory throughput on one Node
1. Memory is stressed a bit more
when running on 6 nodes
2. The more nodes are used, the less
memory bandwidth is needed:
• More time is spent on MPI
• More caches become available
3. Sockets are well balanced,
considering that these measurements
are done on the first node where
Rank 0 is running.
4. The ratio between Write and Read to
memory is always around 20 / 80 %.
Memory Throughput Analysis
Dr. Fisnik Kraja | f.kraja@science-computing.de © 2014 science + computing ag
14/22
Performance improves with Scatter Mode
 Even in cases when less memory bandwidth is used (24 and 48 Nodes)
 The reason for this is that L3 cache on the second socket becomes available
 By doubling the L3 cache per task/core we have reduced the cache misses by at
least 10 %
120 Tasks 240 Tasks 480 Tasks 960 Tasks
12 Nodes 24 Nodes 48 Nodes 96 Nodes
Impact on TSET 0.89 0.91 0.93 0.91
Impact on Memory Throughput 1.06 1.01 0.93 0.85
Impact on L3 Misses 0.89 0.89 0.85 0.89
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Scatter
compared
to
Compact
Memory Hierarchy Dependency
Scatter vs. Compact Task Placement
Dr. Fisnik Kraja | f.kraja@science-computing.de © 2014 science + computing ag
15/22
Hyperthreading Impact on Performance
HT-ON vs. HT-OFF - 24 Tasks per Node - E5 2697v2
1. Here we analyze the impact of having HT=ON/OFF (in BIOS)
• Even when not overpopulating the nodes
2. As shown the impact on performance is minimal
3. The real reasons for this behavior are not clear
• Could be OS Jitter
• What do you think?
2,4 ghz 2,7 ghz 2,4 ghz 2,7 ghz 2,4 ghz 2,7 ghz
6 nodes x 24cores 12 nodes x 24 cores 24 nodes x 24 cores
Robin_ivy27-12c-hton_sbatch 2197 2026 1098 1011 589 544
Robin_ivy27-12c-E3-htoff_sbatch 2189 2025 1081 996 577 529
Reduction of TSET 0.33% 0.02% 1.52% 1.46% 2.11% 2.69%
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
20.00%
0
500
1000
1500
2000
2500
ReductionofElapseTime
ElapsedTimeinSeconds
Dr. Fisnik Kraja | f.kraja@science-computing.de © 2014 science + computing ag
16/22
Turbo Boost Analysis
As expected, on fewer cores the elapsed time
increases.
However the impact becomes more pronounced
as one reduces the number of cores.
By reducing the number of cores by a factor
of 10, the elapsed time increases only by a
factor of 6.7.
This could be an interesting use case to reduce
license costs.
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
3500.0
4000.0
480 432 384 336 288 240 192 144 96 48
2.9 3.1 3.1 3.1 3.1 3.2 3.3 3.4 3.5 3.6
20 (10) 18 (9) 16 (8) 14 (7) 12 (6) 10 (5) 8 (4) 6 (3) 4 (2) 2 (1)
ElapsedTimeinSeconds
# Tasks
Max.Freq.GHZ
Task Placement
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
480 432 384 336 288 240 192 144 96 48
NUmber of Tasks (Cores)
Ratio_TSET
Ratio_Cores
Ratio_Diff
During these tests we use always 24 nodes
and reduce by 2 the number of tasks per
node. This reduces the number of CPU
cores being used:
1. Allowing the Turbo Boost Technology to
clock the cores up to 3.6 GHz
(including here the integrated memory
controller)
2. Giving each core a higher memory
bandwidth
Dr. Fisnik Kraja | f.kraja@science-computing.de © 2014 science + computing ag
17/22
Turbo Boost and Hyperthreading impact on Memory Throughput
Tests with 240 Tasks on 12 Fully/Over Populated Nodes
1. The Turbo Boost impact on memory throughput and on speedup is at the same ratio
2. The HT impact is not.
• The reason for this might be the eviction of cache data since 2 threads are running on the same
core.
Fully-Turbo vs Fully
HT (over populated) vs.
Fully
TB+HT (overpopulated)
vs. Fully
Speedup in Time 1.07 1.08 1.15
Memory Throughput Increase 1.07 1.12 1.19
1.00
1.02
1.04
1.06
1.08
1.10
1.12
1.14
1.16
1.18
1.20
1.22
Ratio
Dr. Fisnik Kraja | f.kraja@science-computing.de © 2014 science + computing ag
18/22
MPI Profiling (1)
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
As shown in these charts the part of
time spent in MPI communications
increases almost linearly with the
increase in the number of nodes.
3 Nodes 6 Nodes 12 Nodes 24 Nodes 48 Nodes 96 Nodes
MPI time 21.58% 22.51% 25.40% 32.34% 40.47% 53.71%
User time 78.42% 77.49% 74.60% 67.66% 59.53% 46.29%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
MPI Time vs. User Time
3, 21.58%6, 22.51%
12, 25.40%
24, 32.34%
48, 40.47%
96, 53.71%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
0 10 20 30 40 50 60 70 80 90 100
Number of Nodes
MPI Time
Linear (MPI Time)
19/22
MPI Profiling (2)
© 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de
1. Most of the MPI Time is spent in MPI_Allreduce and MPI_Waitany
 Benchmarks on Platform MPI selection of collective algorithms
 5-10% performance improvement of MPI collective time is expected
2. While increasing the number of Nodes, the part of time spent on MPI_Allreduce and MPI_Waitany gets
distributed over the other MPI calls like MPI_Recv, MPI_Waitall, etc …
3. Interesting is that up to 9% of the MPI Time is spent on MPI_File_read_at
 A parallel File system might help in reducing this part
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
3 Nodes
6 Nodes
12 Nodes
24 Nodes
48 Nodes
96 Nodes
20/22
Conclusions
1. CPU Frequency
• STAR-CCM+ showed dependency on the CPU Frequency
2. Cache Size
• STAR-CCM+ is dependent on the Cache Size per Core
3. Turbo Boost
• Is worth only in cases when not all the cores are used
4. Hyperthreading
• When used, HT has no big positive impact on performance
• When not used, HT has a small negative impact on performance
5. Memory Bandwidth and Latency
• STAR-CCM+ is more latency than bandwidth dependent
6. MPI Profiling
• MPI Time increases linearly with the number of nodes
7. File System
• Parallel File systems should be considered for MPI-IO
21/22
Thank you for your attention.
science + computing ag
www.science-computing.de
Fisnik Kraja
Email: f.kraja@science-computing.de
Theory behind
CPU Frequency Dependency
 Lets start with the statement
 To Solve m Problems on x resourses we need t=T time
 Then we can make the following assumptions:
 To solve n*m problems on n*x resourses we still need t=T time
 To solve m problems on n*x resourses we need t=T/n time (hopefully)
 From the last assumption: t ≅ 𝑇 × 𝑛−1
 We can expand it: t ≅ 𝑇 × 𝑛−(𝑎+𝑏)
, where 𝑎 + 𝑏 = 1
 We use 𝑎 to represent the dependency on CPU frequency
 We use 𝑏 to represent the dependency on other factors
 To find the value of 𝑎 we keep 𝑏 = 0

More Related Content

What's hot

Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
Jason Shih
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
Edge AI and Vision Alliance
 
Task Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and WorkflowsTask Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and Workflows
Rafael Ferreira da Silva
 

What's hot (20)

Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
A performance-aware power capping orchestrator for the Xen hypervisor
A performance-aware power capping orchestrator for the Xen hypervisorA performance-aware power capping orchestrator for the Xen hypervisor
A performance-aware power capping orchestrator for the Xen hypervisor
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
Task Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and WorkflowsTask Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and Workflows
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
Energy-aware VM Allocation on An Opportunistic Cloud Infrastructure
Energy-aware VM Allocation on An Opportunistic Cloud InfrastructureEnergy-aware VM Allocation on An Opportunistic Cloud Infrastructure
Energy-aware VM Allocation on An Opportunistic Cloud Infrastructure
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
 

Similar to Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_CCM+)

APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 

Similar to Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_CCM+) (20)

Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Quad Core Processors - Technology Presentation
Quad Core Processors - Technology PresentationQuad Core Processors - Technology Presentation
Quad Core Processors - Technology Presentation
 
Runtime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationRuntime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM Simulation
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Give Your Organization Better, Faster Insights & Answers with High Performanc...
Give Your Organization Better, Faster Insights & Answers with High Performanc...Give Your Organization Better, Faster Insights & Answers with High Performanc...
Give Your Organization Better, Faster Insights & Answers with High Performanc...
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_CCM+)

  • 1. science + computing ag IT Service and Software Solutions for Complex Computing Environments Tuebingen | Muenchen | Berlin | Duesseldorf Performance Analysis and Optimizations of CAE Applications Case Study: STAR-CCM+ Dr. Fisnik Kraja HPC Services and Software 02 April 2014 HPC Advisory Council Switzerland Conference 2014
  • 2.  Founded in 1989  ~300 Employees in Tübingen, München, Berlin, Düsseldorf, Ingolstadt  Focus on technical computing (CAD, CAE, CAT)  We count the following among our customers:  Automobile manufacturers  Suppliers of the automobile industry  Manufacturers of microelectronic components  Aerospace companies  Manufacturing  Chemical and pharmaceutical companies  Public Sector © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de 2/22 science + computing at a glance
  • 3. © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de s+c core competencies: IT Services | Consulting | Software Distributed Computing Automation/ Process Optimization Migration/ Consolidation IT Operation IT Security High Performance Computing IT Management Distributed Resource Management 3/22
  • 4. Main Contributors © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de  Applications and Performance Team  Madeleine Richards  Damien Declat (TL)  HPC Services and Software Team  Josef Hellauer  Oliver Schröder  Jan Wender (TL)  CD-adapco 4/22
  • 5. Overview © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de  Introduction  Benchmarking Environment  Initial Performance Optimizations  Performance Analysis and Comparison  CPU Frequency Dependency  Memory Hierarchie Dependency  CPU Comparison  Hyperthreading Impact  Intel(R) Turbo Boost Analysis  MPI Profiling  Conclusions 5/22
  • 6. Introduction © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de  ISV software is in general pre-compiled  A lot of optimization possibilities exist in the HPC environment  Node selection and scheduling (scheduler)  System- and node-level task placement and binding (runtimes)  Operating system optimizations  Purpose of this study is to:  Analyse the behavior of an ISV application  Apply optimizations that improve resource utilization  The test case  STAR-CCM+ 8.02.008  Platform Computing MPI-08.02  Aerodynamics Simulation (60M) 6/22
  • 7. Benchmarking Environment © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de Compute Node Sid B710 Sid B71010c Sid B71012c Robin ivy27-12c-hton Robin ivy27-12c-E3-htoff Processor Intel E5-2697 V2 IvyBridge Intel E5-2680 V2 IvyBridge Intel E5-2697 V2 IvyBridge Intel E5-2697 V2 IvyBridge Intel E5-2697 V2 IvyBridge Frequency 2.70 GHz 2.80 GHz 2.70 GHz 2.70 GHz 2.70 GHz Cores per processor 12 10 12 12 12 Sockets per node 2 2 2 2 2 Cores per nodes 24 20 24 24 24 Memory 32 GB 64 GB 64 GB 64 GB 64 GB Frequency of memory 1866 MHz 1866 MHz 1866 MHz 1866 MHz 1866 MHz IO – FS NFS over IB NFS over IB NFS over IB NFS over IB NFS over IB Interconnect IB FDR IB FDR IB FDR IB FDR IB FDR STAR-CCM+ 8.02.008 8.02.008 8.02.008 8.02.008 8.02.008 Platform MPI 8.2.0.0 8.2.0.0 8.2.0.0 8.2.0.0 8.2.0.0 SLURM 2.6.0 2.6.0 2.6.0 2.5.0 2.5.0 OFED 1.5.4.1 1.5.4.1 1.5.4.1 1.5.4.1 1.5.4.1 7/22
  • 8. Initial Optimizations © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de 1. CPU Binding (cb) • Bind tasks to specific physical/logical cores. This eliminates the overhead coming from thread migrations and improves data locality in combination with the first touch policy. 2. Zone Reclaim (zr) • Linux measures at startup the transfer rate between NUMA nodes and decides whether to enable or not Zone Reclaim in order to optimize memory performance on NUMA systems. We enabled it by force. 3. Transparent Huge Pages (thp) • Latest Linux kernels support different page sizes. In some cases, to improve the performance huge pages are used since memory management is simplified. THP is an abstraction layer in RHEL 6 that makes it easy to use huge pages. 4. Optimizations for Intel (R) CPUs • Turbo Boost (trb) enables the processor to run above its base operating frequency via dynamic control of the CPU's clock rate • Hyperthreading (ht) allows the operating system to address two (or more) virtual or logical cores per physical core, and share the workload between them when possible. 8/22
  • 9. Obtained Improvements © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de 120 Tasks 240 Tasks 480 Tasks 960 Tasks 6 Nodes 12 Nodes 24 Nodes 48 Nodes Opt: none 3412.6 1779.3 938.1 553.0 Opt: cb 2719.0 1196.5 578.7 316.7 Opt: cb,zr 2199.0 1103.5 577.3 325.8 Opt: cb,zr,thp 2191.5 1094.4 575.2 313.6 Opt: cb,zr,thp,trb 2048.6 1027.7 537.2 300.9 Opt: cb,rz,thp,ht 1953.8 1017.3 543.8 317.2 0.0 500.0 1000.0 1500.0 2000.0 2500.0 3000.0 3500.0 4000.0 ElapsedTimeinSeconds STAR-CCM+ on Nodes with E5-2680v2 CPUs (cb=cpu_bind, zr=zone reclaim, thp=transparent huge pages, trb=turbo, ht=hyperthreading) 9/22 25 % 23 % 74 %
  • 10. Performance Analysis and Comparison © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de 1. What can we do?  Analyze the dependency on: • CPU Frequency • Memory Hierarchy  Compare the performance of different CPU types  Analyze the impact of Hyperthreading and Turbo Boost  Profile MPI communications 2. Why should we do that?  To better utilize the resources in a heterogeneous environment • by selecting the appropriate compute nodes • by giving each job exactly the resources needed (neither more nor less)  To find an informed compromise between performance and costs • power consumption and licenses  To predict the behavior of the application on upcoming systems 10/22
  • 11. Dependency on the CPU Frequency © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de 1. This test case shows a 85-88% dependency on the CPU frequency  85% on 6 Nodes  87% on 12 Nodes  88% on 24 and also on 48 Nodes y = 4993.5x-0.852 R² = 0.9975 y = 2567.1x-0.871 R² = 0.9977 y = 1358.5x-0.878 R² = 0.9985 y = 751.72x-0.882 R² = 0.9982 0.0 500.0 1000.0 1500.0 2000.0 2500.0 3000.0 3500.0 4000.0 4500.0 5000.0 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 ElapsedTimeinSeconds CPU Frequency in GHz 6 Nodes (120 Tasks) 12 Nodes (240 Tasks) 24 Nodes (480 Tasks) 48 Nodes (960 Tasks) Power (6 Nodes (120 Tasks)) Power (12 Nodes (240 Tasks)) Power (24 Nodes (480 Tasks)) Power (48 Nodes (960 Tasks)) 11/22
  • 12. CPU Frequency Impact on Memory Throughput © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de 1. Memory throughput increases with almost the same ratio as the speedup:  The integrated memory controller and the caches are faster  We can see that memory is not a bottleneck 2. Almost the same behavior is observed with different tests on 6 and 48 nodes 12/22 1.2 GHz 1.6 GHz 2.0 GHz 2.4 GHz 2.8 GHz 3.1 GHz - TB Speedup(6 Nodes) 1.000 1.304 1.581 1.821 2.034 2.170 Memory Throughput Increase (6 Nodes) 1.000 1.301 1.577 1.815 2.029 2.168 Speedup(48 Nodes) 1.000 1.259 1.501 1.823 2.046 2.171 Memory Throughput Increase (48 Nodes) 1.000 1.262 1.505 1.807 2.037 2.160 0.000 0.500 1.000 1.500 2.000 2.500
  • 13. CPU Comparison E5-2697v2 vs. E5-2680v2 © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de  The 12 core CPU is faster by 8-9 %  However, there are also drawbacks: • Increased power consumption • Increased license costs 6 (120) 12 (240) Sid B71010c - E5-2680v2 2450.4 1236.20 Sid B71012c - E5-2697v2 2389.2 1205.6 0 500 1000 1500 2000 2500 3000 ElapsedTimeinSeconds Nodes (Tasks) E5-2697v2(10c@2.4) vs. E5-2680v2(10c@2.4 GHz) 6 12 Sid B71010c - 2.8 GHz 2191.5 1094.4 Sid B71012c - 2.7 GHz 2011.9 991.0 0 500 1000 1500 2000 2500 ElapsedTimeinSeconds Nodes E5-2697v2(12c@2.7) vs. E5-2680v2(10c@2.8)  In these tests we use 10 cores @ 2.4 GHz on both CPU Types  The E5-2697v2 is still faster  Why? 13/22
  • 14. 37546.3 36361.3 34003.8 29162.2 8657.4 8133.4 7277.4 6151.4 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 120 Tasks 240 Tasks 480 Tasks 960 Tasks 6 Nodes 12 Nodes 24 Nodes 48 Nodes Read (MB/s) Write (MB/s) 120 Tasks 240 Tasks 480 Tasks 960 Tasks 6 Nodes 12 Nodes 24 Nodes 48 Nodes Socket 0 24011.3 22451.9 21014.0 17394.4 Socket 1 22192.4 22058.6 20341.6 18025.4 0 5000 10000 15000 20000 25000 30000 MB/s Average memory throughput on one Node 1. Memory is stressed a bit more when running on 6 nodes 2. The more nodes are used, the less memory bandwidth is needed: • More time is spent on MPI • More caches become available 3. Sockets are well balanced, considering that these measurements are done on the first node where Rank 0 is running. 4. The ratio between Write and Read to memory is always around 20 / 80 %. Memory Throughput Analysis Dr. Fisnik Kraja | f.kraja@science-computing.de © 2014 science + computing ag 14/22
  • 15. Performance improves with Scatter Mode  Even in cases when less memory bandwidth is used (24 and 48 Nodes)  The reason for this is that L3 cache on the second socket becomes available  By doubling the L3 cache per task/core we have reduced the cache misses by at least 10 % 120 Tasks 240 Tasks 480 Tasks 960 Tasks 12 Nodes 24 Nodes 48 Nodes 96 Nodes Impact on TSET 0.89 0.91 0.93 0.91 Impact on Memory Throughput 1.06 1.01 0.93 0.85 Impact on L3 Misses 0.89 0.89 0.85 0.89 0.00 0.20 0.40 0.60 0.80 1.00 1.20 Scatter compared to Compact Memory Hierarchy Dependency Scatter vs. Compact Task Placement Dr. Fisnik Kraja | f.kraja@science-computing.de © 2014 science + computing ag 15/22
  • 16. Hyperthreading Impact on Performance HT-ON vs. HT-OFF - 24 Tasks per Node - E5 2697v2 1. Here we analyze the impact of having HT=ON/OFF (in BIOS) • Even when not overpopulating the nodes 2. As shown the impact on performance is minimal 3. The real reasons for this behavior are not clear • Could be OS Jitter • What do you think? 2,4 ghz 2,7 ghz 2,4 ghz 2,7 ghz 2,4 ghz 2,7 ghz 6 nodes x 24cores 12 nodes x 24 cores 24 nodes x 24 cores Robin_ivy27-12c-hton_sbatch 2197 2026 1098 1011 589 544 Robin_ivy27-12c-E3-htoff_sbatch 2189 2025 1081 996 577 529 Reduction of TSET 0.33% 0.02% 1.52% 1.46% 2.11% 2.69% 0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% 14.00% 16.00% 18.00% 20.00% 0 500 1000 1500 2000 2500 ReductionofElapseTime ElapsedTimeinSeconds Dr. Fisnik Kraja | f.kraja@science-computing.de © 2014 science + computing ag 16/22
  • 17. Turbo Boost Analysis As expected, on fewer cores the elapsed time increases. However the impact becomes more pronounced as one reduces the number of cores. By reducing the number of cores by a factor of 10, the elapsed time increases only by a factor of 6.7. This could be an interesting use case to reduce license costs. 0.0 500.0 1000.0 1500.0 2000.0 2500.0 3000.0 3500.0 4000.0 480 432 384 336 288 240 192 144 96 48 2.9 3.1 3.1 3.1 3.1 3.2 3.3 3.4 3.5 3.6 20 (10) 18 (9) 16 (8) 14 (7) 12 (6) 10 (5) 8 (4) 6 (3) 4 (2) 2 (1) ElapsedTimeinSeconds # Tasks Max.Freq.GHZ Task Placement 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 480 432 384 336 288 240 192 144 96 48 NUmber of Tasks (Cores) Ratio_TSET Ratio_Cores Ratio_Diff During these tests we use always 24 nodes and reduce by 2 the number of tasks per node. This reduces the number of CPU cores being used: 1. Allowing the Turbo Boost Technology to clock the cores up to 3.6 GHz (including here the integrated memory controller) 2. Giving each core a higher memory bandwidth Dr. Fisnik Kraja | f.kraja@science-computing.de © 2014 science + computing ag 17/22
  • 18. Turbo Boost and Hyperthreading impact on Memory Throughput Tests with 240 Tasks on 12 Fully/Over Populated Nodes 1. The Turbo Boost impact on memory throughput and on speedup is at the same ratio 2. The HT impact is not. • The reason for this might be the eviction of cache data since 2 threads are running on the same core. Fully-Turbo vs Fully HT (over populated) vs. Fully TB+HT (overpopulated) vs. Fully Speedup in Time 1.07 1.08 1.15 Memory Throughput Increase 1.07 1.12 1.19 1.00 1.02 1.04 1.06 1.08 1.10 1.12 1.14 1.16 1.18 1.20 1.22 Ratio Dr. Fisnik Kraja | f.kraja@science-computing.de © 2014 science + computing ag 18/22
  • 19. MPI Profiling (1) © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de As shown in these charts the part of time spent in MPI communications increases almost linearly with the increase in the number of nodes. 3 Nodes 6 Nodes 12 Nodes 24 Nodes 48 Nodes 96 Nodes MPI time 21.58% 22.51% 25.40% 32.34% 40.47% 53.71% User time 78.42% 77.49% 74.60% 67.66% 59.53% 46.29% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% MPI Time vs. User Time 3, 21.58%6, 22.51% 12, 25.40% 24, 32.34% 48, 40.47% 96, 53.71% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 0 10 20 30 40 50 60 70 80 90 100 Number of Nodes MPI Time Linear (MPI Time) 19/22
  • 20. MPI Profiling (2) © 2014 science + computing agDr. Fisnik Kraja | f.kraja@science-computing.de 1. Most of the MPI Time is spent in MPI_Allreduce and MPI_Waitany  Benchmarks on Platform MPI selection of collective algorithms  5-10% performance improvement of MPI collective time is expected 2. While increasing the number of Nodes, the part of time spent on MPI_Allreduce and MPI_Waitany gets distributed over the other MPI calls like MPI_Recv, MPI_Waitall, etc … 3. Interesting is that up to 9% of the MPI Time is spent on MPI_File_read_at  A parallel File system might help in reducing this part 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 3 Nodes 6 Nodes 12 Nodes 24 Nodes 48 Nodes 96 Nodes 20/22
  • 21. Conclusions 1. CPU Frequency • STAR-CCM+ showed dependency on the CPU Frequency 2. Cache Size • STAR-CCM+ is dependent on the Cache Size per Core 3. Turbo Boost • Is worth only in cases when not all the cores are used 4. Hyperthreading • When used, HT has no big positive impact on performance • When not used, HT has a small negative impact on performance 5. Memory Bandwidth and Latency • STAR-CCM+ is more latency than bandwidth dependent 6. MPI Profiling • MPI Time increases linearly with the number of nodes 7. File System • Parallel File systems should be considered for MPI-IO 21/22
  • 22. Thank you for your attention. science + computing ag www.science-computing.de Fisnik Kraja Email: f.kraja@science-computing.de
  • 23. Theory behind CPU Frequency Dependency  Lets start with the statement  To Solve m Problems on x resourses we need t=T time  Then we can make the following assumptions:  To solve n*m problems on n*x resourses we still need t=T time  To solve m problems on n*x resourses we need t=T/n time (hopefully)  From the last assumption: t ≅ 𝑇 × 𝑛−1  We can expand it: t ≅ 𝑇 × 𝑛−(𝑎+𝑏) , where 𝑎 + 𝑏 = 1  We use 𝑎 to represent the dependency on CPU frequency  We use 𝑏 to represent the dependency on other factors  To find the value of 𝑎 we keep 𝑏 = 0