SlideShare a Scribd company logo
1 of 41
1
Benchmark Analysis ofBenchmark Analysis of
Multi-Core ProcessorMulti-Core Processor
Memory ContentionMemory Contention
Tyler Simon, Computer Sciences Corp.Tyler Simon, Computer Sciences Corp.
James McGalliard, FEDSIMJames McGalliard, FEDSIM
SCMGSCMG
Richmond/Raleigh April 2009Richmond/Raleigh April 2009
22
Presentation OverviewPresentation Overview
 Multi-Core Processors & PerformanceMulti-Core Processors & Performance
 Discover System & WorkloadDiscover System & Workload
 Benchmark Results & DiscussionBenchmark Results & Discussion
• Cubed SphereCubed Sphere
 Wall ClockWall Clock
 MPI TimeMPI Time
• Memory KernelMemory Kernel
 Concluding RemarksConcluding Remarks
33
Multi-Core Processors &Multi-Core Processors &
PerformancePerformance
 Moore’s Law worked for more than 30Moore’s Law worked for more than 30
yearsyears
 Problems with current leakage and heatProblems with current leakage and heat
 Processors can’t grow much larger atProcessors can’t grow much larger at
current clock ratescurrent clock rates
44
Multi-Core Processors &Multi-Core Processors &
PerformancePerformance
 Industry has turned to multi-coreIndustry has turned to multi-core
processors to continue price/performanceprocessors to continue price/performance
improvementsimprovements
 Early multi-core designs were just multipleEarly multi-core designs were just multiple
single-core processors attached to eachsingle-core processors attached to each
otherother
55
Discover System & WorkloadDiscover System & Workload
 The NASA Center for ComputationalThe NASA Center for Computational
Sciences (NCCS) in Greenbelt MDSciences (NCCS) in Greenbelt MD
 Goddard is the world’s largestGoddard is the world’s largest
organization of Earth scientists andorganization of Earth scientists and
engineersengineers
66
Discover System & WorkloadDiscover System & Workload
 ““Discover” is the largest NCCS system currently.Discover” is the largest NCCS system currently.
Linux Networx & IBM cluster system with 4648Linux Networx & IBM cluster system with 4648
CPUs – dual- and quad-core Dempsey,WoodcrestCPUs – dual- and quad-core Dempsey,Woodcrest
and Harpertown processors.and Harpertown processors.
77
Discover System & WorkloadDiscover System & Workload
 The predominant NCCSThe predominant NCCS
workload is globalworkload is global
climate modeling. Anclimate modeling. An
important modelingimportant modeling
code is the cubedcode is the cubed
sphere.sphere.
 Today’s presentationToday’s presentation
includes cubed sphereincludes cubed sphere
benchmark results asbenchmark results as
well synthetic kernels.well synthetic kernels.
88
Cubed Sphere Benchmark ResultsCubed Sphere Benchmark Results
& Discussion& Discussion
 The benchmark tests were run with the
finite volume cubed sphere dynamic core
application running the “benchmark 1” test
case, a 15-day simulation from a balanced
Hydrostatic Baroclinic state at 100-km
resolution with 26-levels .
 The test case was run on Discover using
only the Intel Harpertown Quad core
processors using 3, 6 and 12 nodes and
varying the active cores per node by 2,4,
6 and 8.
99
Cores, Processors, NodesCores, Processors, Nodes
 Cores = central processing units, including the logicCores = central processing units, including the logic
needed to execute the instruction set, registers & localneeded to execute the instruction set, registers & local
cachecache
 Processors = one or more cores on a single chip, in aProcessors = one or more cores on a single chip, in a
single sockets, including shared cache, network andsingle sockets, including shared cache, network and
memory access connectionsmemory access connections
 Node = a board with one or more processors and localNode = a board with one or more processors and local
memory, network attachedmemory, network attached
1010
Benchmark Results & Discussion
 Note, the largest performance problem
with the GEOS-5 workload is I/O, a
problem that is growing for everybody in
the HPC community.
 The dynamical core is about ¼ of the
processor workload at this time. The
remaining ¾ is physics, which is much
more cache and memory friendly than the
dynamics.
1111
Benchmark Results & DiscussionBenchmark Results & Discussion
Nodes
Cores
per
Node
Total
Cores
Wall
Time
%
Communic
ation
12 2 24 371.1 5.3
12 4 48 212.8 10.19
12 6 72 181.9 17.33
12 8 96 178.5 20.74
6 2 12 676.1 3.96
6 4 24 411.6 7.01
6 6 36 339.2 12.82
6 8 48 318.3 14.16
3 2 6 1336.9 2.77
3 4 12 771.2 5.58
3 6 18 658.8 11.68
3 8 24 601.3 9.88
1212
Benchmark Results & Discussion
1313
Benchmark Results & DiscussionBenchmark Results & Discussion
 Running the cubed sphere benchmark shows that
by using fewer cores per node we can improve
the runtime by 38%.
 Results show performance degradation in regards
to cores per node in application runtime, MPI
behavior and on chip cache.
1414
Benchmark Results & Discussion
 Parallel efficiency is a function of single
core execution time divided by the
processor-time (runtime * total cores).
1515
Benchmark Results & Discussion
 This chart shows that by just reducing the
number of cores per node efficiency is increased
by an average of 53%.
1616
Benchmark Results & Discussion
 The application’s use of MPI and
placement of work into ranks then onto
cores is important when considering
overall runtime.
 The following charts show how MPI
performance is affected by various core
configurations for the 24 core problem size
at 2, 4 and 8 cores per node.
1717
Benchmark Results &
Discussion
 Charts 2a 2b and 2c show with the green
line the total amount of time spent in MPI,
per rank, for the entire benchmark run at
2,4 and 8 cores respectively.
 For 2 cores the time is around 23 seconds,
4 cores 26 seconds and 8 cores we see
more fluctuation, but the MPI use costs
around 60 seconds. Also system time
slowly creeps up as we increase per node
core count.
1818
Benchmark Results & Discussion
 2 active
cores
 Runtime
~370
seconds
 MPI ~ 20
seconds
1919
Benchmark Results & Discussion
 4 active
cores
 Runtime
~410
seconds
 MPI ~30
seconds
2020
Benchmark Results & Discussion
 8 active
cores
 Runtime
~600
seconds
(almost
double vs 2
cores)
 MPI ~60
seconds
(triple vs 2
cores)
2121
Benchmark Results & Discussion
 Next charts break down the previous
charts regarding overall MPI use.
 Show how time is spent within MPI
on a per rank basis.
 How each MPI rank is affected by the
core configuration.
 Note the variation of maximum value
on the y axis between charts.
2222
Benchmark Results & Discussion
 Components
of the green
line in the
earlier charts
 2 active cores
 About 20
seconds total
MPI time, the
same as
before
2323
Benchmark Results & Discussion
 4 active
cores
 About 30
seconds
total MPI
time
2424
Benchmark Results & Discussion
 8 active
cores
 About 60
seconds
total MPI
time
 More
variability
2525
Benchmark Results & Discussion
 On-chip resource contention between
cores is a well documented byproduct of
the “multi-core revolution” and manifests
itself in significant reductions in available
memory bandwidth.
 This fight for cache bandwidth between
cores affects application runtime and, as
expected, is exacerbated by increasing the
number of cores used per node.
2626
Benchmark Results &
Discussion The following charts provide a brief overview of the costs
associated with using a multi core chip (Woodcrest vs.
Harpertown) in terms of reading and writing to the cache with
various block sizes, cache miss latency and cache access time.
 Note that the Woodcrest nodes run at 2.66 GHz and have a 4MB
shared L2 Cache, thus 2MB per core with each core having access
to a 32K L1 (16k for data and 16k for instructions).
 The Harpertown nodes have 2 quad core chips running at 2.5
GHz. Within each quad core chip each set of 2 cores has access to
a 6MB L2 cache and a 64K L1 cache (32k x 2).
Chip Cores
Chips/
Node Clock L1 L2
Woodcrest 2 2 2.66 Ghz 32K ea
4M
shared
Harpertown 4 2 2.5 Ghz 64K pair
6M
shared
/pair
2727
Benchmark Results & Discussion
2828
Benchmark Results & Discussion
2929
Benchmark Results & Discussion
0
100
200
300
400
500
600
700
0.001 0.01 0.1 1 10 100 1000 10000 100000
Stride Size in Kbytes
R+Wtime(ns)
2 active cores (Woodcrest) 4 active cores (Woodcrest) 8 active cores (Harpertown)
3030
Benchmark Results & Discussion
 There is a dramatic increase in latency as
the number of active cores increases.
From 2 to 4, approximately a factor of 4
increase.
 From 2 to 8 cores (with roughly similar
performance between Woodcrest and
Harpertown), about a factor of 8 increase.
 There are plateaus and steep inclines,
particularly in the single core results,
showing that there is sensitivity to locality.
Locality is under the programmer’s
control.
3131
Concluding RemarksConcluding Remarks
 This presentation examined performance
differences for the Cubed Sphere
benchmark on Harpertown nodes by
varying the active core count per node,
e.g., 38% better runtime on 2 cores per
node vs. 8
 MPI performance degrades if core count
and problem size are fixed but core
density increases, e.g., tripling MPI time
going from 2 to 8 cores per node
3232
Concluding Remarks
 Cache read and write times also degrade
as core density increases, e.g., about 8-
fold going from 2 to 8 cores per node
 Runtime seems to be affected by the
number of cores used per node due to the
resource contention in the multicore
environment.
3333
Concluding RemarksConcluding Remarks
 Scheduling a job to run on fewer processors canScheduling a job to run on fewer processors can
improve run-time, at a cost of reduced processorimprove run-time, at a cost of reduced processor
utilization (which may not be a problem).utilization (which may not be a problem).
 To date, most NCCS user/scientists are moreTo date, most NCCS user/scientists are more
concerned with science details and portabilityconcerned with science details and portability
than with performance optimization. We expectthan with performance optimization. We expect
their concern to grow with fixed clock rates andtheir concern to grow with fixed clock rates and
memory contention.memory contention.
3434
Concluding RemarksConcluding Remarks
 Application-level optimization is more workApplication-level optimization is more work
for the user.for the user.
 The direction of all processor designs isThe direction of all processor designs is
towards cells and hybrids. Unknown if thistowards cells and hybrids. Unknown if this
will work.will work.
 A hybrid approach, e.g., using OpenMP orA hybrid approach, e.g., using OpenMP or
cell processors may improve performancecell processors may improve performance
by avoiding off-chip memory contention.by avoiding off-chip memory contention.
3535
POCsPOCs
 Tyler.simon@nasa.govTyler.simon@nasa.gov
 Jim.mcgalliard@gsa.govJim.mcgalliard@gsa.gov
3636
ReferenceReference
 Hennessy, John and Patterson, David.Hennessy, John and Patterson, David. ComputerComputer
Architecture: A Quantitative ApproachArchitecture: A Quantitative Approach, 2nd, 2nd
Edition. Morgan Kauffmann, San Mateo,Edition. Morgan Kauffmann, San Mateo,
California. The memory stride kernel is adaptedCalifornia. The memory stride kernel is adapted
from a code appearing on page 477.from a code appearing on page 477.
3737
Diagram of a Generic Dual CoreDiagram of a Generic Dual Core
ProcessorProcessor
3838
3939
Sandia Multi-Core PerformanceSandia Multi-Core Performance
PredictionPrediction
4040
0
5
10
15
20
25
30
35
40
45
50
0.001 0.01 0.1 1 10 100 1000 10000 100000
Stride Size in Kbytes
R+Wtime(ns)
2 active cores (Woodcrest) 4 active cores (Woodcrest) 8 active cores (Harpertown)
4141

More Related Content

What's hot

Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Arinto Murdopo
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesSlide_N
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
A performance-aware power capping orchestrator for the Xen hypervisor
A performance-aware power capping orchestrator for the Xen hypervisorA performance-aware power capping orchestrator for the Xen hypervisor
A performance-aware power capping orchestrator for the Xen hypervisorNECST Lab @ Politecnico di Milano
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflowrjmurphyslideshare
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Steve Loughran
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Green Telecom & IT Workshop by IISc and Bell Labs: Embodied topology by Prof....
Green Telecom & IT Workshop by IISc and Bell Labs: Embodied topology by Prof....Green Telecom & IT Workshop by IISc and Bell Labs: Embodied topology by Prof....
Green Telecom & IT Workshop by IISc and Bell Labs: Embodied topology by Prof....BellLabs
 
(ATS6-GS04) Performance Analysis of Accelrys Enterprise Platform 9.0 on IBM’s...
(ATS6-GS04) Performance Analysis of Accelrys Enterprise Platform 9.0 on IBM’s...(ATS6-GS04) Performance Analysis of Accelrys Enterprise Platform 9.0 on IBM’s...
(ATS6-GS04) Performance Analysis of Accelrys Enterprise Platform 9.0 on IBM’s...BIOVIA
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsContinuent
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告Ryousei Takano
 

What's hot (20)

Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation Consoles
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
A performance-aware power capping orchestrator for the Xen hypervisor
A performance-aware power capping orchestrator for the Xen hypervisorA performance-aware power capping orchestrator for the Xen hypervisor
A performance-aware power capping orchestrator for the Xen hypervisor
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflow
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
 
06340356
0634035606340356
06340356
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
PROSE
PROSEPROSE
PROSE
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Green Telecom & IT Workshop by IISc and Bell Labs: Embodied topology by Prof....
Green Telecom & IT Workshop by IISc and Bell Labs: Embodied topology by Prof....Green Telecom & IT Workshop by IISc and Bell Labs: Embodied topology by Prof....
Green Telecom & IT Workshop by IISc and Bell Labs: Embodied topology by Prof....
 
(ATS6-GS04) Performance Analysis of Accelrys Enterprise Platform 9.0 on IBM’s...
(ATS6-GS04) Performance Analysis of Accelrys Enterprise Platform 9.0 on IBM’s...(ATS6-GS04) Performance Analysis of Accelrys Enterprise Platform 9.0 on IBM’s...
(ATS6-GS04) Performance Analysis of Accelrys Enterprise Platform 9.0 on IBM’s...
 
OOW-IMC-final
OOW-IMC-finalOOW-IMC-final
OOW-IMC-final
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten Replicators
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
 

Similar to Benchmark Analysis of Multi-core Processor Memory Contention April 2009

Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
Runtime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationRuntime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationFisnik Kraja
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloudNicolas Poggi
 
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPUA Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPUCarlos Reaño González
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingLéia de Sousa
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore ComputersA B Shinde
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
04536342
0453634204536342
04536342fidan78
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisNomanSiddiqui41
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecturemuhammedsalihabbas
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluationGIORGOS STAMELOS
 

Similar to Benchmark Analysis of Multi-core Processor Memory Contention April 2009 (20)

Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Runtime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationRuntime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM Simulation
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPUA Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scaling
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore Computers
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
04536342
0453634204536342
04536342
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & Analysis
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecture
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluation
 

Benchmark Analysis of Multi-core Processor Memory Contention April 2009

  • 1. 1 Benchmark Analysis ofBenchmark Analysis of Multi-Core ProcessorMulti-Core Processor Memory ContentionMemory Contention Tyler Simon, Computer Sciences Corp.Tyler Simon, Computer Sciences Corp. James McGalliard, FEDSIMJames McGalliard, FEDSIM SCMGSCMG Richmond/Raleigh April 2009Richmond/Raleigh April 2009
  • 2. 22 Presentation OverviewPresentation Overview  Multi-Core Processors & PerformanceMulti-Core Processors & Performance  Discover System & WorkloadDiscover System & Workload  Benchmark Results & DiscussionBenchmark Results & Discussion • Cubed SphereCubed Sphere  Wall ClockWall Clock  MPI TimeMPI Time • Memory KernelMemory Kernel  Concluding RemarksConcluding Remarks
  • 3. 33 Multi-Core Processors &Multi-Core Processors & PerformancePerformance  Moore’s Law worked for more than 30Moore’s Law worked for more than 30 yearsyears  Problems with current leakage and heatProblems with current leakage and heat  Processors can’t grow much larger atProcessors can’t grow much larger at current clock ratescurrent clock rates
  • 4. 44 Multi-Core Processors &Multi-Core Processors & PerformancePerformance  Industry has turned to multi-coreIndustry has turned to multi-core processors to continue price/performanceprocessors to continue price/performance improvementsimprovements  Early multi-core designs were just multipleEarly multi-core designs were just multiple single-core processors attached to eachsingle-core processors attached to each otherother
  • 5. 55 Discover System & WorkloadDiscover System & Workload  The NASA Center for ComputationalThe NASA Center for Computational Sciences (NCCS) in Greenbelt MDSciences (NCCS) in Greenbelt MD  Goddard is the world’s largestGoddard is the world’s largest organization of Earth scientists andorganization of Earth scientists and engineersengineers
  • 6. 66 Discover System & WorkloadDiscover System & Workload  ““Discover” is the largest NCCS system currently.Discover” is the largest NCCS system currently. Linux Networx & IBM cluster system with 4648Linux Networx & IBM cluster system with 4648 CPUs – dual- and quad-core Dempsey,WoodcrestCPUs – dual- and quad-core Dempsey,Woodcrest and Harpertown processors.and Harpertown processors.
  • 7. 77 Discover System & WorkloadDiscover System & Workload  The predominant NCCSThe predominant NCCS workload is globalworkload is global climate modeling. Anclimate modeling. An important modelingimportant modeling code is the cubedcode is the cubed sphere.sphere.  Today’s presentationToday’s presentation includes cubed sphereincludes cubed sphere benchmark results asbenchmark results as well synthetic kernels.well synthetic kernels.
  • 8. 88 Cubed Sphere Benchmark ResultsCubed Sphere Benchmark Results & Discussion& Discussion  The benchmark tests were run with the finite volume cubed sphere dynamic core application running the “benchmark 1” test case, a 15-day simulation from a balanced Hydrostatic Baroclinic state at 100-km resolution with 26-levels .  The test case was run on Discover using only the Intel Harpertown Quad core processors using 3, 6 and 12 nodes and varying the active cores per node by 2,4, 6 and 8.
  • 9. 99 Cores, Processors, NodesCores, Processors, Nodes  Cores = central processing units, including the logicCores = central processing units, including the logic needed to execute the instruction set, registers & localneeded to execute the instruction set, registers & local cachecache  Processors = one or more cores on a single chip, in aProcessors = one or more cores on a single chip, in a single sockets, including shared cache, network andsingle sockets, including shared cache, network and memory access connectionsmemory access connections  Node = a board with one or more processors and localNode = a board with one or more processors and local memory, network attachedmemory, network attached
  • 10. 1010 Benchmark Results & Discussion  Note, the largest performance problem with the GEOS-5 workload is I/O, a problem that is growing for everybody in the HPC community.  The dynamical core is about ¼ of the processor workload at this time. The remaining ¾ is physics, which is much more cache and memory friendly than the dynamics.
  • 11. 1111 Benchmark Results & DiscussionBenchmark Results & Discussion Nodes Cores per Node Total Cores Wall Time % Communic ation 12 2 24 371.1 5.3 12 4 48 212.8 10.19 12 6 72 181.9 17.33 12 8 96 178.5 20.74 6 2 12 676.1 3.96 6 4 24 411.6 7.01 6 6 36 339.2 12.82 6 8 48 318.3 14.16 3 2 6 1336.9 2.77 3 4 12 771.2 5.58 3 6 18 658.8 11.68 3 8 24 601.3 9.88
  • 13. 1313 Benchmark Results & DiscussionBenchmark Results & Discussion  Running the cubed sphere benchmark shows that by using fewer cores per node we can improve the runtime by 38%.  Results show performance degradation in regards to cores per node in application runtime, MPI behavior and on chip cache.
  • 14. 1414 Benchmark Results & Discussion  Parallel efficiency is a function of single core execution time divided by the processor-time (runtime * total cores).
  • 15. 1515 Benchmark Results & Discussion  This chart shows that by just reducing the number of cores per node efficiency is increased by an average of 53%.
  • 16. 1616 Benchmark Results & Discussion  The application’s use of MPI and placement of work into ranks then onto cores is important when considering overall runtime.  The following charts show how MPI performance is affected by various core configurations for the 24 core problem size at 2, 4 and 8 cores per node.
  • 17. 1717 Benchmark Results & Discussion  Charts 2a 2b and 2c show with the green line the total amount of time spent in MPI, per rank, for the entire benchmark run at 2,4 and 8 cores respectively.  For 2 cores the time is around 23 seconds, 4 cores 26 seconds and 8 cores we see more fluctuation, but the MPI use costs around 60 seconds. Also system time slowly creeps up as we increase per node core count.
  • 18. 1818 Benchmark Results & Discussion  2 active cores  Runtime ~370 seconds  MPI ~ 20 seconds
  • 19. 1919 Benchmark Results & Discussion  4 active cores  Runtime ~410 seconds  MPI ~30 seconds
  • 20. 2020 Benchmark Results & Discussion  8 active cores  Runtime ~600 seconds (almost double vs 2 cores)  MPI ~60 seconds (triple vs 2 cores)
  • 21. 2121 Benchmark Results & Discussion  Next charts break down the previous charts regarding overall MPI use.  Show how time is spent within MPI on a per rank basis.  How each MPI rank is affected by the core configuration.  Note the variation of maximum value on the y axis between charts.
  • 22. 2222 Benchmark Results & Discussion  Components of the green line in the earlier charts  2 active cores  About 20 seconds total MPI time, the same as before
  • 23. 2323 Benchmark Results & Discussion  4 active cores  About 30 seconds total MPI time
  • 24. 2424 Benchmark Results & Discussion  8 active cores  About 60 seconds total MPI time  More variability
  • 25. 2525 Benchmark Results & Discussion  On-chip resource contention between cores is a well documented byproduct of the “multi-core revolution” and manifests itself in significant reductions in available memory bandwidth.  This fight for cache bandwidth between cores affects application runtime and, as expected, is exacerbated by increasing the number of cores used per node.
  • 26. 2626 Benchmark Results & Discussion The following charts provide a brief overview of the costs associated with using a multi core chip (Woodcrest vs. Harpertown) in terms of reading and writing to the cache with various block sizes, cache miss latency and cache access time.  Note that the Woodcrest nodes run at 2.66 GHz and have a 4MB shared L2 Cache, thus 2MB per core with each core having access to a 32K L1 (16k for data and 16k for instructions).  The Harpertown nodes have 2 quad core chips running at 2.5 GHz. Within each quad core chip each set of 2 cores has access to a 6MB L2 cache and a 64K L1 cache (32k x 2). Chip Cores Chips/ Node Clock L1 L2 Woodcrest 2 2 2.66 Ghz 32K ea 4M shared Harpertown 4 2 2.5 Ghz 64K pair 6M shared /pair
  • 29. 2929 Benchmark Results & Discussion 0 100 200 300 400 500 600 700 0.001 0.01 0.1 1 10 100 1000 10000 100000 Stride Size in Kbytes R+Wtime(ns) 2 active cores (Woodcrest) 4 active cores (Woodcrest) 8 active cores (Harpertown)
  • 30. 3030 Benchmark Results & Discussion  There is a dramatic increase in latency as the number of active cores increases. From 2 to 4, approximately a factor of 4 increase.  From 2 to 8 cores (with roughly similar performance between Woodcrest and Harpertown), about a factor of 8 increase.  There are plateaus and steep inclines, particularly in the single core results, showing that there is sensitivity to locality. Locality is under the programmer’s control.
  • 31. 3131 Concluding RemarksConcluding Remarks  This presentation examined performance differences for the Cubed Sphere benchmark on Harpertown nodes by varying the active core count per node, e.g., 38% better runtime on 2 cores per node vs. 8  MPI performance degrades if core count and problem size are fixed but core density increases, e.g., tripling MPI time going from 2 to 8 cores per node
  • 32. 3232 Concluding Remarks  Cache read and write times also degrade as core density increases, e.g., about 8- fold going from 2 to 8 cores per node  Runtime seems to be affected by the number of cores used per node due to the resource contention in the multicore environment.
  • 33. 3333 Concluding RemarksConcluding Remarks  Scheduling a job to run on fewer processors canScheduling a job to run on fewer processors can improve run-time, at a cost of reduced processorimprove run-time, at a cost of reduced processor utilization (which may not be a problem).utilization (which may not be a problem).  To date, most NCCS user/scientists are moreTo date, most NCCS user/scientists are more concerned with science details and portabilityconcerned with science details and portability than with performance optimization. We expectthan with performance optimization. We expect their concern to grow with fixed clock rates andtheir concern to grow with fixed clock rates and memory contention.memory contention.
  • 34. 3434 Concluding RemarksConcluding Remarks  Application-level optimization is more workApplication-level optimization is more work for the user.for the user.  The direction of all processor designs isThe direction of all processor designs is towards cells and hybrids. Unknown if thistowards cells and hybrids. Unknown if this will work.will work.  A hybrid approach, e.g., using OpenMP orA hybrid approach, e.g., using OpenMP or cell processors may improve performancecell processors may improve performance by avoiding off-chip memory contention.by avoiding off-chip memory contention.
  • 36. 3636 ReferenceReference  Hennessy, John and Patterson, David.Hennessy, John and Patterson, David. ComputerComputer Architecture: A Quantitative ApproachArchitecture: A Quantitative Approach, 2nd, 2nd Edition. Morgan Kauffmann, San Mateo,Edition. Morgan Kauffmann, San Mateo, California. The memory stride kernel is adaptedCalifornia. The memory stride kernel is adapted from a code appearing on page 477.from a code appearing on page 477.
  • 37. 3737 Diagram of a Generic Dual CoreDiagram of a Generic Dual Core ProcessorProcessor
  • 38. 3838
  • 39. 3939 Sandia Multi-Core PerformanceSandia Multi-Core Performance PredictionPrediction
  • 40. 4040 0 5 10 15 20 25 30 35 40 45 50 0.001 0.01 0.1 1 10 100 1000 10000 100000 Stride Size in Kbytes R+Wtime(ns) 2 active cores (Woodcrest) 4 active cores (Woodcrest) 8 active cores (Harpertown)
  • 41. 4141