3. 33
Multi-Core Processors &Multi-Core Processors &
PerformancePerformance
Moore’s Law worked for more than 30Moore’s Law worked for more than 30
yearsyears
Problems with current leakage and heatProblems with current leakage and heat
Processors can’t grow much larger atProcessors can’t grow much larger at
current clock ratescurrent clock rates
4. 44
Multi-Core Processors &Multi-Core Processors &
PerformancePerformance
Industry has turned to multi-coreIndustry has turned to multi-core
processors to continue price/performanceprocessors to continue price/performance
improvementsimprovements
Early multi-core designs were just multipleEarly multi-core designs were just multiple
single-core processors attached to eachsingle-core processors attached to each
otherother
5. 55
Discover System & WorkloadDiscover System & Workload
The NASA Center for ComputationalThe NASA Center for Computational
Sciences (NCCS) in Greenbelt MDSciences (NCCS) in Greenbelt MD
Goddard is the world’s largestGoddard is the world’s largest
organization of Earth scientists andorganization of Earth scientists and
engineersengineers
6. 66
Discover System & WorkloadDiscover System & Workload
““Discover” is the largest NCCS system currently.Discover” is the largest NCCS system currently.
Linux Networx & IBM cluster system with 4648Linux Networx & IBM cluster system with 4648
CPUs – dual- and quad-core Dempsey,WoodcrestCPUs – dual- and quad-core Dempsey,Woodcrest
and Harpertown processors.and Harpertown processors.
7. 77
Discover System & WorkloadDiscover System & Workload
The predominant NCCSThe predominant NCCS
workload is globalworkload is global
climate modeling. Anclimate modeling. An
important modelingimportant modeling
code is the cubedcode is the cubed
sphere.sphere.
Today’s presentationToday’s presentation
includes cubed sphereincludes cubed sphere
benchmark results asbenchmark results as
well synthetic kernels.well synthetic kernels.
8. 88
Cubed Sphere Benchmark ResultsCubed Sphere Benchmark Results
& Discussion& Discussion
The benchmark tests were run with the
finite volume cubed sphere dynamic core
application running the “benchmark 1” test
case, a 15-day simulation from a balanced
Hydrostatic Baroclinic state at 100-km
resolution with 26-levels .
The test case was run on Discover using
only the Intel Harpertown Quad core
processors using 3, 6 and 12 nodes and
varying the active cores per node by 2,4,
6 and 8.
9. 99
Cores, Processors, NodesCores, Processors, Nodes
Cores = central processing units, including the logicCores = central processing units, including the logic
needed to execute the instruction set, registers & localneeded to execute the instruction set, registers & local
cachecache
Processors = one or more cores on a single chip, in aProcessors = one or more cores on a single chip, in a
single sockets, including shared cache, network andsingle sockets, including shared cache, network and
memory access connectionsmemory access connections
Node = a board with one or more processors and localNode = a board with one or more processors and local
memory, network attachedmemory, network attached
10. 1010
Benchmark Results & Discussion
Note, the largest performance problem
with the GEOS-5 workload is I/O, a
problem that is growing for everybody in
the HPC community.
The dynamical core is about ¼ of the
processor workload at this time. The
remaining ¾ is physics, which is much
more cache and memory friendly than the
dynamics.
13. 1313
Benchmark Results & DiscussionBenchmark Results & Discussion
Running the cubed sphere benchmark shows that
by using fewer cores per node we can improve
the runtime by 38%.
Results show performance degradation in regards
to cores per node in application runtime, MPI
behavior and on chip cache.
14. 1414
Benchmark Results & Discussion
Parallel efficiency is a function of single
core execution time divided by the
processor-time (runtime * total cores).
15. 1515
Benchmark Results & Discussion
This chart shows that by just reducing the
number of cores per node efficiency is increased
by an average of 53%.
16. 1616
Benchmark Results & Discussion
The application’s use of MPI and
placement of work into ranks then onto
cores is important when considering
overall runtime.
The following charts show how MPI
performance is affected by various core
configurations for the 24 core problem size
at 2, 4 and 8 cores per node.
17. 1717
Benchmark Results &
Discussion
Charts 2a 2b and 2c show with the green
line the total amount of time spent in MPI,
per rank, for the entire benchmark run at
2,4 and 8 cores respectively.
For 2 cores the time is around 23 seconds,
4 cores 26 seconds and 8 cores we see
more fluctuation, but the MPI use costs
around 60 seconds. Also system time
slowly creeps up as we increase per node
core count.
20. 2020
Benchmark Results & Discussion
8 active
cores
Runtime
~600
seconds
(almost
double vs 2
cores)
MPI ~60
seconds
(triple vs 2
cores)
21. 2121
Benchmark Results & Discussion
Next charts break down the previous
charts regarding overall MPI use.
Show how time is spent within MPI
on a per rank basis.
How each MPI rank is affected by the
core configuration.
Note the variation of maximum value
on the y axis between charts.
22. 2222
Benchmark Results & Discussion
Components
of the green
line in the
earlier charts
2 active cores
About 20
seconds total
MPI time, the
same as
before
24. 2424
Benchmark Results & Discussion
8 active
cores
About 60
seconds
total MPI
time
More
variability
25. 2525
Benchmark Results & Discussion
On-chip resource contention between
cores is a well documented byproduct of
the “multi-core revolution” and manifests
itself in significant reductions in available
memory bandwidth.
This fight for cache bandwidth between
cores affects application runtime and, as
expected, is exacerbated by increasing the
number of cores used per node.
26. 2626
Benchmark Results &
Discussion The following charts provide a brief overview of the costs
associated with using a multi core chip (Woodcrest vs.
Harpertown) in terms of reading and writing to the cache with
various block sizes, cache miss latency and cache access time.
Note that the Woodcrest nodes run at 2.66 GHz and have a 4MB
shared L2 Cache, thus 2MB per core with each core having access
to a 32K L1 (16k for data and 16k for instructions).
The Harpertown nodes have 2 quad core chips running at 2.5
GHz. Within each quad core chip each set of 2 cores has access to
a 6MB L2 cache and a 64K L1 cache (32k x 2).
Chip Cores
Chips/
Node Clock L1 L2
Woodcrest 2 2 2.66 Ghz 32K ea
4M
shared
Harpertown 4 2 2.5 Ghz 64K pair
6M
shared
/pair
29. 2929
Benchmark Results & Discussion
0
100
200
300
400
500
600
700
0.001 0.01 0.1 1 10 100 1000 10000 100000
Stride Size in Kbytes
R+Wtime(ns)
2 active cores (Woodcrest) 4 active cores (Woodcrest) 8 active cores (Harpertown)
30. 3030
Benchmark Results & Discussion
There is a dramatic increase in latency as
the number of active cores increases.
From 2 to 4, approximately a factor of 4
increase.
From 2 to 8 cores (with roughly similar
performance between Woodcrest and
Harpertown), about a factor of 8 increase.
There are plateaus and steep inclines,
particularly in the single core results,
showing that there is sensitivity to locality.
Locality is under the programmer’s
control.
31. 3131
Concluding RemarksConcluding Remarks
This presentation examined performance
differences for the Cubed Sphere
benchmark on Harpertown nodes by
varying the active core count per node,
e.g., 38% better runtime on 2 cores per
node vs. 8
MPI performance degrades if core count
and problem size are fixed but core
density increases, e.g., tripling MPI time
going from 2 to 8 cores per node
32. 3232
Concluding Remarks
Cache read and write times also degrade
as core density increases, e.g., about 8-
fold going from 2 to 8 cores per node
Runtime seems to be affected by the
number of cores used per node due to the
resource contention in the multicore
environment.
33. 3333
Concluding RemarksConcluding Remarks
Scheduling a job to run on fewer processors canScheduling a job to run on fewer processors can
improve run-time, at a cost of reduced processorimprove run-time, at a cost of reduced processor
utilization (which may not be a problem).utilization (which may not be a problem).
To date, most NCCS user/scientists are moreTo date, most NCCS user/scientists are more
concerned with science details and portabilityconcerned with science details and portability
than with performance optimization. We expectthan with performance optimization. We expect
their concern to grow with fixed clock rates andtheir concern to grow with fixed clock rates and
memory contention.memory contention.
34. 3434
Concluding RemarksConcluding Remarks
Application-level optimization is more workApplication-level optimization is more work
for the user.for the user.
The direction of all processor designs isThe direction of all processor designs is
towards cells and hybrids. Unknown if thistowards cells and hybrids. Unknown if this
will work.will work.
A hybrid approach, e.g., using OpenMP orA hybrid approach, e.g., using OpenMP or
cell processors may improve performancecell processors may improve performance
by avoiding off-chip memory contention.by avoiding off-chip memory contention.
36. 3636
ReferenceReference
Hennessy, John and Patterson, David.Hennessy, John and Patterson, David. ComputerComputer
Architecture: A Quantitative ApproachArchitecture: A Quantitative Approach, 2nd, 2nd
Edition. Morgan Kauffmann, San Mateo,Edition. Morgan Kauffmann, San Mateo,
California. The memory stride kernel is adaptedCalifornia. The memory stride kernel is adapted
from a code appearing on page 477.from a code appearing on page 477.
37. 3737
Diagram of a Generic Dual CoreDiagram of a Generic Dual Core
ProcessorProcessor