Multicore Intel Processors Performance Evaluation

The University of Jordan
School of Engineering
Computer Engineering Department
Thesis Presentation
EVALUATION OF POPULAR MULTICORE DESIGN
ALTERNATIVES USING
CONFIGURATION DEPENDENT ANALYSIS
Eng. Aiesha F. Almaslam
Supervisor: Prof. Gheith A. Abandah

Outline
Tuesday, May 2, 2017
 Introduction
 Motivation
 Research Objectives
 Research Methodology
 Multicore design alternatives
 Sniper multicore simulator
 Multithreaded benchmarks
 Raw experiments analysis
 Normalized experiments analysis
 Conclusions
 Future work
Prepared by Eng.Aieshah F. Almaslam2

Introduction
 Multicore architecture is the current step in processor evolution; it is a
special kind of a multiprocessor on a single chip.
 Difficult to make single-core clock frequencies even higher
 Many new applications are multithreaded
 General trend in computer architecture of shifting towards more
parallelism; ILP andTLP
Prepared by Eng.Aieshah F. Al-maslam3
why multicores?

Introduction (cont.)
Multicore processors can be classified as:
 Homogeneous processors: all cores specifications are identical.
 Heterogeneous processors: processor cores are not identical.
 Traditional multicores are common in mobiles, desktops and servers.
 Chip designers continue evolution to increase the number of cores,
leading to the manycore architectures.
Multicore processors classifications.

Motivation
 Many design options are in front of the designers because of many
system parameters; such as:
 Number and speed of processor cores
 Core interconnection network (bus, 2D mesh, ring, etc.)
 Cache coherence protocol (MESI, MOESI, MESIF, etc.)
 Number and size of cache memory levels (L1, L2, L3, or more)
 Cache features ( private or shared) Cache associativity (Direct or Set associative
mapping)
 Number of memory controllers, number and type of RAM (DDR2,DDR3, DDR4, etc.)
 Memory link bandwidth and speed
 Branch predictor type and penalty
 TLB size and associativity
 Hence, It is important to understand how various designs of multicore
processors perform with the current parallel applications.

Objectives
 Evaluating and calibrating common multicore systems, by
using raw and normalized microarchitectural simulations.
 Comparing the behavior of wide range of multithreaded
benchmarks, analyzing their behavior due to different
microarchitectures and different input data sets.
 Studying tradeoffs between various multicore performance
metrics; by using comprehensive and representative multicore
performance metrics. Like ex. time, IPC, utilization and power
metrics.
 Finally, identifying multicore designs strengths and
weaknesses, concluding system design features that have
significant impact on system performance.

Methodology
 Investigating various design options of recent multicore
processors: to select few representative multicore design
alternatives.
 Investigating different multicore simulators: This simulator
should be able to efficient simulate different multicore design
options; can evaluate the performance of these design alternatives .
 Investigating the available benchmarks parallel applications: and
selecting a representative set of them for further study.
 Implementing raw and normalized experiments for multicore
designs: and analyzing the results. By so, so we can determine the
best multicore design and identify strengths and weaknesses of
each design.

Methodology (cont.)
 Computer architecture research is mainly driven by simulation
(Ricco A. 2013).
 The good simulator should have simulation infrastructure that
meets important requirements (Carlson et al. (A,2014); they are:
 Efficiency: both in time and space, by only simulating relevant parts of the
benchmark in detail, avoiding long warm-up time; and occupying a small disk
footprint for storing workloads.
 Accuracy: simulation results should be representative of running the complete
workload.
 Reproducible: the unit of work must be fixed across architectures to allow
comparisons to be made; workloads must be easily shareable while
guaranteeing (mostly) identical simulation results.

Sniper multicore simulator
Its main features:
 Parallel, multi-threaded applications
 Interval core model
 In-order, out-of-order cores
 Shared and private caches and modern branch predictors
 Homogeneous and heterogeneous configuration
 McPAT integration for power
 CPI Stacks and advanced visualization to gain insight into lost cycles
 Validated against real hardware ( Intel Core 2 and Nehalem
microarchitectures)

Multithreaded benchmarks
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam10
Splash2
• FFT
• Radix
• Cholesky
• Lu.cont
Parsec
• Blackscholes
• Canneal
• Fluidanimate
• Swaptions
 Splash2: Stanford ParallelApplications for Shared memory
 Parsec: PrincetonApplication Repository for Shared Memory Computers

Multicore design alternatives
1. Dunnington microarchitecture
Intel Xeon X7460 processor
 First intel (above two) multicore
 Sep 2008
 Six cores per socket
 Bus based network (FSB)
 Shared L2 for each pair of cores
 Shared L3 for socket
 DDR2
For fair comparison we used eight cores in multisocket design (four cores
per socket)

2. Gainestown microarchitecture
Intel Xeon Processor X5550
 introduced on Jan 2009
 Private L1 and L2 caches
 Shared L3 cache
 Bus based network
 2 QPI links
 DDR3 memory
For fair comparison we used eight cores in multisocket design (four cores
per socket)

3. Haswell microarchitecture
Intel Xeon Processor E5-2667 v3
 Introduced on Sep 2014
 Private L1 and L2 caches
 Shared L3 cache
 Bidirectional ring NOC
 DDR4 memory type

4. Knights Corner microarchitecture
Intel Xeon Phi Coprocessor 5110P
 Introduced on Nov 2012
 Only private L1 and L2 caches
 GDDR5 memory
 Sniper doesn’t model ring topology
without NUCA cache.Therefore, we
modify the network to beTilera-like
2d mesh network (4x2 2d mesh size).
 Actually, the 2D mesh network is
used by Intel in the next generation
Xeon Phi line (KNL).

Raw experiments analysis
Total ExecutionTime (LB)
Dunning
ton
Gainest
own
Xeon
Phi
Haswell
Average Ex. Time 39.779125 23.8165 100.52125 21.622625
0
20
40
60
80
100
120
Ex.time(ms)
Small data sets
Dunnin
gton
Gainest
own
Xeon
Phi
Haswell
Average Ex. Time 150.2875 88.24125 340.5038 73.18375
0
50
100
150
200
250
300
350
400
Ex.time(ms)
Large data sets
 As problem scaled up, Ex. time increased
 Haswell exhibits best performance due to high speed ring topology
and large L3 NUCA cache and higher speed and size components
 Xeon Phi exhibits lower performance due to lack of L3 and lower size
and speed of core components
 Gainestown performs better than Dunnington: higher speed of core
to core communication because of using QPI instead of FSB

Raw experiments analysis
L2 miss rate (%) (LB)
Dunnington
Gainestow
n
Xeon Phi Haswell
L2 miss rate 28.8345 49.1823 40.58488 45.23875
0
10
20
30
40
50
60
L2missrate(%)
Small data sets
Dunningt
on
Gainesto
wn
Xeon Phi Haswell
L2 miss rate 26.875 46.6301 37.1618 37.1979
0
5
10
15
20
25
30
35
40
45
50
L2missrate(%)
Large data sets
 Dunnington has the lowest misse rates due to its larger L2 cache (3 MB)
 Xeon Phi also have larger L2 (512 KB) than the rest designs but
 Even that Haswell and Gainestown exhibits worse L2 misses, they can
hide these misses by their high speed QPI and other components

Normalized experiments analysis
Total execution time
nDunni
ngton
nGaines
town
nXeon
Phi
nHaswe
ll
Average Ex. Time 22.0816 21.9469 23.2281 20.464
19
19.5
20
20.5
21
21.5
22
22.5
23
23.5
Ex.time(ms)
Small data sets
nDunnin
gton
nGaines
town
nXeon
Phi
nHaswel
l
Average Ex. Time 72.883 73.646 77.25 69.147
64
66
68
70
72
74
76
78
Ex.time(ms)
Large data sets
 As problem scaled up, Ex. time increased
 nHaswell exhibits best performance due to high speed ring and large
NUCA cache
 nXeon Phi exhibits lowest performance due to lack of L3 and more
time spent in routing protocol
 nGainestown performs better than nDunnington: higher speed of
core to core communication.

Average core IPC (HB)
nDunningt
on
nGainesto
wn
nXeon Phi nHaswell
IPC 1.285 1.3033 1.237 1.435
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
IPC
Small data sets
nDunningt
on
nGainesto
wn
nXeon Phi nHaswell
IPC 1.313 1.33 1.273 1.391
1.2
1.22
1.24
1.26
1.28
1.3
1.32
1.34
1.36
1.38
1.4
IPC
Large data sets
IPC graphs show the same behavioral differences between design
alternatives like execution time graphs. because all of them runs the
same clock frequency.

Canneal CPI stack-large data set / case study
 CPI stack can quantify where the cycles have gone due to memory, branch,
compute and synchronization components
 Gives quick insight on design bottlenecks
 Canneal is memory intensive application (memory contribution close to 70%)

 nXeon Phi shows the largest memory contribution
 nHaswell ring NoC serviced synchronization overhead better than others
(0.23 to max of 0.25 CPI over time). nXeon with 2d mesh has more
synchronization overhead CPI of (0.4 to 0.43 CPI over time).

 The bus-based systems exhibit medium performance values but
nDunnington presents more memory contribution on CPI stack (1.54- 1.55
CPI) rather than nGainestown memory contribution (1.51 -1.52 CPI) due to its
more available L2 cache banks

Average core utilization-large
Utilization differences are very small, but gives some indications:
 nHaswell performs higher utilization because of small cycle losses
 nXeon Phi exhibits less utilization percent due to memory loss cycles

Average dynamic runtime power (LB)
nDunningt
on
nGainesto
wn
nXeon Phi nHaswell
Power 48.659 49.011 46.777 53.433
42
44
46
48
50
52
54
Power(W)
Small data sets
nDunningt
on
nGainesto
wn
nXeon Phi nHaswell
Power 49.99 50.147 48.14 52.389
46
47
48
49
50
51
52
53
Power(W)
Large data sets
 nHaswell consumes largest power: higher utilization
 nXeon Phi consumes less power: higher idle time
 nGainestown consumes more power than nDunnington: it has more
utilization because of high speed QPI

Cache coherence protocols comparison (MESI and MESIF)
nDunnington nGainestown nXeon Phi nHaswell
MESI 1.313625 1.3291 1.2738 1.3918
MESIF 1.35 1.3645 1.286 1.3918
Column1 1.027690551 1.026634565 1.009577642
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6IPC
Large data sets
 MESIF protocol enhances the performance for multicore systems.
 Haswell doesn't benefit from implementing MESIF due to:
 L3 inclusion property in one socket design

Conclusions
Results analysis for normalized experiments reveal the following:-
 Haswell performs better in form of execution time and system
throughput IPC, But it consumes large relatively power consumption.
 Private L2 cache and large shared L3 NUCA cache
 The high-speed core-to-core communication gained by bidirectional
ring NoC.
 Bus-based micro-architectures are no longer able to meet the
requirements of new HPC workloads due to:
 The obvious weakness in their handling the communication overhead.
which for sure will increased in the future many core architectures.

Conclusions (cont.)
 Xeon Phi architecture performs lower power consumptions.
 Designers should do further research in developing its memory
components. Xeon Phi suffers relatively from larger CPI loss in
memory intensive application.
 MESIF protocol enhances the performance for multicore systems in
case of multisocket systems and 2D mesh NOC.
 When application scaled up, the memory contribution increased
that results in a significant fraction of time spent on misses.

Future work
Based on our research, we can foresee the following areas
that need to be more investigated :
 Studying scalability issue with many core processors (more
than 8 cores)
 Adding MOESI cache coherence protocol to Sniper
 Studying heterogeneous multicore designs
 Studying hyper multithreading SMT property
 Studying other multicore designs from AMD, ARM, Nvidia,
etc.

Any Question?
Eng. Aieshah Almaslam

Multicore Intel Processors Performance Evaluation

More Related Content

What's hot

Similar to Multicore Intel Processors Performance Evaluation

Recently uploaded

Multicore Intel Processors Performance Evaluation