The University of Jordan
School of Engineering
Computer Engineering Department
Thesis Presentation
EVALUATION OF POPULAR MULTICORE DESIGN
ALTERNATIVES USING
CONFIGURATION DEPENDENT ANALYSIS
Eng. Aiesha F. Almaslam
Supervisor: Prof. Gheith A. Abandah
Outline
Tuesday, May 2, 2017
 Introduction
 Motivation
 Research Objectives
 Research Methodology
 Multicore design alternatives
 Sniper multicore simulator
 Multithreaded benchmarks
 Raw experiments analysis
 Normalized experiments analysis
 Conclusions
 Future work
Prepared by Eng.Aieshah F. Almaslam2
Introduction
Tuesday, May 2, 2017
 Multicore architecture is the current step in processor evolution; it is a
special kind of a multiprocessor on a single chip.
 Difficult to make single-core clock frequencies even higher
 Many new applications are multithreaded
 General trend in computer architecture of shifting towards more
parallelism; ILP andTLP
Prepared by Eng.Aieshah F. Al-maslam3
why multicores?
Introduction (cont.)
Tuesday, May 2, 2017
Multicore processors can be classified as:
 Homogeneous processors: all cores specifications are identical.
 Heterogeneous processors: processor cores are not identical.
 Traditional multicores are common in mobiles, desktops and servers.
 Chip designers continue evolution to increase the number of cores,
leading to the manycore architectures.
Prepared by Eng.Aieshah F. Al-maslam4
Multicore processors classifications.
Motivation
Tuesday, May 2, 2017
 Many design options are in front of the designers because of many
system parameters; such as:
 Number and speed of processor cores
 Core interconnection network (bus, 2D mesh, ring, etc.)
 Cache coherence protocol (MESI, MOESI, MESIF, etc.)
 Number and size of cache memory levels (L1, L2, L3, or more)
 Cache features ( private or shared) Cache associativity (Direct or Set associative
mapping)
 Number of memory controllers, number and type of RAM (DDR2,DDR3, DDR4, etc.)
 Memory link bandwidth and speed
 Branch predictor type and penalty
 TLB size and associativity
 Hence, It is important to understand how various designs of multicore
processors perform with the current parallel applications.
Prepared by Eng.Aieshah F. Al-maslam5
Objectives
Tuesday, May 2, 2017
 Evaluating and calibrating common multicore systems, by
using raw and normalized microarchitectural simulations.
 Comparing the behavior of wide range of multithreaded
benchmarks, analyzing their behavior due to different
microarchitectures and different input data sets.
 Studying tradeoffs between various multicore performance
metrics; by using comprehensive and representative multicore
performance metrics. Like ex. time, IPC, utilization and power
metrics.
 Finally, identifying multicore designs strengths and
weaknesses, concluding system design features that have
significant impact on system performance.
Prepared by Eng.Aieshah F. Al-maslam6
Methodology
Tuesday, May 2, 2017
 Investigating various design options of recent multicore
processors: to select few representative multicore design
alternatives.
 Investigating different multicore simulators: This simulator
should be able to efficient simulate different multicore design
options; can evaluate the performance of these design alternatives .
 Investigating the available benchmarks parallel applications: and
selecting a representative set of them for further study.
 Implementing raw and normalized experiments for multicore
designs: and analyzing the results. By so, so we can determine the
best multicore design and identify strengths and weaknesses of
each design.
Prepared by Eng.Aieshah F. Al-maslam7
Methodology (cont.)
Tuesday, May 2, 2017
 Computer architecture research is mainly driven by simulation
(Ricco A. 2013).
 The good simulator should have simulation infrastructure that
meets important requirements (Carlson et al. (A,2014); they are:
 Efficiency: both in time and space, by only simulating relevant parts of the
benchmark in detail, avoiding long warm-up time; and occupying a small disk
footprint for storing workloads.
 Accuracy: simulation results should be representative of running the complete
workload.
 Reproducible: the unit of work must be fixed across architectures to allow
comparisons to be made; workloads must be easily shareable while
guaranteeing (mostly) identical simulation results.
Prepared by Eng.Aieshah F. Al-maslam8
Sniper multicore simulator
Tuesday, May 2, 2017
Its main features:
 Parallel, multi-threaded applications
 Interval core model
 In-order, out-of-order cores
 Shared and private caches and modern branch predictors
 Homogeneous and heterogeneous configuration
 McPAT integration for power
 CPI Stacks and advanced visualization to gain insight into lost cycles
 Validated against real hardware ( Intel Core 2 and Nehalem
microarchitectures)
Prepared by Eng.Aieshah F. Al-maslam9
Multithreaded benchmarks
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam10
Splash2
• FFT
• Radix
• Cholesky
• Lu.cont
Parsec
• Blackscholes
• Canneal
• Fluidanimate
• Swaptions
 Splash2: Stanford ParallelApplications for Shared memory
 Parsec: PrincetonApplication Repository for Shared Memory Computers
Multicore design alternatives
Tuesday, May 2, 2017
1. Dunnington microarchitecture
Intel Xeon X7460 processor
 First intel (above two) multicore
 Sep 2008
 Six cores per socket
 Bus based network (FSB)
 Shared L2 for each pair of cores
 Shared L3 for socket
 DDR2
For fair comparison we used eight cores in multisocket design (four cores
per socket)
Prepared by Eng.Aieshah F. Al-maslam11
Multicore design alternatives
Tuesday, May 2, 2017
2. Gainestown microarchitecture
Intel Xeon Processor X5550
 introduced on Jan 2009
 Private L1 and L2 caches
 Shared L3 cache
 Bus based network
 2 QPI links
 DDR3 memory
Prepared by Eng.Aieshah F. Al-maslam12
For fair comparison we used eight cores in multisocket design (four cores
per socket)
Multicore design alternatives
Tuesday, May 2, 2017
3. Haswell microarchitecture
Intel Xeon Processor E5-2667 v3
 Introduced on Sep 2014
 Private L1 and L2 caches
 Shared L3 cache
 Bidirectional ring NOC
 DDR4 memory type
Prepared by Eng.Aieshah F. Al-maslam13
Multicore design alternatives
Tuesday, May 2, 2017
4. Knights Corner microarchitecture
Intel Xeon Phi Coprocessor 5110P
 Introduced on Nov 2012
 Only private L1 and L2 caches
 GDDR5 memory
 Sniper doesn’t model ring topology
without NUCA cache.Therefore, we
modify the network to beTilera-like
2d mesh network (4x2 2d mesh size).
 Actually, the 2D mesh network is
used by Intel in the next generation
Xeon Phi line (KNL).
Prepared by Eng.Aieshah F. Al-maslam14
Raw experiments analysis
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam15
Total ExecutionTime (LB)
Dunning
ton
Gainest
own
Xeon
Phi
Haswell
Average Ex. Time 39.779125 23.8165 100.52125 21.622625
0
20
40
60
80
100
120
Ex.time(ms)
Small data sets
Dunnin
gton
Gainest
own
Xeon
Phi
Haswell
Average Ex. Time 150.2875 88.24125 340.5038 73.18375
0
50
100
150
200
250
300
350
400
Ex.time(ms)
Large data sets
 As problem scaled up, Ex. time increased
 Haswell exhibits best performance due to high speed ring topology
and large L3 NUCA cache and higher speed and size components
 Xeon Phi exhibits lower performance due to lack of L3 and lower size
and speed of core components
 Gainestown performs better than Dunnington: higher speed of core
to core communication because of using QPI instead of FSB
Raw experiments analysis
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam16
L2 miss rate (%) (LB)
Dunnington
Gainestow
n
Xeon Phi Haswell
L2 miss rate 28.8345 49.1823 40.58488 45.23875
0
10
20
30
40
50
60
L2missrate(%)
Small data sets
Dunningt
on
Gainesto
wn
Xeon Phi Haswell
L2 miss rate 26.875 46.6301 37.1618 37.1979
0
5
10
15
20
25
30
35
40
45
50
L2missrate(%)
Large data sets
 Dunnington has the lowest misse rates due to its larger L2 cache (3 MB)
 Xeon Phi also have larger L2 (512 KB) than the rest designs but
 Even that Haswell and Gainestown exhibits worse L2 misses, they can
hide these misses by their high speed QPI and other components
Normalized experiments analysis
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam17
Total execution time
nDunni
ngton
nGaines
town
nXeon
Phi
nHaswe
ll
Average Ex. Time 22.0816 21.9469 23.2281 20.464
19
19.5
20
20.5
21
21.5
22
22.5
23
23.5
Ex.time(ms)
Small data sets
nDunnin
gton
nGaines
town
nXeon
Phi
nHaswel
l
Average Ex. Time 72.883 73.646 77.25 69.147
64
66
68
70
72
74
76
78
Ex.time(ms)
Large data sets
 As problem scaled up, Ex. time increased
 nHaswell exhibits best performance due to high speed ring and large
NUCA cache
 nXeon Phi exhibits lowest performance due to lack of L3 and more
time spent in routing protocol
 nGainestown performs better than nDunnington: higher speed of
core to core communication.
Normalized experiments analysis
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam18
Average core IPC (HB)
nDunningt
on
nGainesto
wn
nXeon Phi nHaswell
IPC 1.285 1.3033 1.237 1.435
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
IPC
Small data sets
nDunningt
on
nGainesto
wn
nXeon Phi nHaswell
IPC 1.313 1.33 1.273 1.391
1.2
1.22
1.24
1.26
1.28
1.3
1.32
1.34
1.36
1.38
1.4
IPC
Large data sets
IPC graphs show the same behavioral differences between design
alternatives like execution time graphs. because all of them runs the
same clock frequency.
Normalized experiments analysis
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam19
Canneal CPI stack-large data set / case study
 CPI stack can quantify where the cycles have gone due to memory, branch,
compute and synchronization components
 Gives quick insight on design bottlenecks
 Canneal is memory intensive application (memory contribution close to 70%)
Normalized experiments analysis
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam20
Canneal CPI stack-large data set / case study
 nXeon Phi shows the largest memory contribution
 nHaswell ring NoC serviced synchronization overhead better than others
(0.23 to max of 0.25 CPI over time). nXeon with 2d mesh has more
synchronization overhead CPI of (0.4 to 0.43 CPI over time).
Normalized experiments analysis
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam21
Canneal CPI stack-large data set / case study
 The bus-based systems exhibit medium performance values but
nDunnington presents more memory contribution on CPI stack (1.54- 1.55
CPI) rather than nGainestown memory contribution (1.51 -1.52 CPI) due to its
more available L2 cache banks
Normalized experiments analysis
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam22
Average core utilization-large
Utilization differences are very small, but gives some indications:
 nHaswell performs higher utilization because of small cycle losses
 nXeon Phi exhibits less utilization percent due to memory loss cycles
Normalized experiments analysis
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam23
Average dynamic runtime power (LB)
nDunningt
on
nGainesto
wn
nXeon Phi nHaswell
Power 48.659 49.011 46.777 53.433
42
44
46
48
50
52
54
Power(W)
Small data sets
nDunningt
on
nGainesto
wn
nXeon Phi nHaswell
Power 49.99 50.147 48.14 52.389
46
47
48
49
50
51
52
53
Power(W)
Large data sets
 nHaswell consumes largest power: higher utilization
 nXeon Phi consumes less power: higher idle time
 nGainestown consumes more power than nDunnington: it has more
utilization because of high speed QPI
Normalized experiments analysis
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam24
Cache coherence protocols comparison (MESI and MESIF)
nDunnington nGainestown nXeon Phi nHaswell
MESI 1.313625 1.3291 1.2738 1.3918
MESIF 1.35 1.3645 1.286 1.3918
Column1 1.027690551 1.026634565 1.009577642
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6IPC
Large data sets
 MESIF protocol enhances the performance for multicore systems.
 Haswell doesn't benefit from implementing MESIF due to:
 L3 inclusion property in one socket design
Conclusions
Tuesday, May 2, 2017
Results analysis for normalized experiments reveal the following:-
 Haswell performs better in form of execution time and system
throughput IPC, But it consumes large relatively power consumption.
 Private L2 cache and large shared L3 NUCA cache
 The high-speed core-to-core communication gained by bidirectional
ring NoC.
 Bus-based micro-architectures are no longer able to meet the
requirements of new HPC workloads due to:
 The obvious weakness in their handling the communication overhead.
which for sure will increased in the future many core architectures.
Prepared by Eng.Aieshah F. Al-maslam25
Conclusions (cont.)
Tuesday, May 2, 2017
 Xeon Phi architecture performs lower power consumptions.
 Designers should do further research in developing its memory
components. Xeon Phi suffers relatively from larger CPI loss in
memory intensive application.
 MESIF protocol enhances the performance for multicore systems in
case of multisocket systems and 2D mesh NOC.
 When application scaled up, the memory contribution increased
that results in a significant fraction of time spent on misses.
Prepared by Eng.Aieshah F. Al-maslam26
Future work
Tuesday, May 2, 2017
Based on our research, we can foresee the following areas
that need to be more investigated :
 Studying scalability issue with many core processors (more
than 8 cores)
 Adding MOESI cache coherence protocol to Sniper
 Studying heterogeneous multicore designs
 Studying hyper multithreading SMT property
 Studying other multicore designs from AMD, ARM, Nvidia,
etc.
Prepared by Eng.Aieshah F. Al-maslam27
Any Question?
Tuesday, May 2, 2017Prepared by Eng.Aieshah F. Al-maslam28
Eng. Aieshah Almaslam

Multicore Intel Processors Performance Evaluation

  • 1.
    The University ofJordan School of Engineering Computer Engineering Department Thesis Presentation EVALUATION OF POPULAR MULTICORE DESIGN ALTERNATIVES USING CONFIGURATION DEPENDENT ANALYSIS Eng. Aiesha F. Almaslam Supervisor: Prof. Gheith A. Abandah
  • 2.
    Outline Tuesday, May 2,2017  Introduction  Motivation  Research Objectives  Research Methodology  Multicore design alternatives  Sniper multicore simulator  Multithreaded benchmarks  Raw experiments analysis  Normalized experiments analysis  Conclusions  Future work Prepared by Eng.Aieshah F. Almaslam2
  • 3.
    Introduction Tuesday, May 2,2017  Multicore architecture is the current step in processor evolution; it is a special kind of a multiprocessor on a single chip.  Difficult to make single-core clock frequencies even higher  Many new applications are multithreaded  General trend in computer architecture of shifting towards more parallelism; ILP andTLP Prepared by Eng.Aieshah F. Al-maslam3 why multicores?
  • 4.
    Introduction (cont.) Tuesday, May2, 2017 Multicore processors can be classified as:  Homogeneous processors: all cores specifications are identical.  Heterogeneous processors: processor cores are not identical.  Traditional multicores are common in mobiles, desktops and servers.  Chip designers continue evolution to increase the number of cores, leading to the manycore architectures. Prepared by Eng.Aieshah F. Al-maslam4 Multicore processors classifications.
  • 5.
    Motivation Tuesday, May 2,2017  Many design options are in front of the designers because of many system parameters; such as:  Number and speed of processor cores  Core interconnection network (bus, 2D mesh, ring, etc.)  Cache coherence protocol (MESI, MOESI, MESIF, etc.)  Number and size of cache memory levels (L1, L2, L3, or more)  Cache features ( private or shared) Cache associativity (Direct or Set associative mapping)  Number of memory controllers, number and type of RAM (DDR2,DDR3, DDR4, etc.)  Memory link bandwidth and speed  Branch predictor type and penalty  TLB size and associativity  Hence, It is important to understand how various designs of multicore processors perform with the current parallel applications. Prepared by Eng.Aieshah F. Al-maslam5
  • 6.
    Objectives Tuesday, May 2,2017  Evaluating and calibrating common multicore systems, by using raw and normalized microarchitectural simulations.  Comparing the behavior of wide range of multithreaded benchmarks, analyzing their behavior due to different microarchitectures and different input data sets.  Studying tradeoffs between various multicore performance metrics; by using comprehensive and representative multicore performance metrics. Like ex. time, IPC, utilization and power metrics.  Finally, identifying multicore designs strengths and weaknesses, concluding system design features that have significant impact on system performance. Prepared by Eng.Aieshah F. Al-maslam6
  • 7.
    Methodology Tuesday, May 2,2017  Investigating various design options of recent multicore processors: to select few representative multicore design alternatives.  Investigating different multicore simulators: This simulator should be able to efficient simulate different multicore design options; can evaluate the performance of these design alternatives .  Investigating the available benchmarks parallel applications: and selecting a representative set of them for further study.  Implementing raw and normalized experiments for multicore designs: and analyzing the results. By so, so we can determine the best multicore design and identify strengths and weaknesses of each design. Prepared by Eng.Aieshah F. Al-maslam7
  • 8.
    Methodology (cont.) Tuesday, May2, 2017  Computer architecture research is mainly driven by simulation (Ricco A. 2013).  The good simulator should have simulation infrastructure that meets important requirements (Carlson et al. (A,2014); they are:  Efficiency: both in time and space, by only simulating relevant parts of the benchmark in detail, avoiding long warm-up time; and occupying a small disk footprint for storing workloads.  Accuracy: simulation results should be representative of running the complete workload.  Reproducible: the unit of work must be fixed across architectures to allow comparisons to be made; workloads must be easily shareable while guaranteeing (mostly) identical simulation results. Prepared by Eng.Aieshah F. Al-maslam8
  • 9.
    Sniper multicore simulator Tuesday,May 2, 2017 Its main features:  Parallel, multi-threaded applications  Interval core model  In-order, out-of-order cores  Shared and private caches and modern branch predictors  Homogeneous and heterogeneous configuration  McPAT integration for power  CPI Stacks and advanced visualization to gain insight into lost cycles  Validated against real hardware ( Intel Core 2 and Nehalem microarchitectures) Prepared by Eng.Aieshah F. Al-maslam9
  • 10.
    Multithreaded benchmarks Tuesday, May2, 2017Prepared by Eng.Aieshah F. Al-maslam10 Splash2 • FFT • Radix • Cholesky • Lu.cont Parsec • Blackscholes • Canneal • Fluidanimate • Swaptions  Splash2: Stanford ParallelApplications for Shared memory  Parsec: PrincetonApplication Repository for Shared Memory Computers
  • 11.
    Multicore design alternatives Tuesday,May 2, 2017 1. Dunnington microarchitecture Intel Xeon X7460 processor  First intel (above two) multicore  Sep 2008  Six cores per socket  Bus based network (FSB)  Shared L2 for each pair of cores  Shared L3 for socket  DDR2 For fair comparison we used eight cores in multisocket design (four cores per socket) Prepared by Eng.Aieshah F. Al-maslam11
  • 12.
    Multicore design alternatives Tuesday,May 2, 2017 2. Gainestown microarchitecture Intel Xeon Processor X5550  introduced on Jan 2009  Private L1 and L2 caches  Shared L3 cache  Bus based network  2 QPI links  DDR3 memory Prepared by Eng.Aieshah F. Al-maslam12 For fair comparison we used eight cores in multisocket design (four cores per socket)
  • 13.
    Multicore design alternatives Tuesday,May 2, 2017 3. Haswell microarchitecture Intel Xeon Processor E5-2667 v3  Introduced on Sep 2014  Private L1 and L2 caches  Shared L3 cache  Bidirectional ring NOC  DDR4 memory type Prepared by Eng.Aieshah F. Al-maslam13
  • 14.
    Multicore design alternatives Tuesday,May 2, 2017 4. Knights Corner microarchitecture Intel Xeon Phi Coprocessor 5110P  Introduced on Nov 2012  Only private L1 and L2 caches  GDDR5 memory  Sniper doesn’t model ring topology without NUCA cache.Therefore, we modify the network to beTilera-like 2d mesh network (4x2 2d mesh size).  Actually, the 2D mesh network is used by Intel in the next generation Xeon Phi line (KNL). Prepared by Eng.Aieshah F. Al-maslam14
  • 15.
    Raw experiments analysis Tuesday,May 2, 2017Prepared by Eng.Aieshah F. Al-maslam15 Total ExecutionTime (LB) Dunning ton Gainest own Xeon Phi Haswell Average Ex. Time 39.779125 23.8165 100.52125 21.622625 0 20 40 60 80 100 120 Ex.time(ms) Small data sets Dunnin gton Gainest own Xeon Phi Haswell Average Ex. Time 150.2875 88.24125 340.5038 73.18375 0 50 100 150 200 250 300 350 400 Ex.time(ms) Large data sets  As problem scaled up, Ex. time increased  Haswell exhibits best performance due to high speed ring topology and large L3 NUCA cache and higher speed and size components  Xeon Phi exhibits lower performance due to lack of L3 and lower size and speed of core components  Gainestown performs better than Dunnington: higher speed of core to core communication because of using QPI instead of FSB
  • 16.
    Raw experiments analysis Tuesday,May 2, 2017Prepared by Eng.Aieshah F. Al-maslam16 L2 miss rate (%) (LB) Dunnington Gainestow n Xeon Phi Haswell L2 miss rate 28.8345 49.1823 40.58488 45.23875 0 10 20 30 40 50 60 L2missrate(%) Small data sets Dunningt on Gainesto wn Xeon Phi Haswell L2 miss rate 26.875 46.6301 37.1618 37.1979 0 5 10 15 20 25 30 35 40 45 50 L2missrate(%) Large data sets  Dunnington has the lowest misse rates due to its larger L2 cache (3 MB)  Xeon Phi also have larger L2 (512 KB) than the rest designs but  Even that Haswell and Gainestown exhibits worse L2 misses, they can hide these misses by their high speed QPI and other components
  • 17.
    Normalized experiments analysis Tuesday,May 2, 2017Prepared by Eng.Aieshah F. Al-maslam17 Total execution time nDunni ngton nGaines town nXeon Phi nHaswe ll Average Ex. Time 22.0816 21.9469 23.2281 20.464 19 19.5 20 20.5 21 21.5 22 22.5 23 23.5 Ex.time(ms) Small data sets nDunnin gton nGaines town nXeon Phi nHaswel l Average Ex. Time 72.883 73.646 77.25 69.147 64 66 68 70 72 74 76 78 Ex.time(ms) Large data sets  As problem scaled up, Ex. time increased  nHaswell exhibits best performance due to high speed ring and large NUCA cache  nXeon Phi exhibits lowest performance due to lack of L3 and more time spent in routing protocol  nGainestown performs better than nDunnington: higher speed of core to core communication.
  • 18.
    Normalized experiments analysis Tuesday,May 2, 2017Prepared by Eng.Aieshah F. Al-maslam18 Average core IPC (HB) nDunningt on nGainesto wn nXeon Phi nHaswell IPC 1.285 1.3033 1.237 1.435 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 IPC Small data sets nDunningt on nGainesto wn nXeon Phi nHaswell IPC 1.313 1.33 1.273 1.391 1.2 1.22 1.24 1.26 1.28 1.3 1.32 1.34 1.36 1.38 1.4 IPC Large data sets IPC graphs show the same behavioral differences between design alternatives like execution time graphs. because all of them runs the same clock frequency.
  • 19.
    Normalized experiments analysis Tuesday,May 2, 2017Prepared by Eng.Aieshah F. Al-maslam19 Canneal CPI stack-large data set / case study  CPI stack can quantify where the cycles have gone due to memory, branch, compute and synchronization components  Gives quick insight on design bottlenecks  Canneal is memory intensive application (memory contribution close to 70%)
  • 20.
    Normalized experiments analysis Tuesday,May 2, 2017Prepared by Eng.Aieshah F. Al-maslam20 Canneal CPI stack-large data set / case study  nXeon Phi shows the largest memory contribution  nHaswell ring NoC serviced synchronization overhead better than others (0.23 to max of 0.25 CPI over time). nXeon with 2d mesh has more synchronization overhead CPI of (0.4 to 0.43 CPI over time).
  • 21.
    Normalized experiments analysis Tuesday,May 2, 2017Prepared by Eng.Aieshah F. Al-maslam21 Canneal CPI stack-large data set / case study  The bus-based systems exhibit medium performance values but nDunnington presents more memory contribution on CPI stack (1.54- 1.55 CPI) rather than nGainestown memory contribution (1.51 -1.52 CPI) due to its more available L2 cache banks
  • 22.
    Normalized experiments analysis Tuesday,May 2, 2017Prepared by Eng.Aieshah F. Al-maslam22 Average core utilization-large Utilization differences are very small, but gives some indications:  nHaswell performs higher utilization because of small cycle losses  nXeon Phi exhibits less utilization percent due to memory loss cycles
  • 23.
    Normalized experiments analysis Tuesday,May 2, 2017Prepared by Eng.Aieshah F. Al-maslam23 Average dynamic runtime power (LB) nDunningt on nGainesto wn nXeon Phi nHaswell Power 48.659 49.011 46.777 53.433 42 44 46 48 50 52 54 Power(W) Small data sets nDunningt on nGainesto wn nXeon Phi nHaswell Power 49.99 50.147 48.14 52.389 46 47 48 49 50 51 52 53 Power(W) Large data sets  nHaswell consumes largest power: higher utilization  nXeon Phi consumes less power: higher idle time  nGainestown consumes more power than nDunnington: it has more utilization because of high speed QPI
  • 24.
    Normalized experiments analysis Tuesday,May 2, 2017Prepared by Eng.Aieshah F. Al-maslam24 Cache coherence protocols comparison (MESI and MESIF) nDunnington nGainestown nXeon Phi nHaswell MESI 1.313625 1.3291 1.2738 1.3918 MESIF 1.35 1.3645 1.286 1.3918 Column1 1.027690551 1.026634565 1.009577642 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6IPC Large data sets  MESIF protocol enhances the performance for multicore systems.  Haswell doesn't benefit from implementing MESIF due to:  L3 inclusion property in one socket design
  • 25.
    Conclusions Tuesday, May 2,2017 Results analysis for normalized experiments reveal the following:-  Haswell performs better in form of execution time and system throughput IPC, But it consumes large relatively power consumption.  Private L2 cache and large shared L3 NUCA cache  The high-speed core-to-core communication gained by bidirectional ring NoC.  Bus-based micro-architectures are no longer able to meet the requirements of new HPC workloads due to:  The obvious weakness in their handling the communication overhead. which for sure will increased in the future many core architectures. Prepared by Eng.Aieshah F. Al-maslam25
  • 26.
    Conclusions (cont.) Tuesday, May2, 2017  Xeon Phi architecture performs lower power consumptions.  Designers should do further research in developing its memory components. Xeon Phi suffers relatively from larger CPI loss in memory intensive application.  MESIF protocol enhances the performance for multicore systems in case of multisocket systems and 2D mesh NOC.  When application scaled up, the memory contribution increased that results in a significant fraction of time spent on misses. Prepared by Eng.Aieshah F. Al-maslam26
  • 27.
    Future work Tuesday, May2, 2017 Based on our research, we can foresee the following areas that need to be more investigated :  Studying scalability issue with many core processors (more than 8 cores)  Adding MOESI cache coherence protocol to Sniper  Studying heterogeneous multicore designs  Studying hyper multithreading SMT property  Studying other multicore designs from AMD, ARM, Nvidia, etc. Prepared by Eng.Aieshah F. Al-maslam27
  • 28.
    Any Question? Tuesday, May2, 2017Prepared by Eng.Aieshah F. Al-maslam28 Eng. Aieshah Almaslam