3. Motivation
• Modern processors are diverse in:
– Optimization objectives: perf, energy
– Workloads: multimedia, encryption, network …
– Scale: embedded system to data center
• A single monolithic core cannot fulfill all requirements
• This has led to two broad ranges of processors:
Narrow in-order (InO)
cores e.g. Xeon Phi
Wide out-of-order (OoO) cores
e.g. Sandybridge and Power7
IBM POWER7: 8 cores Intel Xeon Phi
3
4. Motivation
• Next step: use different types of core in same
processor => AMP
• AMPs can
– Provide better energy efficiency than SMPs and
per-core DVFS
– Can optimize for thread-level or instruction-level
parallelism
– Allow turning-off unused core for saving energy
4
5. Classification of AMPs
• Static AMP:
statically
configuration of cores is fixed
• Reconfigurable AMP: microarchitecture can be
reconfigured dynamically to provide cores of
different resources
5
6. 6
Examples of Static AMPs
Asymmetric
Symmetric
6
C1
C2
C3
C4 C4 C4 C4
C5 C5 C5 C5
C C C C
C C C C
C C C C
C C C C
7. Examples of Static AMPs
9 power-equivalent multi-cores
(B=big core, m=medium core, s=small core)
Generally, two core types are sufficient for providing most benefits of heterogeneity
Eyerman et al. ASPLOS'14
7
9. Terminology
Asymmetric multicore (AMC), asymmetric multicore systems (ASYMS),
asymmetric multiprocessor systems (ASMP), asymmetric chip
multiprocessors (ACMP), heterogeneous microarchitectures (HM),
heterogeneous multicore processor (HMP), heterogeneous CMP (HCMP),
asymmetric cluster CMP (ACCMP), big.LITTLE system
Big/little (or big/small), fast/slow, complex/simple, aggressive/lightweight,
strong/weak cores, application/low-power processor (AP/LP),
central/peripheral processor
Reconfigurable, configurable, adaptive, scalable, composable, composite,
coalition, conjoined, federated, polymorphous, morphable, core morphing,
core fusion, flexible, dynamic and united processors
9
Different terminologies for reconfigurable AMPs and/or
techniques for architecting them
Different terminologies for cores of an AMP
Different terminologies for an AMP
10. Types of Heterogeneity in AMPs
Types of heterogeneity
(basis: nature of asymmetry)
Srinivasan et
al. [2011]
Koufaty et
al. [2010]
Types of heterogeneity
(basis: nature of asymmetry)
Types of heterogeneity (basis:
how asymmetry is introduced)
Khan and
Kundu [2010]
uArch = microarchitecture, freq. = frequency, diff. = different
10
Extemporaneous heterogeneity
(performance of a core altered by
DVFS or hardware reconfiguration)
Deliberate heterogeneity (diff.
uArch, ISA and specialization,
e.g. CPU and GPU)
Functional
asymmetry
(diff. ISA and uArch)
Performance asymmetry
(same ISA, diff. uArch,
cache size, freq)
Virtual asymmetric
(same uArch & ISA,
diff. freq or cache
size)
Physical asymmetric
(same ISA, diff. uArch e.g.
InO vs OOO, and freq.)
Hybrid Cores
(diff. ISA and
uArch)
11. Classification based on performance ordering
core core core
X86
Performance of EV6 > EV5 for Neither Alpha nor x86 is optimal for all
all apps => AMP with
monotonic cores
apps => AMP with non-monotonic
cores
Configuration of Alpha processors
11
Alpha
core
Alpha
EV6
Alpha
EV5
12. Architectural configuration of four ARM processors
performance on XML parsing benchmark
• Cortex A15 and A7: Same ISA but different architecture
• Cortex A57 and A53: Same ISA but different architecture
All the four processors can have 1 to 4 cores per cluster
12
14. Benefits of AMPs
• AMPs are natural choice for systems with diverse
applications and usage scenarios
• Big core => better performance
• Small core => better energy efficiency
• However, no winner on EDP metric!
• Big core => better EDP for compute-intensive apps with
high data reuse
• Small core => better EDP for memory-intensive apps
with little data reuse and many atomic operations
14
15. Challenges of AMPs
• Conventional software are designed for SMPs. Many
changes required for supporting AMPs
• AMP cores should cover a wide and evenly spread
range of performance/complexity design space
• Scheduling complexity in AMP increases exponentially
with rising number of core types and applications
15
16. Challenges of AMPs
• In some AMPs, the ISA, OS and programming
model of different cores are also
present even more challenges
different => they
• AMPs are not widely available
• Some works use DVFS (or clock throttling) to
emulate asymmetric cores, however,
– it over-simplifies challenges of a real AMP =>
inaccurate conclusions
– cannot model non-monotonic cores
16
17. Thread migration overheads
• In static AMPs, thread migration may take millions
of cycles, e.g. in an AMP with Cortex A15 and A7:
• migration latency b/w A15 to A7: 3.75ms
• vice-versa: 2.10ms
• Flushing and warming of cache etc. => additional
overheads
• Hence, migration can be performed only once every
millions of instructions
17
18. Challenge of maintaining fairness
• Fairness: important for meeting QoS guarantees
• In AMP, some threads may be unfairly slowed-down =>
starvation & unpredictable per-task performance
• In a multithreaded app, performance advantage of big
core may be completely negated if thread running on it
stalls waiting for other threads
Big core Small cores
C0 C1 C2 C3
Thread 0 stalls
for other threads
Synchronization barrier
18
19. Challenges of AMPs
• Some AMP designs use non-standard ISAs or compiler
support => may not find wide adoption
• Unpredictability: An asymmetry-unaware scheduler
may schedule different threads to fast or slow cores in
different runs => variable performance.
19
21. App/thread mapping strategies
• The most important challenge in AMPs: finding the
right core for running a thread
• The right choice depends on:
– Optimization target
– Application property
– Core property
• We will discuss some mapping (scheduling)
strategies
21
22. Estimating performance for scheduling
To
on
make scheduling decisions, thread-performance
different core types must be known
Option 1
Estimate perf. of a thread
on a core type without
actually running the
thread on that core type,
e.g., using math models
HW-specific, error-prone
Option 2
Actually run threads on
each core type to sample
performance
• •
• High profiling overhead
•
22
23. App/thread mapping strategies
CPI breakdown for representative cases
(a) CPI dominated
by external stalls
(a) CPI dominated (a) CPI dominated
by execution cycles
by internal stalls
Suitable for big core
Suitable for small core
Koufaty et al. EuroSys’10
23
24. App/thread mapping strategies
• Loads on different thread is imbalanced
– Map slowest thread to big core
• Different VMs running on a host have different
resource requirements
– VM with higher number of `virtual CPUs' gets big core
• App with high ILP => map to a wide-issue
superscalar processor which can issue several
instructions every cycle
24
25. App/thread mapping strategies
Big core Small core
• Highly-parallel phases
• Compute-intensive apps
• App with low miss-rate
• Benefit from running on
big core is large
• Thread with largest
•
•
•
•
Sequential phases
I/O-intensive apps
App with high miss-rate
Benefit from running on
big core is small
Thread with small
remaining execution time
OS kernel code,
virtualization helper code
& device interrupts
•
remaining execution
• Application code
time
•
25
26. App/thread mapping strategies
Big core Small core
• High priority app
• Multimedia-intensive
• Low priority app
• Service daemons and
background processes,
apps
sensor sampling and
buffering tasks
26
27. Example of fairness-oriented scheduling schemes
• ‘Equal-time’: run each thread on each core type for
equal amount of time
• ‘Equal-progress’: It aims to get equal work done in all
threads.
– Idea 1: Schedule thread with currently largest
slowdown on big core.
– Idea 2: Whenever difference in progress of different
threads becomes too high, swap them
Van Craeynest et al. PACT’13
27
28. Use of DVFS along with thread scheduling
• Provides further opportunities to
performance/energy tradeoff
exercise
• Estimate throughput/Watt of program phase at
different voltage/frequency (V/F) levels on all core
types.
• Based on this, best thread-to-core mapping and V/F
values are selected
28
29. Challenges of different thread scheduling policies
Static scheduling Dynamic scheduling
• Works by collecting data
by offline analysis
• Cannot account for
different input sets and
application phases
• Becomes infeasible with
increasing number of co-
running applications
• Works by collecting data
at runtime
Incur thread migration
overhead
Ineffective for short-lived
threads since the profiling
phase itself may form a
large majority of their
lifetime
•
•
29
31. Motivation: Need of fine-grained switching
Variance of IPC in gcc over 300K instructions
31
32. Need of fine-grained switching
Coarse-grained vs. fine-grained heterogeneity
Fallin et al. ICCD’14
32
33. Reconfigurable AMPs
• Benefits: No thread migration overheads
• Challenges: Reconfiguration incurs latency and energy
overheads, e.g., I/D-cache flushes and data migration
• Avoiding this may require: a complex compiler, custom
ISA, 3D stacking, changes to OS and application binary.
• Tradeoffs:
– Centralized resources: saves area, but presents scalability
bottleneck
– High adaptation granularity: allows exploiting different
levels of ILP and TLP but precludes specialization for
accelerating specific applications
33
34. Benefits of reconfigurable AMPs
• Allow flexibly scaling up to exploit MLP and ILP in
single-threaded apps
• Allow scaling down to exploit TLP in multithreaded
apps
• Provide better HW utilization and resilience to errors
since one hard error may not disable entire processor
• They may achieve better performance and energy
proportionality than static AMPs.
34
35. Types of reconfigurable AMPs
1. Those that dynamically fuse or partition the cores
and thus change the core-count
2.
3.
Those
Those
which
which
share/trade resources between cores
transform the core architecture
In following slides, we show examples of each of
these through figures. See the survey for more details
35
36. 1. Changing core-count
An 8-core CMP with two independent cores, 2-core fused
group, and 4-core fused group
Ipek et al. ISCA’07
36
37. Static AMP
with big and
little cores
Reconfigurable AMP
with many little cores,
of which few can be
fused into a wide-issue
processor
Salverda et al. HPCA'08
37
41. 1. Changing core-count
Exploits fine-grain parallelism more effectively
Runs more applications effectively
PIM = processor
in memory
Wide-issue processors
with many ALUs each
Different granularities of parallel processing elements
Sankaralingam et al. ISCA'03
41
42. 1. Changing core-count
A reconfigurable AMP where multiple scalar cores can
be united to create a larger superscalar processor
Chiu et al. ICPP’10
42
43. 2. Trading resources between cores
Asymmetric
building blocks Faulty
A reconfigurable AMP
Gupta et al. MICRO’10
43
44. 2. Trading resources between cores
A 3D reconfigurable AMP: poolable resources (registers,
instruction queue, reorder buffer, cache space, load and store
queues, etc.) in another layer
Homayoun et al. HPCA'11
44
45. 2. Trading resources between cores
Dynamic core morphing (1/2)
Baseline configuration for two heterogeneous cores
Rodrigues et al. PACT’11
45
46. 2. Trading resources between cores
Dynamic core morphing (2/2)
Morphed configuration for two
heterogeneous cores.
RED: Connectivity for strong morphed core BLACK: Connectivity for weak core
46
47. 2. Trading resources between cores
Pipeline level view of the resource sharing
Rodrigues et al. VLSID’14
47