SlideShare a Scribd company logo
1 of 50
big.LITTLE-style
asymmetric
multicore
processors
Sparsh Mittal
IIT Hyderabad, India
Acronyms
• AMP = asymmetric multicore processors
• SMP = symmetric multicore processors
• OOO =out-of-order, InO = in-order
• ISA = instruction set architecture
• EDP = energy delay product
• ILP/TLP/MLP = instruction/thread/memory-level parallelism
• VM = virtual machine
2
Motivation
• Modern processors are diverse in:
– Optimization objectives: perf, energy
– Workloads: multimedia, encryption, network …
– Scale: embedded system to data center
• A single monolithic core cannot fulfill all requirements
• This has led to two broad ranges of processors:
Narrow in-order (InO)
cores e.g. Xeon Phi
Wide out-of-order (OoO) cores
e.g. Sandybridge and Power7
IBM POWER7: 8 cores Intel Xeon Phi
3
Motivation
• Next step: use different types of core in same
processor => AMP
• AMPs can
– Provide better energy efficiency than SMPs and
per-core DVFS
– Can optimize for thread-level or instruction-level
parallelism
– Allow turning-off unused core for saving energy
4
Classification of AMPs
• Static AMP:
statically
configuration of cores is fixed
• Reconfigurable AMP: microarchitecture can be
reconfigured dynamically to provide cores of
different resources
5
6
Examples of Static AMPs
Asymmetric
Symmetric
6
C1
C2
C3
C4 C4 C4 C4
C5 C5 C5 C5
C C C C
C C C C
C C C C
C C C C
Examples of Static AMPs
9 power-equivalent multi-cores
(B=big core, m=medium core, s=small core)
Generally, two core types are sufficient for providing most benefits of heterogeneity
Eyerman et al. ASPLOS'14
7
Example of Reconfigurable AMP
Asymmetric
building blocks Faulty
Gupta et al. MICRO’10
8
Terminology
Asymmetric multicore (AMC), asymmetric multicore systems (ASYMS),
asymmetric multiprocessor systems (ASMP), asymmetric chip
multiprocessors (ACMP), heterogeneous microarchitectures (HM),
heterogeneous multicore processor (HMP), heterogeneous CMP (HCMP),
asymmetric cluster CMP (ACCMP), big.LITTLE system
Big/little (or big/small), fast/slow, complex/simple, aggressive/lightweight,
strong/weak cores, application/low-power processor (AP/LP),
central/peripheral processor
Reconfigurable, configurable, adaptive, scalable, composable, composite,
coalition, conjoined, federated, polymorphous, morphable, core morphing,
core fusion, flexible, dynamic and united processors
9
Different terminologies for reconfigurable AMPs and/or
techniques for architecting them
Different terminologies for cores of an AMP
Different terminologies for an AMP
Types of Heterogeneity in AMPs
Types of heterogeneity
(basis: nature of asymmetry)
Srinivasan et
al. [2011]
Koufaty et
al. [2010]
Types of heterogeneity
(basis: nature of asymmetry)
Types of heterogeneity (basis:
how asymmetry is introduced)
Khan and
Kundu [2010]
uArch = microarchitecture, freq. = frequency, diff. = different
10
Extemporaneous heterogeneity
(performance of a core altered by
DVFS or hardware reconfiguration)
Deliberate heterogeneity (diff.
uArch, ISA and specialization,
e.g. CPU and GPU)
Functional
asymmetry
(diff. ISA and uArch)
Performance asymmetry
(same ISA, diff. uArch,
cache size, freq)
Virtual asymmetric
(same uArch & ISA,
diff. freq or cache
size)
Physical asymmetric
(same ISA, diff. uArch e.g.
InO vs OOO, and freq.)
Hybrid Cores
(diff. ISA and
uArch)
Classification based on performance ordering
core core core
X86
Performance of EV6 > EV5 for Neither Alpha nor x86 is optimal for all
all apps => AMP with
monotonic cores
apps => AMP with non-monotonic
cores
Configuration of Alpha processors
11
Alpha
core
Alpha
EV6
Alpha
EV5
Architectural configuration of four ARM processors
performance on XML parsing benchmark
• Cortex A15 and A7: Same ISA but different architecture
• Cortex A57 and A53: Same ISA but different architecture
All the four processors can have 1 to 4 cores per cluster
12
Configuration of Intel’s QuickIA research prototype
Chitlur et al. HPCA'12
13
Benefits of AMPs
• AMPs are natural choice for systems with diverse
applications and usage scenarios
• Big core => better performance
• Small core => better energy efficiency
• However, no winner on EDP metric!
• Big core => better EDP for compute-intensive apps with
high data reuse
• Small core => better EDP for memory-intensive apps
with little data reuse and many atomic operations
14
Challenges of AMPs
• Conventional software are designed for SMPs. Many
changes required for supporting AMPs
• AMP cores should cover a wide and evenly spread
range of performance/complexity design space
• Scheduling complexity in AMP increases exponentially
with rising number of core types and applications
15
Challenges of AMPs
• In some AMPs, the ISA, OS and programming
model of different cores are also
present even more challenges
different => they
• AMPs are not widely available
• Some works use DVFS (or clock throttling) to
emulate asymmetric cores, however,
– it over-simplifies challenges of a real AMP =>
inaccurate conclusions
– cannot model non-monotonic cores
16
Thread migration overheads
• In static AMPs, thread migration may take millions
of cycles, e.g. in an AMP with Cortex A15 and A7:
• migration latency b/w A15 to A7: 3.75ms
• vice-versa: 2.10ms
• Flushing and warming of cache etc. => additional
overheads
• Hence, migration can be performed only once every
millions of instructions
17
Challenge of maintaining fairness
• Fairness: important for meeting QoS guarantees
• In AMP, some threads may be unfairly slowed-down =>
starvation & unpredictable per-task performance
• In a multithreaded app, performance advantage of big
core may be completely negated if thread running on it
stalls waiting for other threads
Big core Small cores
C0 C1 C2 C3
Thread 0 stalls
for other threads
Synchronization barrier
18
Challenges of AMPs
• Some AMP designs use non-standard ISAs or compiler
support => may not find wide adoption
• Unpredictability: An asymmetry-unaware scheduler
may schedule different threads to fast or slow cores in
different runs => variable performance.
19
Techniques for Managing AMPs
20
App/thread mapping strategies
• The most important challenge in AMPs: finding the
right core for running a thread
• The right choice depends on:
– Optimization target
– Application property
– Core property
• We will discuss some mapping (scheduling)
strategies
21
Estimating performance for scheduling
To
on
make scheduling decisions, thread-performance
different core types must be known
Option 1
Estimate perf. of a thread
on a core type without
actually running the
thread on that core type,
e.g., using math models
HW-specific, error-prone
Option 2
Actually run threads on
each core type to sample
performance
• •
• High profiling overhead
•
22
App/thread mapping strategies
CPI breakdown for representative cases
(a) CPI dominated
by external stalls
(a) CPI dominated (a) CPI dominated
by execution cycles
by internal stalls
Suitable for big core
Suitable for small core
Koufaty et al. EuroSys’10
23
App/thread mapping strategies
• Loads on different thread is imbalanced
– Map slowest thread to big core
• Different VMs running on a host have different
resource requirements
– VM with higher number of `virtual CPUs' gets big core
• App with high ILP => map to a wide-issue
superscalar processor which can issue several
instructions every cycle
24
App/thread mapping strategies
Big core Small core
• Highly-parallel phases
• Compute-intensive apps
• App with low miss-rate
• Benefit from running on
big core is large
• Thread with largest
•
•
•
•
Sequential phases
I/O-intensive apps
App with high miss-rate
Benefit from running on
big core is small
Thread with small
remaining execution time
OS kernel code,
virtualization helper code
& device interrupts
•
remaining execution
• Application code
time
•
25
App/thread mapping strategies
Big core Small core
• High priority app
• Multimedia-intensive
• Low priority app
• Service daemons and
background processes,
apps
sensor sampling and
buffering tasks
26
Example of fairness-oriented scheduling schemes
• ‘Equal-time’: run each thread on each core type for
equal amount of time
• ‘Equal-progress’: It aims to get equal work done in all
threads.
– Idea 1: Schedule thread with currently largest
slowdown on big core.
– Idea 2: Whenever difference in progress of different
threads becomes too high, swap them
Van Craeynest et al. PACT’13
27
Use of DVFS along with thread scheduling
• Provides further opportunities to
performance/energy tradeoff
exercise
• Estimate throughput/Watt of program phase at
different voltage/frequency (V/F) levels on all core
types.
• Based on this, best thread-to-core mapping and V/F
values are selected
28
Challenges of different thread scheduling policies
Static scheduling Dynamic scheduling
• Works by collecting data
by offline analysis
• Cannot account for
different input sets and
application phases
• Becomes infeasible with
increasing number of co-
running applications
• Works by collecting data
at runtime
Incur thread migration
overhead
Ineffective for short-lived
threads since the profiling
phase itself may form a
large majority of their
lifetime
•
•
29
Reconfigurable AMPs
30
Motivation: Need of fine-grained switching
Variance of IPC in gcc over 300K instructions
31
Need of fine-grained switching
Coarse-grained vs. fine-grained heterogeneity
Fallin et al. ICCD’14
32
Reconfigurable AMPs
• Benefits: No thread migration overheads
• Challenges: Reconfiguration incurs latency and energy
overheads, e.g., I/D-cache flushes and data migration
• Avoiding this may require: a complex compiler, custom
ISA, 3D stacking, changes to OS and application binary.
• Tradeoffs:
– Centralized resources: saves area, but presents scalability
bottleneck
– High adaptation granularity: allows exploiting different
levels of ILP and TLP but precludes specialization for
accelerating specific applications
33
Benefits of reconfigurable AMPs
• Allow flexibly scaling up to exploit MLP and ILP in
single-threaded apps
• Allow scaling down to exploit TLP in multithreaded
apps
• Provide better HW utilization and resilience to errors
since one hard error may not disable entire processor
• They may achieve better performance and energy
proportionality than static AMPs.
34
Types of reconfigurable AMPs
1. Those that dynamically fuse or partition the cores
and thus change the core-count
2.
3.
Those
Those
which
which
share/trade resources between cores
transform the core architecture
In following slides, we show examples of each of
these through figures. See the survey for more details
35
1. Changing core-count
An 8-core CMP with two independent cores, 2-core fused
group, and 4-core fused group
Ipek et al. ISCA’07
36
Static AMP
with big and
little cores
Reconfigurable AMP
with many little cores,
of which few can be
fused into a wide-issue
processor
Salverda et al. HPCA'08
37
1. Changing core-count
Idealized processor
Fusing in-order cores
Salverda et al. HPCA'08
38
1. Changing core-count
32 2-wide config. 8 processor config. One 64-wide config.
Kim et al. MICRO’07
39
1. Changing core-count
A reconfigurable AMP
Pricopi et al. TACO'11
40
1. Changing core-count
Exploits fine-grain parallelism more effectively
Runs more applications effectively
PIM = processor
in memory
Wide-issue processors
with many ALUs each
Different granularities of parallel processing elements
Sankaralingam et al. ISCA'03
41
1. Changing core-count
A reconfigurable AMP where multiple scalar cores can
be united to create a larger superscalar processor
Chiu et al. ICPP’10
42
2. Trading resources between cores
Asymmetric
building blocks Faulty
A reconfigurable AMP
Gupta et al. MICRO’10
43
2. Trading resources between cores
A 3D reconfigurable AMP: poolable resources (registers,
instruction queue, reorder buffer, cache space, load and store
queues, etc.) in another layer
Homayoun et al. HPCA'11
44
2. Trading resources between cores
Dynamic core morphing (1/2)
Baseline configuration for two heterogeneous cores
Rodrigues et al. PACT’11
45
2. Trading resources between cores
Dynamic core morphing (2/2)
Morphed configuration for two
heterogeneous cores.
RED: Connectivity for strong morphed core BLACK: Connectivity for weak core
46
2. Trading resources between cores
Pipeline level view of the resource sharing
Rodrigues et al. VLSID’14
47
3. Morphing core-architecture
Baseline 4-way OOO core
Baseline core morphed into an InO core
Srinivasan et al. ISVLSI’13
48
3. Morphing core architecture
Composite core architecture
Lukefahr et al. MICRO’12
49
References
• S. Mittal, “A Survey Of Techniques for Architecting
and Managing Asymmetric Multicore Processors”,
ACM Computing Surveys 2016 (pdf)
50

More Related Content

Similar to PPT_for_big_LITTLE_style_Asymmetric_Mult.pptx

TASK SCHEDULING ON ADAPTIVE MULTI-CORE
TASK SCHEDULING ON ADAPTIVE MULTI-CORETASK SCHEDULING ON ADAPTIVE MULTI-CORE
TASK SCHEDULING ON ADAPTIVE MULTI-COREHaris Muhammed
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design ApproachA B Shinde
 
Multicore processor.pdf
Multicore processor.pdfMulticore processor.pdf
Multicore processor.pdfrajaratna4
 
fundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdffundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdfshubhangisonawane6
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel ComputingMohsin Bhat
 
Week 13-14 Parrallel Processing-new.pptx
Week 13-14 Parrallel Processing-new.pptxWeek 13-14 Parrallel Processing-new.pptx
Week 13-14 Parrallel Processing-new.pptxFaizanSaleem81
 
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...Arun Joseph
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi coremukul bhardwaj
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginnersGerwin Makanyanga
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 

Similar to PPT_for_big_LITTLE_style_Asymmetric_Mult.pptx (20)

OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
TASK SCHEDULING ON ADAPTIVE MULTI-CORE
TASK SCHEDULING ON ADAPTIVE MULTI-CORETASK SCHEDULING ON ADAPTIVE MULTI-CORE
TASK SCHEDULING ON ADAPTIVE MULTI-CORE
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
 
Multicore processor.pdf
Multicore processor.pdfMulticore processor.pdf
Multicore processor.pdf
 
fundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdffundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdf
 
SmartBalance-DAC-v2
SmartBalance-DAC-v2SmartBalance-DAC-v2
SmartBalance-DAC-v2
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Week 13-14 Parrallel Processing-new.pptx
Week 13-14 Parrallel Processing-new.pptxWeek 13-14 Parrallel Processing-new.pptx
Week 13-14 Parrallel Processing-new.pptx
 
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating Systems
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginners
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 

Recently uploaded

ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfMadan Karki
 
Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxMustafa Ahmed
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AISheetal Jain
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalSwarnaSLcse
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxMustafa Ahmed
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1T.D. Shashikala
 
Low Altitude Air Defense (LAAD) Gunner’s Handbook
Low Altitude Air Defense (LAAD) Gunner’s HandbookLow Altitude Air Defense (LAAD) Gunner’s Handbook
Low Altitude Air Defense (LAAD) Gunner’s HandbookPeterJack13
 
The Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptxThe Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptxMANASINANDKISHORDEOR
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfJNTUA
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024EMMANUELLEFRANCEHELI
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...Amil baba
 
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesRashidFaridChishti
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxkalpana413121
 
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...Nitin Sonavane
 
AI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfAI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfmahaffeycheryld
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsMathias Magdowski
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxKarpagam Institute of Teechnology
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxCHAIRMAN M
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...josephjonse
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxRashidFaridChishti
 

Recently uploaded (20)

ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptx
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference Modal
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Low Altitude Air Defense (LAAD) Gunner’s Handbook
Low Altitude Air Defense (LAAD) Gunner’s HandbookLow Altitude Air Defense (LAAD) Gunner’s Handbook
Low Altitude Air Defense (LAAD) Gunner’s Handbook
 
The Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptxThe Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptx
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
 
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
 
AI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfAI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdf
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 

PPT_for_big_LITTLE_style_Asymmetric_Mult.pptx

  • 2. Acronyms • AMP = asymmetric multicore processors • SMP = symmetric multicore processors • OOO =out-of-order, InO = in-order • ISA = instruction set architecture • EDP = energy delay product • ILP/TLP/MLP = instruction/thread/memory-level parallelism • VM = virtual machine 2
  • 3. Motivation • Modern processors are diverse in: – Optimization objectives: perf, energy – Workloads: multimedia, encryption, network … – Scale: embedded system to data center • A single monolithic core cannot fulfill all requirements • This has led to two broad ranges of processors: Narrow in-order (InO) cores e.g. Xeon Phi Wide out-of-order (OoO) cores e.g. Sandybridge and Power7 IBM POWER7: 8 cores Intel Xeon Phi 3
  • 4. Motivation • Next step: use different types of core in same processor => AMP • AMPs can – Provide better energy efficiency than SMPs and per-core DVFS – Can optimize for thread-level or instruction-level parallelism – Allow turning-off unused core for saving energy 4
  • 5. Classification of AMPs • Static AMP: statically configuration of cores is fixed • Reconfigurable AMP: microarchitecture can be reconfigured dynamically to provide cores of different resources 5
  • 6. 6 Examples of Static AMPs Asymmetric Symmetric 6 C1 C2 C3 C4 C4 C4 C4 C5 C5 C5 C5 C C C C C C C C C C C C C C C C
  • 7. Examples of Static AMPs 9 power-equivalent multi-cores (B=big core, m=medium core, s=small core) Generally, two core types are sufficient for providing most benefits of heterogeneity Eyerman et al. ASPLOS'14 7
  • 8. Example of Reconfigurable AMP Asymmetric building blocks Faulty Gupta et al. MICRO’10 8
  • 9. Terminology Asymmetric multicore (AMC), asymmetric multicore systems (ASYMS), asymmetric multiprocessor systems (ASMP), asymmetric chip multiprocessors (ACMP), heterogeneous microarchitectures (HM), heterogeneous multicore processor (HMP), heterogeneous CMP (HCMP), asymmetric cluster CMP (ACCMP), big.LITTLE system Big/little (or big/small), fast/slow, complex/simple, aggressive/lightweight, strong/weak cores, application/low-power processor (AP/LP), central/peripheral processor Reconfigurable, configurable, adaptive, scalable, composable, composite, coalition, conjoined, federated, polymorphous, morphable, core morphing, core fusion, flexible, dynamic and united processors 9 Different terminologies for reconfigurable AMPs and/or techniques for architecting them Different terminologies for cores of an AMP Different terminologies for an AMP
  • 10. Types of Heterogeneity in AMPs Types of heterogeneity (basis: nature of asymmetry) Srinivasan et al. [2011] Koufaty et al. [2010] Types of heterogeneity (basis: nature of asymmetry) Types of heterogeneity (basis: how asymmetry is introduced) Khan and Kundu [2010] uArch = microarchitecture, freq. = frequency, diff. = different 10 Extemporaneous heterogeneity (performance of a core altered by DVFS or hardware reconfiguration) Deliberate heterogeneity (diff. uArch, ISA and specialization, e.g. CPU and GPU) Functional asymmetry (diff. ISA and uArch) Performance asymmetry (same ISA, diff. uArch, cache size, freq) Virtual asymmetric (same uArch & ISA, diff. freq or cache size) Physical asymmetric (same ISA, diff. uArch e.g. InO vs OOO, and freq.) Hybrid Cores (diff. ISA and uArch)
  • 11. Classification based on performance ordering core core core X86 Performance of EV6 > EV5 for Neither Alpha nor x86 is optimal for all all apps => AMP with monotonic cores apps => AMP with non-monotonic cores Configuration of Alpha processors 11 Alpha core Alpha EV6 Alpha EV5
  • 12. Architectural configuration of four ARM processors performance on XML parsing benchmark • Cortex A15 and A7: Same ISA but different architecture • Cortex A57 and A53: Same ISA but different architecture All the four processors can have 1 to 4 cores per cluster 12
  • 13. Configuration of Intel’s QuickIA research prototype Chitlur et al. HPCA'12 13
  • 14. Benefits of AMPs • AMPs are natural choice for systems with diverse applications and usage scenarios • Big core => better performance • Small core => better energy efficiency • However, no winner on EDP metric! • Big core => better EDP for compute-intensive apps with high data reuse • Small core => better EDP for memory-intensive apps with little data reuse and many atomic operations 14
  • 15. Challenges of AMPs • Conventional software are designed for SMPs. Many changes required for supporting AMPs • AMP cores should cover a wide and evenly spread range of performance/complexity design space • Scheduling complexity in AMP increases exponentially with rising number of core types and applications 15
  • 16. Challenges of AMPs • In some AMPs, the ISA, OS and programming model of different cores are also present even more challenges different => they • AMPs are not widely available • Some works use DVFS (or clock throttling) to emulate asymmetric cores, however, – it over-simplifies challenges of a real AMP => inaccurate conclusions – cannot model non-monotonic cores 16
  • 17. Thread migration overheads • In static AMPs, thread migration may take millions of cycles, e.g. in an AMP with Cortex A15 and A7: • migration latency b/w A15 to A7: 3.75ms • vice-versa: 2.10ms • Flushing and warming of cache etc. => additional overheads • Hence, migration can be performed only once every millions of instructions 17
  • 18. Challenge of maintaining fairness • Fairness: important for meeting QoS guarantees • In AMP, some threads may be unfairly slowed-down => starvation & unpredictable per-task performance • In a multithreaded app, performance advantage of big core may be completely negated if thread running on it stalls waiting for other threads Big core Small cores C0 C1 C2 C3 Thread 0 stalls for other threads Synchronization barrier 18
  • 19. Challenges of AMPs • Some AMP designs use non-standard ISAs or compiler support => may not find wide adoption • Unpredictability: An asymmetry-unaware scheduler may schedule different threads to fast or slow cores in different runs => variable performance. 19
  • 21. App/thread mapping strategies • The most important challenge in AMPs: finding the right core for running a thread • The right choice depends on: – Optimization target – Application property – Core property • We will discuss some mapping (scheduling) strategies 21
  • 22. Estimating performance for scheduling To on make scheduling decisions, thread-performance different core types must be known Option 1 Estimate perf. of a thread on a core type without actually running the thread on that core type, e.g., using math models HW-specific, error-prone Option 2 Actually run threads on each core type to sample performance • • • High profiling overhead • 22
  • 23. App/thread mapping strategies CPI breakdown for representative cases (a) CPI dominated by external stalls (a) CPI dominated (a) CPI dominated by execution cycles by internal stalls Suitable for big core Suitable for small core Koufaty et al. EuroSys’10 23
  • 24. App/thread mapping strategies • Loads on different thread is imbalanced – Map slowest thread to big core • Different VMs running on a host have different resource requirements – VM with higher number of `virtual CPUs' gets big core • App with high ILP => map to a wide-issue superscalar processor which can issue several instructions every cycle 24
  • 25. App/thread mapping strategies Big core Small core • Highly-parallel phases • Compute-intensive apps • App with low miss-rate • Benefit from running on big core is large • Thread with largest • • • • Sequential phases I/O-intensive apps App with high miss-rate Benefit from running on big core is small Thread with small remaining execution time OS kernel code, virtualization helper code & device interrupts • remaining execution • Application code time • 25
  • 26. App/thread mapping strategies Big core Small core • High priority app • Multimedia-intensive • Low priority app • Service daemons and background processes, apps sensor sampling and buffering tasks 26
  • 27. Example of fairness-oriented scheduling schemes • ‘Equal-time’: run each thread on each core type for equal amount of time • ‘Equal-progress’: It aims to get equal work done in all threads. – Idea 1: Schedule thread with currently largest slowdown on big core. – Idea 2: Whenever difference in progress of different threads becomes too high, swap them Van Craeynest et al. PACT’13 27
  • 28. Use of DVFS along with thread scheduling • Provides further opportunities to performance/energy tradeoff exercise • Estimate throughput/Watt of program phase at different voltage/frequency (V/F) levels on all core types. • Based on this, best thread-to-core mapping and V/F values are selected 28
  • 29. Challenges of different thread scheduling policies Static scheduling Dynamic scheduling • Works by collecting data by offline analysis • Cannot account for different input sets and application phases • Becomes infeasible with increasing number of co- running applications • Works by collecting data at runtime Incur thread migration overhead Ineffective for short-lived threads since the profiling phase itself may form a large majority of their lifetime • • 29
  • 31. Motivation: Need of fine-grained switching Variance of IPC in gcc over 300K instructions 31
  • 32. Need of fine-grained switching Coarse-grained vs. fine-grained heterogeneity Fallin et al. ICCD’14 32
  • 33. Reconfigurable AMPs • Benefits: No thread migration overheads • Challenges: Reconfiguration incurs latency and energy overheads, e.g., I/D-cache flushes and data migration • Avoiding this may require: a complex compiler, custom ISA, 3D stacking, changes to OS and application binary. • Tradeoffs: – Centralized resources: saves area, but presents scalability bottleneck – High adaptation granularity: allows exploiting different levels of ILP and TLP but precludes specialization for accelerating specific applications 33
  • 34. Benefits of reconfigurable AMPs • Allow flexibly scaling up to exploit MLP and ILP in single-threaded apps • Allow scaling down to exploit TLP in multithreaded apps • Provide better HW utilization and resilience to errors since one hard error may not disable entire processor • They may achieve better performance and energy proportionality than static AMPs. 34
  • 35. Types of reconfigurable AMPs 1. Those that dynamically fuse or partition the cores and thus change the core-count 2. 3. Those Those which which share/trade resources between cores transform the core architecture In following slides, we show examples of each of these through figures. See the survey for more details 35
  • 36. 1. Changing core-count An 8-core CMP with two independent cores, 2-core fused group, and 4-core fused group Ipek et al. ISCA’07 36
  • 37. Static AMP with big and little cores Reconfigurable AMP with many little cores, of which few can be fused into a wide-issue processor Salverda et al. HPCA'08 37
  • 38. 1. Changing core-count Idealized processor Fusing in-order cores Salverda et al. HPCA'08 38
  • 39. 1. Changing core-count 32 2-wide config. 8 processor config. One 64-wide config. Kim et al. MICRO’07 39
  • 40. 1. Changing core-count A reconfigurable AMP Pricopi et al. TACO'11 40
  • 41. 1. Changing core-count Exploits fine-grain parallelism more effectively Runs more applications effectively PIM = processor in memory Wide-issue processors with many ALUs each Different granularities of parallel processing elements Sankaralingam et al. ISCA'03 41
  • 42. 1. Changing core-count A reconfigurable AMP where multiple scalar cores can be united to create a larger superscalar processor Chiu et al. ICPP’10 42
  • 43. 2. Trading resources between cores Asymmetric building blocks Faulty A reconfigurable AMP Gupta et al. MICRO’10 43
  • 44. 2. Trading resources between cores A 3D reconfigurable AMP: poolable resources (registers, instruction queue, reorder buffer, cache space, load and store queues, etc.) in another layer Homayoun et al. HPCA'11 44
  • 45. 2. Trading resources between cores Dynamic core morphing (1/2) Baseline configuration for two heterogeneous cores Rodrigues et al. PACT’11 45
  • 46. 2. Trading resources between cores Dynamic core morphing (2/2) Morphed configuration for two heterogeneous cores. RED: Connectivity for strong morphed core BLACK: Connectivity for weak core 46
  • 47. 2. Trading resources between cores Pipeline level view of the resource sharing Rodrigues et al. VLSID’14 47
  • 48. 3. Morphing core-architecture Baseline 4-way OOO core Baseline core morphed into an InO core Srinivasan et al. ISVLSI’13 48
  • 49. 3. Morphing core architecture Composite core architecture Lukefahr et al. MICRO’12 49
  • 50. References • S. Mittal, “A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors”, ACM Computing Surveys 2016 (pdf) 50