SlideShare a Scribd company logo
1 of 21
Multithreading: Exploiting Thread-Level
Parallelism to Improve Uniprocessor Throughput
Seminar on :
Advance Computer Architecture
Outline
 Multithreading
 Multithreading approaches
 How Resources are Shared?
 Effectiveness of Fine MT on The Sun T1
 Effectiveness ofSTMonSuperscalar processor
 References
Multithreading
 Multithreading
 is a primary technique for exposing more parallelism to the hardware.
In a strict sense, multithreading uses thread-level parallelism, but it’s
role in both improving pipeline utilization and in GPUs motivates us to
introduce the concept here. Although increasing performance by using
ILP.
 Allows multiple threads to share the functional units of a single
processor in an overlapping fashion. In contrast, a more general
method to exploit thread-level parallelism(TLP) is with a multiprocessor
that has multiple independent threads operating at once and in
parallel.
 Does not duplicate the entire processor as a multiprocessor does.
Instead, multi-threading shares most of the processor core among a
set of threads, duplicating only private state, such as the registers and
program counter.
 Fine-grained multithreading
 Switches between threads on each clock, causing the execution of
instructions from multiple threads to be interleaved. This
interleaving is often done in a round-robin fashion, skipping any
threads that are stalled at that time.
 Advantage of this approach is that it can hide throughput losses
(latency) that arise from both short and long stalls.
 The primary disadvantage of this approach is that it slows down
the execution of an individual thread .
 Processors use this approach :-
 The Sun Niagara .
 NVidia GPUs .
Multithreading approaches
 Coarse-grained multithreading
 Was invented as an alternative to Fine-grained multithreading.
 Coarse-grained multithreading switches thread only on costly
stalls, such as level two or three .
 It need to have thread-switching be essentially free because chang
relieves.
 Less likely to slow down the execution of any one thread, since
instructions from other threads will only be issued when a thread
encounters a costly stall.
 Coarse-grained multithreading suffers from a major drawback,
which limited the ability to overcome throughput losses, especially
from shorter stalls.
 No major processors use this technique.
Multithreading approaches
 Simultaneous multithreading (SMT)
 The most common implementation of multithreading and it is a
variation on fine-grained multithreading.
 It arises naturally when fine-grained multithreading is
implemented on to of a multiple-issue, dynamically scheduled
processor.
 Exploits thread-level parallelism at the same time it exploits ILP,
SMT uses TLP to hide long-latency events in a processor.
 The key insight in SMT is that register renaming and dynamic
scheduling allow multiple instruction from independent threads to
be executed without regard to the dependences among them.
 The resolution of the dependences can be handled by the dynamic
scheduling capability.
 Intel core i7 and IBM power7 use SMT.
Multithreading approaches
How Resources are Shared?
 Following figure show the differences in processor’s ability to
exploit the resources of a superscalar for the following
configuration :
 A superscalar with no multithreading support
 A superscalar with coarse-grained multithreading
 A superscalar with fine-grained multithreading
 A superscalar with simultaneous multithreading
 In the superscalar without multithreading support, the use of issue
slots is limited by a lack of ILP, including ILP to hide memory latency.
Because of the length of L2 and L3 cache misses, much of the
processor can be left idle.
How Resources are Shared?
Figure 1 How four different approaches use the functional unit execution slots of
superscalar processor.
 The horizontal dimension represent the instruction execution capability in each clock.
 The vertical dimension represent a sequence of clock cycles.
 An empty (white) box indicates that the corresponding execution slot is unused.
 The shades gray and black corresponding to four different threads in the multithreading
processors.
How Resources are Shared?
 In the coarse-grained multithreaded superscalar, By switching
to another thread that’s cause partially hidden. This switching
reduces the number of completely idle clock cycles. Thread
switching only occurs when there is a stall. Because there are
likely to be some fully idle cycles remaining.
 Fine-grained multithreading can only issue instructions from a
single thread in a cycle – can not find max work every cycle,
but cache misses can be tolerated.
 Simultaneous multithreading can issue instructions from any
thread every cycle has the highest probability of finding work
for every issue slot .
 Sun T1 Processor Overview
 The T1 is a Fine MT, multicore microprocessor introduce by sun in
2005.
 Totally focused on exploiting thread-level parallelism (TLP), rather
than (ILP).
 Returned to a simple Pipeline strategy and focused on exploiting
(TLP), using multiple cores and multithreading to produce
throughput.
 8 processor cores, each supporting 4 threads, each core consist
6-stage single-issue Pipeline ( a standard five stage RISC
Pipeline, with one stage added for thread switching .
 The Sun T1 processor had the best performance on integer
applications with extensive (TLP) and demanding memory
performance, such as SPECJBB and transaction processing
workloads.
Effectiveness of Fine MT on the Sun T1
Figure 2 A summary of T1 processor
Effectiveness of Fine MT on the Sun T1
 T1 Multithreading Unicore Performance
 To examine the performance of the T1 we use three server-
oriented :
 TPC-C
 SPECJBB
 SPECWeb99
 Since multiple threads increase the memory demand from a
single processor they could overload the memory system, leading
to reductions in the potential gain from multithreading.
 Next figures show the effectiveness of fine MT on the Sun T1
Effectiveness of Fine MT on the Sun T1
Figure 3 The relative change in the miss rates and miss latencies when executing
with one thread per core versus four threads per core on the TPC-C benchmark.
Effectiveness of Fine MT on the Sun T1
Figure 4 Breakdown of the status on an average thread.
 Remember that not ready does not imply that the core with that thread
is stalled; it is only when all four threads are not ready that core will
stall.
 Thread can be not ready due to cache misses, Pipeline delays.
Effectiveness of Fine MT on the Sun T1
Figure 5 The breakdown of causes for a thread being not ready
 Thread can be not ready due to cache misses, Pipeline delays.
 Figure above show the frequency of various causes effect on Thread.
Effectiveness of Fine MT on the Sun T1
Figure 6 The per-thread CPI, the per-core CPI, the effective eight-core CPI, and
the effective IPC (inverse of CPI) for the eight-core T1 processor.
Effectiveness of Fine MT on the Sun T1
 Simulation research results are unrealistic.
 In practice, the existing implementations give the result is that the
gain from SMT is also more modest.
 The intel core i7 support SMT with two threads. The following figures
show the performance ratio and the energy efficiency ratio.
 To examine the performance of the T1 we use three server-
oriented :
 TPC-C
 SPECJBB
 SPECWeb99
 Since multiple threads increase the memory demand from a
single processor they could overload the memory system, leading
to reductions in the potential gain from multithreading.
 Next figures show the effectiveness of fine MT on the Sun T1
Effectiveness of STM on Superscalar processors
 Simulation research results are unrealistic.
 In practice, the existing implementations give the result is that the
gain from SMT is also more modest.
 The intel core i7 support SMT with two threads. The following figures
show the performance ratio and the energy efficiency ratio.
Figure 7 The speedup from using multithreading on one core on an i7 processor
averages 1.28 for the Java benchmarks and 1.31 for the PARSEC .
Effectiveness of STM on Superscalar processors
 In the PARSEC benchmarks, SMT reduces energy by 7%, these results
clearly show that SMT in aggressive speculative processor with
extensive support for SMT can improve performance in an energy
efficient fashion, which the more aggressive ILP approaches have failed
to do .
 Indeed, Esmaeilzadeh et al. [2011] show that the energy
improvements from SMT are even larger on the Intel i5 (a processor
similar to the i7, but with smaller caches and a lower clock rate) and
the Intel Atom (an 80×86 processor designed for the netbook market)
Effectiveness of STM on Superscalar processors
 David Patterson, John L. Hennessy, “Computer
Architecture:A Quantitative Approach” Morgan Kaufmann
is an imprint of Elsevier 225 Wyman Street, Waltham,
MA 02451, USA© 2012 Elsevier, Inc. All rights reserved,
pp.223-232
References
Thank You…

More Related Content

What's hot

INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISM
Kamran Ashraf
 

What's hot (20)

program flow mechanisms, advanced computer architecture
program flow mechanisms, advanced computer architectureprogram flow mechanisms, advanced computer architecture
program flow mechanisms, advanced computer architecture
 
Demand paging
Demand pagingDemand paging
Demand paging
 
Memory management
Memory managementMemory management
Memory management
 
memory hierarchy
memory hierarchymemory hierarchy
memory hierarchy
 
Advanced DBMS presentation
Advanced DBMS presentationAdvanced DBMS presentation
Advanced DBMS presentation
 
Hardware Multi-Threading
Hardware Multi-ThreadingHardware Multi-Threading
Hardware Multi-Threading
 
Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazard
 
Memory management
Memory managementMemory management
Memory management
 
system interconnect architectures in ACA
system interconnect architectures in ACAsystem interconnect architectures in ACA
system interconnect architectures in ACA
 
Max flow min cut
Max flow min cutMax flow min cut
Max flow min cut
 
System interconnect architecture
System interconnect architectureSystem interconnect architecture
System interconnect architecture
 
Memory hierarchy
Memory hierarchyMemory hierarchy
Memory hierarchy
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Memory Organization
Memory OrganizationMemory Organization
Memory Organization
 
INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISM
 
program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer Architecture
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating Systems
 
Lecture 3 threads
Lecture 3   threadsLecture 3   threads
Lecture 3 threads
 
Multithreading
MultithreadingMultithreading
Multithreading
 
memory reference instruction
memory reference instructionmemory reference instruction
memory reference instruction
 

Viewers also liked

FIne Grain Multithreading
FIne Grain MultithreadingFIne Grain Multithreading
FIne Grain Multithreading
Dharmesh Tank
 
Architecture Challenges In Cloud Computing
Architecture Challenges In Cloud ComputingArchitecture Challenges In Cloud Computing
Architecture Challenges In Cloud Computing
IndicThreads
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
Fraboni Ec
 

Viewers also liked (20)

FIne Grain Multithreading
FIne Grain MultithreadingFIne Grain Multithreading
FIne Grain Multithreading
 
Network Functions Virtualization – Our Strategy
Network Functions Virtualization – Our StrategyNetwork Functions Virtualization – Our Strategy
Network Functions Virtualization – Our Strategy
 
Open Source Private Cloud Management with OpenStack and Security Evaluation w...
Open Source Private Cloud Management with OpenStack and Security Evaluation w...Open Source Private Cloud Management with OpenStack and Security Evaluation w...
Open Source Private Cloud Management with OpenStack and Security Evaluation w...
 
Task and Data Parallelism
Task and Data ParallelismTask and Data Parallelism
Task and Data Parallelism
 
Architecture Challenges In Cloud Computing
Architecture Challenges In Cloud ComputingArchitecture Challenges In Cloud Computing
Architecture Challenges In Cloud Computing
 
Task and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World ExamplesTask and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World Examples
 
Concurrency basics
Concurrency basicsConcurrency basics
Concurrency basics
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion Sets
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) Limitations
 
Analysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data MiningAnalysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data Mining
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
 
Symmetric multiprocessing
Symmetric multiprocessingSymmetric multiprocessing
Symmetric multiprocessing
 
Smp and asmp architecture.
Smp and asmp architecture.Smp and asmp architecture.
Smp and asmp architecture.
 
Operating Systems 1 (7/12) - Threads
Operating Systems 1 (7/12) - ThreadsOperating Systems 1 (7/12) - Threads
Operating Systems 1 (7/12) - Threads
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
Junctionless transistors
Junctionless transistorsJunctionless transistors
Junctionless transistors
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
Operating System Chapter 4 Multithreaded programming
Operating System Chapter 4 Multithreaded programmingOperating System Chapter 4 Multithreaded programming
Operating System Chapter 4 Multithreaded programming
 

Similar to Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
Fraboni Ec
 
Multi threaded rtos
Multi threaded rtosMulti threaded rtos
Multi threaded rtos
James Wong
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
Haris456
 

Similar to Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput (20)

What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
 
Time critical multitasking for multicore
Time critical multitasking for multicoreTime critical multitasking for multicore
Time critical multitasking for multicore
 
TIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KIT
TIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KITTIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KIT
TIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KIT
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
TPU paper slide
TPU paper slideTPU paper slide
TPU paper slide
 
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
 
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSMULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
 
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSMULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Multi threaded rtos
Multi threaded rtosMulti threaded rtos
Multi threaded rtos
 
Threads
ThreadsThreads
Threads
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore Computers
 
Pipelining in Computer System Achitecture
Pipelining in Computer System AchitecturePipelining in Computer System Achitecture
Pipelining in Computer System Achitecture
 
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMSCOMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 

Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

  • 1. Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput Seminar on : Advance Computer Architecture
  • 2. Outline  Multithreading  Multithreading approaches  How Resources are Shared?  Effectiveness of Fine MT on The Sun T1  Effectiveness ofSTMonSuperscalar processor  References
  • 3. Multithreading  Multithreading  is a primary technique for exposing more parallelism to the hardware. In a strict sense, multithreading uses thread-level parallelism, but it’s role in both improving pipeline utilization and in GPUs motivates us to introduce the concept here. Although increasing performance by using ILP.  Allows multiple threads to share the functional units of a single processor in an overlapping fashion. In contrast, a more general method to exploit thread-level parallelism(TLP) is with a multiprocessor that has multiple independent threads operating at once and in parallel.  Does not duplicate the entire processor as a multiprocessor does. Instead, multi-threading shares most of the processor core among a set of threads, duplicating only private state, such as the registers and program counter.
  • 4.  Fine-grained multithreading  Switches between threads on each clock, causing the execution of instructions from multiple threads to be interleaved. This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that time.  Advantage of this approach is that it can hide throughput losses (latency) that arise from both short and long stalls.  The primary disadvantage of this approach is that it slows down the execution of an individual thread .  Processors use this approach :-  The Sun Niagara .  NVidia GPUs . Multithreading approaches
  • 5.  Coarse-grained multithreading  Was invented as an alternative to Fine-grained multithreading.  Coarse-grained multithreading switches thread only on costly stalls, such as level two or three .  It need to have thread-switching be essentially free because chang relieves.  Less likely to slow down the execution of any one thread, since instructions from other threads will only be issued when a thread encounters a costly stall.  Coarse-grained multithreading suffers from a major drawback, which limited the ability to overcome throughput losses, especially from shorter stalls.  No major processors use this technique. Multithreading approaches
  • 6.  Simultaneous multithreading (SMT)  The most common implementation of multithreading and it is a variation on fine-grained multithreading.  It arises naturally when fine-grained multithreading is implemented on to of a multiple-issue, dynamically scheduled processor.  Exploits thread-level parallelism at the same time it exploits ILP, SMT uses TLP to hide long-latency events in a processor.  The key insight in SMT is that register renaming and dynamic scheduling allow multiple instruction from independent threads to be executed without regard to the dependences among them.  The resolution of the dependences can be handled by the dynamic scheduling capability.  Intel core i7 and IBM power7 use SMT. Multithreading approaches
  • 7. How Resources are Shared?  Following figure show the differences in processor’s ability to exploit the resources of a superscalar for the following configuration :  A superscalar with no multithreading support  A superscalar with coarse-grained multithreading  A superscalar with fine-grained multithreading  A superscalar with simultaneous multithreading  In the superscalar without multithreading support, the use of issue slots is limited by a lack of ILP, including ILP to hide memory latency. Because of the length of L2 and L3 cache misses, much of the processor can be left idle.
  • 8. How Resources are Shared? Figure 1 How four different approaches use the functional unit execution slots of superscalar processor.  The horizontal dimension represent the instruction execution capability in each clock.  The vertical dimension represent a sequence of clock cycles.  An empty (white) box indicates that the corresponding execution slot is unused.  The shades gray and black corresponding to four different threads in the multithreading processors.
  • 9. How Resources are Shared?  In the coarse-grained multithreaded superscalar, By switching to another thread that’s cause partially hidden. This switching reduces the number of completely idle clock cycles. Thread switching only occurs when there is a stall. Because there are likely to be some fully idle cycles remaining.  Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated.  Simultaneous multithreading can issue instructions from any thread every cycle has the highest probability of finding work for every issue slot .
  • 10.  Sun T1 Processor Overview  The T1 is a Fine MT, multicore microprocessor introduce by sun in 2005.  Totally focused on exploiting thread-level parallelism (TLP), rather than (ILP).  Returned to a simple Pipeline strategy and focused on exploiting (TLP), using multiple cores and multithreading to produce throughput.  8 processor cores, each supporting 4 threads, each core consist 6-stage single-issue Pipeline ( a standard five stage RISC Pipeline, with one stage added for thread switching .  The Sun T1 processor had the best performance on integer applications with extensive (TLP) and demanding memory performance, such as SPECJBB and transaction processing workloads. Effectiveness of Fine MT on the Sun T1
  • 11. Figure 2 A summary of T1 processor Effectiveness of Fine MT on the Sun T1
  • 12.  T1 Multithreading Unicore Performance  To examine the performance of the T1 we use three server- oriented :  TPC-C  SPECJBB  SPECWeb99  Since multiple threads increase the memory demand from a single processor they could overload the memory system, leading to reductions in the potential gain from multithreading.  Next figures show the effectiveness of fine MT on the Sun T1 Effectiveness of Fine MT on the Sun T1
  • 13. Figure 3 The relative change in the miss rates and miss latencies when executing with one thread per core versus four threads per core on the TPC-C benchmark. Effectiveness of Fine MT on the Sun T1
  • 14. Figure 4 Breakdown of the status on an average thread.  Remember that not ready does not imply that the core with that thread is stalled; it is only when all four threads are not ready that core will stall.  Thread can be not ready due to cache misses, Pipeline delays. Effectiveness of Fine MT on the Sun T1
  • 15. Figure 5 The breakdown of causes for a thread being not ready  Thread can be not ready due to cache misses, Pipeline delays.  Figure above show the frequency of various causes effect on Thread. Effectiveness of Fine MT on the Sun T1
  • 16. Figure 6 The per-thread CPI, the per-core CPI, the effective eight-core CPI, and the effective IPC (inverse of CPI) for the eight-core T1 processor. Effectiveness of Fine MT on the Sun T1
  • 17.  Simulation research results are unrealistic.  In practice, the existing implementations give the result is that the gain from SMT is also more modest.  The intel core i7 support SMT with two threads. The following figures show the performance ratio and the energy efficiency ratio.  To examine the performance of the T1 we use three server- oriented :  TPC-C  SPECJBB  SPECWeb99  Since multiple threads increase the memory demand from a single processor they could overload the memory system, leading to reductions in the potential gain from multithreading.  Next figures show the effectiveness of fine MT on the Sun T1 Effectiveness of STM on Superscalar processors
  • 18.  Simulation research results are unrealistic.  In practice, the existing implementations give the result is that the gain from SMT is also more modest.  The intel core i7 support SMT with two threads. The following figures show the performance ratio and the energy efficiency ratio. Figure 7 The speedup from using multithreading on one core on an i7 processor averages 1.28 for the Java benchmarks and 1.31 for the PARSEC . Effectiveness of STM on Superscalar processors
  • 19.  In the PARSEC benchmarks, SMT reduces energy by 7%, these results clearly show that SMT in aggressive speculative processor with extensive support for SMT can improve performance in an energy efficient fashion, which the more aggressive ILP approaches have failed to do .  Indeed, Esmaeilzadeh et al. [2011] show that the energy improvements from SMT are even larger on the Intel i5 (a processor similar to the i7, but with smaller caches and a lower clock rate) and the Intel Atom (an 80×86 processor designed for the netbook market) Effectiveness of STM on Superscalar processors
  • 20.  David Patterson, John L. Hennessy, “Computer Architecture:A Quantitative Approach” Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA© 2012 Elsevier, Inc. All rights reserved, pp.223-232 References