SlideShare a Scribd company logo
1 of 32
Chip Multithreading Systems
Need a New Operating System
Scheduler
Purpose of Paper
• Scheduling on single-processor MT systems
• Our proposal involves building a scheduler that would model relative
resource contention resulting from different potential schedules, and use
this as a basis for its decisions.
• In situations where a good prediction is difficult to achieve, we may resort to
using our model of resource contention to reduce the size of the problem
space, and then use the algorithms proposed in earlier work to make final
scheduling decisions.
Abstract
• Unpredictable nature of modern workloads, because of frequent branches
and control transfers, can result in processor pipeline utilization as low as
19%.
• Chip multithreading (CMT), a processor architecture combining chip
multiprocessing and hardware multithreading, is designed to address this
issue
• CMT-savvy operating system scheduler could improve application
performance by a factor of two.
INTRODUCTION
• Applications consist of multiple threads of control and executing short
stretches of integer operations with frequent dynamic branches can yield to
poor CPU utilization.
• This can negatively effect
1. Cache locality
2. Branch Prediction
3. Cause frequent processor stalls
INTRODUCTION
• Two techniques were designed to improve processor utilization for
transaction-processing-like workloads
1. Chip multiprocessing (CMP)
2. hardware multithreading (MT)
• It is done by offering better support for thread-level parallelism (TLP).
Chip Multiprocessing (CMP)
• A CMP processor includes multiple processor cores on a single chip
• It allows more than one thread to be active at a time and improves
utilization of chip resources.
• A MT processor has multiple sets of registers and other thread state and
interleaves execution of instructions from different threads.
• This is done by
1. Switching between threads (as often as on each cycle) or
2. By executing instructions from multiple threads simultaneously (if the
instructions use different functional units).
• As a result, if one thread is blocked on a memory access or some other long-
latency operation, other threads can make forward progress.
Problem
• A major factor affecting performance on systems with multiple hardware
contexts is operating system scheduling.
• For example,
One OS scheduler that is tailored for MT processors produces an
average performance improvement of 17%. Moreover, it has been pointed out
that a naïve scheduler can hurt performance, making a multithreaded
processor perform worse than a single-threaded processor.
• Experiments have shown that the potential for performance gain from a
specialized scheduler on CMT can be as large as a factor of two
Scheduler designs Problems
• Author believe that scheduler designs proposed for single-processor multi
threaded systems do not scale up to the dozens of hardware threads that we
expect to find on proposed CMT processors.
• Such systems will require a fundamentally new design for the operating
system scheduler.
• An ideal scheduler will assign threads to processors in a way that minimizes
resource contention and maximizes system throughput, and would do so in a
scalable fashion.
Scheduler designs Problems
• The scheduler must understand how its scheduling decisions will affect
resource contention
• This is because resource contention ultimately determines performance.
EXPERIMENTAL PLATFORM
• CMT system simulator toolkit as a set of extensions to the Simics
simulation toolkit is developed to sturdy the performance of the system
• This toolkit models systems with multiple multithreaded CPU cores.
• The model of a multithreaded processor core is based on the concept of fine-
grained multithreading
• An MT core has multiple hardware contexts (usually one, two, four or eight),
• Each context consists of a set of registers and other thread state.
• Such a processor interleaves execution of instructions from the threads,
switching between contexts on each cycle.
SIMULATED CPU SPECS
• It has a simple RISC pipeline, with one set of functional units (arithmetic
unit, load/store unit, etc.).
• The processor that is modeled is typically configured with four hardware
contexts per CPU core.
• Each core has a single shared TLB and L1 data and instruction caches.
• The (integrated) L2 cache is shared by all of the CPU cores on the chip.
• One alternative to the architecture that we have chosen is to have multiple
sets of functional units on each CPU core. This is termed a simultaneous
multithreaded (SMT) system.
SMT SYSTEMS
• SMT systems are more complex and require more chip real estate.
• We have instead taken the approach of leveraging a simple, classical RISC
core in order to allow space for more cores on each chip.
• This allows for a higher degree of multithreading in the system, resulting in
higher throughput for multithreaded and multi programmed workloads.
Multithreaded workloads
• A thread may become blocked when it encounters a long latency operation,
such as servicing a cache miss.
• When one or more threads are unavailable, the system continues to switch
among the remaining available threads.
• For multithreaded workloads, this improves processor utilization and hides
the latency of long operations.
• This latency-hiding property is at the heart of hardware multithreading
STUDYING RESOURCE CONTENTION
• On a CMT system, each hardware context appears as a logical processor to
the operating system
• A software thread is assigned to a hardware context for the duration of the
scheduling time slice.
• Threads that share a processor compete for resources.
Resource contention is a conflict over access to a shared resource such as random access memory, disk storage, cache memory
STUDYING RESOURCE CONTENTION
• There are many different categories of resources contention
• Author has focus on the processor pipeline
• This is because that category of contention characterizes the difference
between single-threaded and multithreaded processors.
Resource contention is a conflict over access to a shared resource such as random access memory, disk storage, cache memory
STUDYING RESOURCE CONTENTION
• When assigning threads to hardware contexts, the scheduler has to decide
which threads should be run on the same processor, and which threads
should be run separately.
• The optimal thread assignment should result in high utilization of the
processor.
• If we are to design a scheduler that can find good thread assignments, we
must understand the causes and effects of contention among the threads
that share a processor.
• It is observed that the instruction mix executed by a workload is an
important factor in determining the level of contention for the processor
pipeline.
• The key to understanding why this is the case is the concept of instruction
delay latency
Resource contention is a conflict over access to a shared resource such as random access memory, disk storage, cache memory
INSTRUCTION DELAY LATENCY
• A processor keeps a copy of architectural state for each active hardware
context.
• When a thread performs a long-latency operation, it is blocked
• Subsequent instructions to be issued by that thread are delayed until the
operation completes.
• Duration of this delay is termed as the instruction delay latency.
• ALU instructions have 0 delay latency (“Author model a pipeline that
implements result bypassing”).
Processor Pipeline Contention
• A load that hits in the L1 cache has a latency of four cycles.
• A branch delays the subsequent instruction by two cycles.
• Processor pipeline contention depends on the latencies of the instructions
that the workload executes.
• If a thread is running a workload dominated by instructions with long delay
latencies, such as memory loads, it will often let functional units go unused
• This leave ample opportunities for other threads to use them.
• Resource contention in this case is low.
Processor Pipeline Contention
• Alternatively, if a thread is running an instruction mix consisting strictly of
ALU operations it can keep the pipeline busy at all times.
• The performance of other threads co-scheduled with this thread will suffer
accordingly.
Processor Pipeline Contention
• To assess the severity of pipeline contention for a workload, it is useful to
compare how it would perform on an MT core with how it would perform on
a conventional single-threaded core and on a traditional multiprocessor.
• Performance on the conventional processor provides the theoretical lower
bound
• Performance on the multiprocessor provides the upper bound.
• When contention is low, performance on an MT core should approach that of
an MP system, where each thread has all functional units to itself.
• When contention is high, performance of an MT core will be no better than
that of a single-threaded processor, where a single thread monopolizes all
resources.
CPU BOUND VS MEMORY BOUND
• There are three different type of systems used for this experiment
A. A single-threaded processor
B. A multithreaded processor with four hardware contexts
C. A four-way multiprocessor
CPU BOUND VS MEMORY BOUND
• When running the CPU-
bound workload, A and B will
perform comparably
• C will have a throughput four
times greater than the other
two systems
• This is because it has four
times as many functional
units
CPU BOUND VS MEMORY BOUND
• When running the memory-
bound workload, System B
and System C will perform
comparably, outperforming
System A by a factor of four2
CPU BOUND VS MEMORY BOUND
• The instruction mixes
executed by the threads have
been hand-crafted to consist
strictly of ALU instructions
in the CPU-bound case, and
strictly of load instructions in
the memory-bound case.
EXPERIMENTAL RESULTS
• This simple experiment demonstrates that the instruction mix, and, more
precisely, the average instruction delay latency, can be used as a heuristic
for approximating the processor pipeline requirements for a workload.
• The scheduler can use the information on the workload's instruction mix for
its scheduling decisions.
• A thread with an instruction mix dominated by long-latency instructions can
leave functional units underutilized.
• Therefore, it is logical to co-schedule it with a thread that is running a lot of
short-latency instructions and has high demand for functional units.
Limitations of this research paper
• We stress that this is a heuristic; since it does not model contention for other
resources, it does not always predict performance.
• For example
This technique does not take into account effects of cache contention
that surface when threads with large working sets are running on the same
processor. Consider, for example, the memory-bound workload, modified to
vary the working set size of a thread from eight bytes to 2MB.
CPU BOUND VS MEMORY BOUND
• This Figure shows the results
of comparing the three
systems.
• According to simple heuristic,
B and C should perform
comparably.
• However, in the middle part
of the graph the two curves
diverge.
• The cause is the reduced L1
data cache hit rate on B,
where four threads share the
cache on the processor.
SCHEDULING
• Cycles-per-instruction (CPI), can easily determined using hardware
counters.
• This gives a useful window into the dynamic instruction mix of a workload.
• CPI nicely captures average instruction delay, which can serve as a first
approximation for making scheduling decisions.
SCHEDULING (Experiment)
• In the following experiment, thread's single-threaded CPI (the CPI that
would be observed if the thread were running on a dedicated processor) is
used.
• Threads with high CPIs usually have low pipeline resource requirements,
because they spend much of their time blocked on memory or executing long-
latency instructions,
• This results in leaving functional units unused.
• Threads with low CPIs have high resource requirements, as they spend little
time stalled.
SCHEDULING (Experiment)
• In this experiment simulated
processor is configured with
four cores and four threads
on each core.
• The workload consists of 16
threads, four each with CPI
of one (CPU-bound), six, 11,
and 16.
• Four different schedules are
run, assigning threads to
cores.
• According to the fig,
Schedules (a) and (b) perform
twice as well as (d) and 1.5
times better than (c).
SCHEDULING (Experiment)
• Schedules (a) and (b) to perform better than schedules (c) and (d), because
they schedule threads with low resource requirements (high CPI) together
with threads that have high resource requirements (low CPI).
• In schedules (c) and (d), there are two processor cores running multiple
threads with a single-threaded CPI of 1.
• Those cores would be fully utilized, while other cores would be under-
utilized.
QUESTIONS!!

More Related Content

What's hot

Multicore Computers
Multicore ComputersMulticore Computers
Multicore ComputersA B Shinde
 
Superscalar & superpipeline processor
Superscalar & superpipeline processorSuperscalar & superpipeline processor
Superscalar & superpipeline processorMuhammad Ishaq
 
10 Multicore 07
10 Multicore 0710 Multicore 07
10 Multicore 07timcrack
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processorMazin Alwaaly
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers Sher Shah Merkhel
 
Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Ismail Mukiibi
 
Paralle programming 2
Paralle programming 2Paralle programming 2
Paralle programming 2Anshul Sharma
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architectureMr SMAK
 
Multiprocessor
MultiprocessorMultiprocessor
MultiprocessorNeel Patel
 
Lecture 6.1
Lecture  6.1Lecture  6.1
Lecture 6.1Mr SMAK
 
Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principlesDhaval Bagal
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Ismail Mukiibi
 
Computer architecture multi core processor
Computer architecture multi core processorComputer architecture multi core processor
Computer architecture multi core processorMazin Alwaaly
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technologyAmirali Sharifian
 
Multi processor scheduling
Multi  processor schedulingMulti  processor scheduling
Multi processor schedulingShashank Kapoor
 

What's hot (20)

Multicore Computers
Multicore ComputersMulticore Computers
Multicore Computers
 
Superscalar & superpipeline processor
Superscalar & superpipeline processorSuperscalar & superpipeline processor
Superscalar & superpipeline processor
 
10 Multicore 07
10 Multicore 0710 Multicore 07
10 Multicore 07
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processor
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers
 
Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2
 
Paralle programming 2
Paralle programming 2Paralle programming 2
Paralle programming 2
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architecture
 
Lecture5
Lecture5Lecture5
Lecture5
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Lecture 6.1
Lecture  6.1Lecture  6.1
Lecture 6.1
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principles
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
 
Computer architecture multi core processor
Computer architecture multi core processorComputer architecture multi core processor
Computer architecture multi core processor
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Parallel processing extra
Parallel processing extraParallel processing extra
Parallel processing extra
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
Multi processor scheduling
Multi  processor schedulingMulti  processor scheduling
Multi processor scheduling
 

Viewers also liked

I/o and memory interfacing
I/o and memory interfacingI/o and memory interfacing
I/o and memory interfacingRuchi Srivastava
 
Cloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless ArchitectureCloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless ArchitectureAWS Vietnam Community
 
Understanding memory management
Understanding memory managementUnderstanding memory management
Understanding memory managementGokul Vasan
 
Software development in ar mv8 m architecture - yiu
Software development in ar mv8 m architecture - yiuSoftware development in ar mv8 m architecture - yiu
Software development in ar mv8 m architecture - yiuArm
 
SOC Processors Used in SOC
SOC Processors Used in SOCSOC Processors Used in SOC
SOC Processors Used in SOCA B Shinde
 
Ansible loves Python, Python Philadelphia meetup
Ansible loves Python, Python Philadelphia meetupAnsible loves Python, Python Philadelphia meetup
Ansible loves Python, Python Philadelphia meetupGreg DeKoenigsberg
 
Operating Systems: Data Structures
Operating Systems: Data StructuresOperating Systems: Data Structures
Operating Systems: Data StructuresDamian T. Gordon
 
Operating Systems: What happen in 2016?
Operating Systems: What happen in 2016?Operating Systems: What happen in 2016?
Operating Systems: What happen in 2016?Damian T. Gordon
 
Operating Systems: File Management
Operating Systems: File ManagementOperating Systems: File Management
Operating Systems: File ManagementDamian T. Gordon
 
Operating Systems 1: Syllabus
Operating Systems 1: SyllabusOperating Systems 1: Syllabus
Operating Systems 1: SyllabusDamian T. Gordon
 
Operating Systems 1: Introduction
Operating Systems 1: IntroductionOperating Systems 1: Introduction
Operating Systems 1: IntroductionDamian T. Gordon
 
BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE Linaro
 
BUD17-405: Building a reference IoT product with Zephyr
BUD17-405: Building a reference IoT product with Zephyr BUD17-405: Building a reference IoT product with Zephyr
BUD17-405: Building a reference IoT product with Zephyr Linaro
 
Network: Synchronization: IEEE1588's Future in Computing and the Data Center
Network: Synchronization: IEEE1588's Future in Computing and the Data CenterNetwork: Synchronization: IEEE1588's Future in Computing and the Data Center
Network: Synchronization: IEEE1588's Future in Computing and the Data CenterMichelle Holley
 
Operating Systems: Network Management
Operating Systems: Network ManagementOperating Systems: Network Management
Operating Systems: Network ManagementDamian T. Gordon
 
Operating Systems: Computer Security
Operating Systems: Computer SecurityOperating Systems: Computer Security
Operating Systems: Computer SecurityDamian T. Gordon
 
BUD17-104: Scripting Languages in IoT: Challenges and Approaches
BUD17-104: Scripting Languages in IoT: Challenges and ApproachesBUD17-104: Scripting Languages in IoT: Challenges and Approaches
BUD17-104: Scripting Languages in IoT: Challenges and ApproachesLinaro
 
Operating Systems: Process Scheduling
Operating Systems: Process SchedulingOperating Systems: Process Scheduling
Operating Systems: Process SchedulingDamian T. Gordon
 

Viewers also liked (20)

I/o and memory interfacing
I/o and memory interfacingI/o and memory interfacing
I/o and memory interfacing
 
Cloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless ArchitectureCloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless Architecture
 
Understanding memory management
Understanding memory managementUnderstanding memory management
Understanding memory management
 
Software development in ar mv8 m architecture - yiu
Software development in ar mv8 m architecture - yiuSoftware development in ar mv8 m architecture - yiu
Software development in ar mv8 m architecture - yiu
 
SOC Processors Used in SOC
SOC Processors Used in SOCSOC Processors Used in SOC
SOC Processors Used in SOC
 
Ansible loves Python, Python Philadelphia meetup
Ansible loves Python, Python Philadelphia meetupAnsible loves Python, Python Philadelphia meetup
Ansible loves Python, Python Philadelphia meetup
 
Operating Systems: Data Structures
Operating Systems: Data StructuresOperating Systems: Data Structures
Operating Systems: Data Structures
 
Operating Systems: What happen in 2016?
Operating Systems: What happen in 2016?Operating Systems: What happen in 2016?
Operating Systems: What happen in 2016?
 
Simple Data Compression
Simple Data CompressionSimple Data Compression
Simple Data Compression
 
Operating Systems: File Management
Operating Systems: File ManagementOperating Systems: File Management
Operating Systems: File Management
 
Operating Systems 1: Syllabus
Operating Systems 1: SyllabusOperating Systems 1: Syllabus
Operating Systems 1: Syllabus
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 
Operating Systems 1: Introduction
Operating Systems 1: IntroductionOperating Systems 1: Introduction
Operating Systems 1: Introduction
 
BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE
 
BUD17-405: Building a reference IoT product with Zephyr
BUD17-405: Building a reference IoT product with Zephyr BUD17-405: Building a reference IoT product with Zephyr
BUD17-405: Building a reference IoT product with Zephyr
 
Network: Synchronization: IEEE1588's Future in Computing and the Data Center
Network: Synchronization: IEEE1588's Future in Computing and the Data CenterNetwork: Synchronization: IEEE1588's Future in Computing and the Data Center
Network: Synchronization: IEEE1588's Future in Computing and the Data Center
 
Operating Systems: Network Management
Operating Systems: Network ManagementOperating Systems: Network Management
Operating Systems: Network Management
 
Operating Systems: Computer Security
Operating Systems: Computer SecurityOperating Systems: Computer Security
Operating Systems: Computer Security
 
BUD17-104: Scripting Languages in IoT: Challenges and Approaches
BUD17-104: Scripting Languages in IoT: Challenges and ApproachesBUD17-104: Scripting Languages in IoT: Challenges and Approaches
BUD17-104: Scripting Languages in IoT: Challenges and Approaches
 
Operating Systems: Process Scheduling
Operating Systems: Process SchedulingOperating Systems: Process Scheduling
Operating Systems: Process Scheduling
 

Similar to Chip Multithreading Systems Need a New Operating System Scheduler

Module2 MultiThreads.ppt
Module2 MultiThreads.pptModule2 MultiThreads.ppt
Module2 MultiThreads.pptshreesha16
 
Multicore processor.pdf
Multicore processor.pdfMulticore processor.pdf
Multicore processor.pdfrajaratna4
 
Multiprocessor_YChen.ppt
Multiprocessor_YChen.pptMultiprocessor_YChen.ppt
Multiprocessor_YChen.pptAberaZeleke1
 
MK Sistem Operasi.pdf
MK Sistem Operasi.pdfMK Sistem Operasi.pdf
MK Sistem Operasi.pdfwisard1
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architecturesYoung Alista
 
Hardware Multithreading.pdf
Hardware Multithreading.pdfHardware Multithreading.pdf
Hardware Multithreading.pdfrajaratna4
 
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptxParallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptxSumalatha A
 
chapter4-processes nd processors in DS.ppt
chapter4-processes nd processors in DS.pptchapter4-processes nd processors in DS.ppt
chapter4-processes nd processors in DS.pptaakarshsiwani1
 
Operating system 20 threads
Operating system 20 threadsOperating system 20 threads
Operating system 20 threadsVaibhav Khanna
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsateeq ateeq
 
18 parallel processing
18 parallel processing18 parallel processing
18 parallel processingdilip kumar
 

Similar to Chip Multithreading Systems Need a New Operating System Scheduler (20)

Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Module2 MultiThreads.ppt
Module2 MultiThreads.pptModule2 MultiThreads.ppt
Module2 MultiThreads.ppt
 
Multicore processor.pdf
Multicore processor.pdfMulticore processor.pdf
Multicore processor.pdf
 
OS_MD_4.pdf
OS_MD_4.pdfOS_MD_4.pdf
OS_MD_4.pdf
 
Multiprocessor_YChen.ppt
Multiprocessor_YChen.pptMultiprocessor_YChen.ppt
Multiprocessor_YChen.ppt
 
MK Sistem Operasi.pdf
MK Sistem Operasi.pdfMK Sistem Operasi.pdf
MK Sistem Operasi.pdf
 
Factored operating systems
Factored operating systemsFactored operating systems
Factored operating systems
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
Hardware Multithreading.pdf
Hardware Multithreading.pdfHardware Multithreading.pdf
Hardware Multithreading.pdf
 
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptxParallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
 
chapter4-processes nd processors in DS.ppt
chapter4-processes nd processors in DS.pptchapter4-processes nd processors in DS.ppt
chapter4-processes nd processors in DS.ppt
 
chapter1.ppt
chapter1.pptchapter1.ppt
chapter1.ppt
 
Pthread
PthreadPthread
Pthread
 
Operating system 20 threads
Operating system 20 threadsOperating system 20 threads
Operating system 20 threads
 
parallel-processing.ppt
parallel-processing.pptparallel-processing.ppt
parallel-processing.ppt
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processors
 
Lecture 3 threads
Lecture 3   threadsLecture 3   threads
Lecture 3 threads
 
UNIT 2.pptx
UNIT 2.pptxUNIT 2.pptx
UNIT 2.pptx
 
18 parallel processing
18 parallel processing18 parallel processing
18 parallel processing
 

Recently uploaded

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 

Recently uploaded (20)

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 

Chip Multithreading Systems Need a New Operating System Scheduler

  • 1. Chip Multithreading Systems Need a New Operating System Scheduler
  • 2. Purpose of Paper • Scheduling on single-processor MT systems • Our proposal involves building a scheduler that would model relative resource contention resulting from different potential schedules, and use this as a basis for its decisions. • In situations where a good prediction is difficult to achieve, we may resort to using our model of resource contention to reduce the size of the problem space, and then use the algorithms proposed in earlier work to make final scheduling decisions.
  • 3. Abstract • Unpredictable nature of modern workloads, because of frequent branches and control transfers, can result in processor pipeline utilization as low as 19%. • Chip multithreading (CMT), a processor architecture combining chip multiprocessing and hardware multithreading, is designed to address this issue • CMT-savvy operating system scheduler could improve application performance by a factor of two.
  • 4. INTRODUCTION • Applications consist of multiple threads of control and executing short stretches of integer operations with frequent dynamic branches can yield to poor CPU utilization. • This can negatively effect 1. Cache locality 2. Branch Prediction 3. Cause frequent processor stalls
  • 5. INTRODUCTION • Two techniques were designed to improve processor utilization for transaction-processing-like workloads 1. Chip multiprocessing (CMP) 2. hardware multithreading (MT) • It is done by offering better support for thread-level parallelism (TLP).
  • 6. Chip Multiprocessing (CMP) • A CMP processor includes multiple processor cores on a single chip • It allows more than one thread to be active at a time and improves utilization of chip resources. • A MT processor has multiple sets of registers and other thread state and interleaves execution of instructions from different threads. • This is done by 1. Switching between threads (as often as on each cycle) or 2. By executing instructions from multiple threads simultaneously (if the instructions use different functional units). • As a result, if one thread is blocked on a memory access or some other long- latency operation, other threads can make forward progress.
  • 7. Problem • A major factor affecting performance on systems with multiple hardware contexts is operating system scheduling. • For example, One OS scheduler that is tailored for MT processors produces an average performance improvement of 17%. Moreover, it has been pointed out that a naïve scheduler can hurt performance, making a multithreaded processor perform worse than a single-threaded processor. • Experiments have shown that the potential for performance gain from a specialized scheduler on CMT can be as large as a factor of two
  • 8. Scheduler designs Problems • Author believe that scheduler designs proposed for single-processor multi threaded systems do not scale up to the dozens of hardware threads that we expect to find on proposed CMT processors. • Such systems will require a fundamentally new design for the operating system scheduler. • An ideal scheduler will assign threads to processors in a way that minimizes resource contention and maximizes system throughput, and would do so in a scalable fashion.
  • 9. Scheduler designs Problems • The scheduler must understand how its scheduling decisions will affect resource contention • This is because resource contention ultimately determines performance.
  • 10. EXPERIMENTAL PLATFORM • CMT system simulator toolkit as a set of extensions to the Simics simulation toolkit is developed to sturdy the performance of the system • This toolkit models systems with multiple multithreaded CPU cores. • The model of a multithreaded processor core is based on the concept of fine- grained multithreading • An MT core has multiple hardware contexts (usually one, two, four or eight), • Each context consists of a set of registers and other thread state. • Such a processor interleaves execution of instructions from the threads, switching between contexts on each cycle.
  • 11. SIMULATED CPU SPECS • It has a simple RISC pipeline, with one set of functional units (arithmetic unit, load/store unit, etc.). • The processor that is modeled is typically configured with four hardware contexts per CPU core. • Each core has a single shared TLB and L1 data and instruction caches. • The (integrated) L2 cache is shared by all of the CPU cores on the chip. • One alternative to the architecture that we have chosen is to have multiple sets of functional units on each CPU core. This is termed a simultaneous multithreaded (SMT) system.
  • 12. SMT SYSTEMS • SMT systems are more complex and require more chip real estate. • We have instead taken the approach of leveraging a simple, classical RISC core in order to allow space for more cores on each chip. • This allows for a higher degree of multithreading in the system, resulting in higher throughput for multithreaded and multi programmed workloads.
  • 13. Multithreaded workloads • A thread may become blocked when it encounters a long latency operation, such as servicing a cache miss. • When one or more threads are unavailable, the system continues to switch among the remaining available threads. • For multithreaded workloads, this improves processor utilization and hides the latency of long operations. • This latency-hiding property is at the heart of hardware multithreading
  • 14. STUDYING RESOURCE CONTENTION • On a CMT system, each hardware context appears as a logical processor to the operating system • A software thread is assigned to a hardware context for the duration of the scheduling time slice. • Threads that share a processor compete for resources. Resource contention is a conflict over access to a shared resource such as random access memory, disk storage, cache memory
  • 15. STUDYING RESOURCE CONTENTION • There are many different categories of resources contention • Author has focus on the processor pipeline • This is because that category of contention characterizes the difference between single-threaded and multithreaded processors. Resource contention is a conflict over access to a shared resource such as random access memory, disk storage, cache memory
  • 16. STUDYING RESOURCE CONTENTION • When assigning threads to hardware contexts, the scheduler has to decide which threads should be run on the same processor, and which threads should be run separately. • The optimal thread assignment should result in high utilization of the processor. • If we are to design a scheduler that can find good thread assignments, we must understand the causes and effects of contention among the threads that share a processor. • It is observed that the instruction mix executed by a workload is an important factor in determining the level of contention for the processor pipeline. • The key to understanding why this is the case is the concept of instruction delay latency Resource contention is a conflict over access to a shared resource such as random access memory, disk storage, cache memory
  • 17. INSTRUCTION DELAY LATENCY • A processor keeps a copy of architectural state for each active hardware context. • When a thread performs a long-latency operation, it is blocked • Subsequent instructions to be issued by that thread are delayed until the operation completes. • Duration of this delay is termed as the instruction delay latency. • ALU instructions have 0 delay latency (“Author model a pipeline that implements result bypassing”).
  • 18. Processor Pipeline Contention • A load that hits in the L1 cache has a latency of four cycles. • A branch delays the subsequent instruction by two cycles. • Processor pipeline contention depends on the latencies of the instructions that the workload executes. • If a thread is running a workload dominated by instructions with long delay latencies, such as memory loads, it will often let functional units go unused • This leave ample opportunities for other threads to use them. • Resource contention in this case is low.
  • 19. Processor Pipeline Contention • Alternatively, if a thread is running an instruction mix consisting strictly of ALU operations it can keep the pipeline busy at all times. • The performance of other threads co-scheduled with this thread will suffer accordingly.
  • 20. Processor Pipeline Contention • To assess the severity of pipeline contention for a workload, it is useful to compare how it would perform on an MT core with how it would perform on a conventional single-threaded core and on a traditional multiprocessor. • Performance on the conventional processor provides the theoretical lower bound • Performance on the multiprocessor provides the upper bound. • When contention is low, performance on an MT core should approach that of an MP system, where each thread has all functional units to itself. • When contention is high, performance of an MT core will be no better than that of a single-threaded processor, where a single thread monopolizes all resources.
  • 21. CPU BOUND VS MEMORY BOUND • There are three different type of systems used for this experiment A. A single-threaded processor B. A multithreaded processor with four hardware contexts C. A four-way multiprocessor
  • 22. CPU BOUND VS MEMORY BOUND • When running the CPU- bound workload, A and B will perform comparably • C will have a throughput four times greater than the other two systems • This is because it has four times as many functional units
  • 23. CPU BOUND VS MEMORY BOUND • When running the memory- bound workload, System B and System C will perform comparably, outperforming System A by a factor of four2
  • 24. CPU BOUND VS MEMORY BOUND • The instruction mixes executed by the threads have been hand-crafted to consist strictly of ALU instructions in the CPU-bound case, and strictly of load instructions in the memory-bound case.
  • 25. EXPERIMENTAL RESULTS • This simple experiment demonstrates that the instruction mix, and, more precisely, the average instruction delay latency, can be used as a heuristic for approximating the processor pipeline requirements for a workload. • The scheduler can use the information on the workload's instruction mix for its scheduling decisions. • A thread with an instruction mix dominated by long-latency instructions can leave functional units underutilized. • Therefore, it is logical to co-schedule it with a thread that is running a lot of short-latency instructions and has high demand for functional units.
  • 26. Limitations of this research paper • We stress that this is a heuristic; since it does not model contention for other resources, it does not always predict performance. • For example This technique does not take into account effects of cache contention that surface when threads with large working sets are running on the same processor. Consider, for example, the memory-bound workload, modified to vary the working set size of a thread from eight bytes to 2MB.
  • 27. CPU BOUND VS MEMORY BOUND • This Figure shows the results of comparing the three systems. • According to simple heuristic, B and C should perform comparably. • However, in the middle part of the graph the two curves diverge. • The cause is the reduced L1 data cache hit rate on B, where four threads share the cache on the processor.
  • 28. SCHEDULING • Cycles-per-instruction (CPI), can easily determined using hardware counters. • This gives a useful window into the dynamic instruction mix of a workload. • CPI nicely captures average instruction delay, which can serve as a first approximation for making scheduling decisions.
  • 29. SCHEDULING (Experiment) • In the following experiment, thread's single-threaded CPI (the CPI that would be observed if the thread were running on a dedicated processor) is used. • Threads with high CPIs usually have low pipeline resource requirements, because they spend much of their time blocked on memory or executing long- latency instructions, • This results in leaving functional units unused. • Threads with low CPIs have high resource requirements, as they spend little time stalled.
  • 30. SCHEDULING (Experiment) • In this experiment simulated processor is configured with four cores and four threads on each core. • The workload consists of 16 threads, four each with CPI of one (CPU-bound), six, 11, and 16. • Four different schedules are run, assigning threads to cores. • According to the fig, Schedules (a) and (b) perform twice as well as (d) and 1.5 times better than (c).
  • 31. SCHEDULING (Experiment) • Schedules (a) and (b) to perform better than schedules (c) and (d), because they schedule threads with low resource requirements (high CPI) together with threads that have high resource requirements (low CPI). • In schedules (c) and (d), there are two processor cores running multiple threads with a single-threaded CPI of 1. • Those cores would be fully utilized, while other cores would be under- utilized.