Multithreading

Multithreading
Mr. A. B. Shinde
Assistant Professor,
Electronics Engineering,
P.V.P.I.T., Budhgaon

Contents…
 Using ILP support to exploit
thread –level parallelism
 performance and efficiency in
advanced multiple issue
processors
2

Threads
 A thread is a basic unit of CPU utilization.
 A thread is a separate process with its own instructions and data.
 A thread may represent a process that is part of a parallel program
consisting of multiple processes, or it may represent an
independent program.
3

Threads
 It comprises of a thread ID, a program counter, a register set and a
stack.
 It shares its code section, data section, and other operating-system
resources, such as open files and signals with other threads
belonging to the same Process.
 A traditional process has a single thread of control. If a process has
multiple threads of control, it can perform more than one task at a time.
4

Threads
 Many software packages that run
on modern desktop PCs are
multithreaded.
 For example:
A word processor may have:
a thread for displaying graphics,
another thread for responding to
keystrokes from the user, and
a third thread for performing spelling
and grammar checking in the
background.
5

Threads
 Threads also play a vital role in remote procedure call (RPC)
systems.
 RPCs allows interprocess communication by providing a
communication mechanism similar to ordinary function or procedure
calls.
 Many operating system kernels are multithreaded; several threads
operate in the kernel, and each thread performs a specific task, such as
managing devices or interrupt handling.
6

Multithreading
 Benefits:
1. Responsiveness: Multithreading is an interactive application that
may allow a program to continue running even if part of it is
blocked, thereby increasing responsiveness to the user.
For example: A multithreaded web browser could still allow user
interaction in one thread while an image was being loaded in another
thread.
2. Resource sharing: By default, threads share the memory and the
resources of the process to which they belong. The benefit of sharing
code and data is that it allows an application to have several different
threads of activity within the same address space.
7

Multithreading
 Benefits:
3. Economy: Allocating memory and resources for process creation is
costly. Since threads share resources of the process to which they
belong, they will provide cost effective solution.
4. Utilization of multiprocessor architectures: In multiprocessor
architecture, threads may be running in parallel on different processors.
A single threaded process can only run on one CPU, no matter how
many are available.
Multithreading on a multi-CPU machine increases concurrency.
8

Multithreading Models
 Support for threads may be provided either at the user level or at
the kernel level.
 User threads are supported above the kernel and are managed
without kernel support, whereas kernel threads are supported and
managed directly by the operating system.
9

 Many-to-One Model:
 The many-to-one model maps many user-
level threads to one kernel thread.
 Thread management is done by the
thread library in user space, so it is
efficient.
 Only one thread can access the kernel at
a time, hence multiple threads are unable to
run in parallel on multiprocessors.
10

 One-to-One Model:
 The one-to-one model maps each user
thread to a kernel thread.
 It provides more concurrency than the many-
to-one model. It allows multiple threads to run in
parallel on multiprocessors.
 The only drawback to this model is that
creating a user thread requires creating the
corresponding kernel thread.
 The overhead of creating kernel threads can
burden the performance of an application.
11

 Many-to-Many Model :
 The many-to-many model multiplexes many
user-level threads to a smaller or equal
number of kernel threads.
 The number of kernel threads may be specific
to either a particular application or a particular
machine.
 Developers can create as many user threads
as necessary, and the corresponding kernel
threads can run in parallel on a
multiprocessor.
12

Multithreading: ILP Support to Exploit
Thread-Level Parallelism
13

 Although ILP increases the performance of system; then also ILP
can be quite limited or hard to exploit in some applications.
Furthermore, there may be parallelism occurring naturally at a higher
level in the application.
For example:
An online transaction-processing system has parallelism among the
multiple queries and updates. These queries and updates can be
processed mostly in parallel, since they are largely independent of one
another.
14

 This higher-level parallelism is called thread-level parallelism (TLP)
because it is logically structured as separate threads of execution.
 ILP is parallel operations within a loop or straight-line code.
 TLP is represented by the use of multiple threads of execution that
are in parallel.
15

 Thread-level parallelism is an important alternative to instruction-
level parallelism.
 In many applications thread-level parallelism occurs naturally (many
server applications).
 If software is written from scratch, then expressing the parallelism
is much easy.
 But if established applications written without parallelism in mind,
then there can be significant challenges and can be extremely costly
to rewrite them to exploit thread-level parallelism.
16

 TLP and ILP exploits two different kinds of parallel structures.
 The crucial question is:
Can we exploit TLP on processor designed for ILP
 Answer is: Yes
Datapath designed to exploit ILP will find that many functional units are
often idle because of either stalls or dependences in the code.
The threads can be used as a independent instructions that might keep
the processor busy to implement TLP.
17

 Multithreading allows multiple threads to share the functional units
of a single processor in an overlapping fashion.
 To permit this sharing, the processor must duplicate the
independent state of each thread.
 For example:
A separate copy of the register file, a separate PC and a separate page
table were required for each thread.
 In addition, the hardware must support the ability to change to a
different threads relatively quickly.
18

 There are two main approaches to multithreading.
 Fine-grained multithreading &
 Coarse-grained multithreading
19

 Fine-grained multithreading:
 It switches between threads on each instruction, causing the
execution of multiple threads to be interleaved.
 This interleaving is often done in a round-robin fashion.
 To make fine-grained multithreading practical, the CPU must be
able to switch threads on every clock cycle.
 Advantage: It can hide the throughput losses that arise from both short
and long stalls.
 Disadvantage: It slows down the execution of the individual threads.
20

 Coarse-grained multithreading:
 It was invented as an alternative to fine-grained multithreading.
 Coarse-grained multithreading switches threads only on costly
(larger) stalls.
 Advantage: This change relieves the need to have thread switching.
 Disadvantage: They are likely to slow the processor down, since
instructions from other threads will only be issued when a thread
encounters a costly (larger) stalls.
21

 CPU with coarse-grained multithreading issues instructions from a
single thread.
 When a stall occurs, the pipeline must be emptied or frozen.
 New thread that is executing after the stall must fill the pipeline.
 Because of this start-up overhead, coarse grained multithreading is
much more useful for reducing the penalty of high-cost stalls,
where pipeline refill is negligible compared to the stall time.
22

Converting Thread-Level
Parallelism into Instruction-Level Parallelism
 Simultaneous multithreading (SMT) is a variation on multithreading
that uses the resources of a multiple-issue, dynamically scheduled
processor to exploit TLP.
 Multiple-issue processors often have more functional unit
parallelism than a single thread, motivates the use of SMT.
 With register renaming and dynamic scheduling, multiple instructions
from independent threads can be issued without considering the
dependences among them.
23

Figure illustrates the differences in a processor’s ability to exploit the
resources of a superscalar for the following configurations:
 A superscalar with no multithreading support
 A superscalar with coarse-grained multithreading
 A superscalar with fine-grained multithreading
 A superscalar with simultaneous multithreading
24

 In the superscalar without multithreading support,
the use of issue slots is limited by a lack of ILP.
 In addition, a major stall, such as an instruction
cache miss, can leave the entire processor idle.
25
An empty (white) box indicates that the
corresponding issue slot is unused in that clock
cycle.
Black is used to indicate the occupied issue slots

 In the coarse-grained multithreaded superscalar,
the long stalls are partially hidden by switching
to another thread that uses the resources of the
processor.
 This reduces the number of completely idle
clock cycles, within each clock cycle, the ILP
limitations still lead to idle cycles.
 In a coarse grained multithreaded processor,
thread switching only occurs when there is a
stall, then also there will be some fully idle cycles
remaining.
26
The shades of grey and black correspond to
different threads in the multithreading processors.

 In the fine-grained multithreading, the
interleaving of threads eliminates fully empty
slots.
 Because only one thread issues instructions in
a given clock cycle, ILP limitations still lead to a
significant number of idle slots within individual
clock cycles.
27
An empty (white) box indicates that the
corresponding issue slot is unused in that clock
cycle.
The shades of grey and black correspond to four
different threads in the multithreading processors.

 In SMT, TLP and ILP are exploited
simultaneously.
 Ideally, the issue slot usage is limited by
imbalances in the resource needs and resource
availability over multiple threads.
 In practice, other factors —
- how many active threads are considered,
- finite limitations on buffers,
- the ability to fetch enough instructions from
multiple threads, and
- practical limitations of what instruction
combinations can issue from one thread and
from multiple threads—can also restrict how
many slots are used.
28

 Design Challenges in SMT
 Because a dynamically scheduled superscalar processor has a
deep pipeline, coarse-grained MT will gain much in performance.
 Since SMT makes sense only in a fine-grained implementation, we
should think about the impact of fine-grained scheduling on single-
thread performance.
 This effect can be minimized by having a preferred thread, which
still permits multithreading to preserve some of its performance
advantage with a smaller compromise in single-thread performance.
29

 Other design challenges for an SMT processor:
 Dealing with a larger register file needed to hold multiple contexts.
 Not affecting the clock cycle, particularly in instruction issue, where
more instructions needs to be considered, and choosing what
instructions to commit may be challenging.
 Ensuring that the cache and TLB conflicts generated by the
simultaneous execution of multiple threads do not cause significant
performance degradation is also challenging.
30

 In many cases, the potential performance overhead due to
multithreading is small.
 The efficiency of current superscalars is low enough that there is
scope for significant improvement, even at the cost of some overhead.
31

Performance and Efficiency in Advanced
Multiple-Issue Processors
32

 The question of efficiency in terms of silicon area and power is
equally critical.
 Power is the major constraint on modern processors.
 The Itanium 2 is the most inefficient processor both for floating-point
and integer code.
 The Athlon and Pentium 4 both makes good use of transistor and
area in terms of efficiency.
 The IBM Power5 is the most effective user of energy.
 The fact that none of the processors offer an great advantage in
efficiency.
33

 What Limits Multiple-Issue Processors?
 Power is a function of both static power (proportional to the transistor
count, whether or not the transistors are switching), and dynamic
power (proportional to the product of the number of switching
transistors and the switching rate).
 Static power is certainly a design concern, and dynamic power is
usually the dominant energy consumer.
 A microprocessor trying to achieve both a low CPI and a high CR
must switch more transistors and switch them faster.
34

 Most techniques used for increasing performance, (multiple cores
and multithreading) will increase power consumption.
 The key question is whether a technique is energy efficient?
Does it increase power consumption faster than it increases
performance?
35

 This inefficiency, arises from two primary characteristics:
 First, issuing multiple instructions incurs some overhead in logic
that grows faster than the issue rate grows.
 This logic is responsible for instruction issue analysis, including
dependence checking, register renaming, and similar functions.
 The combined result is that, lower CPIs are likely to lead to lower
ratios of performance per watt, simply due to overhead.
36

 Second, the growing gap between peak issue rates and sustained
performance.
 The number of transistors switching will be proportional to the
peak issue rate, and the performance is proportional to the
sustained rate.
 For example: If we want to sustain four instructions per clock, we must
fetch more, issue more, and initiate execution on more than four
instructions.
 The power will be proportional to the peak rate, but performance
will be at the sustained rate.
37

 Important technique for increasing the exploitation of ILP (speculation)
— is inefficient… because it can never be perfect.
 If speculation were perfect, it could save power, since it would
reduce the execution time and save static power.
 When speculation is not perfect, it rapidly becomes energy
inefficient, since it requires additional dynamic power.
38

 Focusing on improving clock rate:
 Increasing the clock rate will increase transistor switching
frequency and directly increase power consumption.
 To achieve a faster clock rate, we would need to increase pipeline
depth.
 Deeper pipelines, incur additional overhead penalties as well as
causing higher switching rates.
39

40
This presentation is published only for educational purpose
shindesir.pvp@gmail.com

Multithreading

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Multithreading

Similar to Multithreading (20)

More from A B Shinde

More from A B Shinde (20)

Recently uploaded

Recently uploaded (20)

Multithreading