Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

Multithreading: Exploiting Thread-Level
Parallelism to Improve Uniprocessor Throughput
Seminar on :
Advance Computer Architecture

Outline
 Multithreading
 Multithreading approaches
 How Resources are Shared?
 Effectiveness of Fine MT on The Sun T1
 Effectiveness ofSTMonSuperscalar processor
 References

Multithreading
 Multithreading
 is a primary technique for exposing more parallelism to the hardware.
In a strict sense, multithreading uses thread-level parallelism, but it’s
role in both improving pipeline utilization and in GPUs motivates us to
introduce the concept here. Although increasing performance by using
ILP.
 Allows multiple threads to share the functional units of a single
processor in an overlapping fashion. In contrast, a more general
method to exploit thread-level parallelism(TLP) is with a multiprocessor
that has multiple independent threads operating at once and in
parallel.
 Does not duplicate the entire processor as a multiprocessor does.
Instead, multi-threading shares most of the processor core among a
set of threads, duplicating only private state, such as the registers and
program counter.

 Fine-grained multithreading
 Switches between threads on each clock, causing the execution of
instructions from multiple threads to be interleaved. This
interleaving is often done in a round-robin fashion, skipping any
threads that are stalled at that time.
 Advantage of this approach is that it can hide throughput losses
(latency) that arise from both short and long stalls.
 The primary disadvantage of this approach is that it slows down
the execution of an individual thread .
 Processors use this approach :-
 The Sun Niagara .
 NVidia GPUs .
Multithreading approaches

 Coarse-grained multithreading
 Was invented as an alternative to Fine-grained multithreading.
 Coarse-grained multithreading switches thread only on costly
stalls, such as level two or three .
 It need to have thread-switching be essentially free because chang
relieves.
 Less likely to slow down the execution of any one thread, since
instructions from other threads will only be issued when a thread
encounters a costly stall.
 Coarse-grained multithreading suffers from a major drawback,
which limited the ability to overcome throughput losses, especially
from shorter stalls.
 No major processors use this technique.

 Simultaneous multithreading (SMT)
 The most common implementation of multithreading and it is a
variation on fine-grained multithreading.
 It arises naturally when fine-grained multithreading is
implemented on to of a multiple-issue, dynamically scheduled
processor.
 Exploits thread-level parallelism at the same time it exploits ILP,
SMT uses TLP to hide long-latency events in a processor.
 The key insight in SMT is that register renaming and dynamic
scheduling allow multiple instruction from independent threads to
be executed without regard to the dependences among them.
 The resolution of the dependences can be handled by the dynamic
scheduling capability.
 Intel core i7 and IBM power7 use SMT.

How Resources are Shared?
 Following figure show the differences in processor’s ability to
exploit the resources of a superscalar for the following
configuration :
 A superscalar with no multithreading support
 A superscalar with coarse-grained multithreading
 A superscalar with fine-grained multithreading
 A superscalar with simultaneous multithreading
 In the superscalar without multithreading support, the use of issue
slots is limited by a lack of ILP, including ILP to hide memory latency.
Because of the length of L2 and L3 cache misses, much of the
processor can be left idle.

Figure 1 How four different approaches use the functional unit execution slots of
superscalar processor.
 The horizontal dimension represent the instruction execution capability in each clock.
 The vertical dimension represent a sequence of clock cycles.
 An empty (white) box indicates that the corresponding execution slot is unused.
 The shades gray and black corresponding to four different threads in the multithreading
processors.

 In the coarse-grained multithreaded superscalar, By switching
to another thread that’s cause partially hidden. This switching
reduces the number of completely idle clock cycles. Thread
switching only occurs when there is a stall. Because there are
likely to be some fully idle cycles remaining.
 Fine-grained multithreading can only issue instructions from a
single thread in a cycle – can not find max work every cycle,
but cache misses can be tolerated.
 Simultaneous multithreading can issue instructions from any
thread every cycle has the highest probability of finding work
for every issue slot .

 Sun T1 Processor Overview
 The T1 is a Fine MT, multicore microprocessor introduce by sun in
2005.
 Totally focused on exploiting thread-level parallelism (TLP), rather
than (ILP).
 Returned to a simple Pipeline strategy and focused on exploiting
(TLP), using multiple cores and multithreading to produce
throughput.
 8 processor cores, each supporting 4 threads, each core consist
6-stage single-issue Pipeline ( a standard five stage RISC
Pipeline, with one stage added for thread switching .
 The Sun T1 processor had the best performance on integer
applications with extensive (TLP) and demanding memory
performance, such as SPECJBB and transaction processing
workloads.
Effectiveness of Fine MT on the Sun T1

Figure 2 A summary of T1 processor

 T1 Multithreading Unicore Performance
 To examine the performance of the T1 we use three server-
oriented :
 TPC-C
 SPECJBB
 SPECWeb99
 Since multiple threads increase the memory demand from a
single processor they could overload the memory system, leading
to reductions in the potential gain from multithreading.
 Next figures show the effectiveness of fine MT on the Sun T1

Figure 3 The relative change in the miss rates and miss latencies when executing
with one thread per core versus four threads per core on the TPC-C benchmark.

Figure 4 Breakdown of the status on an average thread.
 Remember that not ready does not imply that the core with that thread
is stalled; it is only when all four threads are not ready that core will
stall.
 Thread can be not ready due to cache misses, Pipeline delays.

Figure 5 The breakdown of causes for a thread being not ready
 Thread can be not ready due to cache misses, Pipeline delays.
 Figure above show the frequency of various causes effect on Thread.

Figure 6 The per-thread CPI, the per-core CPI, the effective eight-core CPI, and
the effective IPC (inverse of CPI) for the eight-core T1 processor.

 Simulation research results are unrealistic.
 In practice, the existing implementations give the result is that the
gain from SMT is also more modest.
 The intel core i7 support SMT with two threads. The following figures
show the performance ratio and the energy efficiency ratio.
 To examine the performance of the T1 we use three server-
oriented :
 TPC-C
 SPECJBB
 SPECWeb99
 Since multiple threads increase the memory demand from a
single processor they could overload the memory system, leading
to reductions in the potential gain from multithreading.
 Next figures show the effectiveness of fine MT on the Sun T1
Effectiveness of STM on Superscalar processors

 Simulation research results are unrealistic.
 In practice, the existing implementations give the result is that the
gain from SMT is also more modest.
 The intel core i7 support SMT with two threads. The following figures
show the performance ratio and the energy efficiency ratio.
Figure 7 The speedup from using multithreading on one core on an i7 processor
averages 1.28 for the Java benchmarks and 1.31 for the PARSEC .

 In the PARSEC benchmarks, SMT reduces energy by 7%, these results
clearly show that SMT in aggressive speculative processor with
extensive support for SMT can improve performance in an energy
efficient fashion, which the more aggressive ILP approaches have failed
to do .
 Indeed, Esmaeilzadeh et al. [2011] show that the energy
improvements from SMT are even larger on the Intel i5 (a processor
similar to the i7, but with smaller caches and a lower clock rate) and
the Intel Atom (an 80×86 processor designed for the netbook market)

 David Patterson, John L. Hennessy, “Computer
Architecture:A Quantitative Approach” Morgan Kaufmann
is an imprint of Elsevier 225 Wyman Street, Waltham,
MA 02451, USA© 2012 Elsevier, Inc. All rights reserved,
pp.223-232
References

Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

Similar to Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput (20)

Recently uploaded

Recently uploaded (20)

Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput