SlideShare a Scribd company logo
1 of 10
Download to read offline
What is Simultaneous Multithreading?
Generally speaking there are two types of parallelism that can be exploited by modern computing
machinery to achieve higher performance. The Instruction Level Parallelism (ILP) approach
attempts to reduce program runtime by overlapping the execution time of as many instructions as
possible, to as great a degree as possible. The EV8 will have higher performance than earlier Alpha
designs through the enhanced exploitation of ILP made possible by its eight-instruction issue width.
But gains from higher ILP come at a high and ever increasing price. Building wider machines runs
into the problem of geometrically increasing complexity in control logic while data and control
dependencies within the program code limit performance increases. John Hennessy of Stanford
University has likened the difficulty of increasing exploitation of ILP for greater performance to the
task of pushing a boulder up a mountain whose slopes grow ever steeper the further processor
architects progress [1].
Figure 1. Multithreaded Execution with Increasing Levels of TLP Hardware Support
The second form of parallelism is called Thread Level Parallelism or TLP. This simply means the
ability to execute independent programs, or independent parts of a single program, simultaneously
using different flows of execution, called threads. The illusion of multiple thread execution is often
achieved on a single conventional processor through the use of multitasking. Multitasking relies on
the ability of an operating system (OS) to overlap the execution of multiple threads or programs on
a single processor by running each thread successively for short intervals. This is shown in Figure
1A. This diagram illustrates program execution using rectangles repeated in the horizontal direction
to represent consecutive clock cycles while squares placed vertically in each rectangle represent the
per cycle utilization of instruction issue slots in a four way superscalar processor (unused slots are
left as white squares).
Each thread runs for a short interval that ends when the program experiences an exception like a
page fault, calls an operating system function, or is interrupted by an interval timer. When a thread
is interrupted, a short segment of OS code (shown in Figure 1A as gray instructions in issue slots) is
run which performs a context switch and switches execution to a new thread. Multitasking provides
the illusion of simultaneous execution of multiple threads but does nothing to enhance the overall
computational capability of the processor. In fact, excessive context switching causes processor
cycles, which could have been used running user code, to be wasted in the OS.
The most basic type of TLP exploitation that can be incorporated into processor hardware is coarse
grained multithreading (CMT), shown in Figure 1B. The processor incorporates two or more thread
contexts (general purpose registers, program counter PC, process status word PSW etc.) in
hardware. One process context is active at a time and runs until an exception occurs, or more likely,
a high latency operation such as a cache miss during a load instruction. When this occurs, the
processor hardware automatically flushes and changes the thread context, and switches execution to
a new thread.
For contemporary MPUs, a memory operation initiated in response to a cache miss can take over a
hundred clock cycles, which represents the potential execution of hundreds of instructions. A
conventional in-order processor will simply stall and forever lose those hundreds of potential
instructions slots waiting for memory to respond with needed data. A conventional out-of-order
execution processor has the potential to continue to execute other instructions that weren’t
dependent on the missed load data. However, independent instructions tend to be quickly exhausted
in most programs and the processor simply takes longer to stall.
But a coarse grained multithreaded processor has the opportunity to quickly switch to another
thread after a cache miss and perform useful work while the first thread awaits its data from
memory. Many programs spend considerable time waiting for memory operations and a coarse
grained multithreaded processor has the opportunity to increase overall system throughput,
compared to a conventional processor performing OS-based multitasking. The IBM PowerPC
RS64, also known as Northstar, is rumored to incorporate two way coarse grained multithreading
capability, although it is not utilized in some product lines.
A more comprehensive way to exploit TLP in hardware is the fine grained multithreaded (FMT)
processor. The operation of one variant of this class of machine is shown in Figure 1C. In this type
of design there are N thread contexts in the processor and instructions from each thread are
allocated every Nth processor clock cycle to advance through the processor’s execution pipeline by
one stage. Figure 1C shows the operation of a four-way fine grained multithreaded processor, i.e. N
= 4. At first glance its seems like each thread has only 1/Nth the performance potential of a
conventional processor. It is actually much better than this simply because the execution pipeline
can be made much shorter from the logical viewpoint of a single thread. This reduces instruction
latencies, simplifies compiler code scheduling, and increases the instructions per clock (IPC)
component of performance.
For example, a four-way fine grained multithreaded processor might provide single cycle latency
floating point (FP) addition while conventional processors typically require three or four cycles of
latency. That is possible because the FP adder has four physical processor clock cycles to advance a
thread’s FP add instruction through what is one logical execution pipeline stage from the thread’s
viewpoint. In a similar fashion, memory latency appears to be 1/Nth the number of processor clock
cycles from the viewpoint of individual threads. The hardware cost of fine grained multithreading is
relatively modest: N thread contexts, and control logic and multiplexors to cyclically commutate
instructions and data from N different threads into and out of the execution units. The drawback of
this approach is that its performance running any single thread is still appreciably less than for a
conventional processor although the system throughput is increased. An example of a fine-grained
multithreaded processor is the five-threaded MicroUnity MediaProcessor [2].
The EV8 uses a more powerful mechanism than either coarse or fine grained multithreading to
exploit TLP. Called Simultaneous Multithreading (SMT), it allows the instructions from two or
more threads to be issued to execution units each cycle. This process is illustrated conceptually in
Figure 1D. The advantage of SMT is that it permits TLP to be exploited all the way down to the
most fundamental level of hardware operation - instruction issue slots in a given clock period. This
allows instructions from alternate threads to take advantage of individual instruction execution
opportunities presented by the normal ILP inefficiencies of single thread program execution. SMT
can be thought of as equivalent to the airline practice of using standby passengers to fill seats that
would have otherwise flown empty.
Consider a single thread executing on a superscalar processor. Conventional superscalar processors
such as the Alpha EV6 fall well short of utilizing all the available instruction issue slots. This is
caused by execution inefficiencies including data dependency stalls, cycle by cycle shortfall
between thread ILP and the processor resources given limited re-ordering capability, and memory
accesses that miss in cache. The big advantage of SMT over other approaches is its inherent
flexibility in providing good performance over a wide spectrum of workloads. Programs that have a
lot of extractable ILP can get nearly all the benefit of the wide issue capability of the processor.
And programs with poor ILP can share with other threads instruction issue slots and execution
resources that otherwise would have gone unused.
Hardware Requirements for SMT
Compared to a conventional out-of-order execution superscalar processor like the EV6, the
following hardware changes are necessary to support SMT operation:
1. Multiple program counters (PCs), and the capacity to select one or more of them to direct
instruction fetch each clock cycle.
2. Association of a thread identifier with each instruction fetched to distinguish different
threads for the purpose of branch prediction, branch target buffering, and register renaming.
3. A per-thread capacity to retire, flush, and trap instructions.
4. A per-thread stack for prediction of subroutine return addresses.
One of the most remarkable aspect of SMT is it takes relatively little extra logic to add the
capability to the execution portion of an out-of-order execution superscalar processor that employs
register renaming and issue queues. Register renaming is a scheme in which the logical registers in
an instruction set architecture (ISA) are mapped to a subset of a larger pool of physical hardware
registers. Each time an instruction is decoded the logical register specified to be overwritten with
the instruction result (i.e. the destination register) is assigned a mapping to a new physical register,
i.e. it is renamed. When the instruction completes execution and retires, its physical destination
register becomes officially bound to the logical destination register within the processor state, i.e.
the result is committed. Register renaming permits out-of-order execution of instructions to proceed
even in the presence of false dependencies as shown in Figure 2.
Figure 2 Data Dependencies and Register Renaming
Register renaming is also done to permit speculative execution beyond conditional branches since it
allows the results of speculated instructions to be discarded and earlier processor state restored if
the branch turns out to be mispredicted. In this case it is only necessary to restore an older mapping
of logical to physical registers.
The beauty of register renaming is that it allows an SMT processor to contain multiple thread
contexts without the need for multiple physical register sets or additional complicated tracking logic
to ensure execution results from instructions from different threads are written to the appropriate
thread context. For example, the Alpha EV6 has 80 physical integer registers (there are actually 160
integer registers in the EV6 device but these are really two duplicate sets of 80 for reasons I won’t
go into) and 72 physical FP registers. At any given time, 31 of the 80 physical integer registers
contain the contents of the 31 logical general purpose registers that appear to the programmer in the
Alpha ISA (there are actually 32 logical integer registers but one of them always reads as zero, as is
customary for RISC architectures). The remaining physical registers are available for renaming. The
EV6 uses two separate twelve-port register mappers for integer and FP register renaming, and each
can rename up to four instructions per clock [3]. Content addressable memory (CAM)-based tables
are used to hold the register mapping state. The map tables are also buffered so that an older state
can be saved and later restored, if necessary to recover from branch mispredictions and exceptions.
At first glance, implementing a four-way SMT like the EV8 would seem to require four separate
and independent register mapping tables, one for each thread. This could be physically realized with
a single map table if the size of logical register specifiers used by the mapper is expanded to 7 bits
by appending a two-bit thread identifier associated with a fetched instruction to the 5 bit logical
register specifiers extracted from the instruction itself. So thread context 0 would use mapper
logical registers 0 through 31, thread 1 would use mapper logical registers 32 through 63 and so on.
In this scheme each quadrant of the mapper CAM would have the capability to be independently
backed up in buffers and restored as needed to maintain the illusion of serial, in-order execution of
each thread.
Early research into 8-issue wide superscalar out-of-order processors suggests that with a 64 entry
dispatch queue at least 96, and preferably 128, physical registers are needed to limit the fraction of
time the processor is out of free registers to 15% and 10% respectively [4]. It is known that the EV8
supports four thread contexts in hardware [5]. This suggests that the EV8 needs an additional 96
integer physical registers above and beyond a conventional 8-issue wide superscalar. That places
the number of integer physical renaming registers in the EV8 in the range of 192 to 224 for optimal
performance. It should be noted that this exceeds even the 128 logical/physical integer registers
required in implementations of Intel/HP's IA-64 instruction set architecture. Such a large, highly
ported register file has the potential to seriously limit EV8's clock rate even with the use of an
advanced 0.13 um process. The best solution to this problem is to spread register read and write
access across two pipe stages instead of one. This has the effect of lengthening the basic execution
pipeline from EV6's seven stages to nine stages as shown in Figure 3. One study suggests the extra
two pipeline stages in the hypothetical EV8 will degrade single thread performance by less than 2%
[6].
Figure 3. Comparison of EV6 and Hypothetical EV8 Execution Pipeline
Instruction Selection Strategies For SMT
I have described how the execution engine portion of an out-of-order superscalar processor
implementing register renaming can be modified to support SMT operation. The big design issue
with SMT is the algorithm that chooses between threads for the fetch and issue of instructions to
that execution engine. A number of different schemes associated with 8 issue wide SMT RISC
processor designs have been investigated and reported in the literature [7]. Some of these schemes
are listed in Table 1.
Scheme Max. Active
Threads per
Cycle
Max Instr
Fetched per
Thread per
Cycle
Description
RR.1.8 1 8 Round-robin, 1 active thread, 1 x 8 fetch
RR.2.4 2 4 Round-robin, 2 active threads, 2 x 4 fetch
RR.2.8 2 8 Round-robin, 2 active threads, 2 x 8 fetch
BRCOUNT.1.8 1 8 Choose thread with fewest unresolved
branches, 1 active thread, 1 x 8 fetch
BRCOUNT.2.8 2 8 Choose thread with fewest unresolved
branches, 2 active threads, 2 x 8 fetch
MISSCOUNT.1.8 1 8 Choose thread with fewest outstanding
Dcache misses, 1 active thread, 1 x 8 fetch
MISSCOUNT.2.8 2 8 Choose thread with fewest outstanding
Dcache misses, 2 active thread, 2 x 8 fetch
ICOUNT.1.8 1 8 Choose thread with fewest instructions in
DEC/REN/QUE pipe stages, 1 active thread,
1 x 8 fetch
ICOUNT.2.8 2 8 Choose thread with fewest instructions in
DEC/REN/QUE pipe stages, 2 active thread,
2 x 8 fetch
The simplest scheme is termed RR.1.8, or round-robin, one active thread, up to 8 instructions
fetched. Each clock, the processor selects one thread from those not currently experiencing an
instruction cache (Icache) miss on a round robin basis and uses its PC value to fetch up to 8
instructions per cycle for decoding, renaming, and entry into the integer and/or FP instruction issue
queues. The Icache design is essentially unchanged from that of a conventional single-threaded 8-
issue wide superscalar processor. Variants include RR.2.4, and RR.2.8, which require a dual ported
Icache to permit simultaneous access using two different thread PC values. In the latter case the
Icache also needs to support 16 instructions/cycle bandwidth, or twice that of a single-threaded
processor. This scheme takes as many instructions as possible from the first thread, and fills in any
gaps with instructions fetched from the second thread. The RR.1.8 scheme provides 12% better
single thread performance than RR.2.4 but RR.2.4 outperforms RR.1.8 with four active threads.
Unsurprisingly, the expensive RR.2.8 scheme outperforms both RR.1.8 and RR.2.4 for both single
thread and four thread operation.
More sophisticated schemes have been devised to help increase the throughput of the processor.
The BRCOUNT scheme attempts to give priority to threads that are least likely to be wasting
instruction slots performing speculative execution. It does this by counting branch instructions in
the decode (DEC) pipe stage, rename (REN) pipe stage, and instruction queues (QUE). Priority is
given to the thread(s) with the smallest branch count. In practice BRCOUNT.x.8 offers little
performance advantage over RR.x.8. The MISSCOUNT scheme gives priority to the thread(s) with
the fewest number of outstanding data cache (Dcache) misses. Like BRCOUNT, MISSCOUNT.x.8
offers little advantage over RR.x.8.
The ICOUNT scheme takes a more general approach to prevent the 'clogging' of the instruction
execution queues. Priority is given to the thread(s) with the fewest instructions in the DEC, REN,
and QUE pipe stages. ICOUNT has the effect of keeping one thread from filling the instruction
queue and favors threads that are moving instructions through the issue queues most efficiently. It
turns out the ICOUNT scheme is also highly effective at improving processor throughput. It
outperforms the best round-robin scheme by 23% and increases throughput to as much as 5.3 IPC
compared to 2.5 for a non-SMT superscalar with similar resources (in this study: 32 KB direct
mapped Icache and Dcache, 256 KB 4-way L2 cache, 2 MB direct mapped off-chip cache). In fact,
ICOUNT.1.8 consistently outperforms RR.2.8.
The performance difference between ICOUNT.1.8 and ICOUNT.2.8 doesn’t appear to be
significant. Given the choice between them, the EV8 designers would likely choose ICOUNT.1.8 to
halve Icache fetch bandwidth requirements and reduce associated power consumption. Interestingly,
in a more recent paper [6], Alpha architect Joel Emer and his collaborators seem to favor an
ICOUNT.2.4 scheme (2 active threads, up to 4 instructions fetched per thread per cycle). At first
glance this choice, to the extent that it foretells the actual EV8 fetch heuristic, seems contrary to
previous claims by Compaq that the SMT capabilities of EV8 would not hurt its single thread
performance compared to a single-threaded processor. One possible explanation for this apparent
contradiction may be that the ICOUNT.2.4 scheme as hypothetically implemented in EV8 is
capable of using a single thread PC value to access both Icache ports to permit 8 instruction wide
fetch capability for a single thread when appropriate. The processor organization of this
hypothetical ICOUNT.2.4 based EV8 design is shown in Figure 4.
Figure 4 Hypothetical EV8 CPU Organization
Compaq claims the overall impact of adding SMT capability will be to increase the die area of the
processor portion of the EV8 device by less than 10% [8]. It is harder to gauge the extra burden
SMT imposes on the already considerable design and verification effort for an eight issue wide
superscalar processor, even one implementing a streamlined and prescient RISC architecture like
the Alpha ISA. The potential for EV6-like schedule slips in the EV8 project seems ominously
tangible if Compaq’s Alpha managers and engineers haven’t taken to heart the lessons of that
unfortunate period.
Software Implications of SMT
An obvious question to ask is how does an SMT processor offer up its multithreading capabilities to
software. In the case of the EV8, it is with an abstraction called a thread processing unit or TPU. A
TPU is essentially a single-threaded virtual processor that is presented to the lowest level of the
operating system hardware abstraction layer (HAL). The EV8’s four way SMT capabilities are
represented with four separate TPUs as shown in Figure 5.
Figure 5. Software View of the EV8
Essentially the EV8 appears to software as consisting of four separate processors that share a single
set of translation lookaside buffers (TLBs) and caches. The advantages of SMT over a real four-way
chip level multiprocessor (CMP) are there is only one physical processor occupying die area and
cache coherency occurs without extra logic or overhead.
Can the EV8 execute threads from different processes simultaneously? (i.e. threads with different
address spaces). That hasn’t been disclosed but the simple answer is, it would probably be easy to
permit but it wouldn’t be desirable in practice because it could thrash the TLBs. It is easy to permit
with a mechanism called an address space number (ASN) or address space identifier (ASID). In
conventional processors an ASN is a small hardware register (typically 6 to 8 bits in size)
containing a unique value that is appended to virtual addresses prior to translation. The purpose of
doing this is to speed up context switches in a multitasking operating system by avoiding flushing
and reloading the TLB state, and flushing and/or invalidating the caches. By simply changing the
value in the ASN register during a context switch, the OS can prevent a virtual address from one
process from accidentally matching the same virtual address from a previous process in the TLB
and/or cache. In the case of an SMT it would seem natural that a separate ASN register be provided
within each thread hardware context.
Another important issue is software’s ability to synchronize threads. The Alpha uses a
synchronization mechanism based on the load-locked/store-conditional model [9]. This scheme,
commonly used by RISC architectures, uses a software based spin loop to set or wait on a
semaphore. In a conventional single or multiprocessor system this works well. But on an SMT a
spin loop is horrendously wasteful of processing resources. To solve this problem Compaq invented
a spin loop quiescing feature that allows the TPU associated with a thread executing a spin loop to
be put sleep until the associated semaphore memory location is modified. While asleep the
associated thread does not consume any processor resources. This feature adds relatively little extra
logic to EV8 because it piggybacks on existing cache coherency mechanisms.
Summary
Simultaneous Multithreading technology seems to be a match complement to the modern out-of-
order execution superscalar RISC processor. The difficult task of tracking computational results for
instructions from separate threads issuing and executing simultaneously is a natural fit with register
renaming schemes currently used to work around false register based data dependencies between
instructions and support recovery from speculated instruction execution. The problem of selecting
instructions from a group of active hardware threads for SMT issue and execution has a relatively
simple heuristic solution that provides robust performance over a wide range of workloads with
varying degrees of ILP and TLP.
Research to date suggests SMT can approximately double the throughput performance of an 8
instruction-issue wide processor like EV8 for a cost in extra processor complexity equivalent to less
than 10% increased die area for the processor core. The multithreading capabilities of an SMT
processor can be accessed by software through a virtual CMP model that uses abstracted TPUs in
place of multiple physical CPUs. Existing thread synchronization mechanisms can be retained with
little impact on SMT processor performance if appropriate measures are taken to ensure threads
waiting for a semaphore do not consume a share of execution resources.
In the third and final part of this article I will examine how the performance characteristics of SMT
potentially impact EV8’s competitive posture relative to alternative design approaches like EPIC
and CMP and the implications for the future of MPU design.
Footnotes
[1] Hennessy, J., 'Processor Design and Other Challenges in the Post-PC Era', Proceedings of
Microprocessor Forum 1999, October 5, 1999, Cahners MicroDesign Resources.
[2] Slater, M., 'MicroUnity Lifts Veil on MediaProcessor', Microprocessor Report, Vol. 9, No. 14,
October 23, 1995, p. 11.
[3] Gieseke, B., 'A 600 MHz Superscalar RISC Microprocessor with Out-Of-Order Execution',
Digest of Technical Papers, ISSCC 1997, February 7, 1997, p. 176.
[4] Farkas, K. et al, 'Register File Design Considerations in Dynamically Scheduled Processors',
DECWRL Report, November 1995.
[5] Emer, J., 'Simultaneous Multithreading: Multiplying Alpha Performance', Proceedings of
Microprocessor Forum 1999, October 5, 1999, Cahners MicroDesign Resources.
[6] Lo, J. et al, 'Converting Thread-Level Parallelism to Instruction Level Parallelism via
Simultaneous Multithreading', ACM Transactions on Computer Systems, Vol. 15, No. 3, August
1997, p. 322.
[7] Tullsen, D. et al, 'Exploiting Choice: Instruction Fetch and Issue on an Implementable
Simultaneous Multithreading Processor', Proceedings of the 23rd
Annual International Symposium
on Computer Architecture, May 1996.
[8] Diefendorff, K., 'Compaq Chooses SMT for Alpha', Microprocessor Report, Vol. 13, No. 16,
December 6, 1999, p. 1.
[9] Sites, R., 'Alpha Architecture Reference Manual', Digital Press, 1992. Fundamentals of
Multithreading: http://www.slcentral.com/articles/01/6/multithreading/

More Related Content

What's hot

Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architecturesAmit Kumar Rathi
 
Register Organization and Instruction cycle
Register Organization and Instruction cycleRegister Organization and Instruction cycle
Register Organization and Instruction cycleMuhammad Ameer Mohavia
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors pptSiddhartha Anand
 
UVM ARCHITECTURE FOR VERIFICATION
UVM ARCHITECTURE FOR VERIFICATIONUVM ARCHITECTURE FOR VERIFICATION
UVM ARCHITECTURE FOR VERIFICATIONIAEME Publication
 
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIORevisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIOHisaki Ohara
 
Overview of Nios II Embedded Processor
Overview of Nios II Embedded ProcessorOverview of Nios II Embedded Processor
Overview of Nios II Embedded ProcessorAltera Corporation
 
ARM Versions, architecture
ARM Versions, architectureARM Versions, architecture
ARM Versions, architectureKarthik Vivek
 
11 instruction sets addressing modes
11  instruction sets addressing modes 11  instruction sets addressing modes
11 instruction sets addressing modes Sher Shah Merkhel
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technologyAmirali Sharifian
 
20111015 勉強会 (PCIe / SR-IOV)
20111015 勉強会 (PCIe / SR-IOV)20111015 勉強会 (PCIe / SR-IOV)
20111015 勉強会 (PCIe / SR-IOV)Kentaro Ebisawa
 
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.pptDrUrvashiBansal
 

What's hot (20)

Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architectures
 
Register Organization and Instruction cycle
Register Organization and Instruction cycleRegister Organization and Instruction cycle
Register Organization and Instruction cycle
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
UNIT 4 B.docx
UNIT 4 B.docxUNIT 4 B.docx
UNIT 4 B.docx
 
08 Operating System Support
08  Operating  System  Support08  Operating  System  Support
08 Operating System Support
 
UVM ARCHITECTURE FOR VERIFICATION
UVM ARCHITECTURE FOR VERIFICATIONUVM ARCHITECTURE FOR VERIFICATION
UVM ARCHITECTURE FOR VERIFICATION
 
FIFO Design
FIFO DesignFIFO Design
FIFO Design
 
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIORevisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIO
 
Superscalar Processor
Superscalar ProcessorSuperscalar Processor
Superscalar Processor
 
Parallelism
ParallelismParallelism
Parallelism
 
Overview of Nios II Embedded Processor
Overview of Nios II Embedded ProcessorOverview of Nios II Embedded Processor
Overview of Nios II Embedded Processor
 
ARM Versions, architecture
ARM Versions, architectureARM Versions, architecture
ARM Versions, architecture
 
UVM TUTORIAL;
UVM TUTORIAL;UVM TUTORIAL;
UVM TUTORIAL;
 
11 instruction sets addressing modes
11  instruction sets addressing modes 11  instruction sets addressing modes
11 instruction sets addressing modes
 
Pipelining & All Hazards Solution
Pipelining  & All Hazards SolutionPipelining  & All Hazards Solution
Pipelining & All Hazards Solution
 
Nehalem (microarchitecture)
Nehalem (microarchitecture)Nehalem (microarchitecture)
Nehalem (microarchitecture)
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
20111015 勉強会 (PCIe / SR-IOV)
20111015 勉強会 (PCIe / SR-IOV)20111015 勉強会 (PCIe / SR-IOV)
20111015 勉強会 (PCIe / SR-IOV)
 
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
 

Similar to What is simultaneous multithreading

Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...Ahmed kasim
 
Multithreading
MultithreadingMultithreading
MultithreadingA B Shinde
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsateeq ateeq
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technologyNikhil Venugopal
 
Multilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memoryMultilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memoryMahesh Kumar Attri
 
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismFarwa Ansari
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Subhajit Sahu
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore ComputersA B Shinde
 

Similar to What is simultaneous multithreading (20)

Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
 
Multithreading
MultithreadingMultithreading
Multithreading
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Reconfigurable computing
Reconfigurable computingReconfigurable computing
Reconfigurable computing
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processors
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Unit 5-lecture 5
Unit 5-lecture 5Unit 5-lecture 5
Unit 5-lecture 5
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technology
 
Multilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memoryMultilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memory
 
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
 
Hyper threading
Hyper threadingHyper threading
Hyper threading
 
Co question 2008
Co question 2008Co question 2008
Co question 2008
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)
 
Hyper thread technology
Hyper thread technologyHyper thread technology
Hyper thread technology
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
Threads
ThreadsThreads
Threads
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore Computers
 
Wiki 2
Wiki 2Wiki 2
Wiki 2
 

More from Fraboni Ec

Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreadingFraboni Ec
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceFraboni Ec
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningFraboni Ec
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data miningFraboni Ec
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryFraboni Ec
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksFraboni Ec
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cacheFraboni Ec
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsFraboni Ec
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonFraboni Ec
 
Abstract data types
Abstract data typesAbstract data types
Abstract data typesFraboni Ec
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsFraboni Ec
 
Abstraction file
Abstraction fileAbstraction file
Abstraction fileFraboni Ec
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisFraboni Ec
 
Abstract class
Abstract classAbstract class
Abstract classFraboni Ec
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaFraboni Ec
 

More from Fraboni Ec (20)

Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
Lisp
LispLisp
Lisp
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Cache recap
Cache recapCache recap
Cache recap
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Object model
Object modelObject model
Object model
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Abstract class
Abstract classAbstract class
Abstract class
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Inheritance
InheritanceInheritance
Inheritance
 
Api crash
Api crashApi crash
Api crash
 

Recently uploaded

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

What is simultaneous multithreading

  • 1. What is Simultaneous Multithreading? Generally speaking there are two types of parallelism that can be exploited by modern computing machinery to achieve higher performance. The Instruction Level Parallelism (ILP) approach attempts to reduce program runtime by overlapping the execution time of as many instructions as possible, to as great a degree as possible. The EV8 will have higher performance than earlier Alpha designs through the enhanced exploitation of ILP made possible by its eight-instruction issue width. But gains from higher ILP come at a high and ever increasing price. Building wider machines runs into the problem of geometrically increasing complexity in control logic while data and control dependencies within the program code limit performance increases. John Hennessy of Stanford University has likened the difficulty of increasing exploitation of ILP for greater performance to the task of pushing a boulder up a mountain whose slopes grow ever steeper the further processor architects progress [1]. Figure 1. Multithreaded Execution with Increasing Levels of TLP Hardware Support The second form of parallelism is called Thread Level Parallelism or TLP. This simply means the ability to execute independent programs, or independent parts of a single program, simultaneously using different flows of execution, called threads. The illusion of multiple thread execution is often
  • 2. achieved on a single conventional processor through the use of multitasking. Multitasking relies on the ability of an operating system (OS) to overlap the execution of multiple threads or programs on a single processor by running each thread successively for short intervals. This is shown in Figure 1A. This diagram illustrates program execution using rectangles repeated in the horizontal direction to represent consecutive clock cycles while squares placed vertically in each rectangle represent the per cycle utilization of instruction issue slots in a four way superscalar processor (unused slots are left as white squares). Each thread runs for a short interval that ends when the program experiences an exception like a page fault, calls an operating system function, or is interrupted by an interval timer. When a thread is interrupted, a short segment of OS code (shown in Figure 1A as gray instructions in issue slots) is run which performs a context switch and switches execution to a new thread. Multitasking provides the illusion of simultaneous execution of multiple threads but does nothing to enhance the overall computational capability of the processor. In fact, excessive context switching causes processor cycles, which could have been used running user code, to be wasted in the OS. The most basic type of TLP exploitation that can be incorporated into processor hardware is coarse grained multithreading (CMT), shown in Figure 1B. The processor incorporates two or more thread contexts (general purpose registers, program counter PC, process status word PSW etc.) in hardware. One process context is active at a time and runs until an exception occurs, or more likely, a high latency operation such as a cache miss during a load instruction. When this occurs, the processor hardware automatically flushes and changes the thread context, and switches execution to a new thread. For contemporary MPUs, a memory operation initiated in response to a cache miss can take over a hundred clock cycles, which represents the potential execution of hundreds of instructions. A conventional in-order processor will simply stall and forever lose those hundreds of potential instructions slots waiting for memory to respond with needed data. A conventional out-of-order execution processor has the potential to continue to execute other instructions that weren’t dependent on the missed load data. However, independent instructions tend to be quickly exhausted in most programs and the processor simply takes longer to stall. But a coarse grained multithreaded processor has the opportunity to quickly switch to another thread after a cache miss and perform useful work while the first thread awaits its data from memory. Many programs spend considerable time waiting for memory operations and a coarse grained multithreaded processor has the opportunity to increase overall system throughput, compared to a conventional processor performing OS-based multitasking. The IBM PowerPC RS64, also known as Northstar, is rumored to incorporate two way coarse grained multithreading capability, although it is not utilized in some product lines. A more comprehensive way to exploit TLP in hardware is the fine grained multithreaded (FMT) processor. The operation of one variant of this class of machine is shown in Figure 1C. In this type of design there are N thread contexts in the processor and instructions from each thread are allocated every Nth processor clock cycle to advance through the processor’s execution pipeline by one stage. Figure 1C shows the operation of a four-way fine grained multithreaded processor, i.e. N = 4. At first glance its seems like each thread has only 1/Nth the performance potential of a conventional processor. It is actually much better than this simply because the execution pipeline can be made much shorter from the logical viewpoint of a single thread. This reduces instruction latencies, simplifies compiler code scheduling, and increases the instructions per clock (IPC) component of performance.
  • 3. For example, a four-way fine grained multithreaded processor might provide single cycle latency floating point (FP) addition while conventional processors typically require three or four cycles of latency. That is possible because the FP adder has four physical processor clock cycles to advance a thread’s FP add instruction through what is one logical execution pipeline stage from the thread’s viewpoint. In a similar fashion, memory latency appears to be 1/Nth the number of processor clock cycles from the viewpoint of individual threads. The hardware cost of fine grained multithreading is relatively modest: N thread contexts, and control logic and multiplexors to cyclically commutate instructions and data from N different threads into and out of the execution units. The drawback of this approach is that its performance running any single thread is still appreciably less than for a conventional processor although the system throughput is increased. An example of a fine-grained multithreaded processor is the five-threaded MicroUnity MediaProcessor [2]. The EV8 uses a more powerful mechanism than either coarse or fine grained multithreading to exploit TLP. Called Simultaneous Multithreading (SMT), it allows the instructions from two or more threads to be issued to execution units each cycle. This process is illustrated conceptually in Figure 1D. The advantage of SMT is that it permits TLP to be exploited all the way down to the most fundamental level of hardware operation - instruction issue slots in a given clock period. This allows instructions from alternate threads to take advantage of individual instruction execution opportunities presented by the normal ILP inefficiencies of single thread program execution. SMT can be thought of as equivalent to the airline practice of using standby passengers to fill seats that would have otherwise flown empty. Consider a single thread executing on a superscalar processor. Conventional superscalar processors such as the Alpha EV6 fall well short of utilizing all the available instruction issue slots. This is caused by execution inefficiencies including data dependency stalls, cycle by cycle shortfall between thread ILP and the processor resources given limited re-ordering capability, and memory accesses that miss in cache. The big advantage of SMT over other approaches is its inherent flexibility in providing good performance over a wide spectrum of workloads. Programs that have a lot of extractable ILP can get nearly all the benefit of the wide issue capability of the processor. And programs with poor ILP can share with other threads instruction issue slots and execution resources that otherwise would have gone unused. Hardware Requirements for SMT Compared to a conventional out-of-order execution superscalar processor like the EV6, the following hardware changes are necessary to support SMT operation: 1. Multiple program counters (PCs), and the capacity to select one or more of them to direct instruction fetch each clock cycle. 2. Association of a thread identifier with each instruction fetched to distinguish different threads for the purpose of branch prediction, branch target buffering, and register renaming. 3. A per-thread capacity to retire, flush, and trap instructions. 4. A per-thread stack for prediction of subroutine return addresses. One of the most remarkable aspect of SMT is it takes relatively little extra logic to add the capability to the execution portion of an out-of-order execution superscalar processor that employs register renaming and issue queues. Register renaming is a scheme in which the logical registers in an instruction set architecture (ISA) are mapped to a subset of a larger pool of physical hardware registers. Each time an instruction is decoded the logical register specified to be overwritten with the instruction result (i.e. the destination register) is assigned a mapping to a new physical register, i.e. it is renamed. When the instruction completes execution and retires, its physical destination
  • 4. register becomes officially bound to the logical destination register within the processor state, i.e. the result is committed. Register renaming permits out-of-order execution of instructions to proceed even in the presence of false dependencies as shown in Figure 2. Figure 2 Data Dependencies and Register Renaming Register renaming is also done to permit speculative execution beyond conditional branches since it allows the results of speculated instructions to be discarded and earlier processor state restored if the branch turns out to be mispredicted. In this case it is only necessary to restore an older mapping of logical to physical registers. The beauty of register renaming is that it allows an SMT processor to contain multiple thread contexts without the need for multiple physical register sets or additional complicated tracking logic to ensure execution results from instructions from different threads are written to the appropriate thread context. For example, the Alpha EV6 has 80 physical integer registers (there are actually 160 integer registers in the EV6 device but these are really two duplicate sets of 80 for reasons I won’t go into) and 72 physical FP registers. At any given time, 31 of the 80 physical integer registers contain the contents of the 31 logical general purpose registers that appear to the programmer in the Alpha ISA (there are actually 32 logical integer registers but one of them always reads as zero, as is customary for RISC architectures). The remaining physical registers are available for renaming. The EV6 uses two separate twelve-port register mappers for integer and FP register renaming, and each can rename up to four instructions per clock [3]. Content addressable memory (CAM)-based tables
  • 5. are used to hold the register mapping state. The map tables are also buffered so that an older state can be saved and later restored, if necessary to recover from branch mispredictions and exceptions. At first glance, implementing a four-way SMT like the EV8 would seem to require four separate and independent register mapping tables, one for each thread. This could be physically realized with a single map table if the size of logical register specifiers used by the mapper is expanded to 7 bits by appending a two-bit thread identifier associated with a fetched instruction to the 5 bit logical register specifiers extracted from the instruction itself. So thread context 0 would use mapper logical registers 0 through 31, thread 1 would use mapper logical registers 32 through 63 and so on. In this scheme each quadrant of the mapper CAM would have the capability to be independently backed up in buffers and restored as needed to maintain the illusion of serial, in-order execution of each thread. Early research into 8-issue wide superscalar out-of-order processors suggests that with a 64 entry dispatch queue at least 96, and preferably 128, physical registers are needed to limit the fraction of time the processor is out of free registers to 15% and 10% respectively [4]. It is known that the EV8 supports four thread contexts in hardware [5]. This suggests that the EV8 needs an additional 96 integer physical registers above and beyond a conventional 8-issue wide superscalar. That places the number of integer physical renaming registers in the EV8 in the range of 192 to 224 for optimal performance. It should be noted that this exceeds even the 128 logical/physical integer registers required in implementations of Intel/HP's IA-64 instruction set architecture. Such a large, highly ported register file has the potential to seriously limit EV8's clock rate even with the use of an advanced 0.13 um process. The best solution to this problem is to spread register read and write access across two pipe stages instead of one. This has the effect of lengthening the basic execution pipeline from EV6's seven stages to nine stages as shown in Figure 3. One study suggests the extra two pipeline stages in the hypothetical EV8 will degrade single thread performance by less than 2% [6]. Figure 3. Comparison of EV6 and Hypothetical EV8 Execution Pipeline Instruction Selection Strategies For SMT I have described how the execution engine portion of an out-of-order superscalar processor implementing register renaming can be modified to support SMT operation. The big design issue with SMT is the algorithm that chooses between threads for the fetch and issue of instructions to that execution engine. A number of different schemes associated with 8 issue wide SMT RISC processor designs have been investigated and reported in the literature [7]. Some of these schemes are listed in Table 1.
  • 6. Scheme Max. Active Threads per Cycle Max Instr Fetched per Thread per Cycle Description RR.1.8 1 8 Round-robin, 1 active thread, 1 x 8 fetch RR.2.4 2 4 Round-robin, 2 active threads, 2 x 4 fetch RR.2.8 2 8 Round-robin, 2 active threads, 2 x 8 fetch BRCOUNT.1.8 1 8 Choose thread with fewest unresolved branches, 1 active thread, 1 x 8 fetch BRCOUNT.2.8 2 8 Choose thread with fewest unresolved branches, 2 active threads, 2 x 8 fetch MISSCOUNT.1.8 1 8 Choose thread with fewest outstanding Dcache misses, 1 active thread, 1 x 8 fetch MISSCOUNT.2.8 2 8 Choose thread with fewest outstanding Dcache misses, 2 active thread, 2 x 8 fetch ICOUNT.1.8 1 8 Choose thread with fewest instructions in DEC/REN/QUE pipe stages, 1 active thread, 1 x 8 fetch ICOUNT.2.8 2 8 Choose thread with fewest instructions in DEC/REN/QUE pipe stages, 2 active thread, 2 x 8 fetch The simplest scheme is termed RR.1.8, or round-robin, one active thread, up to 8 instructions fetched. Each clock, the processor selects one thread from those not currently experiencing an instruction cache (Icache) miss on a round robin basis and uses its PC value to fetch up to 8 instructions per cycle for decoding, renaming, and entry into the integer and/or FP instruction issue queues. The Icache design is essentially unchanged from that of a conventional single-threaded 8- issue wide superscalar processor. Variants include RR.2.4, and RR.2.8, which require a dual ported Icache to permit simultaneous access using two different thread PC values. In the latter case the Icache also needs to support 16 instructions/cycle bandwidth, or twice that of a single-threaded processor. This scheme takes as many instructions as possible from the first thread, and fills in any gaps with instructions fetched from the second thread. The RR.1.8 scheme provides 12% better single thread performance than RR.2.4 but RR.2.4 outperforms RR.1.8 with four active threads. Unsurprisingly, the expensive RR.2.8 scheme outperforms both RR.1.8 and RR.2.4 for both single thread and four thread operation. More sophisticated schemes have been devised to help increase the throughput of the processor. The BRCOUNT scheme attempts to give priority to threads that are least likely to be wasting instruction slots performing speculative execution. It does this by counting branch instructions in the decode (DEC) pipe stage, rename (REN) pipe stage, and instruction queues (QUE). Priority is given to the thread(s) with the smallest branch count. In practice BRCOUNT.x.8 offers little performance advantage over RR.x.8. The MISSCOUNT scheme gives priority to the thread(s) with
  • 7. the fewest number of outstanding data cache (Dcache) misses. Like BRCOUNT, MISSCOUNT.x.8 offers little advantage over RR.x.8. The ICOUNT scheme takes a more general approach to prevent the 'clogging' of the instruction execution queues. Priority is given to the thread(s) with the fewest instructions in the DEC, REN, and QUE pipe stages. ICOUNT has the effect of keeping one thread from filling the instruction queue and favors threads that are moving instructions through the issue queues most efficiently. It turns out the ICOUNT scheme is also highly effective at improving processor throughput. It outperforms the best round-robin scheme by 23% and increases throughput to as much as 5.3 IPC compared to 2.5 for a non-SMT superscalar with similar resources (in this study: 32 KB direct mapped Icache and Dcache, 256 KB 4-way L2 cache, 2 MB direct mapped off-chip cache). In fact, ICOUNT.1.8 consistently outperforms RR.2.8. The performance difference between ICOUNT.1.8 and ICOUNT.2.8 doesn’t appear to be significant. Given the choice between them, the EV8 designers would likely choose ICOUNT.1.8 to halve Icache fetch bandwidth requirements and reduce associated power consumption. Interestingly, in a more recent paper [6], Alpha architect Joel Emer and his collaborators seem to favor an ICOUNT.2.4 scheme (2 active threads, up to 4 instructions fetched per thread per cycle). At first glance this choice, to the extent that it foretells the actual EV8 fetch heuristic, seems contrary to previous claims by Compaq that the SMT capabilities of EV8 would not hurt its single thread performance compared to a single-threaded processor. One possible explanation for this apparent contradiction may be that the ICOUNT.2.4 scheme as hypothetically implemented in EV8 is capable of using a single thread PC value to access both Icache ports to permit 8 instruction wide fetch capability for a single thread when appropriate. The processor organization of this hypothetical ICOUNT.2.4 based EV8 design is shown in Figure 4.
  • 8. Figure 4 Hypothetical EV8 CPU Organization Compaq claims the overall impact of adding SMT capability will be to increase the die area of the processor portion of the EV8 device by less than 10% [8]. It is harder to gauge the extra burden SMT imposes on the already considerable design and verification effort for an eight issue wide superscalar processor, even one implementing a streamlined and prescient RISC architecture like the Alpha ISA. The potential for EV6-like schedule slips in the EV8 project seems ominously tangible if Compaq’s Alpha managers and engineers haven’t taken to heart the lessons of that unfortunate period. Software Implications of SMT An obvious question to ask is how does an SMT processor offer up its multithreading capabilities to software. In the case of the EV8, it is with an abstraction called a thread processing unit or TPU. A TPU is essentially a single-threaded virtual processor that is presented to the lowest level of the operating system hardware abstraction layer (HAL). The EV8’s four way SMT capabilities are represented with four separate TPUs as shown in Figure 5.
  • 9. Figure 5. Software View of the EV8 Essentially the EV8 appears to software as consisting of four separate processors that share a single set of translation lookaside buffers (TLBs) and caches. The advantages of SMT over a real four-way chip level multiprocessor (CMP) are there is only one physical processor occupying die area and cache coherency occurs without extra logic or overhead. Can the EV8 execute threads from different processes simultaneously? (i.e. threads with different address spaces). That hasn’t been disclosed but the simple answer is, it would probably be easy to permit but it wouldn’t be desirable in practice because it could thrash the TLBs. It is easy to permit with a mechanism called an address space number (ASN) or address space identifier (ASID). In conventional processors an ASN is a small hardware register (typically 6 to 8 bits in size) containing a unique value that is appended to virtual addresses prior to translation. The purpose of doing this is to speed up context switches in a multitasking operating system by avoiding flushing and reloading the TLB state, and flushing and/or invalidating the caches. By simply changing the value in the ASN register during a context switch, the OS can prevent a virtual address from one process from accidentally matching the same virtual address from a previous process in the TLB and/or cache. In the case of an SMT it would seem natural that a separate ASN register be provided within each thread hardware context. Another important issue is software’s ability to synchronize threads. The Alpha uses a synchronization mechanism based on the load-locked/store-conditional model [9]. This scheme, commonly used by RISC architectures, uses a software based spin loop to set or wait on a semaphore. In a conventional single or multiprocessor system this works well. But on an SMT a spin loop is horrendously wasteful of processing resources. To solve this problem Compaq invented a spin loop quiescing feature that allows the TPU associated with a thread executing a spin loop to be put sleep until the associated semaphore memory location is modified. While asleep the associated thread does not consume any processor resources. This feature adds relatively little extra logic to EV8 because it piggybacks on existing cache coherency mechanisms. Summary Simultaneous Multithreading technology seems to be a match complement to the modern out-of- order execution superscalar RISC processor. The difficult task of tracking computational results for instructions from separate threads issuing and executing simultaneously is a natural fit with register renaming schemes currently used to work around false register based data dependencies between instructions and support recovery from speculated instruction execution. The problem of selecting instructions from a group of active hardware threads for SMT issue and execution has a relatively simple heuristic solution that provides robust performance over a wide range of workloads with varying degrees of ILP and TLP.
  • 10. Research to date suggests SMT can approximately double the throughput performance of an 8 instruction-issue wide processor like EV8 for a cost in extra processor complexity equivalent to less than 10% increased die area for the processor core. The multithreading capabilities of an SMT processor can be accessed by software through a virtual CMP model that uses abstracted TPUs in place of multiple physical CPUs. Existing thread synchronization mechanisms can be retained with little impact on SMT processor performance if appropriate measures are taken to ensure threads waiting for a semaphore do not consume a share of execution resources. In the third and final part of this article I will examine how the performance characteristics of SMT potentially impact EV8’s competitive posture relative to alternative design approaches like EPIC and CMP and the implications for the future of MPU design. Footnotes [1] Hennessy, J., 'Processor Design and Other Challenges in the Post-PC Era', Proceedings of Microprocessor Forum 1999, October 5, 1999, Cahners MicroDesign Resources. [2] Slater, M., 'MicroUnity Lifts Veil on MediaProcessor', Microprocessor Report, Vol. 9, No. 14, October 23, 1995, p. 11. [3] Gieseke, B., 'A 600 MHz Superscalar RISC Microprocessor with Out-Of-Order Execution', Digest of Technical Papers, ISSCC 1997, February 7, 1997, p. 176. [4] Farkas, K. et al, 'Register File Design Considerations in Dynamically Scheduled Processors', DECWRL Report, November 1995. [5] Emer, J., 'Simultaneous Multithreading: Multiplying Alpha Performance', Proceedings of Microprocessor Forum 1999, October 5, 1999, Cahners MicroDesign Resources. [6] Lo, J. et al, 'Converting Thread-Level Parallelism to Instruction Level Parallelism via Simultaneous Multithreading', ACM Transactions on Computer Systems, Vol. 15, No. 3, August 1997, p. 322. [7] Tullsen, D. et al, 'Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor', Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996. [8] Diefendorff, K., 'Compaq Chooses SMT for Alpha', Microprocessor Report, Vol. 13, No. 16, December 6, 1999, p. 1. [9] Sites, R., 'Alpha Architecture Reference Manual', Digital Press, 1992. Fundamentals of Multithreading: http://www.slcentral.com/articles/01/6/multithreading/