Multithreaded
Processors
Paper presentation
Goal
Utilization of coarser-grained parallelism by CMPs and multithreaded processors
Focus is on processors that are designed to simultaneously execute threads of
same or different processes.(explicit multithreaded processors)
Explicit multithreaded processors aim is to increase the performance(low
execution time) of a multiprogramming workload, while single threaded/implicit
multithreaded and superscalar processors increase performance of single
program.
CMP – Chip Multiprocessor(2 or more processors on a single chip)
Multithreaded processors- interleaves execution of different threads of control in
the same pipeline.
What is it?
●Notion of thread
● Different from a software application thread
● coarse-grained thread-level parallelism
● Implies separate logical address space
●Implicit Multithreading
● Find multiple lines of execution in a single seq. program.
●Explicit multithreading
● Multiple PCs, register contexts
● Different from RISC processors
Why do we need it?
• ILP is limited
• Memory latency problem, covering up long latency cycles by useful work.
• Div and branch interlocking. Covering up idle time of CPU
• Latency: primary cache miss/2ndary cache miss
• Several enabled instructions from diff threads that may be candidates for
execution.
• Switching in a single threaded processor is costly!
• Idle hardware utilization
Multithreaded Processors –
Principle Approaches
●Techniques
● Fast context switch(how?)
●Interleaved multithreading technique
● Instruction from different threads every cycle
●Blocked multithreading technique
● Continues until an event occurs
●Simultaneous multithreading
● Simultaneously issue multiple instructions from multiple
threads(Superscalar)
Taken from [2]. Survey of processors with explicit multithreading.
Interleaved multithreading(fine-
grained)
• Processor switches to a different thread after each IF
• Context switch after every clock cycle
• Eliminates data and control hazards
• Improves overall performance(execution time)
• Requires at least as many threads as pipeline stages
• Single-thread performance degrades
• Two techniques to overcome this
• Dependence lookahead technique(CRAY MTA)
• Interleaving technique
CRAY MTA
• Interleaved multithreaded VLIW processor
• uses explicit look ahead technique
• 3 bits to encode
• Supports 128 distinct threads
• Hides memory latency
• VLIW
• 64 bit instructions consists of 3 operations
• <M-op, A-op, C-op>- priority from high to low
Blocked multithreading(coarse-
grained)
• Continues execution until a context switch is forced
• Single thread can proceed at full speed
• Lesser threads needed compared to interleaved multithreading
• Context switch events
• Switch-on-load
• Switch-on-store
• Switch-on-branch
• Switch-on-cache-miss
• Switch-on-signals(interrupts)
• Conditional switch
MIT Sparcle
• Context switch only during a remote cache miss
• Small latencies are taken care of by the compilers.
• Implementation of fast context switching
• Also uses multiple register contexts and PCs
Simultaneous multithreading(SMT)
• Mix of superscalar and multithreading technique
• All hardware contexts are active leading to competition
• Issue multiple instructions from multiple threads each cycle
• Both TLP and ILP comes into play
• Multiple slots for different threads are filled as well multiple
issue slots are filled.
• Resource organization
• Resource sharing
• Resource replication
SMT Alpha 21164 processor
• Simulations conducted on 8 threaded 8-issue
superscalar
• 3 Floating point units and 6 integer units are
assumed
• Fetch policy
• Throughput
• 6.64 IPC on SPEC92 benchmark
Taken from [2]. Survey of processors with explicit multithreading.
Comparison
Chip Multiprocessors
1. Multiple processors on a single
chip
2. Every unit is duplicated and
works independently
3. Latency problem remains in
multiple issue cycles.
4. Every part of a processor is
duplicated so easier to
implement.
Multithreaded Processors
1. Multithreading comes into play
2. multiple threads under execution
so multiple PCs and registers
3. Latencies arising in one stream
are filled by another thread unlike
RISC architectures.
4. Hardware either shared or
replicated so complex.
References
1. Theo Ungerer, Borut Robic and Jurij Silc.
(2002) Multithreaded Processors in The
Computer Journal, Vol. 45 No. 3.
2. Theo Ungerer, Borut Robic and Jurij Silc.
(2003) A survey of Processors with Explicit
Multithreading in ACM Computing Surveys, Vol.
35 No. 1, March 2003, pp. 29-63.

Multithreaded processors ppt

  • 1.
  • 2.
    Goal Utilization of coarser-grainedparallelism by CMPs and multithreaded processors Focus is on processors that are designed to simultaneously execute threads of same or different processes.(explicit multithreaded processors) Explicit multithreaded processors aim is to increase the performance(low execution time) of a multiprogramming workload, while single threaded/implicit multithreaded and superscalar processors increase performance of single program. CMP – Chip Multiprocessor(2 or more processors on a single chip) Multithreaded processors- interleaves execution of different threads of control in the same pipeline.
  • 3.
    What is it? ●Notionof thread ● Different from a software application thread ● coarse-grained thread-level parallelism ● Implies separate logical address space ●Implicit Multithreading ● Find multiple lines of execution in a single seq. program. ●Explicit multithreading ● Multiple PCs, register contexts ● Different from RISC processors
  • 4.
    Why do weneed it? • ILP is limited • Memory latency problem, covering up long latency cycles by useful work. • Div and branch interlocking. Covering up idle time of CPU • Latency: primary cache miss/2ndary cache miss • Several enabled instructions from diff threads that may be candidates for execution. • Switching in a single threaded processor is costly! • Idle hardware utilization
  • 5.
    Multithreaded Processors – PrincipleApproaches ●Techniques ● Fast context switch(how?) ●Interleaved multithreading technique ● Instruction from different threads every cycle ●Blocked multithreading technique ● Continues until an event occurs ●Simultaneous multithreading ● Simultaneously issue multiple instructions from multiple threads(Superscalar)
  • 6.
    Taken from [2].Survey of processors with explicit multithreading.
  • 7.
    Interleaved multithreading(fine- grained) • Processorswitches to a different thread after each IF • Context switch after every clock cycle • Eliminates data and control hazards • Improves overall performance(execution time) • Requires at least as many threads as pipeline stages • Single-thread performance degrades • Two techniques to overcome this • Dependence lookahead technique(CRAY MTA) • Interleaving technique
  • 8.
    CRAY MTA • Interleavedmultithreaded VLIW processor • uses explicit look ahead technique • 3 bits to encode • Supports 128 distinct threads • Hides memory latency • VLIW • 64 bit instructions consists of 3 operations • <M-op, A-op, C-op>- priority from high to low
  • 9.
    Blocked multithreading(coarse- grained) • Continuesexecution until a context switch is forced • Single thread can proceed at full speed • Lesser threads needed compared to interleaved multithreading • Context switch events • Switch-on-load • Switch-on-store • Switch-on-branch • Switch-on-cache-miss • Switch-on-signals(interrupts) • Conditional switch
  • 10.
    MIT Sparcle • Contextswitch only during a remote cache miss • Small latencies are taken care of by the compilers. • Implementation of fast context switching • Also uses multiple register contexts and PCs
  • 11.
    Simultaneous multithreading(SMT) • Mixof superscalar and multithreading technique • All hardware contexts are active leading to competition • Issue multiple instructions from multiple threads each cycle • Both TLP and ILP comes into play • Multiple slots for different threads are filled as well multiple issue slots are filled. • Resource organization • Resource sharing • Resource replication
  • 12.
    SMT Alpha 21164processor • Simulations conducted on 8 threaded 8-issue superscalar • 3 Floating point units and 6 integer units are assumed • Fetch policy • Throughput • 6.64 IPC on SPEC92 benchmark
  • 13.
    Taken from [2].Survey of processors with explicit multithreading.
  • 14.
    Comparison Chip Multiprocessors 1. Multipleprocessors on a single chip 2. Every unit is duplicated and works independently 3. Latency problem remains in multiple issue cycles. 4. Every part of a processor is duplicated so easier to implement. Multithreaded Processors 1. Multithreading comes into play 2. multiple threads under execution so multiple PCs and registers 3. Latencies arising in one stream are filled by another thread unlike RISC architectures. 4. Hardware either shared or replicated so complex.
  • 15.
    References 1. Theo Ungerer,Borut Robic and Jurij Silc. (2002) Multithreaded Processors in The Computer Journal, Vol. 45 No. 3. 2. Theo Ungerer, Borut Robic and Jurij Silc. (2003) A survey of Processors with Explicit Multithreading in ACM Computing Surveys, Vol. 35 No. 1, March 2003, pp. 29-63.