SlideShare a Scribd company logo
COMPUTER ARCHITECTURE
BATCH 2012
Assignment tittle
“Summary of Paper”
BY
FARWA ABDUL HANNAN
(12-CS-13)
&
ZAINAB KHALID
(12-CS-33)
Date OF Submission: Wednesday, 11 May, 2016
NFC – INSITUTDE OF ENGINEERING AND FERTILIZER
RESEARCH, FSD
Simultaneous Multithreading: Maximizing On-Chip Parallelism
Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy
Department of Computer Science and Engineering
University of Washington
Seattle, WA 98195
______________________________________________________________________________
1. Introduction
The paper examines simultaneous
multithreading which is a technique that
allows several independent threads to issue
multiple functional units in each cycle. The
objective of simultaneous multithreading is
to increase processor utilization for both long
memory latencies and limited available
parallelism per thread.
This study evaluates the potential
improvement, relative to wide superscalar
architectures and conventional multithreaded
architectures, of various simultaneous
multithreading models.
The proposed results show the limits of
superscalar execution and traditional
multithreading to increase instruction
throughput in future processors.
2. Methodology
The main goal is to evaluate several
architectural alternatives in order to examine
simultaneous multithreading. For this a
simulation environment has been developed
that defines the implementation of the
simultaneous multithreaded architecture and
that architecture is the extension of next
generation wide superscalar processors.
2.1 Simulation Environment
The simulator uses the emulated based
instruction level simulation that caches the
partially decoded instructions for fast
emulated execution. The simulator models
the pipeline execution, hierarchy of memory
and the branch prediction logic of wide
superscalar processors. The simulator is
based on Alpha 21164. Unlike Alpha this
model supports the increased single stream
parallelism. The simulated configuration
consists of 10 functional units of four types
such as four integer, two floating point, three
load/store and 1 branch and issue rate id at
maximum of 8 instructions per cycle. It is
assumed that all functional units are
completely pipelined. Assuming that the first
and second-level on-chip caches
considerably larger than on the Alpha, for
two reasons. First, multithreading puts a
larger strain on the cache subsystem, and
second, larger on-chip caches are expected to
be common in the same time frame that
simultaneous multithreading becomes viable.
Simulations with caches closer to current
processors, discussed in these experiments as
appropriate, are also run but do not show any
results.
Whenever the program counter crosses the
boundary of 32 bytes, the instruction caches
access occurs otherwise the instruction is
fetched from the already fetched buffer.
Dependence free instructions are issued in
order to an eight instructions per thread
scheduling window. From there, instructions
can be scheduled onto functional units,
depending on functional unit availability.
Instructions that are not scheduled due to
functional unit availability have priority in
the next cycle. This straightforward issue is
complemented model with the use of state-
of-the-art static scheduling, using the
Multiflow trace scheduling compiler. This
reduces the benefits that might be gained by
full dynamic execution, thus eliminating a
great deal of complexity (e.g., there is no
need for register renaming unless we need
precise exceptions, and we can use a simple
1-bitper- register score boarding scheme) in
the replicated register sets and fetch/decode
pipes.
2.2 Workload
The workload consists of SPEC92
benchmark suite that consists of twenty
public-domain, non-trivial programs that are
widely used to measure the performance of
computer systems, particularly those in the
UNIX workstation market. These
benchmarks were expressly chosen to
represent real-world applications and were
intended to be large enough to stress the
computational and memory system resources
of current-generation machines.
To gauge the raw instruction throughput
which is achievable by multithreaded
superscalar processors, the uniprocessor
applications are chosen by assigning a
distinct program to each thread. This models
a parallel workload which is achieved by
multiprogramming rather than parallel
processing. Hence the throughput results are
not affected by synchronization delays,
inefficient parallelization, etc.
Each program is compiled with the Multiflow
trace scheduling compiler and is modified to
produce Alpha code scheduled for target
machine. The applications were each
compiled with several different compiler
options.
3. Superscalar Bottlenecks:
Where Have All the Cycles
Gone?
This section provides motivation for SM. By
using the base single hardware context
machine, the issue utilization is measured,
i.e., the percentage of issue slots that are
filled in each cycle, for most of the SPEC
benchmarks. The cause of each empty issue
slot is also recorded. The results also
demonstrate that the functional units of
proposed wide superscalar processor are
highly underutilized. These results also
indicate that there is no dominant source of
wasted issue bandwidth. Simultaneous
multithreading has the potential to recover all
issue slots lost to both horizontal and vertical
waste. The next section provides details on
how effectively it does so.
4. Simultaneous
Multithreading
The performance results for simultaneous
multithreaded processors are discussed in this
section. Several machine models for
simultaneous multithreading are defined and
it is showed here that simultaneous
multithreading provides significant
performance improvement for both single
threaded superscalar and fine grain
multithreaded processors.
4.1 The Machine Models
The Fine-Grain Multithreading, SM:Full
Simultaneous Issue, SM:Single Issue,
SM:Dual Issue, and SM:Four Issue,
SM:Limited Connection models reflects
several possible design choices for a
combined multithreaded and superscalars
processors.
 Fine-Grain Multithreading
 SM:Full Simultaneous Issue
 SM:Single Issue
 SM:Dual Issue.
 SM:Four Issue
 SM:Limited Connection
4.2 The Performance of Simultaneous
Multithreading
Simultaneous Multithreading act also
displayed. The fine-grain multithreaded
architecture offers a maximum speed up. The
advantage of the real-time multithreading
models, achieve maximum speedups over
single thread. The speedups are calculated
using the full real-time issue.
By using Simultaneous Multithreading, it’s
not compulsory for any particular thread to
use the whole resources of processor to get
the maximum performance. One of the four
issue model it becomes good with full
simultaneous issue like the ratio of threads &
slots increases.
After the experiments it is seen the possibility
of transacting the number of hardware
contexts against the complexity in other
areas. The increasing rate in processor
consuming are the actual results of threads
which shares the processor resources if not
then it will remain idle for many time but
sharing the resources also contains negative
effects. The resources that are not executed
plays important role in the performance area.
Single-thread is not so reliable so it is
founded that it’s comfortable with multiple
one. The main effect is to share the caches
and it has been searched out that increasing
the public data brings the wasted cycles down
to 1%.
To gain the speedups the higher caches are
not so compulsory. The lesser caches tells us
the size of that caches which disturbs the 1-
thread and 8-thread results correspondingly
and the total speedups becomes constant in
front of extensive range of size of caches.
As a result it is shown that the limits of
simultaneous multithreading exceeds on the
performance possible through either single
thread execution or fine-grain
multithreading, when run on a wide
superscalar. It is also noticed that basic
implementations of SM with incomplete per-
thread abilities can still get high instruction
throughput. For this no change of architecture
required.
5. Cache Design for a
Simultaneous Multithreaded
Processor
The cache problem has been searched out.
Focus was on the organization of the first-
level (L1) caches, which related the use of
private per-thread caches to public caches for
both instructions and data.
The research use the 4-issue model with up to
8 threads. Not all of the private caches will be
consumed when less than eight threads are
running.
When there are many properties for
multithreaded caches then the public caches
adjusts for a small number of threads while
the private ones perform with large number
of threads well.
For instance the two caches gives the
opposite results because of their transactions
are not the same for both data and
instructions.
Public cache leaves a private data cache total
number of threads whereas the caches which
holds instructions can take advantage from
private cache at 8 threads. The reason is that
they access different patterns between the
data and instructions.
6. Simultaneous
Multithreading versus
Single-Chip Multiprocessing
The performance of simultaneous
multithreading to small-scale, single-chip
multiprocessing (MP) has been compared.
While comparing it is been noted that the two
scenarios are same that is both have multiple
register sets, multiple FU and higher
bandwidth on a single chip. The basic
difference is in the method of how these
resources are separated and organized.
Obviously scheduling is more complex for an
SM processor.
Functional unit configuration is frequently
enhanced for the multiprocessor and
represents a useless configuration for
simultaneous multithreading. MP calculates
with 1, 2 and 4 issues per cycle on every
processor and SM processors with 4 and 8
issues per cycle. 4 issue model is used for all
SM values. By using that model it reduces the
difficulties between SM and MP
architectures.
After the experiments we see that the SM
results are good in two ways that is the
amount of time required to schedule
instructions onto functional units, and the
public cache access time.
The distance between the data cache and
instructions or the load & store units may
have a big influence on cache access time
which is that the multiprocessor, with private
caches and private load & store units, can
decrease the distances between them but the
SM processor unable to do so even if with
private caches, the reason is that the load &
store units are public.
The solution was that the two different
structures could remove this difference.
There comes further advantages of SM over
MP that are not presented by the experiments:
the first one is Performance with few threads:
Its results display only the performance at
maximum exploitation.
The advantage of SM over the MP is greater
as some of the processors become unutilized.
The second advantage is Granularity and
flexibility of design: the options of
configurations are better-off with SM. For
this in multiprocessor, we have to add
calculating in units of whole processor. Our
evaluations did not take advantage of this
flexibility.
Like the performance and complexity results
displayed the reasons is that when constituent
concentrations allows us to set multiple
hardware contexts and wide-ranging issue
bandwidth on a single chip, instantaneous
multithreading denotes the most well-
organized organization of those resources.

More Related Content

What's hot

Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance Computing
Umarudin Zaenuri
 
Clock Synchronization (Distributed computing)
Clock Synchronization (Distributed computing)Clock Synchronization (Distributed computing)
Clock Synchronization (Distributed computing)Sri Prasanna
 
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsDB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
John Beresniewicz
 
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
ADVANCED COMPUTER ARCHITECTUREAND PARALLEL PROCESSINGADVANCED COMPUTER ARCHITECTUREAND PARALLEL PROCESSING
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Zena Abo-Altaheen
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
Romain Jacotin
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
Mehul Patel
 
INTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptxINTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptx
LECO9
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecturePiyush Mittal
 
Cache optimization
Cache optimizationCache optimization
Cache optimization
Vani Kandhasamy
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
"It can always get worse!" – Lessons Learned in over 20 years working with Or..."It can always get worse!" – Lessons Learned in over 20 years working with Or...
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
Markus Michalewicz
 
Using Machine Learning to Debug Oracle RAC Issues
Using Machine Learning to Debug Oracle RAC IssuesUsing Machine Learning to Debug Oracle RAC Issues
Using Machine Learning to Debug Oracle RAC Issues
Anil Nair
 
Parallel processing
Parallel processingParallel processing
Parallel processing
Syed Zaid Irshad
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Parallel programming model
Parallel programming modelParallel programming model
Parallel programming model
Illuru Phani Kumar
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
Databricks
 

What's hot (20)

Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance Computing
 
Clock Synchronization (Distributed computing)
Clock Synchronization (Distributed computing)Clock Synchronization (Distributed computing)
Clock Synchronization (Distributed computing)
 
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsDB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
 
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
ADVANCED COMPUTER ARCHITECTUREAND PARALLEL PROCESSINGADVANCED COMPUTER ARCHITECTUREAND PARALLEL PROCESSING
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
INTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptxINTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptx
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Cache optimization
Cache optimizationCache optimization
Cache optimization
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
"It can always get worse!" – Lessons Learned in over 20 years working with Or..."It can always get worse!" – Lessons Learned in over 20 years working with Or...
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
 
Using Machine Learning to Debug Oracle RAC Issues
Using Machine Learning to Debug Oracle RAC IssuesUsing Machine Learning to Debug Oracle RAC Issues
Using Machine Learning to Debug Oracle RAC Issues
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
 
13. multiprocessing
13. multiprocessing13. multiprocessing
13. multiprocessing
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Parallel programming model
Parallel programming modelParallel programming model
Parallel programming model
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 

Viewers also liked

Cohen sutherland algorithm
Cohen sutherland algorithmCohen sutherland algorithm
Cohen sutherland algorithm
Farwa Ansari
 
Digital logic and design's Lab 4 nand
Digital logic and design's Lab 4 nandDigital logic and design's Lab 4 nand
Digital logic and design's Lab 4 nand
Farwa Ansari
 
Scaling
ScalingScaling
Scaling
Farwa Ansari
 
Applications of Image Processing
Applications of Image ProcessingApplications of Image Processing
Applications of Image Processing
Farwa Ansari
 
Javadocx j option pane
Javadocx j option paneJavadocx j option pane
Javadocx j option pane
Farwa Ansari
 
Digital logic and design's Lab 3
Digital logic and design's Lab 3Digital logic and design's Lab 3
Digital logic and design's Lab 3
Farwa Ansari
 
Raster images (assignment)
Raster images (assignment)Raster images (assignment)
Raster images (assignment)
Farwa Ansari
 
Prefix and suffix of open gl
Prefix and suffix of open glPrefix and suffix of open gl
Prefix and suffix of open gl
Farwa Ansari
 
Manual of JAVA (more than Half)
Manual of JAVA (more than Half)Manual of JAVA (more than Half)
Manual of JAVA (more than Half)
Farwa Ansari
 
JAVA Manual remaining
JAVA Manual remainingJAVA Manual remaining
JAVA Manual remaining
Farwa Ansari
 
Linear combination of vector
Linear combination of vectorLinear combination of vector
Linear combination of vector
Farwa Ansari
 
Templates
TemplatesTemplates
Templates
Farwa Ansari
 
Tomasulo Algorithm
Tomasulo AlgorithmTomasulo Algorithm
Tomasulo Algorithm
Farwa Ansari
 
Chapter 4: Lexical & Syntax Analysis (Programming Exercises)
Chapter 4: Lexical & Syntax Analysis (Programming Exercises)Chapter 4: Lexical & Syntax Analysis (Programming Exercises)
Chapter 4: Lexical & Syntax Analysis (Programming Exercises)
Farwa Ansari
 
Mission statement and Vision statement of 3 Different Companies
Mission statement and Vision statement of 3 Different CompaniesMission statement and Vision statement of 3 Different Companies
Mission statement and Vision statement of 3 Different Companies
Farwa Ansari
 
IPv6 Implementation challenges
IPv6 Implementation challengesIPv6 Implementation challenges
IPv6 Implementation challenges
Farwa Ansari
 
Implementation & Challenges of IPv6
Implementation & Challenges of IPv6Implementation & Challenges of IPv6
Implementation & Challenges of IPv6
Farwa Ansari
 
DLDLab 8 half adder
DLDLab 8 half adderDLDLab 8 half adder
DLDLab 8 half adder
Farwa Ansari
 
Graphic display devices
Graphic display devicesGraphic display devices
Graphic display devices
Farwa Ansari
 
Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address TranslationMemory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
Farwa Ansari
 

Viewers also liked (20)

Cohen sutherland algorithm
Cohen sutherland algorithmCohen sutherland algorithm
Cohen sutherland algorithm
 
Digital logic and design's Lab 4 nand
Digital logic and design's Lab 4 nandDigital logic and design's Lab 4 nand
Digital logic and design's Lab 4 nand
 
Scaling
ScalingScaling
Scaling
 
Applications of Image Processing
Applications of Image ProcessingApplications of Image Processing
Applications of Image Processing
 
Javadocx j option pane
Javadocx j option paneJavadocx j option pane
Javadocx j option pane
 
Digital logic and design's Lab 3
Digital logic and design's Lab 3Digital logic and design's Lab 3
Digital logic and design's Lab 3
 
Raster images (assignment)
Raster images (assignment)Raster images (assignment)
Raster images (assignment)
 
Prefix and suffix of open gl
Prefix and suffix of open glPrefix and suffix of open gl
Prefix and suffix of open gl
 
Manual of JAVA (more than Half)
Manual of JAVA (more than Half)Manual of JAVA (more than Half)
Manual of JAVA (more than Half)
 
JAVA Manual remaining
JAVA Manual remainingJAVA Manual remaining
JAVA Manual remaining
 
Linear combination of vector
Linear combination of vectorLinear combination of vector
Linear combination of vector
 
Templates
TemplatesTemplates
Templates
 
Tomasulo Algorithm
Tomasulo AlgorithmTomasulo Algorithm
Tomasulo Algorithm
 
Chapter 4: Lexical & Syntax Analysis (Programming Exercises)
Chapter 4: Lexical & Syntax Analysis (Programming Exercises)Chapter 4: Lexical & Syntax Analysis (Programming Exercises)
Chapter 4: Lexical & Syntax Analysis (Programming Exercises)
 
Mission statement and Vision statement of 3 Different Companies
Mission statement and Vision statement of 3 Different CompaniesMission statement and Vision statement of 3 Different Companies
Mission statement and Vision statement of 3 Different Companies
 
IPv6 Implementation challenges
IPv6 Implementation challengesIPv6 Implementation challenges
IPv6 Implementation challenges
 
Implementation & Challenges of IPv6
Implementation & Challenges of IPv6Implementation & Challenges of IPv6
Implementation & Challenges of IPv6
 
DLDLab 8 half adder
DLDLab 8 half adderDLDLab 8 half adder
DLDLab 8 half adder
 
Graphic display devices
Graphic display devicesGraphic display devices
Graphic display devices
 
Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address TranslationMemory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
 

Similar to Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism

Compiler design
Compiler designCompiler design
Compiler design
renukarenuka9
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
ijcsit
 
Reducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor ArchitecturesReducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor Architectures
AIRCC Publishing Corporation
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
ijcsit
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
ijdpsjournal
 
LOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSLOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONS
ijdpsjournal
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collections
ijdpsjournal
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
palani kumar
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptx
Prudhvi668506
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
eSAT Publishing House
 
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTESPARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
suthi
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
faithxdunce63732
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
Fraboni Ec
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
CSCJournals
 
Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptx
ssuser41d319
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
ijmvsc
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
ijdpsjournal
 
Latency aware write buffer resource
Latency aware write buffer resourceLatency aware write buffer resource
Latency aware write buffer resource
ijdpsjournal
 

Similar to Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism (20)

Compiler design
Compiler designCompiler design
Compiler design
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
 
Reducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor ArchitecturesReducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor Architectures
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
 
LOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSLOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONS
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collections
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
 
Wiki 2
Wiki 2Wiki 2
Wiki 2
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptx
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTESPARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
 
Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptx
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
 
Latency aware write buffer resource
Latency aware write buffer resourceLatency aware write buffer resource
Latency aware write buffer resource
 

More from Farwa Ansari

Energy Harvesting Techniques in Wireless Sensor Networks – A Survey
Energy Harvesting Techniques in Wireless Sensor Networks – A SurveyEnergy Harvesting Techniques in Wireless Sensor Networks – A Survey
Energy Harvesting Techniques in Wireless Sensor Networks – A Survey
Farwa Ansari
 
Micro-services architecture
Micro-services architectureMicro-services architecture
Micro-services architecture
Farwa Ansari
 
Software Design Patterns - An Overview
Software Design Patterns - An OverviewSoftware Design Patterns - An Overview
Software Design Patterns - An Overview
Farwa Ansari
 
Optimizing the memory management of a virtual machine monitor on a NUMA syste...
Optimizing the memory management of a virtual machine monitor on a NUMA syste...Optimizing the memory management of a virtual machine monitor on a NUMA syste...
Optimizing the memory management of a virtual machine monitor on a NUMA syste...
Farwa Ansari
 
Fault Tolerance Typed Assembly Language - A graphical overview
Fault Tolerance Typed Assembly Language - A graphical overviewFault Tolerance Typed Assembly Language - A graphical overview
Fault Tolerance Typed Assembly Language - A graphical overview
Farwa Ansari
 
Comparative Analysis of Face Recognition Methodologies and Techniques
Comparative Analysis of Face Recognition Methodologies and TechniquesComparative Analysis of Face Recognition Methodologies and Techniques
Comparative Analysis of Face Recognition Methodologies and Techniques
Farwa Ansari
 
Chapter 5: Names, Bindings and Scopes (review Questions and Problem Set)
Chapter 5: Names, Bindings and Scopes (review Questions and Problem Set)Chapter 5: Names, Bindings and Scopes (review Questions and Problem Set)
Chapter 5: Names, Bindings and Scopes (review Questions and Problem Set)
Farwa Ansari
 
Business plan of a software house
Business plan of a software houseBusiness plan of a software house
Business plan of a software house
Farwa Ansari
 
Dld (lab 1 & 2)
Dld (lab 1 & 2)Dld (lab 1 & 2)
Dld (lab 1 & 2)
Farwa Ansari
 
Hacking and Hackers
Hacking and HackersHacking and Hackers
Hacking and Hackers
Farwa Ansari
 

More from Farwa Ansari (10)

Energy Harvesting Techniques in Wireless Sensor Networks – A Survey
Energy Harvesting Techniques in Wireless Sensor Networks – A SurveyEnergy Harvesting Techniques in Wireless Sensor Networks – A Survey
Energy Harvesting Techniques in Wireless Sensor Networks – A Survey
 
Micro-services architecture
Micro-services architectureMicro-services architecture
Micro-services architecture
 
Software Design Patterns - An Overview
Software Design Patterns - An OverviewSoftware Design Patterns - An Overview
Software Design Patterns - An Overview
 
Optimizing the memory management of a virtual machine monitor on a NUMA syste...
Optimizing the memory management of a virtual machine monitor on a NUMA syste...Optimizing the memory management of a virtual machine monitor on a NUMA syste...
Optimizing the memory management of a virtual machine monitor on a NUMA syste...
 
Fault Tolerance Typed Assembly Language - A graphical overview
Fault Tolerance Typed Assembly Language - A graphical overviewFault Tolerance Typed Assembly Language - A graphical overview
Fault Tolerance Typed Assembly Language - A graphical overview
 
Comparative Analysis of Face Recognition Methodologies and Techniques
Comparative Analysis of Face Recognition Methodologies and TechniquesComparative Analysis of Face Recognition Methodologies and Techniques
Comparative Analysis of Face Recognition Methodologies and Techniques
 
Chapter 5: Names, Bindings and Scopes (review Questions and Problem Set)
Chapter 5: Names, Bindings and Scopes (review Questions and Problem Set)Chapter 5: Names, Bindings and Scopes (review Questions and Problem Set)
Chapter 5: Names, Bindings and Scopes (review Questions and Problem Set)
 
Business plan of a software house
Business plan of a software houseBusiness plan of a software house
Business plan of a software house
 
Dld (lab 1 & 2)
Dld (lab 1 & 2)Dld (lab 1 & 2)
Dld (lab 1 & 2)
 
Hacking and Hackers
Hacking and HackersHacking and Hackers
Hacking and Hackers
 

Recently uploaded

Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
CarlosHernanMontoyab2
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 

Recently uploaded (20)

Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 

Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism

  • 1. COMPUTER ARCHITECTURE BATCH 2012 Assignment tittle “Summary of Paper” BY FARWA ABDUL HANNAN (12-CS-13) & ZAINAB KHALID (12-CS-33) Date OF Submission: Wednesday, 11 May, 2016 NFC – INSITUTDE OF ENGINEERING AND FERTILIZER RESEARCH, FSD
  • 2. Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy Department of Computer Science and Engineering University of Washington Seattle, WA 98195 ______________________________________________________________________________ 1. Introduction The paper examines simultaneous multithreading which is a technique that allows several independent threads to issue multiple functional units in each cycle. The objective of simultaneous multithreading is to increase processor utilization for both long memory latencies and limited available parallelism per thread. This study evaluates the potential improvement, relative to wide superscalar architectures and conventional multithreaded architectures, of various simultaneous multithreading models. The proposed results show the limits of superscalar execution and traditional multithreading to increase instruction throughput in future processors. 2. Methodology The main goal is to evaluate several architectural alternatives in order to examine simultaneous multithreading. For this a simulation environment has been developed that defines the implementation of the simultaneous multithreaded architecture and that architecture is the extension of next generation wide superscalar processors. 2.1 Simulation Environment The simulator uses the emulated based instruction level simulation that caches the partially decoded instructions for fast emulated execution. The simulator models the pipeline execution, hierarchy of memory and the branch prediction logic of wide superscalar processors. The simulator is based on Alpha 21164. Unlike Alpha this model supports the increased single stream parallelism. The simulated configuration consists of 10 functional units of four types such as four integer, two floating point, three load/store and 1 branch and issue rate id at maximum of 8 instructions per cycle. It is assumed that all functional units are completely pipelined. Assuming that the first and second-level on-chip caches considerably larger than on the Alpha, for two reasons. First, multithreading puts a larger strain on the cache subsystem, and second, larger on-chip caches are expected to be common in the same time frame that simultaneous multithreading becomes viable. Simulations with caches closer to current processors, discussed in these experiments as appropriate, are also run but do not show any results. Whenever the program counter crosses the boundary of 32 bytes, the instruction caches access occurs otherwise the instruction is fetched from the already fetched buffer. Dependence free instructions are issued in order to an eight instructions per thread scheduling window. From there, instructions can be scheduled onto functional units, depending on functional unit availability. Instructions that are not scheduled due to functional unit availability have priority in the next cycle. This straightforward issue is
  • 3. complemented model with the use of state- of-the-art static scheduling, using the Multiflow trace scheduling compiler. This reduces the benefits that might be gained by full dynamic execution, thus eliminating a great deal of complexity (e.g., there is no need for register renaming unless we need precise exceptions, and we can use a simple 1-bitper- register score boarding scheme) in the replicated register sets and fetch/decode pipes. 2.2 Workload The workload consists of SPEC92 benchmark suite that consists of twenty public-domain, non-trivial programs that are widely used to measure the performance of computer systems, particularly those in the UNIX workstation market. These benchmarks were expressly chosen to represent real-world applications and were intended to be large enough to stress the computational and memory system resources of current-generation machines. To gauge the raw instruction throughput which is achievable by multithreaded superscalar processors, the uniprocessor applications are chosen by assigning a distinct program to each thread. This models a parallel workload which is achieved by multiprogramming rather than parallel processing. Hence the throughput results are not affected by synchronization delays, inefficient parallelization, etc. Each program is compiled with the Multiflow trace scheduling compiler and is modified to produce Alpha code scheduled for target machine. The applications were each compiled with several different compiler options. 3. Superscalar Bottlenecks: Where Have All the Cycles Gone? This section provides motivation for SM. By using the base single hardware context machine, the issue utilization is measured, i.e., the percentage of issue slots that are filled in each cycle, for most of the SPEC benchmarks. The cause of each empty issue slot is also recorded. The results also demonstrate that the functional units of proposed wide superscalar processor are highly underutilized. These results also indicate that there is no dominant source of wasted issue bandwidth. Simultaneous multithreading has the potential to recover all issue slots lost to both horizontal and vertical waste. The next section provides details on how effectively it does so. 4. Simultaneous Multithreading The performance results for simultaneous multithreaded processors are discussed in this section. Several machine models for simultaneous multithreading are defined and it is showed here that simultaneous multithreading provides significant performance improvement for both single threaded superscalar and fine grain multithreaded processors. 4.1 The Machine Models The Fine-Grain Multithreading, SM:Full Simultaneous Issue, SM:Single Issue, SM:Dual Issue, and SM:Four Issue, SM:Limited Connection models reflects several possible design choices for a
  • 4. combined multithreaded and superscalars processors.  Fine-Grain Multithreading  SM:Full Simultaneous Issue  SM:Single Issue  SM:Dual Issue.  SM:Four Issue  SM:Limited Connection 4.2 The Performance of Simultaneous Multithreading Simultaneous Multithreading act also displayed. The fine-grain multithreaded architecture offers a maximum speed up. The advantage of the real-time multithreading models, achieve maximum speedups over single thread. The speedups are calculated using the full real-time issue. By using Simultaneous Multithreading, it’s not compulsory for any particular thread to use the whole resources of processor to get the maximum performance. One of the four issue model it becomes good with full simultaneous issue like the ratio of threads & slots increases. After the experiments it is seen the possibility of transacting the number of hardware contexts against the complexity in other areas. The increasing rate in processor consuming are the actual results of threads which shares the processor resources if not then it will remain idle for many time but sharing the resources also contains negative effects. The resources that are not executed plays important role in the performance area. Single-thread is not so reliable so it is founded that it’s comfortable with multiple one. The main effect is to share the caches and it has been searched out that increasing the public data brings the wasted cycles down to 1%. To gain the speedups the higher caches are not so compulsory. The lesser caches tells us the size of that caches which disturbs the 1- thread and 8-thread results correspondingly and the total speedups becomes constant in front of extensive range of size of caches. As a result it is shown that the limits of simultaneous multithreading exceeds on the performance possible through either single thread execution or fine-grain multithreading, when run on a wide superscalar. It is also noticed that basic implementations of SM with incomplete per- thread abilities can still get high instruction throughput. For this no change of architecture required. 5. Cache Design for a Simultaneous Multithreaded Processor The cache problem has been searched out. Focus was on the organization of the first- level (L1) caches, which related the use of private per-thread caches to public caches for both instructions and data. The research use the 4-issue model with up to 8 threads. Not all of the private caches will be consumed when less than eight threads are running. When there are many properties for multithreaded caches then the public caches adjusts for a small number of threads while the private ones perform with large number of threads well. For instance the two caches gives the opposite results because of their transactions are not the same for both data and instructions. Public cache leaves a private data cache total number of threads whereas the caches which holds instructions can take advantage from private cache at 8 threads. The reason is that they access different patterns between the data and instructions.
  • 5. 6. Simultaneous Multithreading versus Single-Chip Multiprocessing The performance of simultaneous multithreading to small-scale, single-chip multiprocessing (MP) has been compared. While comparing it is been noted that the two scenarios are same that is both have multiple register sets, multiple FU and higher bandwidth on a single chip. The basic difference is in the method of how these resources are separated and organized. Obviously scheduling is more complex for an SM processor. Functional unit configuration is frequently enhanced for the multiprocessor and represents a useless configuration for simultaneous multithreading. MP calculates with 1, 2 and 4 issues per cycle on every processor and SM processors with 4 and 8 issues per cycle. 4 issue model is used for all SM values. By using that model it reduces the difficulties between SM and MP architectures. After the experiments we see that the SM results are good in two ways that is the amount of time required to schedule instructions onto functional units, and the public cache access time. The distance between the data cache and instructions or the load & store units may have a big influence on cache access time which is that the multiprocessor, with private caches and private load & store units, can decrease the distances between them but the SM processor unable to do so even if with private caches, the reason is that the load & store units are public. The solution was that the two different structures could remove this difference. There comes further advantages of SM over MP that are not presented by the experiments: the first one is Performance with few threads: Its results display only the performance at maximum exploitation. The advantage of SM over the MP is greater as some of the processors become unutilized. The second advantage is Granularity and flexibility of design: the options of configurations are better-off with SM. For this in multiprocessor, we have to add calculating in units of whole processor. Our evaluations did not take advantage of this flexibility. Like the performance and complexity results displayed the reasons is that when constituent concentrations allows us to set multiple hardware contexts and wide-ranging issue bandwidth on a single chip, instantaneous multithreading denotes the most well- organized organization of those resources.