Study of various factors affecting performance of multi core processors

Super-Scalar Processors
 Super-Scalar processor is a CPU that implements a form of parallelism called
instruction-level parallelism within a single processor.
Simple superscalar
pipeline. By fetching and
dispatching two
instructions at a time, a
maximum of two
instructions per cycle can
be completed

Super-Scalar Processors
 Allow faster CPU throughput by executes more than one instruction during a
clock cycle by simultaneously dispatching multiple instructions to different
execution units on the processor.
 The execution unit is an execution resource within a single CPU such as an
arithmetic logic unit, a bit shifter, or a multiplier.

Super-Scalar Processor
 Superscalar design involves the processor being able to issue multiple
instructions in a single clock, with redundant facilities to execute an
instruction. We're talking about within a single core, mind you
multicore processing is different.
 Pipelining divides an instruction into steps, and since each step is
executed in a different part of the processor, multiple instructions can
be in different "phases" each clock

What is Multi Core Processor?
 A multi-core processor is a single computing component with two or more
independent actual processing units.
Core is the processing
unit which receives
instructions and
performs calculations,
or actions, based on
those instructions.

Multicore VS Super-Scalar
 In Super- Scalar, there is only one instruction counter.
 Even that Super-Scalar is keep tracking of multiple instructions in-flight, but
all the instructions are from a single program because this is still just one
processor.
 In multi-core we have multiple instruction streams executing simultaneously.
The important part is that each core (executing with its own instruction
counter) can also be super-scalar in order to execute each single process more
quickly.

What is Chip Multiprocessor?
 Integrate the cores onto a single integrated circuit
memory controller is a
digital circuit that
manages the flow of
data going to and from
the computer's main
memory.
Peripheral Component
Interconnect Express, a
serial expansion bus
standard for connecting
a computer to one or
more peripheral devices.
PCIe provides lower
latency and higher data
transfer rates than
parallel busses such as
PCI
Minimal Instruction Set
Computer (MISC) is a
processor architecture
with a very small
number of basic
operations and
corresponding opcodes.

Why CMP?
 Enable sharing of computation resources.

Single Chip Multiprocessor
 Integration of resources on a single chip.
 Why?
 Commercial: Dependency on these multi-threaded throughput-oriented
programs.
 Long off-chip delays: Traditional symmetric multiprocessors suffer from a
performance penalty caused by memory stalls due to cache misses and
cache-to-cache transfers.

 Benefits
 Reduced the cost of processing power.
 Low per unit cost.
 Increase reliability as there are many fewer electrical connections to fail.
 Increased throughput required by multi-threaded applications
 Reducing the overhead incurred due to sharing misses in traditional shared-
memory multiprocessors.

Core cache is
divided into 2
parts one for data
and one for
instructions.

Multiple-Chip Multiprocessor
 Known as M-CMP
 M-CMP is a combination of multiple CMPs.
Processor
Data and
Inst. cache

Multiple-Chip Multiprocessor
 All of the systems use shared memory to preserve operating
system and application investment.
 Key challenge for M-CMP systems is implementing correct and high
performance cache coherence protocols.
 These protocols keep caches transparent to software, usually by maintaining
the coherence invariant that each block may have either one writer or
multiple readers.
 M-CMPs present a greater challenge, because they must maintain both intra-
CMP coherence and inter- CMP coherence

Simulation
 Goal: evaluating the performance of the novel CMP or M-CMP micro-
architectures requires a way of simulating the environment in which we
would expect these architectures to be used in real systems.
 Software:
 GEM5 - modular platform for computer-system architecture research.
 Ruby – memory simulator, implements a detailed simulation model for the
memory subsystem.
 Execution time: Ruby Cycle
 L1 cache misses: calculated by dividing request missed by number of requests
(Instruction + Data).
 L2 miss/miss rate: calculated from the number of requests issued to the L2 and
the misses of all banks of L2.

Simulation
 L2/Dir replacement: Number of replacements of L2/Directory entries. It's
caused by capacity misses and conflict misses.
 Miss latency average: Average of the L1 miss latency in Ruby cycles. It is
measured from the moment a memory request is issued to the moment when
the date is retrieved.
 Memory requests: Number of reads and writes issued to main memory.
 Cache size: 32KB L1I + 32KB L1D and 512KB L2 cache per core

Results
Directory size
Processors
More cores needs to
increase the
directory to have
less cache miss

Results L2 misses
increases as # of
cores within the
system
increases
miss rate
decrease to
40%-80%
because of the
larger total L2
cache on chip

Results Observed:
L2 miss rate
decreases with
increase in
director sizeL2 misses is mainly
determined by L2
cache size and more
importantly
application working
set.

Results Replacement policy : Least Recent Use
(LRU)
L2 Replacement occurs when the L2
cache is full and another allocation is
required.
Directory and L2
size and on the
applications

Results Observation: larger directory size does not
improve the directory replacements. The
reason is that so many data are mapped to the
same location resulting in many conflicts.
Increasing the set associativity of the directory
to avoid the conflicts.

Results
Memory read
requests caused
by L2 miss is the
dominant faction
of total memory
requests.

Results Miss latency increases 50% from 4-core to 16-
core and 150%-230% from 16-core to 64-core.
On a L1 miss, there are up to 3 nodes involved
to fulfill the miss: local node, home node and
remote node.
Home Node: output of address mapping
function
Remote Node: Cache line requested by one of
the cores.
Local Node: Cache line founded in the local
private or shared partition

Results
For all structure, the latency looks almost the
same, which depends on network topology
and on-chip link latency.

Study of various factors affecting performance of multi core processors

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Study of various factors affecting performance of multi core processors

Similar to Study of various factors affecting performance of multi core processors (20)

Recently uploaded

Recently uploaded (9)

Study of various factors affecting performance of multi core processors