L1.pdf

Almaaqal University - Collage of Engineering-Department of Control and Computer Engineering
Advanced Computer Architectures, CC408, 4th
Year
1
Prepared By: Assist. Prof. Dr. Mohammed Al-Ibadi
Advanced Computer Architectures
1. Overview
Computer architecture is a fundamental concept in the field of computer science and
engineering. It encompasses the design and structure of computer systems, focusing on
how hardware components are organized and interconnected to execute instructions and
process data efficiently.
1.1 SISD, MISD, SIMD, MIMD Architectures
Michael Flynn has introduced taxonomy for various computer architectures based on
notions of Instruction Streams (IS) and Data Streams (DS). According to this taxonomy,
the Computer Architectures could be classified into four categories; Single Instruction
Single Data (SISD), Multiple Instruction Single Data (MISD), Single Instruction
Multiple Data (SIMD), and Multiple Instruction Multiple Data (MIMD).
Figure 1.1: Flynn Taxonomy of Computer Architecture
Let's explore each of them:

Year
2
1. SISD (Single Instruction, Single Data):
SISD architecture is the most basic and traditional form of computing. In this model, a
single central processing unit (CPU) executes one instruction at a time on a single piece
of data. It's a sequential, linear approach where each instruction is processed one after
the other. SISD is commonly found in older, uniprocessor systems.
2. SIMD (Single Instruction, Multiple Data):
In a SIMD architecture, a single instruction is applied simultaneously to multiple data
elements. This is achieved through the use of multiple processing units or cores, and
each unit processes a different data element simultaneously. SIMD is prevalent in
graphics processing units (GPUs) and is well-suited for tasks requiring parallel
processing, like image and video processing.

Year
3
3. MISD (Multiple Instruction, Single Data):
MISD architecture is the least common among the four. It involves multiple processing
units, each executing its own unique instruction on the same piece of data. This type of
architecture has limited practical applications and is often used for experimental or
specialized purposes, such as fault-tolerant systems.
4. MIMD (Multiple Instruction, Multiple Data):
MIMD architecture is the most versatile and widely used in modern computing. In
MIMD systems, multiple processors or cores independently execute different
instructions on separate data. This allows for true parallelism, making it suitable for a
wide range of applications, including multi-threaded software, scientific simulations,
and distributed computing.

Year
4
These different computer architectures are adapted to specific requirements and
workloads, and the choice of architecture depends on the nature of the tasks a computer
system needs to perform.
1.2 CISC and RISC
There are two major approaches to processor architecture: Complex Instruction Set
Computer (CISC) processors and Reduced Instruction Set Computer (RISC)
processors. Classic CISC processors are the Intel x86, Motorola 68xxx, and National
Semiconductor 32xxx processors and, to a lesser degree, the Intel Pentium. Common
RISC architectures are the Motorola/IBM PowerPC, the MIPS architecture, Sun’s
SPARC, the ARM, the ATMEL AVR, and the Microchip PIC.
CISC architecture is to complete a task in as few lines of assembly as possible. This is
achieved by building processor hardware that is capable of understanding and executing
a series of operations. RISC architecture only use simple instructions that can be
executed within one clock cycle. In the following we will discuss a case study for
illustrating the difference between CISC and RISC approaches.

Year
5
Case study: Multiplying Two Numbers in Memory
Figure 1.6 illustrates the storage scheme for a generic computer. The main memory is
divided into locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4. The
execution unit is responsible for carrying out all computations. However, the execution
unit can only operate on data that has been loaded into one of the six registers (A, B, C,
D, E, or F). Let's say we want to find the product of two numbers - one stored in location
2:3 and another stored in location 5:2 - and then store the product back in the location
2:3.
Figure 1.6: The storage scheme for a generic computer
1. The CISC Approach
The primary goal of CISC architecture is to complete a task in as few lines of assembly
as possible. This is achieved by building processor hardware that is capable of
understanding and executing a series of operations. For this particular task, a CISC
processor would come prepared with a specific instruction (we'll call it "MUL"). When

Year
6
executed, this instruction loads the two values into separate registers, multiplies the
operands in the execution unit, and then stores the product in the appropriate register.
Thus, the entire task of multiplying two numbers can be completed with one instruction:
MUL 2:3, 5:2
MUL is what is known as a "complex instruction." It operates directly on the computer's
memory banks and does not require the programmer to explicitly call any loading or
storing functions. It closely resembles a command in a higher level language. For
instance, if we let "a" represent the value of 2:3 and "b" represent the value of 5:2, then
this command is identical to the C statement "a = a * b."
One of the primary advantages of this system is that the compiler has to do very little
work to translate a high-level language statement into assembly. Because the length of
the code is relatively short, very little RAM is required to store instructions. The
emphasis is put on building complex instructions directly into the hardware.
2. The RISC Approach
RISC processors only use simple instructions that can be executed within one clock
cycle. Thus, the "MUL" command described above could be divided into three separate
commands: "LOAD," which moves data from the memory bank to a register, "PROD,"
which finds the product of two operands located within the registers, and "STORE,"
which moves data from a register to the memory banks. In order to perform the exact
series of steps described in the CISC approach, a programmer would need to code four
lines of assembly:
LOAD A, 2:3
LOAD B, 5:2
PROD A, B
STORE 2:3, A
At first, this may seem like a much less efficient way of completing the operation.
Because there are more lines of code, more RAM is needed to store the assembly level

Year
7
instructions. The compiler must also perform more work to convert a high-level
language statement into code of this form.
However, the RISC strategy also brings some very important advantages. Because each
instruction requires only one clock cycle to execute, the entire program will execute in
approximately the same amount of time as the multi-cycle "MUL" command. These
RISC "reduced instructions" require less transistors of hardware space than the complex
instructions, leaving more room for general purpose registers. Because all of the
instructions execute in a uniform amount of time (i.e. one clock), pipelining is possible.
Separating the "LOAD" and "STORE" instructions actually reduces the amount of work
that the computer must perform. After a CISC-style "MUL" command is executed, the
processor automatically erases the registers. If one of the operands needs to be used for
another computation, the processor must re-load the data from the memory bank into a
register. In RISC, the operand will remain in the register until another value is loaded
in its place.
Due to their computing power and low power consumption, RISC processors are
becoming widely used, particularly in embedded computer systems, and many RISC
attributes are appearing in what are traditionally CISC architectures (such as with the
Intel Pentium). Ironically, many RISC architectures are adding some CISC-like
features, and so the distinction between RISC and CISC is blurring.
So, which is better for embedded and industrial applications, RISC or CISC?
If power consumption needs to be low, then RISC is probably the better architecture to
use. However, if the available space for program storage is small, then a CISC processor
may be a better alternative.
The following equation can show the execution time of the program by both RISC and
CISC:
𝑇𝑖𝑚𝑒
𝑃𝑟𝑜𝑔𝑟𝑎𝑚
=
𝑇𝑖𝑚𝑒
𝐶𝑦𝑐𝑙𝑒
∗
𝐶𝑦𝑐𝑙𝑒𝑠
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛
∗
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
𝑃𝑟𝑜𝑔𝑟𝑎𝑚
1.1

Year
8
The CISC approach attempts to minimize the number of instructions per program,
sacrificing the number of cycles per instruction. RISC does the opposite, reducing the
cycles per instruction at the cost of the number of instructions per program.
2. System Buses
A bus connects various components in a computer system. We use the term system bus
to represent any bus within a processor system. These buses are also referred to as
internal buses.
A high-level view of a bus consists of an address bus for addressing information, a
data bus to carry data, and a control bus for various control signals. External buses,
on the other hand, are used to interface with the devices outside a typical processor
system. By our classification, serial and parallel interfaces, and Universal Serial Bus
(USB) belong to the external category. These buses are typically used to connect I/O
devices. In this section, we focus on internal buses.
Figure 2.1: A simplified block diagram showing components of the system bus
Figure 2.1 shows dedicated buses connecting the major components of a computer
system. The system bus consists of address, data, and control buses. One problem with

Year
9
the dedicated bus design is that it requires a large number of wires. We can reduce this
count by using multiplexed buses. For example, a single bus may be used for both
address and data. In addition, the address and data bus widths play an important role in
determining the address space and data transfer rate, respectively.
The control bus carries transaction-specific control information. Some typical control
signals are given below:
• Memory Read and Memory Write: These two control signals are used to indicate
that the transaction is a memory read or write operation.
• I/O Read and I/O Write: These signals indicate that the transaction involves an I/O
operation. The I/O read is used to read data from an I/O device. The I/O write, on the
other hand, is used for writing to an I/O device.
• Ready: A target device that requires more time to perform an operation typically uses
this signal to relate this fact to the requestor. For example, in a memory read operation,
if the memory cannot supply data within the CPU-specified time, it can let the CPU
know that it needs more time to complete the read operation. The CPU responds by
inserting wait states to extend the read cycle.
• Bus Request and Bus Grant: Since several devices share the bus, a device should
first request the bus before using it. The bus request signal is used to request the bus.
This signal is connected to a bus arbiter that arbitrates among the competing requests
to use the bus. The bus arbiter conveys its decision to allocate the bus to a device by
sending the bus grant signal to that device.
• Interrupt and Interrupt Acknowledgment: These two signals are used to facilitate
interrupt processing. A device requesting an interrupt service will raise the interrupt
signal. For example, if you depress a key, the keyboard generates an interrupt request
asking the processor to read the key. When the processor is ready to service the
interrupt, it sends the interrupt acknowledgment signal to the interrupting device. A
computer system has several devices that require interrupt processing. Therefore, as

Year
10
with the bus arbitration, we need a mechanism to arbitrate among the different interrupt
requests. This arbitration is usually done by assigning priorities to interrupts.
• DMA Request and DMA Acknowledgment: These two signals are used to transfer
data between memory and an I/O device in direct memory access mode. The normal
mode of data transfer between memory and an I/O device is via the processor. As an
example, consider transferring a block of data from a buffer in memory to an I/O device.
To perform this transfer, the processor reads a data word from the memory and then
writes to the I/O device. It repeats this process until all data are written to the I/O device.
This is called programmed I/O. The DMA mode relieves the processor of this chore.
The processor issues a command to the DMA controller (see Figure 2.1) by giving
appropriate parameters such as the data buffer pointer, buffer size, and the I/O device
id. The DMA controller performs the data transfer without any help from the processor.
Thus, the processor is free to work on other tasks. Note that when the DMA transfer is
taking place, the DMA controller acts as the bus master.
• Clock: This signal is used to synchronize operations of the bus and also provides
timing information for the operations.
• Reset: This signal initializes the system.
Buses can be designed as either synchronous or asynchronous buses. In synchronous
buses, a bus clock provides synchronization of all bus operations. Asynchronous buses
do not use a common bus clock signal; instead, these buses use handshaking to complete
an operation by using additional synchronization signals.
2.1 Bus Design Issues
Bus designers need to consider several issues to get the desired cost-performance trade-
off. Here is a list of the bus design issues:

Year
11
• Bus Width: Bus width refers to the data and address bus widths. System performance
improves with a wider data bus as we can move more bytes in parallel. We increase the
addressing capacity of the system by adding more address lines.
• Bus Type: As discussed above, there are two basic types of buses: dedicated and
multiplexed.
• Bus Operations: Bus systems support several types of operations to transfer data.
These include the read, write, block transfer, read-modify-write, and interrupt.
2.1.1 Bus Width
Data bus width determines how the data are transferred between two communicating
entities (e.g., CPU and memory). Although the instruction set architecture may have a
specific size, the data bus need not correspond to this value. For example, the Pentium
is a 32-bit processor. This simply means that the instructions can work on operands that
are up to 32 bits wide. The wider the data bus, the higher the bandwidth. Bandwidth
refers to the rate of data transfer. For example, we could measure bandwidth in number
of bits transferred per second. To improve performance, processors tend to use a wider
data bus. For example, even though the Pentium is a 32-bit processor, its data bus is 64
bits wide. Similarly, Intel’s 64-bit Itanium processor uses a 128-bit wide data bus.
It is, however, not cheap to build systems with wider buses. They need more space on
the motherboard or backplane, wider connectors, and more pins on the chip. For
economical reasons, cheaper processor versions use smaller bus widths. For example,
the IBM PC started with the 8088 CPU, which is a 16-bit processor just like the 8086.
Although the 8086 CPU uses a 16-bit data bus, its cheaper cousin the 8088 uses only 8-
bit data lines. Obviously, the 8088 needs two cycles to move a 16-bit value, 8-bits in
each cycle.
We should mention that improvement could also be obtained by increasing the clock
frequency. A 1 GHz Pentium moves data at a much faster rate than a 566 MHz Pentium.

Year
12
The address bus determines the system memory addressing capacity. A system with n
address lines can directly address 2n
memory words. In byte-addressable memories, that
means 2n
bytes. With each new generation of processors, we see a substantial increase
in memory addressing capacity.
You would think that the 4 GB address space of the Pentium is very large. But there are
applications (e.g., servers) that need more address space. For these and other reasons,
Intel’s 64-bit Itanium processor uses a 64-bit address bus. That’s a lot of address space
and guarantees that we will not run into address space problems for quite some time.
2.1.2 Bus Type
We have noted that we would like to have wider buses but they increase system cost.
For example, a 64-bit processor with 64 data and address lines requires 128 pins just
for these two buses. If you want to move 128 bits of data like the Itanium, we need 192
pins! Such designs are called dedicated bus designs because we have separate buses
dedicated to carry data and address information. The obvious advantage of these designs
is the performance we can get out of them. To reduce the cost of such systems we might
use multiplexed bus designs. In these systems, buses are not dedicated to a function.

Year
13
Instead, both data and address information are time multiplexed on a shared bus. We
refer to such a shared bus as an address-data (AD) bus.
To illustrate how multiplexed buses can be used, let us look at memory read and write
operations. In a memory read, the CPU places the address on the AD bus. The memory
unit reads the address and starts accessing the addressed memory location. In the
meantime, the CPU removes the address so that the same lines can be used by memory
to place the data. The memory write cycle operates similarly except that the CPU would
have to remove the address and then place the data to be written on the AD lines.
Obviously, multiplexed bus designs reduce the cost but they also reduce the system
performance.
2.1.3 Bus Operations
The bus operations are mainly done between processor and memory, or processor and
input/output, and between memory and input/output in case of DMA operation. The bus
operations consist of Memory Read, Memory Write, I/O Read, and I/O Write.
Processors support a variety of other operations. Processors provide block transfer
operations (like in cache memory) that read or write several contiguous locations of a
memory block. Such block transfers are more efficient than transferring each individual
word.
Read-modify-write operations are useful in multiprocessor systems. In these systems, a
shared data structure or a section of code, called the critical section, must be accessed
on a mutually exclusive basis (i.e., one at a time).
Interrupts are used to draw the attention of the processor for a service required by an
I/O device. Since the type of service required depends on the interrupting device, the
processor enters an interrupt cycle to get the identification of the interrupting device.
From this information, the processor executes an appropriate service needed by the
device. For example, when a key is depressed, the processor is interrupted, and the
interrupt cycle is used to find that the keyboard wants service (i.e., reading a key stroke).

L1.pdf

More Related Content

Similar to L1.pdf

Recently uploaded

L1.pdf