MODULE – 2
Processor Design
1
Contents
■ Custom Single purpose Processor
– RT level Combinational Components
– RT level Sequential Components
– Custom Single Purpose Processor Design
– Optimizing custom single processors
– Optimizing original program, FSMD, datapath,
FSM
■ General Purpose Processors
– Basic Architecture
– Datapath
– Control unit
– Memory
– Pipelining
2
Contents (cont..)
■ Superscalar and VLIW Architectures
■ Application Specific Instruction Set Processors
(ASIPs)
– Microcontrollers
– DSP
– Less general ASIP environments
■ Selecting a Microprocessor/General purpose
processor
3
Introduction
■ Processor – Digital circuit to perform computation tasks
– Datapath
– Controller
■ General purpose processor
– Wide variety of computation tasks
■ Single purpose processor
– To carry out a particular computation task
– Common tasks
■ Custom single purpose processors
– Non-standard task
4
Introduction (cont..)
■ Why custom single purpose processor?
– Faster performance
■ Fewer clock cycles from customized datapath
■ Shorter clock cycles from simple functional units
– Smaller size
■ Simpler datapath
■ No program memory
– Less power consumption
■ More efficient computation
■ Drawbacks
– High NRE costs
– Time to market longer
– Flexibility reduced
5
Combinational Logic
■ Transistor – Basic electrical component in digital systems
■ Transistors  Logic Gates  Digital Systems
■ MOS transistor on silicon
– Acts as an on/off switch
– Voltage at “gate” controls whether current flows from source to
drain
6
source drain
oxide
gate
IC package IC
channel
Silicon substrate
gate
source
drain
Conducts
if gate=1
CMOS Transistor
Implementations■ Complementary Metal
Oxide Semiconductor
■ We refer to logic levels
– Typically 0 is 0V, 1 is 5V
■ nMOS conducts if gate=1
■ pMOS conducts if gate=0
■ Basic gates
7
x F = x'
1
Inverter
0
F = (xy)'
x
1
x
y
y
NAND gate
0
1
F = (x+y)'
x y
x
y
NOR gate
0
gate
source
drain
nMOS
Conducts
if gate=1
gate
source
drain
pMOS
Conducts
if gate=0
Basic Logic Gates
8
F = x y
AND
F = (x
y)’
NAND
F = x 
y
XOR
F = x
Driver
F = x’
Inverte
r
x F
F = x +
y
OR
F =
(x+y)’
NOR
x F
x
y
F
F
x
y
x
y
F
x
y
F
x
y
F
F =x y
XNOR
Fy
x
x
0
y
0
F
0
0 1 0
1 0 0
1 1 1
x
0
y
0
F
0
0 1 1
1 0 1
1 1 1
x
0
y
0
F
0
0 1 1
1 0 1
1 1 0
x
0
y
0
F
1
0 1 0
1 0 0
1 1 1
x
0
y
0
F
1
0 1 1
1 0 1
1 1 0
x
0
y
0
F
1
0 1 0
1 0 0
1 1 0
x F
0 0
1 1
x F
0 1
1 0
Combinational Logic Design
■ Combinational circuit
– Digital Circuit whose output is a function of
current inputs
– No memory of past inputs
■ Steps in designing a Combinational Logic Circuit
1. Problem Definition
2. Truth Table
3. Output Equations
4. Minimized Expressions
5. Logic Circuit
9
Combinational Logic Design
1. Problem Description
y is 1 if a is equal to 1, or b and c are 1.
z is 1 if b or c is equal to 1, but not both, or if all
are 1.
10
Combinational Logic Design
(cont..)
2. Truth Table
11
a b c y z
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 1 0
1 0 1 1 1
1 1 0 1 1
1 1 1 1 1
Combinational Logic Design
(cont..)3. Output Equations
y= a’bc + abc’ + ab’c + abc’ + abc
z= a’b’c + a’bc’ + ab’c + abc’ + abc
4. Minimized Expressions
y= a + bc
z= ab + b’c +bc’
12
Combinational Logic Design (cont..)
13
a
b
c
y
z
Combinational Logic Design
(cont..)■ Large circuits complex to design using logic gates
■ Eg- 16 inputs
– 216=64K rows in truth table
■ Reduce complexity by components that are abstract
than logic gates
14
Combinational Components
15
Sequential Logic Design
■ Sequential Circuit
– Output is a function of current as well as previous
input values
– Has memory
■ Basic sequential circuit – FLIP FLOP
– Stores a single bit
16
17
State Tables
Excitation Tables
Sequential Components
18
Sequential Logic Design (cont..)
■ Control Inputs
– Synchronous
– Asynchronous
■ Clear control lines are asynchronous
19
Sequential Logic Design
A) Problem Description
You want to construct a clock divider. Output a 1 for every
four clock cycles
20
B) State Diagram
21
c) Implementation Model
22
d) State Table
23
e) Minimized Expressions
24
f) Combinational Logic
25
Custom Single-purpose Processor
Basic Model
26
controller and datapath
controller datapath
…
…
external
control
inputs
external
control
outputs
…
external
data
inputs
…
external
data
outputs
datapath
control
inputs
datapath
control
outputs
… …
a view inside the controller and datapath
controller datapath
… …
state
register
next-state
and
control
logic
registers
functional
units
State Diagram Templates
27
Assignment statement
a = b
next statement
a = b
next
statement
Loop statement
while (cond) {
loop-body-
statements
}
next statement
loop-body-
statements
cond
next
statement
!cond
J:
C:
Branch statement
if (c1)
c1 stmts
else if c2
c2 stmts
else
other stmts
next statement
c1
c2 stmts
!c1*c2 !c1*!c2
next
statement
othersc1 stmts
J:
C:
Example: Greatest Common
Divisor■ First create algorithm
■ Convert algorithm to “complex” state machine
– Known as FSMD: finite-state machine with datapath
– Can use templates to perform such conversion
28
GCD
(a) Black-Box View
x_i y_i
d_o
go_i
b) Desired Functionality
29
0: int x, y;
1: while (1) {
2: while (!go_i);
3: x = x_i;
4: y = y_i;
5: while (x != y) {
6: if (x < y)
7: y = y - x;
else
8: x = x - y;
}
9: d_o = x;
}
c) State Diagram
30
y = y -x7: x = x - y8:
6-J:
x!=y
5:
!(x!=y)
x<y !(x<y)
6:
5-J:
1:
1
!1
x = x_i3:
y = y_i4:
2:
2-J:
!go_i
!(!go_i)
d_o = x
1-J:
9:
31
Creating the Datapath
■ Create a register for any
declared variable
■ Create a functional unit
for each arithmetic
operation
■ Connect the ports,
registers and functional
units
– Based on reads and
writes
– Use multiplexors for
multiple sources
■ Create unique identifier
– for each control input
and output of datapath
components
32
Creating the Controller
■ Stage 3 x_sel=0; x_ld=1;
– for loading ‘x’
■ Stage 4  y_sel=0; y_ld=1;
– For loading ‘y’
■ Stage 7  y_sel=1; y_ld=1;
– For loading the subtracted result y-x
■ Stage 8  x_sel=1; x_ld=1;
– For loading the subtracted result x-y
■ Stage 9 d_ld=1
– Load the output register
33
Controller Implementation Model
■ Inputs
– go_i Enable
– Q3-Q0 Output from
state register
– x_neq_y
– X_lt_y
■ Outputs
– x_sel, y_sel
– x_ld, y_ld
– d_ld
– I3 - I0
34
35
Completing the GCD Custom Single-
Purpose Processor Design
36
… …
a view inside the controller and datapath
controller datapath
… …
state
register
next-state
and
control
logic
registers
functional
units
■ We finished the datapath
■ We have a state table for the
next state and control logic
■ Truth table for the
combinational logic
■ This is not an optimized
design.
Optimizing Single-Purpose
Processors■ Optimization is the task of making design metric values the
best possible
■ GCD eg- If numbers are large, it will take more steps
– Speed decreases
■ Optimization opportunities
– Original Program
– FSMD
– Datapath
– FSM
37
Optimizing the Original Program
■ Analyze program attributes and look for areas of possible
improvement
– Number of computations
– Size of variable
– Time and space complexity
– Operations used
■ Multiplication and division very expensive
38
Optimizing the Original Program
(Cont..)
39
0: int x, y;
1: while (1) {
2: while (!go_i);
3: x = x_i;
4: y = y_i;
5: while (x != y) {
6: if (x < y)
7: y = y - x;
else
8: x = x - y;
}
9: d_o = x;
}
0: int x, y, r;
1: while (1) {
2: while (!go_i);
// x must be the larger number
3: if (x_i >= y_i) {
4: x=x_i;
5: y=y_i;
}
6: else {
7: x=y_i;
8: y=x_i;
}
9: while (y != 0) {
10: r = x % y;
11: x = y;
12: y = r;
}
13: d_o = x;
}
Original Program
Optimized Program
replace the subtraction
operation(s) with
modulo operation in
order to speed up
program
GCD(42, 8)
• 9 iterations to complete the loop
• x and y values evaluated as follows : (42, 8),
(34, 8), (26,8), (18,8), (10, 8), (2,8), (2,6),
(2,4), (2,2).
GCD(42,8)
• 3 iterations to complete the loop
• x and y values evaluated as follows: (42, 8),
(8,2), (2,0)
Optimizing the FSMD
■ Areas of possible improvements
– Merge states
■ States with constants on transitions can be eliminated, transition
taken is already known
■ States with independent operations can be merged
– Separate states
■ States which require complex operations (a*b*c*d) can be broken
into smaller states to reduce hardware size
– Scheduling
■ Task of assigning operations from the original program to
states in an FSMD
40
Optimizing the FSMD
41
Original FSMD Optimized FSMD
• Eliminate state 1 – transitions have constant
values
• Merge state 2 and state 2J – no loop operation in
between them
• Merge state 3 and state 4 – assignment
operations are independent of one another
• Merge state 5 and state 6 – transitions from state
6 can be done in state 5
• Eliminate state 5J and 6J – transitions from each
state can be done from state 7 and state 8,
respectively
• Eliminate state 1-J – transition from state 1-J can
be done directly from state 9
Optimizing the FSMD (cont..)
■ Consider a = b * c * d * e
■ Generating a single state for the operation requires 3 multipliers
in the datapath.
■ Multipliers are expensive
■ Break down the operation down into smaller operations
– t1 = b * c
– t2 = d * e
– a = t1 * t2
■ Each smaller operation has its own state
■ Only 1 multiplier is required in the datapath
42
Optimizing the FSMD (cont..)
■ Timing of output operations could be changed while the FSMD
is optimized
■ Reduced FSMD will generate GCD output in fewer clock cycles
■ Changing the timing would not be acceptable in all cases.
Eg- Clock divider
■ Thus, when optimizing FSMD, a designer must be aware of
whether output timing may or may not be modified.
43
Optimizing the Datapath
■ Sharing of functional units
– One-to-one mapping, as done previously, is not necessary
– If same operation occurs in different states, they can share
a single functional unit
■ Multi-functional units
– ALUs support a variety of operations, it can be shared
among operations occurring in different states
44
Optimizing the FSM
■ State Encoding
– Task of assigning a unique bit pattern to each state in an
FSM
– Size of state register and combinational logic vary
– Eg- FSM with n states – n! possible encoding ways
– Can be treated as an ordering problem
– More encodings are possible – Can use more than log2n
bits to encode ‘n’ states
– CAD tools – great aid in searching for the best encoding
■ State Minimization
– Task of merging equivalent states into a single state
■ State equivalent if for all possible input combinations the two
states generate the same outputs and transitions to the next same
state
45
■ Converting a sequential program into custom single purpose
processor
– Convert the program into FSMD
– Splitting FSMD into a simple FSM controlling datapath
– Performing sequential logic design on the FSM
■ In many cases, we prefer not to start with a program – but
instead directly with a FSMD
– Cycle by cycle timing of a system is central to the design
– Programming language don’t typically support cycle by
cycle description
46
RT-level Custom
Single-Purpose Processor Design
RT-level Custom
Single-Purpose Processor Design
■ Example
– Device to send an 8-bit number to another device (the
receiver)
– Receiver can receive all 8 bits at once
– Sender sends 4 bits at a time – First lower order 4 bits and
then the higher order 4 bits
■ Bridge should be designed that will enable the 2 devices to
communicate
47
RT-level Custom
Single-Purpose Processor Design
48
49
RT-level custom single-purpose processor
design (cont…)
General Purpose Processors -
Software
50
Introduction
■ General-Purpose Processor
– Processor designed for a variety of computation tasks
– Low unit cost, in part because manufacturer spreads NRE
over large numbers of units
■ Motorola sold half a billion 68HC05 microcontrollers in 1996 alone
– Carefully designed since higher NRE is acceptable
■ Can yield good performance, size and power
– Low NRE cost, short time-to-market/prototype, high
flexibility
■ User just writes software; no processor design
– Also known as “microprocessor” – “micro” used when they
were implemented on one or a few chips rather than entire
rooms
51
Basic Architecture
52
■ Control unit and
datapath
– Note similarity to
single-purpose
processor
■ Key differences
– Datapath is general
– Control unit doesn’t
store the algorithm
– the algorithm is
“programmed” into
the memory
Datapath Operations
53
• Load
• Read memory
location into
register
• ALU operation
– Input certain
registers through
ALU, store back in
register
• Store
– Write register to
memory location
Control Unit
■ Control unit: configures the
datapath operations
– Sequence of desired
operations
(“instructions”) stored in
memory – “program”
■ Instruction cycle – broken
into several sub-operations,
each one clock cycle, e.g.:
– Fetch: Get next
instruction into IR
– Decode: Determine what
the instruction means
– Fetch operands: Move
data from memory to
datapath register
– Execute: Move data
through the ALU
– Store results: Write data
from register to memory 54
Control Unit Sub-Operations
■ Fetch
– Get next
instruction
into IR
– PC: program
counter,
always points
to next
instruction
– IR: holds the
fetched
instruction
55
Control Unit Sub-Operations
■ Decode
– Determine
what the
instruction
means
56
Control Unit Sub-Operations
■ Fetch operands
– Move data
from memory
to datapath
register
57
Control Unit Sub-Operations
■ Execute
– Move data
through the
ALU
– This particular
instruction
does nothing
during this
sub-operation
58
Control Unit Sub-Operations
■ Store results
– Write data
from register
to memory
– This particular
instruction
does nothing
during this
sub-operation
59
Instruction Cycles
60
Instruction Cycles
61
Instruction Cycles
62
Architectural Considerations
■ N-bit processor
– N-bit ALU,
registers, buses,
memory data
interface
– Embedded: 8-bit,
16-bit, 32-bit
common
– Desktop/servers:
32-bit, even 64
■ PC size determines
address space
63
Architectural Considerations
■ Clock frequency
– Inverse of clock
period
– Must be longer
than longest
register to
register delay in
entire processor
– Memory access is
often the longest
64
Pipelining: Increasing Instruction Throughput
65
66
Two Memory Architectures
■ Princeton
– Fewer memory
wires
■ Harvard
– Simultaneous
program and
data memory
access
Processor
Program
memory
Data
memory
Processor
Memory
(program and data)
Harvard Princeton
Cache Memory
■ Memory access may be slow
■ Cache is small but fast memory
close to processor
– Holds copy of part of
memory
– Hits and misses
67
Processor
Memory
Cache
Fast/expensive technology, usually
on the same chip
Slower/cheaper technology, usually
on a different chip
Superscalar and VLIW
Architectures■ Performance can be improved by:
– Faster clock (but there’s a limit)
– Pipelining: slice up instruction into stages, overlap stages
– Multiple ALUs to support more than one instruction
stream
■ Superscalar
– Scalar: non-vector operations
– Fetches instructions in batches, executes as many as possible
■ May require extensive hardware to detect independent
instructions
– VLIW: each word in memory has multiple independent
instructions
■ Relies on the compiler to detect and schedule instructions
■ Currently growing in popularity
68
Programmer’s View
■ Programmer doesn’t need detailed understanding of
architecture
– Instead, needs to know what instructions can be executed
■ Two levels of instructions:
– Assembly level
– Structured languages (C, C++, Java, etc.)
■ Most development today done using structured languages
– But, some assembly level programming may still be necessary
– Drivers: portion of program that communicates with and/or controls
(drives) another device
■ Often have detailed timing considerations, extensive bit manipulation
■ Assembly level may be best for these
69
Assembly-Level Instructions
■ Instruction Set
– Defines the legal set of instructions for that processor
■ Data transfer: memory/register, register/register, I/O, etc.
■ Arithmetic/logical: move register through ALU and back
■ Branches: determine next PC value when not just PC+1
70
opcode operand1 operand2
opcode operand1 operand2
opcode operand1 operand2
opcode operand1 operand2
...
Instruction 1
Instruction 2
Instruction 3
Instruction 4
A Simple (Trivial) Instruction Set
71
Addressing Modes
72
Sample Programs
73
int total = 0;
for (int i=10; i!=0; i--)
total += i;
// next instructions...
C program
MOV R0, #0; // total = 0
MOV R1, #10; // i = 10
JZ R1, Next; // Done if i=0
ADD R0, R1; // total += i
MOV R2, #1; // constant 1
JZ R3, Loop; // Jump always
Loop:
Next: // next instructions...
SUB R1, R2; // i--
Equivalent assembly program
MOV R3, #0; // constant 0
0
1
2
3
5
6
7
Programmer Considerations
74
■ Program and data memory space
– Embedded processors often very limited
■ e.g., 64 Kbytes program, 256 bytes of RAM (expandable)
■ Registers: How many are there?
– Only a direct concern for assembly-level programmers
■ I/O
– How communicate with external signals?
■ Interrupts
Operating System
75
■ Optional software layer providing low-level services to a
program (application).
– File management, disk access
– Keyboard/display interfacing
– Scheduling multiple programs for execution
■ Or even just multiple threads from one program
– Program makes system calls to the OS
Development Environment
76
■ Development processor
– The processor on which we write and debug our programs
■ Usually a PC
■ Target processor
– The processor that the program will run on in our
embedded system
■ Often different from the development processor
Software Development Process
77
■ Compilers
– Cross compiler
■ Runs on one
processor, but
generates code
for another
■ Assemblers
■ Linkers
■ Debuggers
■ Profilers
Running a Program
■ If development processor is different than target,
how can we run our compiled code? Two options:
– Download to target processor
– Simulate
■ Simulation
– One method: Hardware description language
■ But slow, not always available
– Another method: Instruction set simulator (ISS)
■ Runs on development processor, but executes
instructions of target processor
78
Testing and Debugging
79
■ ISS
– Gives us control over
time – set breakpoints,
look at register values, set
values, step-by-step
execution, ...
– But, doesn’t interact with
real environment
■ Download to board
– Use device programmer
– Runs in real environment,
but not controllable
■ Compromise: Emulator
– Runs in real environment
– Supports some
controllability from the
PC
Application-Specific
Instruction-Set Processors (ASIPs)
80
■ General-Purpose Processors
– Sometimes too general to be effective in demanding
application
■ e.g., video processing – requires huge video buffers
and operations on large arrays of data, inefficient on a
GPP
– But single-purpose processor has high NRE, not
programmable
■ ASIP’s – targeted to a particular domain
– Contain architectural features specific to that domain
■ e.g., embedded control, digital signal processing, video
processing, network processing, telecommunications,
etc.
– Still programmable
A Common ASIP: Microcontroller
81
■ For embedded control applications
– Reading sensors, setting actuators
– Mostly dealing with events (bits): data is present, but not in huge
amounts
– e.g., VCR, disk drive, digital camera (assuming SPP for image
compression), washing machine, microwave oven
■ Microcontroller features
– On-chip peripherals
■ Timers, analog-digital converters, serial communication, etc.
■ Tightly integrated for programmer, typically part of register space
– On-chip program and data memory
– Direct programmer access to many of the chip’s pins
– Specialized instructions for bit-manipulation and other low-level
operations
■ Incorporating peripherals and memory onto the same IC – reduces the no.
of required IC’s  Compact and low power implementations
Another Common ASIP: Digital Signal Processors
(DSP)■ For signal processing applications
– Large amounts of digitized data, often streaming
■ Source – photo captured by a digital camera, a voice packet through
a network router
– Data transformations must be applied fast
– e.g., cell-phone voice filter, digital TV, music synthesizer
■ DSP features
– Several instruction execution units – Filtering, Transforming
vectors or metrics of data
– Multiple-accumulate single-cycle instruction, other instructions.
– Efficient vector operations – e.g., add two arrays
■ Vector ALUs, loop buffers, etc.
– Contains number of ADC, DAC, PWM, timers, counters etc.
– Commonly used DSP’s are well supported in terms of
compiler and other development tools  Easy and cheap to
integrate into most embedded systems. 82
Less General ASIP Environments
■ ASIP’s that are less general in nature
■ Designed to perform very domain specific processing while
allowing some degree of programmability.
■ ASIP’s designed for networking hardware May be designed to
be programmable with different network routing algorithms,
checksum, and packet processing protocols
83
Trend: Even More Customized
ASIPs
84
■ In the past, microprocessors were acquired as chips
■ Today, we increasingly acquire a processor as Intellectual
Property (IP)
– e.g., synthesizable VHDL model
■ Opportunity to add a custom datapath hardware and a few
custom instructions, or delete a few instructions
– Can have significant performance, power and size impacts
– Problem: need compiler/debugger for customized ASIP
■ Remember, most development uses structured languages
■ One solution: Automatic compiler/debugger generation
– e.g., www.tensillica.com
■ Another solution: Re-targetable compilers
– e.g., www.improvsys.com (customized VLIW architectures)
Selecting a Microprocessor
85
■ Issues
– Technical: speed, power, size, cost
– Other: development environment, prior expertise, licensing,
etc.
■ Speed: how evaluate a processor’s speed?
– Clock speed – but instructions per cycle may differ
– Instructions per second – but work per instruction may differ
– Dhrystone: Synthetic benchmark, developed in 1984.
Dhrystones/sec.
■ MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital’s VAX 11/780).
A.k.a. Dhrystone MIPS. Commonly used today.
– So, 750 MIPS = 750*1757 = 1,317,750 Dhrystones per second
– SPEC: set of more realistic benchmarks, but oriented to desktops
– EEMBC – EDN Embedded Benchmark Consortium, www.eembc.org
■ Suites of benchmarks: automotive, consumer electronics,
networking, office automation, telecommunications
General Purpose Processors
86
Processor Clock speed Periph. Bus Width MIPS Power Trans. Price
General Purpose Processors
Intel PIII 1GHz 2x16 K
L1, 256K
L2, MMX
32 ~900 97W ~7M $900
IBM
PowerPC
750X
550 MHz 2x32 K
L1, 256K
L2
32/64 ~1300 5W ~7M $900
MIPS
R5000
250 MHz 2x32 K
2 way set assoc.
32/64 NA NA 3.6M NA
StrongARM
SA-110
233 MHz None 32 268 1W 2.1M NA
Microcontroller
Intel
8051
12 MHz 4K ROM, 128 RAM,
32 I/O, Timer, UART
8 ~1 ~0.2W ~10K $7
Motorola
68HC811
3 MHz 4K ROM, 192 RAM,
32 I/O, Timer, WDT,
SPI
8 ~.5 ~0.1W ~10K $5
Digital Signal Processors
TI C5416 160 MHz 128K, SRAM, 3 T1
Ports, DMA, 13
ADC, 9 DAC
16/32 ~600 NA NA $34
Lucent
DSP32C
80 MHz 16K Inst., 2K Data,
Serial Ports, DMA
32 40 NA NA $75
Chapter Summary
87
■ General-purpose processors
– Good performance, low NRE, flexible
■ Controller, datapath, and memory
■ Structured languages prevail
– But some assembly level programming still necessary
■ Many tools available
– Including instruction-set simulators, and in-circuit emulators
■ ASIPs
– Microcontrollers, DSPs, network processors, more customized ASIPs
■ Choosing among processors is an important step
■ Designing a general-purpose processor is conceptually the same
as designing a single-purpose processor
Problems
88
1. An algorithm for matrix multiplication, assuming that we have one adder and
one multiplier, follows:
a. Convert the matrix multiplication algorithm into a state diagram.
b. Rewrite the matrix multiplication algorithm given the assumption that we have
3 adders and 6 multipliers.
c. If each multiplication takes 2 cycles to compute and each addition takes one
cycle to compute, how many cycles does it take to complete the matrix
multiplication given one adder and one multiplier? Three adders and six
multipliers?
d. If each adder requires 10 transistors to implement and each multiplier requires
100 transistors to implement, what is the total number of transistors to
implement the matrix multiplication circuit using 1 adder and 1 multiplier? Three
adders and six multipliers?
89
main()
{
int A[3][2]={ {1, 2}, {3,4}, {5,6}};
int B[2][3]= {{7, 8, 9}, (10, 11, 12}};
int C[3][3], i, j, k;
for(i=0; i<3; i++) {
for(j=0; j<3; j++) {
c[i][j]=0;
for(k=0;k<2;k++){
c[i][j]+=A[i][k]*B[k][j];
}
}
}
}
90
91
■ Cycles to complete matrix multiplication
– 1 adder + 1 multiplier = 54 cycles
– 3 adders + 6 multipliers = 9 cycles
■ Number of transistors
– 1 adder + 1 multiplier = 110 transistors
– 3 adders + 6 multipliers = 630 transistors
92
2. Design a single-purpose processor that outputs Fibonacci
numbers up to n places. Start with a function computing the
desired result, translate it into a state diagram, and sketch a
probable datapath.
93
94
95
c_ld c_sel x2_ld x2_sel
count_lt_ncount_ne_01
0

Esd module2

  • 1.
  • 2.
    Contents ■ Custom Singlepurpose Processor – RT level Combinational Components – RT level Sequential Components – Custom Single Purpose Processor Design – Optimizing custom single processors – Optimizing original program, FSMD, datapath, FSM ■ General Purpose Processors – Basic Architecture – Datapath – Control unit – Memory – Pipelining 2
  • 3.
    Contents (cont..) ■ Superscalarand VLIW Architectures ■ Application Specific Instruction Set Processors (ASIPs) – Microcontrollers – DSP – Less general ASIP environments ■ Selecting a Microprocessor/General purpose processor 3
  • 4.
    Introduction ■ Processor –Digital circuit to perform computation tasks – Datapath – Controller ■ General purpose processor – Wide variety of computation tasks ■ Single purpose processor – To carry out a particular computation task – Common tasks ■ Custom single purpose processors – Non-standard task 4
  • 5.
    Introduction (cont..) ■ Whycustom single purpose processor? – Faster performance ■ Fewer clock cycles from customized datapath ■ Shorter clock cycles from simple functional units – Smaller size ■ Simpler datapath ■ No program memory – Less power consumption ■ More efficient computation ■ Drawbacks – High NRE costs – Time to market longer – Flexibility reduced 5
  • 6.
    Combinational Logic ■ Transistor– Basic electrical component in digital systems ■ Transistors  Logic Gates  Digital Systems ■ MOS transistor on silicon – Acts as an on/off switch – Voltage at “gate” controls whether current flows from source to drain 6 source drain oxide gate IC package IC channel Silicon substrate gate source drain Conducts if gate=1
  • 7.
    CMOS Transistor Implementations■ ComplementaryMetal Oxide Semiconductor ■ We refer to logic levels – Typically 0 is 0V, 1 is 5V ■ nMOS conducts if gate=1 ■ pMOS conducts if gate=0 ■ Basic gates 7 x F = x' 1 Inverter 0 F = (xy)' x 1 x y y NAND gate 0 1 F = (x+y)' x y x y NOR gate 0 gate source drain nMOS Conducts if gate=1 gate source drain pMOS Conducts if gate=0
  • 8.
    Basic Logic Gates 8 F= x y AND F = (x y)’ NAND F = x  y XOR F = x Driver F = x’ Inverte r x F F = x + y OR F = (x+y)’ NOR x F x y F F x y x y F x y F x y F F =x y XNOR Fy x x 0 y 0 F 0 0 1 0 1 0 0 1 1 1 x 0 y 0 F 0 0 1 1 1 0 1 1 1 1 x 0 y 0 F 0 0 1 1 1 0 1 1 1 0 x 0 y 0 F 1 0 1 0 1 0 0 1 1 1 x 0 y 0 F 1 0 1 1 1 0 1 1 1 0 x 0 y 0 F 1 0 1 0 1 0 0 1 1 0 x F 0 0 1 1 x F 0 1 1 0
  • 9.
    Combinational Logic Design ■Combinational circuit – Digital Circuit whose output is a function of current inputs – No memory of past inputs ■ Steps in designing a Combinational Logic Circuit 1. Problem Definition 2. Truth Table 3. Output Equations 4. Minimized Expressions 5. Logic Circuit 9
  • 10.
    Combinational Logic Design 1.Problem Description y is 1 if a is equal to 1, or b and c are 1. z is 1 if b or c is equal to 1, but not both, or if all are 1. 10
  • 11.
    Combinational Logic Design (cont..) 2.Truth Table 11 a b c y z 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1
  • 12.
    Combinational Logic Design (cont..)3.Output Equations y= a’bc + abc’ + ab’c + abc’ + abc z= a’b’c + a’bc’ + ab’c + abc’ + abc 4. Minimized Expressions y= a + bc z= ab + b’c +bc’ 12
  • 13.
    Combinational Logic Design(cont..) 13 a b c y z
  • 14.
    Combinational Logic Design (cont..)■Large circuits complex to design using logic gates ■ Eg- 16 inputs – 216=64K rows in truth table ■ Reduce complexity by components that are abstract than logic gates 14
  • 15.
  • 16.
    Sequential Logic Design ■Sequential Circuit – Output is a function of current as well as previous input values – Has memory ■ Basic sequential circuit – FLIP FLOP – Stores a single bit 16
  • 17.
  • 18.
  • 19.
    Sequential Logic Design(cont..) ■ Control Inputs – Synchronous – Asynchronous ■ Clear control lines are asynchronous 19
  • 20.
    Sequential Logic Design A)Problem Description You want to construct a clock divider. Output a 1 for every four clock cycles 20
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    Custom Single-purpose Processor BasicModel 26 controller and datapath controller datapath … … external control inputs external control outputs … external data inputs … external data outputs datapath control inputs datapath control outputs … … a view inside the controller and datapath controller datapath … … state register next-state and control logic registers functional units
  • 27.
    State Diagram Templates 27 Assignmentstatement a = b next statement a = b next statement Loop statement while (cond) { loop-body- statements } next statement loop-body- statements cond next statement !cond J: C: Branch statement if (c1) c1 stmts else if c2 c2 stmts else other stmts next statement c1 c2 stmts !c1*c2 !c1*!c2 next statement othersc1 stmts J: C:
  • 28.
    Example: Greatest Common Divisor■First create algorithm ■ Convert algorithm to “complex” state machine – Known as FSMD: finite-state machine with datapath – Can use templates to perform such conversion 28 GCD (a) Black-Box View x_i y_i d_o go_i
  • 29.
    b) Desired Functionality 29 0:int x, y; 1: while (1) { 2: while (!go_i); 3: x = x_i; 4: y = y_i; 5: while (x != y) { 6: if (x < y) 7: y = y - x; else 8: x = x - y; } 9: d_o = x; }
  • 30.
    c) State Diagram 30 y= y -x7: x = x - y8: 6-J: x!=y 5: !(x!=y) x<y !(x<y) 6: 5-J: 1: 1 !1 x = x_i3: y = y_i4: 2: 2-J: !go_i !(!go_i) d_o = x 1-J: 9:
  • 31.
  • 32.
    Creating the Datapath ■Create a register for any declared variable ■ Create a functional unit for each arithmetic operation ■ Connect the ports, registers and functional units – Based on reads and writes – Use multiplexors for multiple sources ■ Create unique identifier – for each control input and output of datapath components 32
  • 33.
    Creating the Controller ■Stage 3 x_sel=0; x_ld=1; – for loading ‘x’ ■ Stage 4  y_sel=0; y_ld=1; – For loading ‘y’ ■ Stage 7  y_sel=1; y_ld=1; – For loading the subtracted result y-x ■ Stage 8  x_sel=1; x_ld=1; – For loading the subtracted result x-y ■ Stage 9 d_ld=1 – Load the output register 33
  • 34.
    Controller Implementation Model ■Inputs – go_i Enable – Q3-Q0 Output from state register – x_neq_y – X_lt_y ■ Outputs – x_sel, y_sel – x_ld, y_ld – d_ld – I3 - I0 34
  • 35.
  • 36.
    Completing the GCDCustom Single- Purpose Processor Design 36 … … a view inside the controller and datapath controller datapath … … state register next-state and control logic registers functional units ■ We finished the datapath ■ We have a state table for the next state and control logic ■ Truth table for the combinational logic ■ This is not an optimized design.
  • 37.
    Optimizing Single-Purpose Processors■ Optimizationis the task of making design metric values the best possible ■ GCD eg- If numbers are large, it will take more steps – Speed decreases ■ Optimization opportunities – Original Program – FSMD – Datapath – FSM 37
  • 38.
    Optimizing the OriginalProgram ■ Analyze program attributes and look for areas of possible improvement – Number of computations – Size of variable – Time and space complexity – Operations used ■ Multiplication and division very expensive 38
  • 39.
    Optimizing the OriginalProgram (Cont..) 39 0: int x, y; 1: while (1) { 2: while (!go_i); 3: x = x_i; 4: y = y_i; 5: while (x != y) { 6: if (x < y) 7: y = y - x; else 8: x = x - y; } 9: d_o = x; } 0: int x, y, r; 1: while (1) { 2: while (!go_i); // x must be the larger number 3: if (x_i >= y_i) { 4: x=x_i; 5: y=y_i; } 6: else { 7: x=y_i; 8: y=x_i; } 9: while (y != 0) { 10: r = x % y; 11: x = y; 12: y = r; } 13: d_o = x; } Original Program Optimized Program replace the subtraction operation(s) with modulo operation in order to speed up program GCD(42, 8) • 9 iterations to complete the loop • x and y values evaluated as follows : (42, 8), (34, 8), (26,8), (18,8), (10, 8), (2,8), (2,6), (2,4), (2,2). GCD(42,8) • 3 iterations to complete the loop • x and y values evaluated as follows: (42, 8), (8,2), (2,0)
  • 40.
    Optimizing the FSMD ■Areas of possible improvements – Merge states ■ States with constants on transitions can be eliminated, transition taken is already known ■ States with independent operations can be merged – Separate states ■ States which require complex operations (a*b*c*d) can be broken into smaller states to reduce hardware size – Scheduling ■ Task of assigning operations from the original program to states in an FSMD 40
  • 41.
    Optimizing the FSMD 41 OriginalFSMD Optimized FSMD • Eliminate state 1 – transitions have constant values • Merge state 2 and state 2J – no loop operation in between them • Merge state 3 and state 4 – assignment operations are independent of one another • Merge state 5 and state 6 – transitions from state 6 can be done in state 5 • Eliminate state 5J and 6J – transitions from each state can be done from state 7 and state 8, respectively • Eliminate state 1-J – transition from state 1-J can be done directly from state 9
  • 42.
    Optimizing the FSMD(cont..) ■ Consider a = b * c * d * e ■ Generating a single state for the operation requires 3 multipliers in the datapath. ■ Multipliers are expensive ■ Break down the operation down into smaller operations – t1 = b * c – t2 = d * e – a = t1 * t2 ■ Each smaller operation has its own state ■ Only 1 multiplier is required in the datapath 42
  • 43.
    Optimizing the FSMD(cont..) ■ Timing of output operations could be changed while the FSMD is optimized ■ Reduced FSMD will generate GCD output in fewer clock cycles ■ Changing the timing would not be acceptable in all cases. Eg- Clock divider ■ Thus, when optimizing FSMD, a designer must be aware of whether output timing may or may not be modified. 43
  • 44.
    Optimizing the Datapath ■Sharing of functional units – One-to-one mapping, as done previously, is not necessary – If same operation occurs in different states, they can share a single functional unit ■ Multi-functional units – ALUs support a variety of operations, it can be shared among operations occurring in different states 44
  • 45.
    Optimizing the FSM ■State Encoding – Task of assigning a unique bit pattern to each state in an FSM – Size of state register and combinational logic vary – Eg- FSM with n states – n! possible encoding ways – Can be treated as an ordering problem – More encodings are possible – Can use more than log2n bits to encode ‘n’ states – CAD tools – great aid in searching for the best encoding ■ State Minimization – Task of merging equivalent states into a single state ■ State equivalent if for all possible input combinations the two states generate the same outputs and transitions to the next same state 45
  • 46.
    ■ Converting asequential program into custom single purpose processor – Convert the program into FSMD – Splitting FSMD into a simple FSM controlling datapath – Performing sequential logic design on the FSM ■ In many cases, we prefer not to start with a program – but instead directly with a FSMD – Cycle by cycle timing of a system is central to the design – Programming language don’t typically support cycle by cycle description 46 RT-level Custom Single-Purpose Processor Design
  • 47.
    RT-level Custom Single-Purpose ProcessorDesign ■ Example – Device to send an 8-bit number to another device (the receiver) – Receiver can receive all 8 bits at once – Sender sends 4 bits at a time – First lower order 4 bits and then the higher order 4 bits ■ Bridge should be designed that will enable the 2 devices to communicate 47
  • 48.
  • 49.
    49 RT-level custom single-purposeprocessor design (cont…)
  • 50.
  • 51.
    Introduction ■ General-Purpose Processor –Processor designed for a variety of computation tasks – Low unit cost, in part because manufacturer spreads NRE over large numbers of units ■ Motorola sold half a billion 68HC05 microcontrollers in 1996 alone – Carefully designed since higher NRE is acceptable ■ Can yield good performance, size and power – Low NRE cost, short time-to-market/prototype, high flexibility ■ User just writes software; no processor design – Also known as “microprocessor” – “micro” used when they were implemented on one or a few chips rather than entire rooms 51
  • 52.
    Basic Architecture 52 ■ Controlunit and datapath – Note similarity to single-purpose processor ■ Key differences – Datapath is general – Control unit doesn’t store the algorithm – the algorithm is “programmed” into the memory
  • 53.
    Datapath Operations 53 • Load •Read memory location into register • ALU operation – Input certain registers through ALU, store back in register • Store – Write register to memory location
  • 54.
    Control Unit ■ Controlunit: configures the datapath operations – Sequence of desired operations (“instructions”) stored in memory – “program” ■ Instruction cycle – broken into several sub-operations, each one clock cycle, e.g.: – Fetch: Get next instruction into IR – Decode: Determine what the instruction means – Fetch operands: Move data from memory to datapath register – Execute: Move data through the ALU – Store results: Write data from register to memory 54
  • 55.
    Control Unit Sub-Operations ■Fetch – Get next instruction into IR – PC: program counter, always points to next instruction – IR: holds the fetched instruction 55
  • 56.
    Control Unit Sub-Operations ■Decode – Determine what the instruction means 56
  • 57.
    Control Unit Sub-Operations ■Fetch operands – Move data from memory to datapath register 57
  • 58.
    Control Unit Sub-Operations ■Execute – Move data through the ALU – This particular instruction does nothing during this sub-operation 58
  • 59.
    Control Unit Sub-Operations ■Store results – Write data from register to memory – This particular instruction does nothing during this sub-operation 59
  • 60.
  • 61.
  • 62.
  • 63.
    Architectural Considerations ■ N-bitprocessor – N-bit ALU, registers, buses, memory data interface – Embedded: 8-bit, 16-bit, 32-bit common – Desktop/servers: 32-bit, even 64 ■ PC size determines address space 63
  • 64.
    Architectural Considerations ■ Clockfrequency – Inverse of clock period – Must be longer than longest register to register delay in entire processor – Memory access is often the longest 64
  • 65.
  • 66.
    66 Two Memory Architectures ■Princeton – Fewer memory wires ■ Harvard – Simultaneous program and data memory access Processor Program memory Data memory Processor Memory (program and data) Harvard Princeton
  • 67.
    Cache Memory ■ Memoryaccess may be slow ■ Cache is small but fast memory close to processor – Holds copy of part of memory – Hits and misses 67 Processor Memory Cache Fast/expensive technology, usually on the same chip Slower/cheaper technology, usually on a different chip
  • 68.
    Superscalar and VLIW Architectures■Performance can be improved by: – Faster clock (but there’s a limit) – Pipelining: slice up instruction into stages, overlap stages – Multiple ALUs to support more than one instruction stream ■ Superscalar – Scalar: non-vector operations – Fetches instructions in batches, executes as many as possible ■ May require extensive hardware to detect independent instructions – VLIW: each word in memory has multiple independent instructions ■ Relies on the compiler to detect and schedule instructions ■ Currently growing in popularity 68
  • 69.
    Programmer’s View ■ Programmerdoesn’t need detailed understanding of architecture – Instead, needs to know what instructions can be executed ■ Two levels of instructions: – Assembly level – Structured languages (C, C++, Java, etc.) ■ Most development today done using structured languages – But, some assembly level programming may still be necessary – Drivers: portion of program that communicates with and/or controls (drives) another device ■ Often have detailed timing considerations, extensive bit manipulation ■ Assembly level may be best for these 69
  • 70.
    Assembly-Level Instructions ■ InstructionSet – Defines the legal set of instructions for that processor ■ Data transfer: memory/register, register/register, I/O, etc. ■ Arithmetic/logical: move register through ALU and back ■ Branches: determine next PC value when not just PC+1 70 opcode operand1 operand2 opcode operand1 operand2 opcode operand1 operand2 opcode operand1 operand2 ... Instruction 1 Instruction 2 Instruction 3 Instruction 4
  • 71.
    A Simple (Trivial)Instruction Set 71
  • 72.
  • 73.
    Sample Programs 73 int total= 0; for (int i=10; i!=0; i--) total += i; // next instructions... C program MOV R0, #0; // total = 0 MOV R1, #10; // i = 10 JZ R1, Next; // Done if i=0 ADD R0, R1; // total += i MOV R2, #1; // constant 1 JZ R3, Loop; // Jump always Loop: Next: // next instructions... SUB R1, R2; // i-- Equivalent assembly program MOV R3, #0; // constant 0 0 1 2 3 5 6 7
  • 74.
    Programmer Considerations 74 ■ Programand data memory space – Embedded processors often very limited ■ e.g., 64 Kbytes program, 256 bytes of RAM (expandable) ■ Registers: How many are there? – Only a direct concern for assembly-level programmers ■ I/O – How communicate with external signals? ■ Interrupts
  • 75.
    Operating System 75 ■ Optionalsoftware layer providing low-level services to a program (application). – File management, disk access – Keyboard/display interfacing – Scheduling multiple programs for execution ■ Or even just multiple threads from one program – Program makes system calls to the OS
  • 76.
    Development Environment 76 ■ Developmentprocessor – The processor on which we write and debug our programs ■ Usually a PC ■ Target processor – The processor that the program will run on in our embedded system ■ Often different from the development processor
  • 77.
    Software Development Process 77 ■Compilers – Cross compiler ■ Runs on one processor, but generates code for another ■ Assemblers ■ Linkers ■ Debuggers ■ Profilers
  • 78.
    Running a Program ■If development processor is different than target, how can we run our compiled code? Two options: – Download to target processor – Simulate ■ Simulation – One method: Hardware description language ■ But slow, not always available – Another method: Instruction set simulator (ISS) ■ Runs on development processor, but executes instructions of target processor 78
  • 79.
    Testing and Debugging 79 ■ISS – Gives us control over time – set breakpoints, look at register values, set values, step-by-step execution, ... – But, doesn’t interact with real environment ■ Download to board – Use device programmer – Runs in real environment, but not controllable ■ Compromise: Emulator – Runs in real environment – Supports some controllability from the PC
  • 80.
    Application-Specific Instruction-Set Processors (ASIPs) 80 ■General-Purpose Processors – Sometimes too general to be effective in demanding application ■ e.g., video processing – requires huge video buffers and operations on large arrays of data, inefficient on a GPP – But single-purpose processor has high NRE, not programmable ■ ASIP’s – targeted to a particular domain – Contain architectural features specific to that domain ■ e.g., embedded control, digital signal processing, video processing, network processing, telecommunications, etc. – Still programmable
  • 81.
    A Common ASIP:Microcontroller 81 ■ For embedded control applications – Reading sensors, setting actuators – Mostly dealing with events (bits): data is present, but not in huge amounts – e.g., VCR, disk drive, digital camera (assuming SPP for image compression), washing machine, microwave oven ■ Microcontroller features – On-chip peripherals ■ Timers, analog-digital converters, serial communication, etc. ■ Tightly integrated for programmer, typically part of register space – On-chip program and data memory – Direct programmer access to many of the chip’s pins – Specialized instructions for bit-manipulation and other low-level operations ■ Incorporating peripherals and memory onto the same IC – reduces the no. of required IC’s  Compact and low power implementations
  • 82.
    Another Common ASIP:Digital Signal Processors (DSP)■ For signal processing applications – Large amounts of digitized data, often streaming ■ Source – photo captured by a digital camera, a voice packet through a network router – Data transformations must be applied fast – e.g., cell-phone voice filter, digital TV, music synthesizer ■ DSP features – Several instruction execution units – Filtering, Transforming vectors or metrics of data – Multiple-accumulate single-cycle instruction, other instructions. – Efficient vector operations – e.g., add two arrays ■ Vector ALUs, loop buffers, etc. – Contains number of ADC, DAC, PWM, timers, counters etc. – Commonly used DSP’s are well supported in terms of compiler and other development tools  Easy and cheap to integrate into most embedded systems. 82
  • 83.
    Less General ASIPEnvironments ■ ASIP’s that are less general in nature ■ Designed to perform very domain specific processing while allowing some degree of programmability. ■ ASIP’s designed for networking hardware May be designed to be programmable with different network routing algorithms, checksum, and packet processing protocols 83
  • 84.
    Trend: Even MoreCustomized ASIPs 84 ■ In the past, microprocessors were acquired as chips ■ Today, we increasingly acquire a processor as Intellectual Property (IP) – e.g., synthesizable VHDL model ■ Opportunity to add a custom datapath hardware and a few custom instructions, or delete a few instructions – Can have significant performance, power and size impacts – Problem: need compiler/debugger for customized ASIP ■ Remember, most development uses structured languages ■ One solution: Automatic compiler/debugger generation – e.g., www.tensillica.com ■ Another solution: Re-targetable compilers – e.g., www.improvsys.com (customized VLIW architectures)
  • 85.
    Selecting a Microprocessor 85 ■Issues – Technical: speed, power, size, cost – Other: development environment, prior expertise, licensing, etc. ■ Speed: how evaluate a processor’s speed? – Clock speed – but instructions per cycle may differ – Instructions per second – but work per instruction may differ – Dhrystone: Synthetic benchmark, developed in 1984. Dhrystones/sec. ■ MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital’s VAX 11/780). A.k.a. Dhrystone MIPS. Commonly used today. – So, 750 MIPS = 750*1757 = 1,317,750 Dhrystones per second – SPEC: set of more realistic benchmarks, but oriented to desktops – EEMBC – EDN Embedded Benchmark Consortium, www.eembc.org ■ Suites of benchmarks: automotive, consumer electronics, networking, office automation, telecommunications
  • 86.
    General Purpose Processors 86 ProcessorClock speed Periph. Bus Width MIPS Power Trans. Price General Purpose Processors Intel PIII 1GHz 2x16 K L1, 256K L2, MMX 32 ~900 97W ~7M $900 IBM PowerPC 750X 550 MHz 2x32 K L1, 256K L2 32/64 ~1300 5W ~7M $900 MIPS R5000 250 MHz 2x32 K 2 way set assoc. 32/64 NA NA 3.6M NA StrongARM SA-110 233 MHz None 32 268 1W 2.1M NA Microcontroller Intel 8051 12 MHz 4K ROM, 128 RAM, 32 I/O, Timer, UART 8 ~1 ~0.2W ~10K $7 Motorola 68HC811 3 MHz 4K ROM, 192 RAM, 32 I/O, Timer, WDT, SPI 8 ~.5 ~0.1W ~10K $5 Digital Signal Processors TI C5416 160 MHz 128K, SRAM, 3 T1 Ports, DMA, 13 ADC, 9 DAC 16/32 ~600 NA NA $34 Lucent DSP32C 80 MHz 16K Inst., 2K Data, Serial Ports, DMA 32 40 NA NA $75
  • 87.
    Chapter Summary 87 ■ General-purposeprocessors – Good performance, low NRE, flexible ■ Controller, datapath, and memory ■ Structured languages prevail – But some assembly level programming still necessary ■ Many tools available – Including instruction-set simulators, and in-circuit emulators ■ ASIPs – Microcontrollers, DSPs, network processors, more customized ASIPs ■ Choosing among processors is an important step ■ Designing a general-purpose processor is conceptually the same as designing a single-purpose processor
  • 88.
  • 89.
    1. An algorithmfor matrix multiplication, assuming that we have one adder and one multiplier, follows: a. Convert the matrix multiplication algorithm into a state diagram. b. Rewrite the matrix multiplication algorithm given the assumption that we have 3 adders and 6 multipliers. c. If each multiplication takes 2 cycles to compute and each addition takes one cycle to compute, how many cycles does it take to complete the matrix multiplication given one adder and one multiplier? Three adders and six multipliers? d. If each adder requires 10 transistors to implement and each multiplier requires 100 transistors to implement, what is the total number of transistors to implement the matrix multiplication circuit using 1 adder and 1 multiplier? Three adders and six multipliers? 89
  • 90.
    main() { int A[3][2]={ {1,2}, {3,4}, {5,6}}; int B[2][3]= {{7, 8, 9}, (10, 11, 12}}; int C[3][3], i, j, k; for(i=0; i<3; i++) { for(j=0; j<3; j++) { c[i][j]=0; for(k=0;k<2;k++){ c[i][j]+=A[i][k]*B[k][j]; } } } } 90
  • 91.
  • 92.
    ■ Cycles tocomplete matrix multiplication – 1 adder + 1 multiplier = 54 cycles – 3 adders + 6 multipliers = 9 cycles ■ Number of transistors – 1 adder + 1 multiplier = 110 transistors – 3 adders + 6 multipliers = 630 transistors 92
  • 93.
    2. Design asingle-purpose processor that outputs Fibonacci numbers up to n places. Start with a function computing the desired result, translate it into a state diagram, and sketch a probable datapath. 93
  • 94.
  • 95.
    95 c_ld c_sel x2_ldx2_sel count_lt_ncount_ne_01 0