DSP architecture

DSP Architectures

Rensselaer at Hartford
ECSE 6620 - Fall 2001
Lecture 16
Jason M. Stripinis
jasonstripinis@engineer.com

Basic Processor Structure

• Here we see a very simple processor structure - such as
might be found in a small 8-bit microprocessor.
12 DEC 01 ECSE 6620 - Jason Stripinis2(jasonstripinis@eng

Basic Processor Functions
• ALU
– Arithmetic Logic Unit - this circuit takes two operands on the
inputs (labeled A and B) and produces a result on the output
(labeled Y).
– The operations will usually include, as a minimum:
• add, subtract
• and, or, not
• shift right, shift left
• ALUs in more complex processors will execute many more
instructions.


• Register File
– A set of storage locations (registers) for storing temporary results.
Early machines had just one register (accumulator). Modern RISC
processors will have at least 32 registers.
• Instruction Register
– The instruction currently being executed by the processor is stored
here.
• Control Unit
– The control unit decodes the instruction in the instruction register
and sets signals which control the operation of most other units of
the processor. For example, the operation code (opcode) in the
instruction will be used to determine the settings of control signals
for the ALU which determine which operation (+,-,^,v,~,shift,etc)
it performs.

• Clock
– The vast majority of processors are synchronous, that is, they use a
clock signal to determine when to capture the next data word and
perform an operation on it. In a globally synchronous processor, a
common clock needs to be routed (connected) to every unit in the
processor.
• Program counter
– The program counter holds the memory address of the next
instruction to be executed. It is updated every instruction cycle to
point to the next instruction in the program. Branch instructions
change the program counter by other than a simple increment.


• Memory Address Register
– This register is loaded with the address of the next data word to be
fetched from or stored into main memory.
• Address Bus
– Transfers addresses to memory and memory-mapped peripherals.
It is driven by the processor acting as a bus master.
• Data Bus
– Carries data to and from the processor, memory and peripherals. It
will be driven by the data source, i.e. processor, memory, etc.
• Multiplexed Bus
– To limit device pin counts and bus complexity, some processors
MUX address and data onto the same bus, with an adverse affect
on performance.

DSP Implementations
• DSP Algorithm
– Series of mathematical operations that are applied to process a
sequence of digital signals sampled from the real (analog) world
• Application examples
– Filtering
– FFT
– Noise cancellation
– Spectral Processing


Why is special architecture good for
digital signal processing?
• DSPs are tailored to run DSP algorithms efficiently.
• Special functions to handle DSP algorithm demands:
– Unique data access patterns
• Streams of data requiring high bandwidth
• Low data repetition but high code repetition
– Math operation focus (“number cruncher”)
– Real-time constraints
– Power and size constraints
– Cost requirement
– Attention to numeric effects (limited fixed point error)


DSP Functional Characteristics
• Typically require a few specific operations
• Consider a FIR Filter :

This requires:
–additions & multiplications
–delays
–array handling


DSP Typical Operations
• Additions & Multiplications
– fetch two operands
– perform the addition or multiplication (or both)
– store the result

• Delays
– store the result for later use

• Array Handling
– fetch values from consecutive memory locations
– copy data from register to register

12 DEC 01 ECSE 6620 - Jason Stripinis10
(jasonstripinis@eng

DSP Typical Operations
• To perform these basic operations most DSPs:
– have a parallel multiply and add
– have multiple memory accesses (to fetch two operands and store the
result)
– have sufficient registers to hold data temporarily
– efficient address generation for array handling
– special features such as delays or circular addressing

(jasonstripinis@eng

DSP Arithmetic Logic Unit
• Most DSP operations require additions and multiplications
together. So DSP processors usually have parallel
hardware adders and multipliers which can be used with a
single instruction:

(jasonstripinis@eng

Register Structure
• Delays require that intermediate values be held for later
use.
• For example, when keeping a running total - the total can
be kept within the processor to avoid wasting repeated
reads from and writes to memory.
• For this reason DSP processors have lots of registers which
can be used to hold intermediate values.
• Registers may be fixed-point or floating-point.

(jasonstripinis@eng

Memory Addressing
• Array handling requires that data can be fetched efficiently
from consecutive memory locations.
• For this reason DSP processors have address registers
which are used to hold addresses and can be used to
generate the next needed address efficiently.
• Usually, the next needed address can be generated during
the data fetch or store operation, and with no overhead.

(jasonstripinis@eng

Memory Addressing
• Example DSP address generation operations:

Instruction Name Description
read the data pointed to by the address in
*rP register indirect
register rP
having read the data, postincrement the address
*rP++ postincrement
pointer to point to the next value in the array
having read the data, postdecrement the address
*rP-- postdecrement pointer to point to the previous value in the
array
*rP++rI register postincrement pointer by the amount held in register rI to point
to rI values further down the array
*rP++rIr bit reversed pointer to point to the next value in the array, as
if the address bits were in bit reversed order

(jasonstripinis@eng

Memory Architectures for DSP
• For arithmetic the DSP needs to fetch two operands in a
single instruction cycle.
• Since we also need to store the result and to read the
instruction itself more than two memory accesses per
instruction cycle are needed.
• Even the simplest DSP operation - an addition involving
two operands and a store of the result to memory - requires
four memory accesses (three to fetch the two operands and
the instruction, plus a fourth to write the result)

(jasonstripinis@eng

• DSP processors usually support multiple memory accesses
in the same instruction cycle.
• It is not possible to access two different memory addresses
simultaneously over a single memory bus.
• There are two common methods to achieve multiple
memory accesses per instruction cycle:
• Harvard architecture
• modified von Neumann architecture

(jasonstripinis@eng

(Harvard Architecture)
• The Harvard architecture has two separate physical
memory buses, allowing two simultaneous memory
accesses.
• The true Harvard architecture dedicates one bus for
fetching instructions, with the other available to fetch
operands.
• This is inadequate for DSP operations, which usually
involve at least two operands. So DSP Harvard
architectures usually permit the 'program' bus to be used
also for access of operands.

(jasonstripinis@eng

• Note that it is often necessary to fetch three things - the
instruction plus two operands - and the Harvard
architecture is inadequate to support this.
• So DSP Harvard architectures often also include a cache
memory which can be used to store instructions which will
be reused, leaving both Harvard buses free for fetching
operands.
• The Harvard architecture plus cache - is sometimes called
an extended Harvard architecture or Super Harvard
ARChitecture (SHARC).

(jasonstripinis@eng

• The Harvard architecture requires two memory buses. This
makes it expensive to bring off the chip - for example a
DSP using 32 bit words and with a 32 bit address space
requires at least 64 pins for each memory bus - a total of
128 pins if the Harvard architecture is brought off the chip.
This results in very large chips, which are difficult to
design into a circuit.

(jasonstripinis@eng

(von Neumann Architecture)
• The von Neumann architecture uses only a single memory
bus. This is relatively cheap, requiring less pins that the
Harvard architecture, and simple to use because the
programmer can place instructions or data anywhere
throughout the available memory.
• But it does not permit multiple memory accesses.
• The modified von Neumann architecture allows multiple
memory accesses per instruction cycle by running the
memory clock faster than the instruction cycle.

(jasonstripinis@eng

(von Neumann Architecture)
• Each instruction cycle is divided into multiple 'machine
states' and a memory access can be made in each machine
state, permitting a multiple memory accesses per
instruction cycle.
• The modified von Neumann architecture permits all the
memory accesses needed to support addition or
multiplication: fetch of the instruction; fetch of the two
operands; and storage of the result.

(jasonstripinis@eng

Why use a special architecture for
digital signal processing?
The Answers
Unique data access patterns Bit reversed addressing (FFT)
Streams of data requiring high Multiple access memory
bandwidth architecture
Low data repetition but high Eliminate data cache (save $$)
code repetition
Math operation focus MAC instruction
Vector processing unit
Real-time constraints Zero-overhead loops
Power and size constraints Limited addition function
units (unlike GPP)
Cost requirement On-board peripherals (SOC)
Attention to numeric effects ALU with 16-bit operands and
(limited fixed point error) 32-bit result

(jasonstripinis@eng

DSP Generations
• 1st Generation (1979-1982)
– Transition from experimental signal processors
• 2nd Generation (1985-1986)
– Move from co-processor to stand-alone processor
• 3rd Generation (1987-1989)
– Major hardware improvements to speed
• 4th Generation (1990-1996)
– More on-chip integration (ADC, DAC, memory, multi-processor)
• 5th Generation (1997-)

(jasonstripinis@eng

DSP Generations
1st Generation (1979-1982)
• Primarily targeted at digital filtering
• Specialized co-processor for signal processing
• NMOS (n-Channel Metal Oxide Semi) fabrication

• 16-bit fixed point
• fast multiplier (and adder)
• Harvard architecture
• Specialized Instruction set

(jasonstripinis@eng

DSP Generations
• Example = Texas Instruments TMS32010
– 16-bit fixed point
– Harvard architecture
– two Address registers
– one A register (adder)
– one P register (multiplier)
– one T register (data shift on delay line)
– No zero-overhead loop
– Specialized Instruction set
– MAC Time 400 ns (<100 ns today)
– 50 ms per 1024-FFT

(jasonstripinis@eng

DSP Generations
• Example = Texas Instruments TMS32010

(jasonstripinis@eng

DSP Generations
2nd Generation (1985-1986)
• Move from co-processor to stand-alone processor
• CMOS (Complementary Metal Oxide Semi) fabrication
• Double the speed of first generation

• Advances in memory architecture (more internal RAM)
• better pipelining of functional units
• address generators (bit-reversing)
• Zero-overhead loop HW
• Limited floating point in SW

(jasonstripinis@eng

DSP Generations
2nd Generation (1985-1986)
• Example = Texas Instruments TMS32020 (1985)
– Harvard architecture
– Improved TMS32010
– RPTS allows pipelined instruction performed in single cycle
– Specialized Instruction set
– MAC Time 200 ns
– 10 ms per 1024-FFT

(jasonstripinis@eng

DSP Generations
3rd Generation (1987-1989)
• Increased floating point support
– 32-bit floating point hardware DSPs released
– Floating point emulation on fixed point processors
– IEEE754 support
• Hardware enhancements (large speed increase)
– dense CMOS fabrication
– on chip DMA
– instruction caches
– increased clock rates (first cores above 10 MHz)
• Increased complexity of SW

(jasonstripinis@eng

DSP Generations
3rd Generation (1987-1989)
• Example = Motorola DSP56001 (1988)
– 24-bit data, instructions
– 3 memory spaces (P, X, Y)
– parallel moves
– circular addressing
– MAC Time 75 ns (21 ns today)
– ~3 ms per 1024-FFT
• Other Examples:
– AT&T DSP16A
– Analog Devices ADSP-2100
– TI TMS320C50

(jasonstripinis@eng

DSP Generations
4th Generation (1990-1996)
• Hardware integration
– ADC
– DAC
– more memory
– multiple DSPs on one chip
• Decreasing power consumption
– 5.0 VDC → 3.3 VDC → 3.0 VDC → 2.7 VDC
• GPPs start to get DSP functions
– SIMD
– Leads to Intel introducing MMX (MultiMedia eXtensions) for x86

(jasonstripinis@eng

DSP Generations
4th Generation (1990-1996)
• Example = TI TMS320C541 (1995)
– Enhanced architecture
– Low voltage (3.3 VDC)
– More on-chip memory
– Application specific functional units
– MAC Time 20 ns (10 ns today)
– ~1 ms per 1024-FFT

• Example = TI TMS320C80
– multiple processors per chip

(jasonstripinis@eng

The GPP Option
• High-performance general-purpose processors for PCs and
workstations are increasingly suitable for some DSP
applications.
• E.g., Intel MMX Pentium, Motorola/IBM PowerPC 604e
• These processors achieve excellent to outstanding floating
and/or fixed-point DSP performance via:
– Very high clock rates (200-500 MHz)
– Superscalar architectures
– Single-cycle multiplication and arithmetic operations
– Good memory bandwidth
– Branch prediction
– In some cases, single-instruction, multiple-data (SIMD) ops

(jasonstripinis@eng

DSP Generations
5th Generation (1997-)
• Not the classic DSP architectures
– SIMD (Single Instruction Multiple Data stream) instructions
– VLIW (Very Long Instruction Words) allows RISC processing
• High parallelism
• Increased clock speeds
• No longer application specific functional units (no MAC FU)
• Low voltage (2.5 VDC or less, even 1.2 VDC cores)
• MAC Time 3 ns (but can be power hungry)
• GPPs start to get DSP functions
– Intel introduces MMX (MultiMedia eXtensions) for x86 in 1997
• Increased integration
– MCU and DSP cores on same chip
– MCU functions/ports added to DSPs
(jasonstripinis@eng

DSP Generations
• SIMD (Single Instruction Multiple Data) instructions
– Enhance throughput by allowing parallelism
– Requires multiple functional units and wider buses
– May support multiple data widths (different functional groups)
– Example = DSP16000

WAS SIMD

(jasonstripinis@eng

DSP Generations
• VLIW (Very Long Instruction
Words)
– Instruction Level Parallelism (ILP) can
be a major performance gain
• Superscalar implementation requires
larger die and more power to
dynamically pipeline instructions
– VLIW can be used to statically pipeline
instructions at compile time (or even by
hand!)
– VLIW instruction words have fixed
"slots" for instructions that map to the
functional units available.

(jasonstripinis@eng

DSP Generations
• VLIW Advantages
– huge theoretical pay off
• less than 1 ns per MAC!
• Less than 75 ns per 1024-FFT

• VLIW Drawbacks
– Can be very difficult to program and debug
– High power consumption if VLIW is not filled
– Code size dramatically increases requiring more program memory

(jasonstripinis@eng

DSP Generations
• VLIW Example = TI TMS320C6201

32-bit Functional Units
Lx = ALU
Sx = Branching
and shifting
Mx = Multiplier
Dx = Data Store

(jasonstripinis@eng

DSP Generational Development
• DSP processor performance has increased by a factor of
about 400x over the past 20 years
400
350
300
250
200
150 MAC (ns)
100
50
0
1st 2nd 3rd 4th 5th
Gen Gen Gen Gen Gen

• DSP architectures will be increasingly specialized for
applications, especially communications applications
• General-purpose processors will become viable for many
DSP applications

(jasonstripinis@eng

DSP architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to DSP architecture

Similar to DSP architecture (20)

Recently uploaded

Recently uploaded (20)

DSP architecture