SECA1506- Digital Signal Processing
Semester -V
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
UNIT V
REALTIME DIGITAL SIGNAL
PROCESSING
ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 2
What is a DSP?
• A specialized microprocessor for real-time
DSP applications
– Digital filtering (FIR and IIR)
– FFT
– Convolution, Matrix Multiplication etc
ADC DAC
DSP
ANALOG
INPUT
ANALOG
OUTPUT
DIGITAL
INPUT
DIGITAL
OUTPUT
ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 11
Common DSP features
• Harvard architecture
• Dedicated single-cycle Multiply-Accumulate (MAC)
instruction (hardware MAC units)
• Single-Instruction Multiple Data (SIMD) Very Large
Instruction Word (VLIW) architecture
• Pipelining
• Saturation arithmetic
• Zero overhead looping
• Hardware circular addressing
• Cache
• DMA
13
Single-Cycle MAC unit
Multiplier
Adder
Register
a x
i i
a x
i i
a x
i-1 i-1
a x
i i a x
i-1 i-1
+
Σ(a x )
i i
i=0
n
Can compute a sum of n-
products in n cycles
ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 14
Single Instruction - Multiple Data
(SIMD)
• A technique for data-level parallelism by
employing a number of processing elements
working in parallel
ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 15
Very Long Instruction Word (VLIW)
• A technique for instruction-
level parallelism by
executing instructions
without dependencies
(known at compile-time) in
parallel
• Example of a single VLIW
instruction:
F=a+b; c=e/g; d=x&y; w=z*h;
VLIW instruction F=a+b c=e/g d=x&y w=z*h
PU
PU
PU
PU
a
b
F
c
d
w
e
g
x
y
z
h
PIPELINING
• It is a technique which allows two or more operations to
overlap during execution.
• DSP algorithms are repetitive making them suitable for
pipelining
• It ensures a steady flow of instructions to the CPU and
increases system performance.
• In pipelining each instruction still takes three clock cycles
but at each cycle the processor is executing up to three
different instructions.
• It has an impact upon the system memory . The no.of
memory accesses increases by the no.of stages.
• In Harvard architecture the separation of data and
instruction memory promotes pipelining.
• It also allows better utilisation of arithmetic unit.
17
Pipelining
•DSPs commonly feature deep pipelines
•TMS320C6x processors have 3 pipeline stages with a
number of phases (cycles):
– Fetch
• Program Address Generate (PG)
• Program Address Send (PS)
• Program ready wait (PW)
• Program receive (PR)
– Decode
• Dispatch (DP)
• Decode (DC)
– Execute
• 6 to 10 phases
ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 18
Saturation Arithmetic
• fixed range for operations like addition and
multiplication
• normal overflow and underflow produce the
maximum and minimum allowed value, respectively
• Associativity and distributivity no longer apply
• 1 signed byte saturation arithmetic examples:
• 64 + 69 = 127
• -127 – 5 = -128
• (64 + 70) – 25 = 122 ≠ 64 + (70 -25) = 109
ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 19
Zero Overhead Looping
• Hardware support for loops with a constant
number of iterations using hardware loop
counters and loop buffers
• No branching
• No loop overhead
• No pipeline stalls or branch prediction
• No need for loop unrolling
ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 20
Hardware Circular Addressing
• A data structure
implementing a fixed length
queue of fixed size objects
where objects are added to
the head of the queue while
items are removed from the
tail of the queue.
• Requires at least 2 pointers
(head and tail)
• Extensively used in digital
filtering
y[n] = a0x[n]+a1x[n-1]+…+akx[n-k]
X[n]
X[n-1]
X[n-2]
X[n-3]
X[n]
X[n-1]
X[n-2]
X[n-3]
Head
Tail
Cycle1
Cycle2
ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 21
Direct Memory Access (DMA)
• The feature that allows peripherals to access main
memory without the intervention of the CPU
• Typically, the CPU initiates DMA transfer, does other
operations while the transfer is in progress, and
receives an interrupt from the DMA controller once
the operation is complete.
• Can create cache coherency problems (the data in
the cache may be different from the data in the
external memory after DMA)
• Requires a DMA controller
ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 22
Cache memory
• Separate instruction and data L1 caches
(Harvard architecture)
• Cache coherence protocols required, since
most systems use DMA
ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 23
DSP vs. Microcontroller
• DSP
– Harvard Architecture
– VLIW/SIMD (parallel
execution units)
– No bit level operations
– Hardware MACs
– DSP applications
• Microcontroller
– Mostly von Neumann
Architecture
– Single execution unit
– Flexible bit-level operations
– No hardware MACs
– Control applications
DIGITAL SIGNAL PROCESSORS
Leading Manufacturers:
1.Texas Instruments (TI)
2.Analog Devices
3.Motorola
Programmable DSP (PDSP):
General purpose microprocessors designed specifically for DSP
applications
Special architecture and instruction set to compute DSP
algorithms more efficiently.
TYPES OF PROGRAMMABLE DSP (PDSP)
1. General Purpose DSP
Basically high speed MP with architecture and instruction sets
optimized for DSP operations.
Fixed Point : TMS320C5x, C54x, DSP563x Floating
Point: TMS320C4x, C67xx, ADSP21xxx
2. Special Purpose DSP
:: Hardware designed for,
1. Specific DSP algorithms such as FFT
2. Specific applications – PCM & Filtering.
Fixed point processors:( 16 – bit )
TMS320C1x, C2x, C2xx
TMS320C50, C51, C53
TEXAS DIGITAL SIGNAL PROCESSORS
Floating point processors:( 32 – bit )
TMS320C3X
TMS320C4X
SELECTION OF DSP PROCESSORS
Architectural Features :
On-chip memory,Special instruction set,I/O capability & Large
memory.
Execution Speed :
MIPS & MFLOPS
Type of arithmetic :
Fixed point (Cell phone & Computer disk drives)
Floating point (Wide & dynamic range of
values)
SELECTION OF DSP PROCESSORS
Word length :
Fixed point
16 - bit ::
24 - bit ::
TMS320C54x
Telecommunications applications
DSP56300
High quality audio applications
Floating point
32 - bit :: TMS320C3x, C4x
Single - precision arithmetic
TYPICALAPPLICATIONS OF TEXAS FAMILY
C1x,C2x,C3xx,C5x,C54x :
Toys, HDD, Modems, Cellular phones & Active car suspensions.
C3x :
Bar-
Filters, Analysers, Hi - Fi Systems, Voice mail, imaging,
code readers, motor control, 3D Graphics.
C4x : Parallel-processing Systems, Image recognition telecom
routing.
TYPICALAPPLICATIONS OF TEXAS FAMILY
C6x : Wireless base stations,pooled modems, Multi channel
telephone systems.
C8x : Video telephony, 3D computer graphics, Virtual reality &
multimedia applications.
High-end
●Wireless Base Station - TMS320C6000
●Cable modem
●gateways
Mid-end
● Cellular phone - TMS320C540
● Fax/ voice server
Low end
● Storage products - TMS320C27
● Digital camera - TMS320C5000
● Portable phones
● Wireless headsets
● Consumer audio
● Automobiles, toasters, thermostats, ...
OTHER DSPAPPLICATIONS
Increasing
Cost








Address generation unit (AGU)
Arithmetic logic unit (ALU)
Barrel shifter.
Floating-point unit (FPU)
Back-side bus.
Multiplexer, De-multiplexer. Registers.
Memory management unit (MMU)
Translation look aside buffer (TLB)
FUNCTIONAL MODES OF DSP PROCESSORS
DSPARCHITECTURE
ENABLING TECHNOLOGIES
Time Frame Approach Primary Application Enabling Technologies
Early 1970’s
Discrete logic
 Bipolar SSI, MSI
 FFT algorithm
Late 1970’s
Building block
 Non-real time
procesing
 Simulation
 Military radars
 Digital Comm.
Early 1980’s
Single Chip DSP P
 Telecom
 Control
 Single chip bipolar multiplier
 Flash A/D
 P architectures
 NMOS/CMOS
Late 1980’s Function/Application
specific chips
 Computers
 Communication
 Vector processing
 Parallel processing
Early 1990’s
Multiprocessing
 Video/Image Processing  Advanced multiprocessing
 VLIW, MIMD, etc.
Late 1990’s Single-chip
multiprocessing
 Wireless telephony
 Internet related
 Low power single-chip DSP
 Multiprocessing
 They have been used in numerous applications, such as communication, control,
computers, Instrumentation, and consumer electronics.
 The architectural features and the processing power of these devices have been
constantly upgraded based on the advances in technology and the application
needs.
 Most of them have Harvard architecture, a single-cycle hardware multiplier, an
address generation unit with dedicated address registers, special addressing
modes, on-chip peripherals interfaces.


Three most popular ones are those from Texas Instruments, Motorola, and Analog
Devices.
Texas Instruments was one of the first to
come out with a commercial programmable DSP with the
introduction of its TMS32010 in 1982.
COMMERCIAL DIGITAL SIGNAL PROCESSING
DEVICES:
FIRST GENERATION DSP P
TMS32010 (TEXAS INSTRUMENTS) - 1982
FEATURES
o
o
o
o
o
o
o
200 ns instruction cycle (5 MIPS) 144 words (16 bit) on-chip data
RAM
1.5K words (16 bit) on-chip program ROM - TMS32010
External program memory expansion to a total of 4K words at full
speed
16-bit instruction/data word
single cycle 32-bit ALU/accumulator Single cycle 16 x 16-bit multiply
in 200 ns
Two cycle MAC (5 MOPS) Zero to 15-bit barrel shifter
Eight input and eight output channels
TMS320C203/LC203 BLOCK DIAGRAM
DSP Core Approach - 1995
THIRD GENERATION DSP P CASE STUDY
TMS320C30 - 1988
FEATURES
o 60 ns single-cycle instruction execution time
o
o
33.3 MFLOPS (million floating-point operations per second)
16.7 MIPS (million instructions per second)
o
o
o
o
o
One 4K x 32-bit single-cycle dual-access on-chip ROM block
Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks
64 x 32-bit instruction cache
32-bit instruction and data words, 24-bit addresses
40/32-bit floating-point/integer multiplier and ALU 32-bit
barrel shifter
o
o
o
o
o
o
o
Eight extended precision registers (accumulators)
Two address generators with eight auxiliary registers and two
auxiliary register arithmetic units
On-chip direct memory Access (DMA) controller for concurrent
I/O and CPU operation
Parallel ALU and multiplier instructions Block repeat capability
Interlocked instructions for multiprocessing support Two serial
ports to support 8/16/32-bit transfers Two 32-bit timers
1  CDMOS Process
THIRD GENERATION DSP P CASE STUDY
TMS320C30 - 1988
TMS320C30 BLOCK DIAGRAM
C54X ARCHITECTURE
#1: CPU designed for efficient DSP processing MAC unit, 2 Accumulators,
Additional Adder, Barrel Shifter
#2: Multiple busses for efficient data and program flow Four busses and large
on-chip memory that result in sustained performance near peak
#3: Highly tuned instruction set for powerful DSP computing Sophisticated
instructions that execute in fewer cycles, with less code and low power
demands
KEY #1: DSP ENGINE
MPY
Y
40
= 
n = 1
x a
an * xn
ADD
y
KEY #1: MAC UNIT
MPY
ADD
MAC *AR2+, *AR3+, A
acc A acc B
Fractional
Mode Bit
A
B
O
Data Acc A Temp Coeff Prgm Data Acc A
S/U S/U
KEY #1: ACCUMULATORS + ADDER
LD @s, A
ADD @e,
A
SUB @r, A
STL A, @t
General-Purpose Math example: t = s+e-r
A BusB BusA B C T D
Shifter
MUX
A B MAC
acc A acc B
U Bus
ALU
KEY #1: BARREL SHIFTER
Barrel Shifter
(-16-+31)
S Bus
ALU E Bus
LD
STH
@X, 16, A
@B, Y
A B C D
KEY #1: TEMPORARY REGISTER
Temporary
Register
ALU
MAC
T Bus
EXP
Encoder
A
B
D X
For example:
A = xa
LD
MPY
@x, T
@a,
A
KEY #2: EFFICIENT DATA/PROGRAM FLOW
#1: CPU designed for efficient DSP processing
 MAC unit, 2 Accumulators, Additional Adder, Barrel
Shifter
#2: Multiple busses for efficient data and program flow
 Four busses and large on-chip memory that result in
sustained performance near peak
#3: Highly tuned instruction set for powerful DSP
computing
 Sophisticated instructions that execute in fewer cycles, with less code
and low power demands
KEY #2: MULTIPLE BUSSES
MAC *AR2+, *AR3+, A
Central
Arithmetic
Logic
Unit
EXTERNAL
MEMORY
M
U
X
M
U
X
E
S
P
D
E
C
C D
M
ALU SHIFTER
B
T MAC A
o 27
P
Prefetch
F
Fetch
D A R
Decode Access Read
E
Execute
KEY #2: PIPELINE
o
o
o
o
o
Prefetch: Calculate address of instruction
Fetch: Collect instruction
Decode: Interpret instruction
Access: Collect address of operand
Read: Collect operand
Execute: Perform operation
KEY #2: BUS USAGE
Central
Arithmetic
Logic
Unit
EXTERNAL
MEMORY
M
U
X
M
U
X
E
S
P
ALU SHIFTER
B
T MAC A
PC
CNTL
E
C
D
ARs
KEY #2: PIPELINE PERFORMANCE
P3 F3
CYCLES
P1 F1 D1 A1 R1 X1
A2 R2
X2
D3 A3
R3 P4 F4 D4 A4 P5 F5
D5
P6 F6
P2 F2 D2
X3
D6 A6
R4 X4
A5 R5 X5
R6 X6
Fully loaded pipeline
KEY #3: POWERFUL INSTRUCTIONS
#1: CPU designed for efficient DSP processing
 MAC Unit, 2 Accumulators, Additional Adder,Barrel Shifter
#2: Multiple busses for efficient data and program flow
 Four busses and large on-chip memory that result in
sustained performance near peak
#3: Highly tuned instruction set for powerful DSP
computing
 Sophisticated instructions that execute in fewer cycles, with less
code and low power demands
KEY #3: ADVANCED APPLICATIONS
Symmetric FIR filter
Adaptive filtering
Polynomial evaluation
Code book search
Viterbi
FIRS
LMS
POLY
STRCD
SACC
D
SRCC
D
DADST
C62X ARCHITECTURE-TMS320 C6201
REVISION 2
C6201 CPU Megamodule
L1 S1 M1 D1
A Register File
D2 M2 S2 L2
B Register File
Instruction Dispatch
Program Fetch
Interrupts
Control
Registers
Control Logic
Emulation
Test
Ext. Memory
Interface
4-DMA
Program Cache / Program Memory
32-bit address, 256-Bit data512K Bits RAM
Host Port
Interface
2 Timers
2 Multi-channel
buffered serial
ports (T1/E1)
Data Memory
32-Bit address, 8-, 16-, 32-Bit data 512K Bits RAM
Pwr
Dwn
Instruction Decode
Data Path 1 Data Path 2
C6201 INTERNAL MEMORY ARCHITECTURE
o
o
Separate Internal Program and Data Spaces
Program
o
o
o
16K 32-bit instructions (2K Fetch Packets)
256-bit Fetch Width
Configurable as either
o Direct Mapped Cache, Memory Mapped Program
Memory
o Dat
ao
o
o
32K x 16
Single Ported Accessible by Both CPU Data Buses
4 x 8K 16-bit Banks
o
o
2 Possible Simultaneous Memory Accesses (4 Banks)
4-Way Interleave, Banks and Interleave Minimize Access
Conflicts
C6000 PIPELINE OPERATION BENEFITS
o Cycle Time
o
o
Allows 6 ns cycle time on 67x
Allows 5 ns cycle time & single cycle execution on
C62x
o Parallelism
o 8 new instructions can always be dispatched every cycle
o High Performance Internal Memory
Access
o
o
o
o
Pipelined Program and Data Accesses
Two 32-bit Data Accesses/Cycle (C62x)
Two 64-bit Data Accesses/Cycle
(C67x) 256-bit Program Access/Cycle
o Good Compiler Target
o
o
o
Visible: No Variable-Length Pipeline Flow
Deterministic: Order and Time of
Execution Orthogonal: Independent
Instructions
C67X ARCHITECTURE-TMS320C6701 DSP-
BLOCK DIAGRAM
L1 S1 M1 D1
A Register File
D2 M2 S2 L2
B Register File
Instruction Dispatch
Program Fetch
Interrupts
Control
Registers
Control
Logic
Emulation
Test
External
Memory
Interface
4
Channel
DMA
Program Cache/Program Memory 32-bit
address, 256-Bit data
512K Bits RAM
Host
Port
Interface
2 Timers
2 Multi-
channel
buffered
serial ports
(T1/E1)
Data Memory 32-Bit address
8-, 16-, 32-Bit data 512K Bits
RAM
Power ’C67x Floating-Point CPU Core
Down
Instruction Decode
Data Path 1 Data Path 2
TM
TMS320C6701
ADVANCED VLIW CPU (VELOCITI )
o 1 GFLOPS @ 167 MHz
o
o
6-ns cycle time
6 x 32-bit floating-point instructions/cycle
o
o
o
o
Load store architecture 3.3-V I/Os, 1.8-V
internal
Single- and double-precision IEEE floating-point
Dual data paths
o 6 floating-point units / 8 x 32-bit instructions
TMS320C6701-MEMORY /PERIPHERALS
o
o
Same as ’C6201
External interface supports
o SDRAM, SRAM, SBSRAM
o
o
o
o
o
4-channel boot loading DMA 16-bit host
port interface 1Mbit on-chip SRA
2 multichannel buffered serial ports(T1/E1)
Pin compatible with ’C6201
M-Unit 1
Multiplier
Unit
Control
Registers
Emulation
M-Unit 2
Multiplier
Unit
D-Unit 1
Data Load/
Store
D-Unit 2
Data Load/
Store
S-Unit 1
Auxiliary
Logic Unit
S-Unit 2
Auxiliary
Logic Unit
L-Unit 1
Arithmetic
Logic Unit
Decode
L-Unit 2
Arithmetic
Logic Unit
Register
file
Register
file
Program Fetch & Dispatch
’C62x CPU ’C67x CPU
’C67X AND ’C62X COMMONALITY
o
o
Driving commonality between ’C67x & ’C62x shortens ’C67x design
time. Maintaining symmetry between data paths shortens the ’C67x design
time.
M-Unit 1 Multiplier
Unit with Floating
Point
Control
Registers
Emulation
M-Unit 2 Multiplier
Unit with Floating
Point
D-Unit 1 Data
Load/ Store
D-Unit 2 Data
Load/ Store
S-Unit 1
Auxiliary Logic Unit
with Floating Point
S-Unit 2
Auxiliary Logic Unit
with Floating Point
L-Unit 1
Arithmetic Logic Unit
with Floating Point
Decode
L-Unit 2
Arithmetic Logic Unit
with Floating
Point
Register file Register file
Program Fetch & Dispatch
TEXAS DIGITAL SIGNAL PROCESSORS
execute 1600
C6x Devices :
An advanced VLIW architecture and can
MIPS.
Advanced DSPs(ADSP)
C8x Devices :
On a single piece of Silicon, a number of and a
RISC master Processor
DIGITAL SIGNAL PROCESSORS
Leading Manufacturers:
1.Texas Instruments (TI)
2.Analog Devices
3.Motorola
Programmable DSP (PDSP):
General purpose microprocessors designed specifically for DSP
applications
Special architecture and instruction set to compute DSP algorithms more
efficiently.
TYPES OF PROGRAMMABLE DSP
(PDSP)
1. General Purpose DSP
Basically high speed MP with architecture and instruction sets
optimized for DSP operations.
g. : Fixed Point : TMS320C5x, C54x, DSP563x
Floating Point: TMS320C4x, C67xx, ADSP21xxx
2. Special Purpose DSP
:: Hardware designed for,
1. Specific DSP algorithms such as FFT
2. Specific applications – PCM & Filtering.
Fixed point processors:( 16 – bit )
TMS320C1x, C2x, C2xx
TMS320C50, C51, C53
TEXAS DIGITAL SIGNAL
PROCESSORS
Floating point processors:( 32 – bit )
TMS320C3X
TMS320C4X
Yg hvuihbijbh itf ygcinbjbiojbfhuujh.ppt
Yg hvuihbijbh itf ygcinbjbiojbfhuujh.ppt
Yg hvuihbijbh itf ygcinbjbiojbfhuujh.ppt

Yg hvuihbijbh itf ygcinbjbiojbfhuujh.ppt

  • 1.
    SECA1506- Digital SignalProcessing Semester -V DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING UNIT V REALTIME DIGITAL SIGNAL PROCESSING
  • 2.
    ACOE343 - EmbeddedReal-Time Processor Systems - Frederick University 2 What is a DSP? • A specialized microprocessor for real-time DSP applications – Digital filtering (FIR and IIR) – FFT – Convolution, Matrix Multiplication etc ADC DAC DSP ANALOG INPUT ANALOG OUTPUT DIGITAL INPUT DIGITAL OUTPUT
  • 11.
    ACOE343 - EmbeddedReal-Time Processor Systems - Frederick University 11 Common DSP features • Harvard architecture • Dedicated single-cycle Multiply-Accumulate (MAC) instruction (hardware MAC units) • Single-Instruction Multiple Data (SIMD) Very Large Instruction Word (VLIW) architecture • Pipelining • Saturation arithmetic • Zero overhead looping • Hardware circular addressing • Cache • DMA
  • 13.
    13 Single-Cycle MAC unit Multiplier Adder Register ax i i a x i i a x i-1 i-1 a x i i a x i-1 i-1 + Σ(a x ) i i i=0 n Can compute a sum of n- products in n cycles
  • 14.
    ACOE343 - EmbeddedReal-Time Processor Systems - Frederick University 14 Single Instruction - Multiple Data (SIMD) • A technique for data-level parallelism by employing a number of processing elements working in parallel
  • 15.
    ACOE343 - EmbeddedReal-Time Processor Systems - Frederick University 15 Very Long Instruction Word (VLIW) • A technique for instruction- level parallelism by executing instructions without dependencies (known at compile-time) in parallel • Example of a single VLIW instruction: F=a+b; c=e/g; d=x&y; w=z*h; VLIW instruction F=a+b c=e/g d=x&y w=z*h PU PU PU PU a b F c d w e g x y z h
  • 16.
    PIPELINING • It isa technique which allows two or more operations to overlap during execution. • DSP algorithms are repetitive making them suitable for pipelining • It ensures a steady flow of instructions to the CPU and increases system performance. • In pipelining each instruction still takes three clock cycles but at each cycle the processor is executing up to three different instructions. • It has an impact upon the system memory . The no.of memory accesses increases by the no.of stages. • In Harvard architecture the separation of data and instruction memory promotes pipelining. • It also allows better utilisation of arithmetic unit.
  • 17.
    17 Pipelining •DSPs commonly featuredeep pipelines •TMS320C6x processors have 3 pipeline stages with a number of phases (cycles): – Fetch • Program Address Generate (PG) • Program Address Send (PS) • Program ready wait (PW) • Program receive (PR) – Decode • Dispatch (DP) • Decode (DC) – Execute • 6 to 10 phases
  • 18.
    ACOE343 - EmbeddedReal-Time Processor Systems - Frederick University 18 Saturation Arithmetic • fixed range for operations like addition and multiplication • normal overflow and underflow produce the maximum and minimum allowed value, respectively • Associativity and distributivity no longer apply • 1 signed byte saturation arithmetic examples: • 64 + 69 = 127 • -127 – 5 = -128 • (64 + 70) – 25 = 122 ≠ 64 + (70 -25) = 109
  • 19.
    ACOE343 - EmbeddedReal-Time Processor Systems - Frederick University 19 Zero Overhead Looping • Hardware support for loops with a constant number of iterations using hardware loop counters and loop buffers • No branching • No loop overhead • No pipeline stalls or branch prediction • No need for loop unrolling
  • 20.
    ACOE343 - EmbeddedReal-Time Processor Systems - Frederick University 20 Hardware Circular Addressing • A data structure implementing a fixed length queue of fixed size objects where objects are added to the head of the queue while items are removed from the tail of the queue. • Requires at least 2 pointers (head and tail) • Extensively used in digital filtering y[n] = a0x[n]+a1x[n-1]+…+akx[n-k] X[n] X[n-1] X[n-2] X[n-3] X[n] X[n-1] X[n-2] X[n-3] Head Tail Cycle1 Cycle2
  • 21.
    ACOE343 - EmbeddedReal-Time Processor Systems - Frederick University 21 Direct Memory Access (DMA) • The feature that allows peripherals to access main memory without the intervention of the CPU • Typically, the CPU initiates DMA transfer, does other operations while the transfer is in progress, and receives an interrupt from the DMA controller once the operation is complete. • Can create cache coherency problems (the data in the cache may be different from the data in the external memory after DMA) • Requires a DMA controller
  • 22.
    ACOE343 - EmbeddedReal-Time Processor Systems - Frederick University 22 Cache memory • Separate instruction and data L1 caches (Harvard architecture) • Cache coherence protocols required, since most systems use DMA
  • 23.
    ACOE343 - EmbeddedReal-Time Processor Systems - Frederick University 23 DSP vs. Microcontroller • DSP – Harvard Architecture – VLIW/SIMD (parallel execution units) – No bit level operations – Hardware MACs – DSP applications • Microcontroller – Mostly von Neumann Architecture – Single execution unit – Flexible bit-level operations – No hardware MACs – Control applications
  • 24.
    DIGITAL SIGNAL PROCESSORS LeadingManufacturers: 1.Texas Instruments (TI) 2.Analog Devices 3.Motorola Programmable DSP (PDSP): General purpose microprocessors designed specifically for DSP applications Special architecture and instruction set to compute DSP algorithms more efficiently.
  • 25.
    TYPES OF PROGRAMMABLEDSP (PDSP) 1. General Purpose DSP Basically high speed MP with architecture and instruction sets optimized for DSP operations. Fixed Point : TMS320C5x, C54x, DSP563x Floating Point: TMS320C4x, C67xx, ADSP21xxx 2. Special Purpose DSP :: Hardware designed for, 1. Specific DSP algorithms such as FFT 2. Specific applications – PCM & Filtering.
  • 26.
    Fixed point processors:(16 – bit ) TMS320C1x, C2x, C2xx TMS320C50, C51, C53 TEXAS DIGITAL SIGNAL PROCESSORS Floating point processors:( 32 – bit ) TMS320C3X TMS320C4X
  • 27.
    SELECTION OF DSPPROCESSORS Architectural Features : On-chip memory,Special instruction set,I/O capability & Large memory. Execution Speed : MIPS & MFLOPS Type of arithmetic : Fixed point (Cell phone & Computer disk drives) Floating point (Wide & dynamic range of values)
  • 28.
    SELECTION OF DSPPROCESSORS Word length : Fixed point 16 - bit :: 24 - bit :: TMS320C54x Telecommunications applications DSP56300 High quality audio applications Floating point 32 - bit :: TMS320C3x, C4x Single - precision arithmetic
  • 29.
    TYPICALAPPLICATIONS OF TEXASFAMILY C1x,C2x,C3xx,C5x,C54x : Toys, HDD, Modems, Cellular phones & Active car suspensions. C3x : Bar- Filters, Analysers, Hi - Fi Systems, Voice mail, imaging, code readers, motor control, 3D Graphics. C4x : Parallel-processing Systems, Image recognition telecom routing.
  • 30.
    TYPICALAPPLICATIONS OF TEXASFAMILY C6x : Wireless base stations,pooled modems, Multi channel telephone systems. C8x : Video telephony, 3D computer graphics, Virtual reality & multimedia applications.
  • 31.
    High-end ●Wireless Base Station- TMS320C6000 ●Cable modem ●gateways Mid-end ● Cellular phone - TMS320C540 ● Fax/ voice server Low end ● Storage products - TMS320C27 ● Digital camera - TMS320C5000 ● Portable phones ● Wireless headsets ● Consumer audio ● Automobiles, toasters, thermostats, ... OTHER DSPAPPLICATIONS Increasing Cost
  • 32.
            Address generation unit(AGU) Arithmetic logic unit (ALU) Barrel shifter. Floating-point unit (FPU) Back-side bus. Multiplexer, De-multiplexer. Registers. Memory management unit (MMU) Translation look aside buffer (TLB) FUNCTIONAL MODES OF DSP PROCESSORS
  • 33.
    DSPARCHITECTURE ENABLING TECHNOLOGIES Time FrameApproach Primary Application Enabling Technologies Early 1970’s Discrete logic  Bipolar SSI, MSI  FFT algorithm Late 1970’s Building block  Non-real time procesing  Simulation  Military radars  Digital Comm. Early 1980’s Single Chip DSP P  Telecom  Control  Single chip bipolar multiplier  Flash A/D  P architectures  NMOS/CMOS Late 1980’s Function/Application specific chips  Computers  Communication  Vector processing  Parallel processing Early 1990’s Multiprocessing  Video/Image Processing  Advanced multiprocessing  VLIW, MIMD, etc. Late 1990’s Single-chip multiprocessing  Wireless telephony  Internet related  Low power single-chip DSP  Multiprocessing
  • 34.
     They havebeen used in numerous applications, such as communication, control, computers, Instrumentation, and consumer electronics.  The architectural features and the processing power of these devices have been constantly upgraded based on the advances in technology and the application needs.  Most of them have Harvard architecture, a single-cycle hardware multiplier, an address generation unit with dedicated address registers, special addressing modes, on-chip peripherals interfaces.   Three most popular ones are those from Texas Instruments, Motorola, and Analog Devices. Texas Instruments was one of the first to come out with a commercial programmable DSP with the introduction of its TMS32010 in 1982. COMMERCIAL DIGITAL SIGNAL PROCESSING DEVICES:
  • 35.
    FIRST GENERATION DSPP TMS32010 (TEXAS INSTRUMENTS) - 1982 FEATURES o o o o o o o 200 ns instruction cycle (5 MIPS) 144 words (16 bit) on-chip data RAM 1.5K words (16 bit) on-chip program ROM - TMS32010 External program memory expansion to a total of 4K words at full speed 16-bit instruction/data word single cycle 32-bit ALU/accumulator Single cycle 16 x 16-bit multiply in 200 ns Two cycle MAC (5 MOPS) Zero to 15-bit barrel shifter Eight input and eight output channels
  • 36.
  • 37.
    THIRD GENERATION DSPP CASE STUDY TMS320C30 - 1988 FEATURES o 60 ns single-cycle instruction execution time o o 33.3 MFLOPS (million floating-point operations per second) 16.7 MIPS (million instructions per second) o o o o o One 4K x 32-bit single-cycle dual-access on-chip ROM block Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks 64 x 32-bit instruction cache 32-bit instruction and data words, 24-bit addresses 40/32-bit floating-point/integer multiplier and ALU 32-bit barrel shifter
  • 38.
    o o o o o o o Eight extended precisionregisters (accumulators) Two address generators with eight auxiliary registers and two auxiliary register arithmetic units On-chip direct memory Access (DMA) controller for concurrent I/O and CPU operation Parallel ALU and multiplier instructions Block repeat capability Interlocked instructions for multiprocessing support Two serial ports to support 8/16/32-bit transfers Two 32-bit timers 1  CDMOS Process THIRD GENERATION DSP P CASE STUDY TMS320C30 - 1988
  • 39.
  • 40.
    C54X ARCHITECTURE #1: CPUdesigned for efficient DSP processing MAC unit, 2 Accumulators, Additional Adder, Barrel Shifter #2: Multiple busses for efficient data and program flow Four busses and large on-chip memory that result in sustained performance near peak #3: Highly tuned instruction set for powerful DSP computing Sophisticated instructions that execute in fewer cycles, with less code and low power demands
  • 41.
    KEY #1: DSPENGINE MPY Y 40 =  n = 1 x a an * xn ADD y
  • 42.
    KEY #1: MACUNIT MPY ADD MAC *AR2+, *AR3+, A acc A acc B Fractional Mode Bit A B O Data Acc A Temp Coeff Prgm Data Acc A S/U S/U
  • 43.
    KEY #1: ACCUMULATORS+ ADDER LD @s, A ADD @e, A SUB @r, A STL A, @t General-Purpose Math example: t = s+e-r A BusB BusA B C T D Shifter MUX A B MAC acc A acc B U Bus ALU
  • 44.
    KEY #1: BARRELSHIFTER Barrel Shifter (-16-+31) S Bus ALU E Bus LD STH @X, 16, A @B, Y A B C D
  • 45.
    KEY #1: TEMPORARYREGISTER Temporary Register ALU MAC T Bus EXP Encoder A B D X For example: A = xa LD MPY @x, T @a, A
  • 46.
    KEY #2: EFFICIENTDATA/PROGRAM FLOW #1: CPU designed for efficient DSP processing  MAC unit, 2 Accumulators, Additional Adder, Barrel Shifter #2: Multiple busses for efficient data and program flow  Four busses and large on-chip memory that result in sustained performance near peak #3: Highly tuned instruction set for powerful DSP computing  Sophisticated instructions that execute in fewer cycles, with less code and low power demands
  • 47.
    KEY #2: MULTIPLEBUSSES MAC *AR2+, *AR3+, A Central Arithmetic Logic Unit EXTERNAL MEMORY M U X M U X E S P D E C C D M ALU SHIFTER B T MAC A
  • 48.
    o 27 P Prefetch F Fetch D AR Decode Access Read E Execute KEY #2: PIPELINE o o o o o Prefetch: Calculate address of instruction Fetch: Collect instruction Decode: Interpret instruction Access: Collect address of operand Read: Collect operand Execute: Perform operation
  • 49.
    KEY #2: BUSUSAGE Central Arithmetic Logic Unit EXTERNAL MEMORY M U X M U X E S P ALU SHIFTER B T MAC A PC CNTL E C D ARs
  • 50.
    KEY #2: PIPELINEPERFORMANCE P3 F3 CYCLES P1 F1 D1 A1 R1 X1 A2 R2 X2 D3 A3 R3 P4 F4 D4 A4 P5 F5 D5 P6 F6 P2 F2 D2 X3 D6 A6 R4 X4 A5 R5 X5 R6 X6 Fully loaded pipeline
  • 51.
    KEY #3: POWERFULINSTRUCTIONS #1: CPU designed for efficient DSP processing  MAC Unit, 2 Accumulators, Additional Adder,Barrel Shifter #2: Multiple busses for efficient data and program flow  Four busses and large on-chip memory that result in sustained performance near peak #3: Highly tuned instruction set for powerful DSP computing  Sophisticated instructions that execute in fewer cycles, with less code and low power demands
  • 52.
    KEY #3: ADVANCEDAPPLICATIONS Symmetric FIR filter Adaptive filtering Polynomial evaluation Code book search Viterbi FIRS LMS POLY STRCD SACC D SRCC D DADST
  • 53.
    C62X ARCHITECTURE-TMS320 C6201 REVISION2 C6201 CPU Megamodule L1 S1 M1 D1 A Register File D2 M2 S2 L2 B Register File Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test Ext. Memory Interface 4-DMA Program Cache / Program Memory 32-bit address, 256-Bit data512K Bits RAM Host Port Interface 2 Timers 2 Multi-channel buffered serial ports (T1/E1) Data Memory 32-Bit address, 8-, 16-, 32-Bit data 512K Bits RAM Pwr Dwn Instruction Decode Data Path 1 Data Path 2
  • 54.
    C6201 INTERNAL MEMORYARCHITECTURE o o Separate Internal Program and Data Spaces Program o o o 16K 32-bit instructions (2K Fetch Packets) 256-bit Fetch Width Configurable as either o Direct Mapped Cache, Memory Mapped Program Memory o Dat ao o o 32K x 16 Single Ported Accessible by Both CPU Data Buses 4 x 8K 16-bit Banks o o 2 Possible Simultaneous Memory Accesses (4 Banks) 4-Way Interleave, Banks and Interleave Minimize Access Conflicts
  • 55.
    C6000 PIPELINE OPERATIONBENEFITS o Cycle Time o o Allows 6 ns cycle time on 67x Allows 5 ns cycle time & single cycle execution on C62x o Parallelism o 8 new instructions can always be dispatched every cycle o High Performance Internal Memory Access o o o o Pipelined Program and Data Accesses Two 32-bit Data Accesses/Cycle (C62x) Two 64-bit Data Accesses/Cycle (C67x) 256-bit Program Access/Cycle o Good Compiler Target o o o Visible: No Variable-Length Pipeline Flow Deterministic: Order and Time of Execution Orthogonal: Independent Instructions
  • 56.
    C67X ARCHITECTURE-TMS320C6701 DSP- BLOCKDIAGRAM L1 S1 M1 D1 A Register File D2 M2 S2 L2 B Register File Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test External Memory Interface 4 Channel DMA Program Cache/Program Memory 32-bit address, 256-Bit data 512K Bits RAM Host Port Interface 2 Timers 2 Multi- channel buffered serial ports (T1/E1) Data Memory 32-Bit address 8-, 16-, 32-Bit data 512K Bits RAM Power ’C67x Floating-Point CPU Core Down Instruction Decode Data Path 1 Data Path 2
  • 57.
    TM TMS320C6701 ADVANCED VLIW CPU(VELOCITI ) o 1 GFLOPS @ 167 MHz o o 6-ns cycle time 6 x 32-bit floating-point instructions/cycle o o o o Load store architecture 3.3-V I/Os, 1.8-V internal Single- and double-precision IEEE floating-point Dual data paths o 6 floating-point units / 8 x 32-bit instructions
  • 58.
    TMS320C6701-MEMORY /PERIPHERALS o o Same as’C6201 External interface supports o SDRAM, SRAM, SBSRAM o o o o o 4-channel boot loading DMA 16-bit host port interface 1Mbit on-chip SRA 2 multichannel buffered serial ports(T1/E1) Pin compatible with ’C6201
  • 59.
    M-Unit 1 Multiplier Unit Control Registers Emulation M-Unit 2 Multiplier Unit D-Unit1 Data Load/ Store D-Unit 2 Data Load/ Store S-Unit 1 Auxiliary Logic Unit S-Unit 2 Auxiliary Logic Unit L-Unit 1 Arithmetic Logic Unit Decode L-Unit 2 Arithmetic Logic Unit Register file Register file Program Fetch & Dispatch ’C62x CPU ’C67x CPU ’C67X AND ’C62X COMMONALITY o o Driving commonality between ’C67x & ’C62x shortens ’C67x design time. Maintaining symmetry between data paths shortens the ’C67x design time. M-Unit 1 Multiplier Unit with Floating Point Control Registers Emulation M-Unit 2 Multiplier Unit with Floating Point D-Unit 1 Data Load/ Store D-Unit 2 Data Load/ Store S-Unit 1 Auxiliary Logic Unit with Floating Point S-Unit 2 Auxiliary Logic Unit with Floating Point L-Unit 1 Arithmetic Logic Unit with Floating Point Decode L-Unit 2 Arithmetic Logic Unit with Floating Point Register file Register file Program Fetch & Dispatch
  • 60.
    TEXAS DIGITAL SIGNALPROCESSORS execute 1600 C6x Devices : An advanced VLIW architecture and can MIPS. Advanced DSPs(ADSP) C8x Devices : On a single piece of Silicon, a number of and a RISC master Processor
  • 61.
    DIGITAL SIGNAL PROCESSORS LeadingManufacturers: 1.Texas Instruments (TI) 2.Analog Devices 3.Motorola Programmable DSP (PDSP): General purpose microprocessors designed specifically for DSP applications Special architecture and instruction set to compute DSP algorithms more efficiently.
  • 62.
    TYPES OF PROGRAMMABLEDSP (PDSP) 1. General Purpose DSP Basically high speed MP with architecture and instruction sets optimized for DSP operations. g. : Fixed Point : TMS320C5x, C54x, DSP563x Floating Point: TMS320C4x, C67xx, ADSP21xxx 2. Special Purpose DSP :: Hardware designed for, 1. Specific DSP algorithms such as FFT 2. Specific applications – PCM & Filtering.
  • 63.
    Fixed point processors:(16 – bit ) TMS320C1x, C2x, C2xx TMS320C50, C51, C53 TEXAS DIGITAL SIGNAL PROCESSORS Floating point processors:( 32 – bit ) TMS320C3X TMS320C4X