Yg hvuihbijbh itf ygcinbjbiojbfhuujh.ppt

SECA1506- Digital Signal Processing
Semester -V
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
UNIT V
REALTIME DIGITAL SIGNAL
PROCESSING

ACOE343 - Embedded Real-Time Processor Systems
- Frederick University 2
What is a DSP?
• A specialized microprocessor for real-time
DSP applications
– Digital filtering (FIR and IIR)
– FFT
– Convolution, Matrix Multiplication etc
ADC DAC
DSP
ANALOG
INPUT
ANALOG
OUTPUT
DIGITAL
INPUT
DIGITAL
OUTPUT

Common DSP features
• Harvard architecture
• Dedicated single-cycle Multiply-Accumulate (MAC)
instruction (hardware MAC units)
• Single-Instruction Multiple Data (SIMD) Very Large
Instruction Word (VLIW) architecture
• Pipelining
• Saturation arithmetic
• Zero overhead looping
• Hardware circular addressing
• Cache
• DMA

13
Single-Cycle MAC unit
Multiplier
Adder
Register
a x
i i
a x
i i
a x
i-1 i-1
a x
i i a x
i-1 i-1
+
Σ(a x )
i i
i=0
n
Can compute a sum of n-
products in n cycles

Single Instruction - Multiple Data
(SIMD)
• A technique for data-level parallelism by
employing a number of processing elements
working in parallel

Very Long Instruction Word (VLIW)
• A technique for instruction-
level parallelism by
executing instructions
without dependencies
(known at compile-time) in
parallel
• Example of a single VLIW
instruction:
F=a+b; c=e/g; d=x&y; w=z*h;
VLIW instruction F=a+b c=e/g d=x&y w=z*h
PU
PU
PU
PU
a
b
F
c
d
w
e
g
x
y
z
h

PIPELINING
• It is a technique which allows two or more operations to
overlap during execution.
• DSP algorithms are repetitive making them suitable for
pipelining
• It ensures a steady flow of instructions to the CPU and
increases system performance.
• In pipelining each instruction still takes three clock cycles
but at each cycle the processor is executing up to three
different instructions.
• It has an impact upon the system memory . The no.of
memory accesses increases by the no.of stages.
• In Harvard architecture the separation of data and
instruction memory promotes pipelining.
• It also allows better utilisation of arithmetic unit.

17
Pipelining
•DSPs commonly feature deep pipelines
•TMS320C6x processors have 3 pipeline stages with a
number of phases (cycles):
– Fetch
• Program Address Generate (PG)
• Program Address Send (PS)
• Program ready wait (PW)
• Program receive (PR)
– Decode
• Dispatch (DP)
• Decode (DC)
– Execute
• 6 to 10 phases

Saturation Arithmetic
• fixed range for operations like addition and
multiplication
• normal overflow and underflow produce the
maximum and minimum allowed value, respectively
• Associativity and distributivity no longer apply
• 1 signed byte saturation arithmetic examples:
• 64 + 69 = 127
• -127 – 5 = -128
• (64 + 70) – 25 = 122 ≠ 64 + (70 -25) = 109

Zero Overhead Looping
• Hardware support for loops with a constant
number of iterations using hardware loop
counters and loop buffers
• No branching
• No loop overhead
• No pipeline stalls or branch prediction
• No need for loop unrolling

Hardware Circular Addressing
• A data structure
implementing a fixed length
queue of fixed size objects
where objects are added to
the head of the queue while
items are removed from the
tail of the queue.
• Requires at least 2 pointers
(head and tail)
• Extensively used in digital
filtering
y[n] = a0x[n]+a1x[n-1]+…+akx[n-k]
X[n]
X[n-1]
X[n-2]
X[n-3]
X[n]
X[n-1]
X[n-2]
X[n-3]
Head
Tail
Cycle1
Cycle2

Direct Memory Access (DMA)
• The feature that allows peripherals to access main
memory without the intervention of the CPU
• Typically, the CPU initiates DMA transfer, does other
operations while the transfer is in progress, and
receives an interrupt from the DMA controller once
the operation is complete.
• Can create cache coherency problems (the data in
the cache may be different from the data in the
external memory after DMA)
• Requires a DMA controller

Cache memory
• Separate instruction and data L1 caches
(Harvard architecture)
• Cache coherence protocols required, since
most systems use DMA

DSP vs. Microcontroller
• DSP
– Harvard Architecture
– VLIW/SIMD (parallel
execution units)
– No bit level operations
– Hardware MACs
– DSP applications
• Microcontroller
– Mostly von Neumann
Architecture
– Single execution unit
– Flexible bit-level operations
– No hardware MACs
– Control applications

DIGITAL SIGNAL PROCESSORS
Leading Manufacturers:
1.Texas Instruments (TI)
2.Analog Devices
3.Motorola
Programmable DSP (PDSP):
General purpose microprocessors designed specifically for DSP
applications
Special architecture and instruction set to compute DSP
algorithms more efficiently.

TYPES OF PROGRAMMABLE DSP (PDSP)
1. General Purpose DSP
Basically high speed MP with architecture and instruction sets
optimized for DSP operations.
Fixed Point : TMS320C5x, C54x, DSP563x Floating
Point: TMS320C4x, C67xx, ADSP21xxx
2. Special Purpose DSP
:: Hardware designed for,
1. Specific DSP algorithms such as FFT
2. Specific applications – PCM & Filtering.

Fixed point processors:( 16 – bit )
TMS320C1x, C2x, C2xx
TMS320C50, C51, C53
TEXAS DIGITAL SIGNAL PROCESSORS
Floating point processors:( 32 – bit )
TMS320C3X
TMS320C4X

SELECTION OF DSP PROCESSORS
Architectural Features :
On-chip memory,Special instruction set,I/O capability & Large
memory.
Execution Speed :
MIPS & MFLOPS
Type of arithmetic :
Fixed point (Cell phone & Computer disk drives)
Floating point (Wide & dynamic range of
values)

SELECTION OF DSP PROCESSORS
Word length :
Fixed point
16 - bit ::
24 - bit ::
TMS320C54x
Telecommunications applications
DSP56300
High quality audio applications
Floating point
32 - bit :: TMS320C3x, C4x
Single - precision arithmetic

TYPICALAPPLICATIONS OF TEXAS FAMILY
C1x,C2x,C3xx,C5x,C54x :
Toys, HDD, Modems, Cellular phones & Active car suspensions.
C3x :
Bar-
Filters, Analysers, Hi - Fi Systems, Voice mail, imaging,
code readers, motor control, 3D Graphics.
C4x : Parallel-processing Systems, Image recognition telecom
routing.

TYPICALAPPLICATIONS OF TEXAS FAMILY
C6x : Wireless base stations,pooled modems, Multi channel
telephone systems.
C8x : Video telephony, 3D computer graphics, Virtual reality &
multimedia applications.

High-end
●Wireless Base Station - TMS320C6000
●Cable modem
●gateways
Mid-end
● Cellular phone - TMS320C540
● Fax/ voice server
Low end
● Storage products - TMS320C27
● Digital camera - TMS320C5000
● Portable phones
● Wireless headsets
● Consumer audio
● Automobiles, toasters, thermostats, ...
OTHER DSPAPPLICATIONS
Increasing
Cost









Address generation unit (AGU)
Arithmetic logic unit (ALU)
Barrel shifter.
Floating-point unit (FPU)
Back-side bus.
Multiplexer, De-multiplexer. Registers.
Memory management unit (MMU)
Translation look aside buffer (TLB)
FUNCTIONAL MODES OF DSP PROCESSORS

DSPARCHITECTURE
ENABLING TECHNOLOGIES
Time Frame Approach Primary Application Enabling Technologies
Early 1970’s
Discrete logic
 Bipolar SSI, MSI
 FFT algorithm
Late 1970’s
Building block
 Non-real time
procesing
 Simulation
 Military radars
 Digital Comm.
Early 1980’s
Single Chip DSP P
 Telecom
 Control
 Single chip bipolar multiplier
 Flash A/D
 P architectures
 NMOS/CMOS
Late 1980’s Function/Application
specific chips
 Computers
 Communication
 Vector processing
 Parallel processing
Early 1990’s
Multiprocessing
 Video/Image Processing  Advanced multiprocessing
 VLIW, MIMD, etc.
Late 1990’s Single-chip
multiprocessing
 Wireless telephony
 Internet related
 Low power single-chip DSP
 Multiprocessing

 They have been used in numerous applications, such as communication, control,
computers, Instrumentation, and consumer electronics.
 The architectural features and the processing power of these devices have been
constantly upgraded based on the advances in technology and the application
needs.
 Most of them have Harvard architecture, a single-cycle hardware multiplier, an
address generation unit with dedicated address registers, special addressing
modes, on-chip peripherals interfaces.


Three most popular ones are those from Texas Instruments, Motorola, and Analog
Devices.
Texas Instruments was one of the first to
come out with a commercial programmable DSP with the
introduction of its TMS32010 in 1982.
COMMERCIAL DIGITAL SIGNAL PROCESSING
DEVICES:

FIRST GENERATION DSP P
TMS32010 (TEXAS INSTRUMENTS) - 1982
FEATURES
o
o
o
o
o
o
o
200 ns instruction cycle (5 MIPS) 144 words (16 bit) on-chip data
RAM
1.5K words (16 bit) on-chip program ROM - TMS32010
External program memory expansion to a total of 4K words at full
speed
16-bit instruction/data word
single cycle 32-bit ALU/accumulator Single cycle 16 x 16-bit multiply
in 200 ns
Two cycle MAC (5 MOPS) Zero to 15-bit barrel shifter
Eight input and eight output channels

TMS320C203/LC203 BLOCK DIAGRAM
DSP Core Approach - 1995

THIRD GENERATION DSP P CASE STUDY
TMS320C30 - 1988
FEATURES
o 60 ns single-cycle instruction execution time
o
o
33.3 MFLOPS (million floating-point operations per second)
16.7 MIPS (million instructions per second)
o
o
o
o
o
One 4K x 32-bit single-cycle dual-access on-chip ROM block
Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks
64 x 32-bit instruction cache
32-bit instruction and data words, 24-bit addresses
40/32-bit floating-point/integer multiplier and ALU 32-bit
barrel shifter

o
o
o
o
o
o
o
Eight extended precision registers (accumulators)
Two address generators with eight auxiliary registers and two
auxiliary register arithmetic units
On-chip direct memory Access (DMA) controller for concurrent
I/O and CPU operation
Parallel ALU and multiplier instructions Block repeat capability
Interlocked instructions for multiprocessing support Two serial
ports to support 8/16/32-bit transfers Two 32-bit timers
1  CDMOS Process
THIRD GENERATION DSP P CASE STUDY
TMS320C30 - 1988

C54X ARCHITECTURE
#1: CPU designed for efficient DSP processing MAC unit, 2 Accumulators,
Additional Adder, Barrel Shifter
#2: Multiple busses for efficient data and program flow Four busses and large
on-chip memory that result in sustained performance near peak
#3: Highly tuned instruction set for powerful DSP computing Sophisticated
instructions that execute in fewer cycles, with less code and low power
demands

KEY #1: DSP ENGINE
MPY
Y
40
= 
n = 1
x a
an * xn
ADD
y

KEY #1: MAC UNIT
MPY
ADD
MAC *AR2+, *AR3+, A
acc A acc B
Fractional
Mode Bit
A
B
O
Data Acc A Temp Coeff Prgm Data Acc A
S/U S/U

KEY #1: ACCUMULATORS + ADDER
LD @s, A
ADD @e,
A
SUB @r, A
STL A, @t
General-Purpose Math example: t = s+e-r
A BusB BusA B C T D
Shifter
MUX
A B MAC
acc A acc B
U Bus
ALU

KEY #1: BARREL SHIFTER
Barrel Shifter
(-16-+31)
S Bus
ALU E Bus
LD
STH
@X, 16, A
@B, Y
A B C D

KEY #1: TEMPORARY REGISTER
Temporary
Register
ALU
MAC
T Bus
EXP
Encoder
A
B
D X
For example:
A = xa
LD
MPY
@x, T
@a,
A

KEY #2: EFFICIENT DATA/PROGRAM FLOW
#1: CPU designed for efficient DSP processing
 MAC unit, 2 Accumulators, Additional Adder, Barrel
Shifter
#2: Multiple busses for efficient data and program flow
 Four busses and large on-chip memory that result in
sustained performance near peak
#3: Highly tuned instruction set for powerful DSP
computing
 Sophisticated instructions that execute in fewer cycles, with less code
and low power demands

KEY #2: MULTIPLE BUSSES
MAC *AR2+, *AR3+, A
Central
Arithmetic
Logic
Unit
EXTERNAL
MEMORY
M
U
X
M
U
X
E
S
P
D
E
C
C D
M
ALU SHIFTER
B
T MAC A

o 27
P
Prefetch
F
Fetch
D A R
Decode Access Read
E
Execute
KEY #2: PIPELINE
o
o
o
o
o
Prefetch: Calculate address of instruction
Fetch: Collect instruction
Decode: Interpret instruction
Access: Collect address of operand
Read: Collect operand
Execute: Perform operation

KEY #2: BUS USAGE
Central
Arithmetic
Logic
Unit
EXTERNAL
MEMORY
M
U
X
M
U
X
E
S
P
ALU SHIFTER
B
T MAC A
PC
CNTL
E
C
D
ARs

KEY #2: PIPELINE PERFORMANCE
P3 F3
CYCLES
P1 F1 D1 A1 R1 X1
A2 R2
X2
D3 A3
R3 P4 F4 D4 A4 P5 F5
D5
P6 F6
P2 F2 D2
X3
D6 A6
R4 X4
A5 R5 X5
R6 X6
Fully loaded pipeline

KEY #3: POWERFUL INSTRUCTIONS
#1: CPU designed for efficient DSP processing
 MAC Unit, 2 Accumulators, Additional Adder,Barrel Shifter
#2: Multiple busses for efficient data and program flow
 Four busses and large on-chip memory that result in
sustained performance near peak
#3: Highly tuned instruction set for powerful DSP
computing
 Sophisticated instructions that execute in fewer cycles, with less
code and low power demands

KEY #3: ADVANCED APPLICATIONS
Symmetric FIR filter
Adaptive filtering
Polynomial evaluation
Code book search
Viterbi
FIRS
LMS
POLY
STRCD
SACC
D
SRCC
D
DADST

C62X ARCHITECTURE-TMS320 C6201
REVISION 2
C6201 CPU Megamodule
L1 S1 M1 D1
A Register File
D2 M2 S2 L2
B Register File
Instruction Dispatch
Program Fetch
Interrupts
Control
Registers
Control Logic
Emulation
Test
Ext. Memory
Interface
4-DMA
Program Cache / Program Memory
32-bit address, 256-Bit data512K Bits RAM
Host Port
Interface
2 Timers
2 Multi-channel
buffered serial
ports (T1/E1)
Data Memory
32-Bit address, 8-, 16-, 32-Bit data 512K Bits RAM
Pwr
Dwn
Instruction Decode
Data Path 1 Data Path 2

C6201 INTERNAL MEMORY ARCHITECTURE
o
o
Separate Internal Program and Data Spaces
Program
o
o
o
16K 32-bit instructions (2K Fetch Packets)
256-bit Fetch Width
Configurable as either
o Direct Mapped Cache, Memory Mapped Program
Memory
o Dat
ao
o
o
32K x 16
Single Ported Accessible by Both CPU Data Buses
4 x 8K 16-bit Banks
o
o
2 Possible Simultaneous Memory Accesses (4 Banks)
4-Way Interleave, Banks and Interleave Minimize Access
Conflicts

C6000 PIPELINE OPERATION BENEFITS
o Cycle Time
o
o
Allows 6 ns cycle time on 67x
Allows 5 ns cycle time & single cycle execution on
C62x
o Parallelism
o 8 new instructions can always be dispatched every cycle
o High Performance Internal Memory
Access
o
o
o
o
Pipelined Program and Data Accesses
Two 32-bit Data Accesses/Cycle (C62x)
Two 64-bit Data Accesses/Cycle
(C67x) 256-bit Program Access/Cycle
o Good Compiler Target
o
o
o
Visible: No Variable-Length Pipeline Flow
Deterministic: Order and Time of
Execution Orthogonal: Independent
Instructions

C67X ARCHITECTURE-TMS320C6701 DSP-
BLOCK DIAGRAM
L1 S1 M1 D1
A Register File
D2 M2 S2 L2
B Register File
Instruction Dispatch
Program Fetch
Interrupts
Control
Registers
Control
Logic
Emulation
Test
External
Memory
Interface
4
Channel
DMA
Program Cache/Program Memory 32-bit
address, 256-Bit data
512K Bits RAM
Host
Port
Interface
2 Timers
2 Multi-
channel
buffered
serial ports
(T1/E1)
Data Memory 32-Bit address
8-, 16-, 32-Bit data 512K Bits
RAM
Power ’C67x Floating-Point CPU Core
Down
Instruction Decode
Data Path 1 Data Path 2

TM
TMS320C6701
ADVANCED VLIW CPU (VELOCITI )
o 1 GFLOPS @ 167 MHz
o
o
6-ns cycle time
6 x 32-bit floating-point instructions/cycle
o
o
o
o
Load store architecture 3.3-V I/Os, 1.8-V
internal
Single- and double-precision IEEE floating-point
Dual data paths
o 6 floating-point units / 8 x 32-bit instructions

TMS320C6701-MEMORY /PERIPHERALS
o
o
Same as ’C6201
External interface supports
o SDRAM, SRAM, SBSRAM
o
o
o
o
o
4-channel boot loading DMA 16-bit host
port interface 1Mbit on-chip SRA
2 multichannel buffered serial ports(T1/E1)
Pin compatible with ’C6201

M-Unit 1
Multiplier
Unit
Control
Registers
Emulation
M-Unit 2
Multiplier
Unit
D-Unit 1
Data Load/
Store
D-Unit 2
Data Load/
Store
S-Unit 1
Auxiliary
Logic Unit
S-Unit 2
Auxiliary
Logic Unit
L-Unit 1
Arithmetic
Logic Unit
Decode
L-Unit 2
Arithmetic
Logic Unit
Register
file
Register
file
Program Fetch & Dispatch
’C62x CPU ’C67x CPU
’C67X AND ’C62X COMMONALITY
o
o
Driving commonality between ’C67x & ’C62x shortens ’C67x design
time. Maintaining symmetry between data paths shortens the ’C67x design
time.
M-Unit 1 Multiplier
Unit with Floating
Point
Control
Registers
Emulation
M-Unit 2 Multiplier
Unit with Floating
Point
D-Unit 1 Data
Load/ Store
D-Unit 2 Data
Load/ Store
S-Unit 1
Auxiliary Logic Unit
with Floating Point
S-Unit 2
Auxiliary Logic Unit
with Floating Point
L-Unit 1
Arithmetic Logic Unit
with Floating Point
Decode
L-Unit 2
Arithmetic Logic Unit
with Floating
Point
Register file Register file
Program Fetch & Dispatch

TEXAS DIGITAL SIGNAL PROCESSORS
execute 1600
C6x Devices :
An advanced VLIW architecture and can
MIPS.
Advanced DSPs(ADSP)
C8x Devices :
On a single piece of Silicon, a number of and a
RISC master Processor

DIGITAL SIGNAL PROCESSORS
Leading Manufacturers:
1.Texas Instruments (TI)
2.Analog Devices
3.Motorola
Programmable DSP (PDSP):
General purpose microprocessors designed specifically for DSP
applications
Special architecture and instruction set to compute DSP algorithms more
efficiently.

TYPES OF PROGRAMMABLE DSP
(PDSP)
1. General Purpose DSP
Basically high speed MP with architecture and instruction sets
optimized for DSP operations.
g. : Fixed Point : TMS320C5x, C54x, DSP563x
Floating Point: TMS320C4x, C67xx, ADSP21xxx
2. Special Purpose DSP
:: Hardware designed for,
1. Specific DSP algorithms such as FFT
2. Specific applications – PCM & Filtering.

Fixed point processors:( 16 – bit )
TMS320C1x, C2x, C2xx
TMS320C50, C51, C53
TEXAS DIGITAL SIGNAL
PROCESSORS
Floating point processors:( 32 – bit )
TMS320C3X
TMS320C4X

Yg hvuihbijbh itf ygcinbjbiojbfhuujh.ppt

Yg hvuihbijbh itf ygcinbjbiojbfhuujh.ppt

More Related Content

Similar to Yg hvuihbijbh itf ygcinbjbiojbfhuujh.ppt

Recently uploaded

Yg hvuihbijbh itf ygcinbjbiojbfhuujh.ppt