DSP architecture

2,912 views
2,653 views

Published on

Brief summary of Digital Signal Processing architectures and development by generation.

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,912
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

DSP architecture

  1. 1. DSP ArchitecturesRensselaer at HartfordECSE 6620 - Fall 2001 Lecture 16 Jason M. Stripinis jasonstripinis@engineer.com
  2. 2. Basic Processor Structure• Here we see a very simple processor structure - such as might be found in a small 8-bit microprocessor.12 DEC 01 ECSE 6620 - Jason Stripinis2(jasonstripinis@eng
  3. 3. Basic Processor Functions• ALU – Arithmetic Logic Unit - this circuit takes two operands on the inputs (labeled A and B) and produces a result on the output (labeled Y). – The operations will usually include, as a minimum: • add, subtract • and, or, not • shift right, shift left • ALUs in more complex processors will execute many more instructions.12 DEC 01 ECSE 6620 - Jason Stripinis3(jasonstripinis@eng
  4. 4. Basic Processor Functions• Register File – A set of storage locations (registers) for storing temporary results. Early machines had just one register (accumulator). Modern RISC processors will have at least 32 registers.• Instruction Register – The instruction currently being executed by the processor is stored here.• Control Unit – The control unit decodes the instruction in the instruction register and sets signals which control the operation of most other units of the processor. For example, the operation code (opcode) in the instruction will be used to determine the settings of control signals for the ALU which determine which operation (+,-,^,v,~,shift,etc) it performs.12 DEC 01 ECSE 6620 - Jason Stripinis4(jasonstripinis@eng
  5. 5. Basic Processor Functions• Clock – The vast majority of processors are synchronous, that is, they use a clock signal to determine when to capture the next data word and perform an operation on it. In a globally synchronous processor, a common clock needs to be routed (connected) to every unit in the processor.• Program counter – The program counter holds the memory address of the next instruction to be executed. It is updated every instruction cycle to point to the next instruction in the program. Branch instructions change the program counter by other than a simple increment.12 DEC 01 ECSE 6620 - Jason Stripinis5(jasonstripinis@eng
  6. 6. Basic Processor Functions• Memory Address Register – This register is loaded with the address of the next data word to be fetched from or stored into main memory.• Address Bus – Transfers addresses to memory and memory-mapped peripherals. It is driven by the processor acting as a bus master.• Data Bus – Carries data to and from the processor, memory and peripherals. It will be driven by the data source, i.e. processor, memory, etc.• Multiplexed Bus – To limit device pin counts and bus complexity, some processors MUX address and data onto the same bus, with an adverse affect on performance.12 DEC 01 ECSE 6620 - Jason Stripinis6(jasonstripinis@eng
  7. 7. DSP Implementations• DSP Algorithm – Series of mathematical operations that are applied to process a sequence of digital signals sampled from the real (analog) world• Application examples – Filtering – FFT – Noise cancellation – Spectral Processing12 DEC 01 ECSE 6620 - Jason Stripinis7(jasonstripinis@eng
  8. 8. Why is special architecture good for digital signal processing?• DSPs are tailored to run DSP algorithms efficiently.• Special functions to handle DSP algorithm demands: – Unique data access patterns • Streams of data requiring high bandwidth • Low data repetition but high code repetition – Math operation focus (“number cruncher”) – Real-time constraints – Power and size constraints – Cost requirement – Attention to numeric effects (limited fixed point error)12 DEC 01 ECSE 6620 - Jason Stripinis8(jasonstripinis@eng
  9. 9. DSP Functional Characteristics• Typically require a few specific operations• Consider a FIR Filter : This requires: –additions & multiplications –delays –array handling12 DEC 01 ECSE 6620 - Jason Stripinis9(jasonstripinis@eng
  10. 10. DSP Typical Operations• Additions & Multiplications – fetch two operands – perform the addition or multiplication (or both) – store the result• Delays – store the result for later use• Array Handling – fetch values from consecutive memory locations – copy data from register to register12 DEC 01 ECSE 6620 - Jason Stripinis10 (jasonstripinis@eng
  11. 11. DSP Typical Operations• To perform these basic operations most DSPs: – have a parallel multiply and add – have multiple memory accesses (to fetch two operands and store the result) – have sufficient registers to hold data temporarily – efficient address generation for array handling – special features such as delays or circular addressing12 DEC 01 ECSE 6620 - Jason Stripinis11 (jasonstripinis@eng
  12. 12. DSP Arithmetic Logic Unit• Most DSP operations require additions and multiplications together. So DSP processors usually have parallel hardware adders and multipliers which can be used with a single instruction:12 DEC 01 ECSE 6620 - Jason Stripinis12 (jasonstripinis@eng
  13. 13. Register Structure• Delays require that intermediate values be held for later use.• For example, when keeping a running total - the total can be kept within the processor to avoid wasting repeated reads from and writes to memory.• For this reason DSP processors have lots of registers which can be used to hold intermediate values.• Registers may be fixed-point or floating-point.12 DEC 01 ECSE 6620 - Jason Stripinis13 (jasonstripinis@eng
  14. 14. Memory Addressing• Array handling requires that data can be fetched efficiently from consecutive memory locations.• For this reason DSP processors have address registers which are used to hold addresses and can be used to generate the next needed address efficiently.• Usually, the next needed address can be generated during the data fetch or store operation, and with no overhead.12 DEC 01 ECSE 6620 - Jason Stripinis14 (jasonstripinis@eng
  15. 15. Memory Addressing• Example DSP address generation operations:Instruction Name Description read the data pointed to by the address in*rP register indirect register rP having read the data, postincrement the address*rP++ postincrement pointer to point to the next value in the array having read the data, postdecrement the address*rP-- postdecrement pointer to point to the previous value in the array having read the data, postincrement the address*rP++rI register postincrement pointer by the amount held in register rI to point to rI values further down the array having read the data, postincrement the address*rP++rIr bit reversed pointer to point to the next value in the array, as if the address bits were in bit reversed order12 DEC 01 ECSE 6620 - Jason Stripinis15 (jasonstripinis@eng
  16. 16. Memory Architectures for DSP• For arithmetic the DSP needs to fetch two operands in a single instruction cycle.• Since we also need to store the result and to read the instruction itself more than two memory accesses per instruction cycle are needed.• Even the simplest DSP operation - an addition involving two operands and a store of the result to memory - requires four memory accesses (three to fetch the two operands and the instruction, plus a fourth to write the result)12 DEC 01 ECSE 6620 - Jason Stripinis16 (jasonstripinis@eng
  17. 17. Memory Architectures for DSP• DSP processors usually support multiple memory accesses in the same instruction cycle.• It is not possible to access two different memory addresses simultaneously over a single memory bus.• There are two common methods to achieve multiple memory accesses per instruction cycle: • Harvard architecture • modified von Neumann architecture12 DEC 01 ECSE 6620 - Jason Stripinis17 (jasonstripinis@eng
  18. 18. Memory Architectures for DSP (Harvard Architecture)• The Harvard architecture has two separate physical memory buses, allowing two simultaneous memory accesses.• The true Harvard architecture dedicates one bus for fetching instructions, with the other available to fetch operands.• This is inadequate for DSP operations, which usually involve at least two operands. So DSP Harvard architectures usually permit the program bus to be used also for access of operands.12 DEC 01 ECSE 6620 - Jason Stripinis18 (jasonstripinis@eng
  19. 19. Memory Architectures for DSP (Harvard Architecture)• Note that it is often necessary to fetch three things - the instruction plus two operands - and the Harvard architecture is inadequate to support this.• So DSP Harvard architectures often also include a cache memory which can be used to store instructions which will be reused, leaving both Harvard buses free for fetching operands.• The Harvard architecture plus cache - is sometimes called an extended Harvard architecture or Super Harvard ARChitecture (SHARC).12 DEC 01 ECSE 6620 - Jason Stripinis19 (jasonstripinis@eng
  20. 20. Memory Architectures for DSP (Harvard Architecture)• The Harvard architecture requires two memory buses. This makes it expensive to bring off the chip - for example a DSP using 32 bit words and with a 32 bit address space requires at least 64 pins for each memory bus - a total of 128 pins if the Harvard architecture is brought off the chip. This results in very large chips, which are difficult to design into a circuit.12 DEC 01 ECSE 6620 - Jason Stripinis20 (jasonstripinis@eng
  21. 21. Memory Architectures for DSP (von Neumann Architecture)• The von Neumann architecture uses only a single memory bus. This is relatively cheap, requiring less pins that the Harvard architecture, and simple to use because the programmer can place instructions or data anywhere throughout the available memory.• But it does not permit multiple memory accesses.• The modified von Neumann architecture allows multiple memory accesses per instruction cycle by running the memory clock faster than the instruction cycle.12 DEC 01 ECSE 6620 - Jason Stripinis21 (jasonstripinis@eng
  22. 22. Memory Architectures for DSP (von Neumann Architecture)• Each instruction cycle is divided into multiple machine states and a memory access can be made in each machine state, permitting a multiple memory accesses per instruction cycle.• The modified von Neumann architecture permits all the memory accesses needed to support addition or multiplication: fetch of the instruction; fetch of the two operands; and storage of the result.12 DEC 01 ECSE 6620 - Jason Stripinis22 (jasonstripinis@eng
  23. 23. Why use a special architecture for digital signal processing? The Answers Unique data access patterns Bit reversed addressing (FFT) Streams of data requiring high Multiple access memory bandwidth architecture Low data repetition but high Eliminate data cache (save $$) code repetition Math operation focus MAC instruction Vector processing unit Real-time constraints Zero-overhead loops Power and size constraints Limited addition function units (unlike GPP) Cost requirement On-board peripherals (SOC) Attention to numeric effects ALU with 16-bit operands and (limited fixed point error) 32-bit result12 DEC 01 ECSE 6620 - Jason Stripinis23 (jasonstripinis@eng
  24. 24. DSP Generations• 1st Generation (1979-1982) – Transition from experimental signal processors• 2nd Generation (1985-1986) – Move from co-processor to stand-alone processor• 3rd Generation (1987-1989) – Major hardware improvements to speed• 4th Generation (1990-1996) – More on-chip integration (ADC, DAC, memory, multi-processor)• 5th Generation (1997-)12 DEC 01 ECSE 6620 - Jason Stripinis24 (jasonstripinis@eng
  25. 25. DSP Generations 1st Generation (1979-1982)• Primarily targeted at digital filtering• Specialized co-processor for signal processing• NMOS (n-Channel Metal Oxide Semi) fabrication• 16-bit fixed point• fast multiplier (and adder)• Harvard architecture• Specialized Instruction set12 DEC 01 ECSE 6620 - Jason Stripinis25 (jasonstripinis@eng
  26. 26. DSP Generations 1st Generation (1979-1982)• Example = Texas Instruments TMS32010 – 16-bit fixed point – Harvard architecture – two Address registers – one A register (adder) – one P register (multiplier) – one T register (data shift on delay line) – No zero-overhead loop – Specialized Instruction set – MAC Time 400 ns (<100 ns today) – 50 ms per 1024-FFT12 DEC 01 ECSE 6620 - Jason Stripinis26 (jasonstripinis@eng
  27. 27. DSP Generations 1st Generation (1979-1982)• Example = Texas Instruments TMS3201012 DEC 01 ECSE 6620 - Jason Stripinis27 (jasonstripinis@eng
  28. 28. DSP Generations 2nd Generation (1985-1986)• Move from co-processor to stand-alone processor• CMOS (Complementary Metal Oxide Semi) fabrication• Double the speed of first generation• Advances in memory architecture (more internal RAM)• better pipelining of functional units• address generators (bit-reversing)• Zero-overhead loop HW• Limited floating point in SW12 DEC 01 ECSE 6620 - Jason Stripinis28 (jasonstripinis@eng
  29. 29. DSP Generations 2nd Generation (1985-1986)• Example = Texas Instruments TMS32020 (1985) – 16-bit fixed point – Harvard architecture – Improved TMS32010 – RPTS allows pipelined instruction performed in single cycle – Specialized Instruction set – MAC Time 200 ns – 10 ms per 1024-FFT12 DEC 01 ECSE 6620 - Jason Stripinis29 (jasonstripinis@eng
  30. 30. DSP Generations 3rd Generation (1987-1989)• Increased floating point support – 32-bit floating point hardware DSPs released – Floating point emulation on fixed point processors – IEEE754 support• Hardware enhancements (large speed increase) – dense CMOS fabrication – on chip DMA – instruction caches – increased clock rates (first cores above 10 MHz)• Increased complexity of SW12 DEC 01 ECSE 6620 - Jason Stripinis30 (jasonstripinis@eng
  31. 31. DSP Generations 3rd Generation (1987-1989)• Example = Motorola DSP56001 (1988) – 24-bit data, instructions – 24-bit fixed point – 3 memory spaces (P, X, Y) – parallel moves – circular addressing – MAC Time 75 ns (21 ns today) – ~3 ms per 1024-FFT• Other Examples: – AT&T DSP16A – Analog Devices ADSP-2100 – TI TMS320C5012 DEC 01 ECSE 6620 - Jason Stripinis31 (jasonstripinis@eng
  32. 32. DSP Generations 4th Generation (1990-1996)• Hardware integration – ADC – DAC – more memory – multiple DSPs on one chip• Decreasing power consumption – 5.0 VDC → 3.3 VDC → 3.0 VDC → 2.7 VDC• GPPs start to get DSP functions – SIMD – Leads to Intel introducing MMX (MultiMedia eXtensions) for x8612 DEC 01 ECSE 6620 - Jason Stripinis32 (jasonstripinis@eng
  33. 33. DSP Generations 4th Generation (1990-1996)• Example = TI TMS320C541 (1995) – Enhanced architecture – Low voltage (3.3 VDC) – More on-chip memory – Application specific functional units – MAC Time 20 ns (10 ns today) – ~1 ms per 1024-FFT• Example = TI TMS320C80 – multiple processors per chip12 DEC 01 ECSE 6620 - Jason Stripinis33 (jasonstripinis@eng
  34. 34. The GPP Option• High-performance general-purpose processors for PCs and workstations are increasingly suitable for some DSP applications.• E.g., Intel MMX Pentium, Motorola/IBM PowerPC 604e• These processors achieve excellent to outstanding floating and/or fixed-point DSP performance via: – Very high clock rates (200-500 MHz) – Superscalar architectures – Single-cycle multiplication and arithmetic operations – Good memory bandwidth – Branch prediction – In some cases, single-instruction, multiple-data (SIMD) ops12 DEC 01 ECSE 6620 - Jason Stripinis34 (jasonstripinis@eng
  35. 35. DSP Generations 5th Generation (1997-)• Not the classic DSP architectures – SIMD (Single Instruction Multiple Data stream) instructions – VLIW (Very Long Instruction Words) allows RISC processing • High parallelism • Increased clock speeds • No longer application specific functional units (no MAC FU)• Low voltage (2.5 VDC or less, even 1.2 VDC cores)• MAC Time 3 ns (but can be power hungry)• GPPs start to get DSP functions – Intel introduces MMX (MultiMedia eXtensions) for x86 in 1997• Increased integration – MCU and DSP cores on same chip – MCU functions/ports added to DSPs12 DEC 01 ECSE 6620 - Jason Stripinis35 (jasonstripinis@eng
  36. 36. DSP Generations 5th Generation (1997-)• SIMD (Single Instruction Multiple Data) instructions – Enhance throughput by allowing parallelism – Requires multiple functional units and wider buses – May support multiple data widths (different functional groups) – Example = DSP16000 WAS SIMD12 DEC 01 ECSE 6620 - Jason Stripinis36 (jasonstripinis@eng
  37. 37. DSP Generations 5th Generation (1997-)• VLIW (Very Long Instruction Words) – Instruction Level Parallelism (ILP) can be a major performance gain • Superscalar implementation requires larger die and more power to dynamically pipeline instructions – VLIW can be used to statically pipeline instructions at compile time (or even by hand!) – VLIW instruction words have fixed "slots" for instructions that map to the functional units available.12 DEC 01 ECSE 6620 - Jason Stripinis37 (jasonstripinis@eng
  38. 38. DSP Generations 5th Generation (1997-)• VLIW Advantages – huge theoretical pay off • less than 1 ns per MAC! • Less than 75 ns per 1024-FFT• VLIW Drawbacks – Can be very difficult to program and debug – High power consumption if VLIW is not filled – Code size dramatically increases requiring more program memory12 DEC 01 ECSE 6620 - Jason Stripinis38 (jasonstripinis@eng
  39. 39. DSP Generations 5th Generation (1997-)• VLIW Example = TI TMS320C6201 32-bit Functional Units Lx = ALU Sx = Branching and shifting Mx = Multiplier Dx = Data Store12 DEC 01 ECSE 6620 - Jason Stripinis39 (jasonstripinis@eng
  40. 40. DSP Generational Development• DSP processor performance has increased by a factor of about 400x over the past 20 years 400 350 300 250 200 150 MAC (ns) 100 50 0 1st 2nd 3rd 4th 5th Gen Gen Gen Gen Gen• DSP architectures will be increasingly specialized for applications, especially communications applications• General-purpose processors will become viable for many DSP applications12 DEC 01 ECSE 6620 - Jason Stripinis40 (jasonstripinis@eng

×