SlideShare a Scribd company logo
1 of 12
Download to read offline
21.5-1© IEEE 1998
SA 21.5: A 33GB/s 13.4Mb Integrated Graphics
Accelerator and Frame Buffer
R. Torrance, I. Mes, B. Hold, D. Jones, J. Crepeau,
P. DeMone, D. MacDonald, C. O’Connell, P. Gillingham,
R. White1
, S. Duggins2
, D. Fielder2
MOSAID Technologies Inc., Kanata, Ontario, Canada
1
Accelerix Incorporated, Ottawa, Ontario, Canada
2
Symbionics Ltd., Cambridge, England, UK
With the increasing capacity of commodity DRAM, a dichotomy
has appeared between memory requirements of PC graphics
systems, and standard DRAM for PC main memory. PC graphics
subsystems require wide, relatively shallow memory for high
data bandwidth while commodity DRAM is too narrow and too
deep. Integrating DRAM frame buffer and graphics accelerator
logic on the same chip solves this problem (Figure 1).
Reported integrated DRAM and logic devices separate DRAM
and ASIC logic portions with a traditional memory interface [1].
Although this approach has a number of benefits over the
discrete solution, much greater improvements are available by
more tightly integrating the DRAM and logic, something not
possible at the board level. This device integrates parts of the
graphics processor within the DRAM to increase performance.
Figure 2 shows the architecture of one bank of the frame buffer,
where the pixel processing unit (PPU) and the serial output
registers (SORs) are integrated into the DRAM architecture [2].
This allows the bus width between the DRAM frame buffer and
the processor to be 4096b.
Figure 3a shows a traditional DRAM, where a relatively narrow
databus runs the length of the sense amp straps. To allow for
massively-parallel DRAM access, the frame buffer uses
databuses orthogonal to the standard direction (Figure 3b).
These databuses can be routed in metal2 (M2) over an array
with capacitor-over-bitline processing. This allows up to 1
databus for each column of DRAM, depending on the M2 pitch of
the process. The device employs 8 banks, each of 512b databus
width.
The PPU is 512b wide per bank, requiring one bit of the PPU to
pitch-match to 4 DRAM columns. With this architecture, the
choice of circuits included in this massively parallel processor
must be judicious. It is limited to the most basic pixel operations:
the raster operations. Figure 4 shows a single bit of the PPU, and
how it is pitch-matched to the DRAM to accelerate raster
operations. The PPU registers are built using dual port 6T SRAM
cells as shown in Figure 5a, while the rastop function unit is
implemented with a pMOS based 8-1 multiplexor as shown in
Figure 5b. This allows the PPU to perform any of the 256 rastops
possible with 3 input variables. All 512b are identical, and since
there is no dataflow directly between bits, redundant PPU bits
can be easily added (pitch-matched to the DRAM redundant
columns). This allows more aggressive core rules for the PPU
to reduce area.
This frame buffer architecture allows the graphics controller to
accelerate some of the most common graphic operations. For
instance, a block move can require a source pixel read, a
destination pixel read, a rastop, and a destination pixel write. In
a typical system with a 64b memory interface, this is done 64b at
a time. In this chip, all reads can be done at up to 4096b wide,
as can the actual raster operation, and the write back. For
operationsthatrequirerealignmentofsourcepixelstodestination
pixels, a 32b funnel shifter is built under the FB data-bus in each
bank, allowing realignment of 256b at a time. Also, a 32b word can
be written to all 16 PPU words simultaneously. If all 8 banks are
enabled, this allows for a 4096b write to DRAM in a single cycle,
useful for screen clearing.
The gate array logic incorporates the graphic accelerator, PCI
interface,VGAcoreandvideoinputblocktoprovidethenecessary
system level functionality for a PC graphics device. Within the
accelerator logic the highly-parallel nature of the DRAM and its
embedded logic gives a much larger control space than a normal
external D/VRAM interface. This requires a novel approach to
BITBlt operation since data alignment, manipulation(ROP) and
masking must be performed in the most optimal way to achieve
the best performance. The availability of 5 512B registers within
the PPUs allows data to be cached for some operations to increase
performance further. The custom-designed pixel output path
(POP) logic includes a 64b hardware cursor, a video scaler with a
4-tap FIR filter for horizontal scaling, and 2-tap for vertical
scaling, color space conversion circuits, and a 135MHz LUTDAC.
This device uses dual fully integrated PLLs, one for the 66MHz
main clock (MCLK), and one for the 135MHz pixel clock (PCLK).
Noise is a concern on a mixed DRAM/logic chip. Noise from the
ASIC logic can cause a degradation of the refresh time for DRAM
cells. Refresh time is programmable, to allow the use of the
maximum refresh time possible after processing. The device
employs separate power pins and on-chip power rails for DRAM,
gate array, PLLs, DACs, and POP logic. Critical circuitry is
guarded with substrate and well guard rings.
Testing methodologies have yet to be standardized for mixed
DRAM/logicchips.AlthoughBISTisusefulforembeddedSRAMs,
redundancy and complicated cell coupling tests make it difficult
to use for large embedded DRAMs. BIST also complicates the
debug of an embedded DRAM, since the controllability and
observability of that DRAM is severely limited. In this device all
DRAM data and control is multiplexed onto the device pins in test
mode, using an on-chip test interface. This circuit makes the
entire frame buffer appear similar to a standard asynchronous
DRAM in memory test , allowing use of standard DRAM test
software. All bits of the SORs and PPUs are mapped into unused
row addresses for test. Even the rastop processor is mapped to
unused row addresses to allow the entire frame buffer to be
easily tested on a standard DRAM tester. The digital logic blocks
are tested using full scan methodology and functional test
vectors, and the test interface allows access for these paths. For
device burn-in only, the logic incorporates a BIST block for the
DRAM array. This ensures that the DRAM is cycled during post-
package burn-in testing with only the requirement for power,
ground, and a few control pins connected to a tester.
Thedeviceisimplementedina0.35µm/0.5µmblendedDRAMand
logic process using triple-level-metal. Table 1 shows the fea-
tures of the chip. Figure 6 is a chip micrograph.
Acknowledgments:
The authors acknowledge the support of M. Liang and C. Yoo of
TSMC in developing this chip.
References:
[1] Miyano, S., et al., “A 1.6GB/s Data Transfer Rate 8Mb Embedded
DRAM,” IEEE Journal of Solid-State Circuits, vol. 30, no. 11, pp. 1281-
1285, Nov., 1995.
[2] Elliott, D., et al., “Computational RAM: A Memory SIMD Hybrid and
ItsApplicationtoDSP,”CustomIntegratedCircuitsConference,May,1992.
21.5-2© IEEE 1998
Figure 1: Chip architecture.
Figure 2: Frame buffer bank.Figure 3a: Traditional
DRAM.
Figure 3b: Wide databus scheme.Figure 3a: Traditional DRAM.
Figure 5a: PPU register bit.
Figure 5b: RASTOP processor.
21.5-3© IEEE 1998
SA 21.5: A 33Gb/s 13.4Mb Integrated Graphics Accelerator and Frame Buffer
Maximum Screen Size 1280x1024x8b/pixel
1024x768x16b/pixel
800x600x24b/pixel
Technology 0.35µm/0.5µm BlendIC, 3M
DRAM 13.4Mb
Pitch-matched logic 160k gates
SRAM 38kb
Maximum System Clock Rate 66MHz
Maximum Pixel Clock Rate 135MHz
Supply Voltage 3.3V
Package 208-pin QFP
Table 1: Chip overview.
Figure 4: Pixel processing unit.
Figure 6: Chip micrograph.
21.5-4© IEEE 1998
Figure 1: Chip architecture.
21.5-5© IEEE 1998
Figure 2: Frame buffer bank.Figure 3a: Traditional
DRAM.
21.5-6© IEEE 1998
Figure 3a: Traditional DRAM.
21.5-7© IEEE 1998
Figure 3b: Wide databus scheme.
21.5-8© IEEE 1998
Figure 4: Pixel processing unit.
21.5-9© IEEE 1998
Figure 5a: PPU register bit.
21.5-10© IEEE 1998
Figure 5b: RASTOP processor.
21.5-11© IEEE 1998
Figure 6: Chip micrograph.
21.5-12© IEEE 1998
Maximum Screen Size 1280x1024x8b/pixel
1024x768x16b/pixel
800x600x24b/pixel
Technology 0.35µm/0.5µm BlendIC, 3M
DRAM 13.4Mb
Pitch-matched logic 160k gates
SRAM 38kb
Maximum System Clock Rate 66MHz
Maximum Pixel Clock Rate 135MHz
Supply Voltage 3.3V
Package 208-pin QFP
Table 1: Chip overview.

More Related Content

What's hot

Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Speed...
Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Speed...Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Speed...
Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Speed...VLSICS Design
 
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESPERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESBUKYABALAJI
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Dhiraj Chaudhary
 
Effective Sparse Matrix Representation for the GPU Architectures
 Effective Sparse Matrix Representation for the GPU Architectures Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
 
Kim hybrid memristor_nl2012
Kim hybrid memristor_nl2012Kim hybrid memristor_nl2012
Kim hybrid memristor_nl2012Damoum Atta
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Advantages of 64 Bit 5T SRAM
Advantages of 64 Bit 5T SRAMAdvantages of 64 Bit 5T SRAM
Advantages of 64 Bit 5T SRAMIJSRED
 

What's hot (14)

Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Speed...
Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Speed...Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Speed...
Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Speed...
 
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESPERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
 
SRAM
SRAMSRAM
SRAM
 
Sram technology
Sram technologySram technology
Sram technology
 
Low power sram
Low power sramLow power sram
Low power sram
 
Effective Sparse Matrix Representation for the GPU Architectures
 Effective Sparse Matrix Representation for the GPU Architectures Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
 
Memory base
Memory baseMemory base
Memory base
 
Kim hybrid memristor_nl2012
Kim hybrid memristor_nl2012Kim hybrid memristor_nl2012
Kim hybrid memristor_nl2012
 
1904 1908
1904 19081904 1908
1904 1908
 
Pci
PciPci
Pci
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Advantages of 64 Bit 5T SRAM
Advantages of 64 Bit 5T SRAMAdvantages of 64 Bit 5T SRAM
Advantages of 64 Bit 5T SRAM
 

Viewers also liked

Clase #3 de access corta
Clase #3 de access cortaClase #3 de access corta
Clase #3 de access cortaEsmeralda2227
 
Reverse Engineering Modeling 1
Reverse Engineering Modeling 1Reverse Engineering Modeling 1
Reverse Engineering Modeling 1IvanMcKinney
 
the world of P.E.I
the world of P.E.Ithe world of P.E.I
the world of P.E.IAlan Stange
 
Metal Irrigation Enclosures
Metal Irrigation EnclosuresMetal Irrigation Enclosures
Metal Irrigation EnclosuresIvanMcKinney
 
Reverse Engineering Modeling 2
Reverse Engineering Modeling 2Reverse Engineering Modeling 2
Reverse Engineering Modeling 2IvanMcKinney
 
Decorative Plumbing
Decorative PlumbingDecorative Plumbing
Decorative PlumbingIvanMcKinney
 
Photo Storyboard Presentation!
Photo Storyboard Presentation!Photo Storyboard Presentation!
Photo Storyboard Presentation!guest10be464
 

Viewers also liked (9)

Clase #3 de access corta
Clase #3 de access cortaClase #3 de access corta
Clase #3 de access corta
 
Reverse Engineering Modeling 1
Reverse Engineering Modeling 1Reverse Engineering Modeling 1
Reverse Engineering Modeling 1
 
the world of P.E.I
the world of P.E.Ithe world of P.E.I
the world of P.E.I
 
Metal Irrigation Enclosures
Metal Irrigation EnclosuresMetal Irrigation Enclosures
Metal Irrigation Enclosures
 
Reverse Engineering Modeling 2
Reverse Engineering Modeling 2Reverse Engineering Modeling 2
Reverse Engineering Modeling 2
 
Decorative Plumbing
Decorative PlumbingDecorative Plumbing
Decorative Plumbing
 
Photo Storyboard Presentation!
Photo Storyboard Presentation!Photo Storyboard Presentation!
Photo Storyboard Presentation!
 
Mexico
MexicoMexico
Mexico
 
Dolphin Mold
Dolphin MoldDolphin Mold
Dolphin Mold
 

Similar to Accelerix ISSCC 1998 Paper

Memory module
Memory moduleMemory module
Memory modulelemar12
 
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET Journal
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldPraveen Kumar
 
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Léia de Sousa
 
A Simplied Bit-Line Technique for Memory Optimization
A Simplied Bit-Line Technique for Memory OptimizationA Simplied Bit-Line Technique for Memory Optimization
A Simplied Bit-Line Technique for Memory Optimizationijsrd.com
 
Abstract The Prospect of 3D-IC
Abstract The Prospect of 3D-ICAbstract The Prospect of 3D-IC
Abstract The Prospect of 3D-ICvishnu murthy
 
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Joshua Mora
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
Design and Performance Evaluation of a 64-bit SRAM Memory Array Utilizing Mod...
Design and Performance Evaluation of a 64-bit SRAM Memory Array Utilizing Mod...Design and Performance Evaluation of a 64-bit SRAM Memory Array Utilizing Mod...
Design and Performance Evaluation of a 64-bit SRAM Memory Array Utilizing Mod...IRJET Journal
 
Term paper
Term paperTerm paper
Term paperBilal201
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilogSrinivas Naidu
 
memeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memoriesmemeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memoriesGauravDaware2
 
Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Spee...
 Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Spee... Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Spee...
Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Spee...VLSICS Design
 

Similar to Accelerix ISSCC 1998 Paper (20)

Memory module
Memory moduleMemory module
Memory module
 
Ultrasparc3 (mpr)
Ultrasparc3 (mpr)Ultrasparc3 (mpr)
Ultrasparc3 (mpr)
 
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworld
 
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
 
A Simplied Bit-Line Technique for Memory Optimization
A Simplied Bit-Line Technique for Memory OptimizationA Simplied Bit-Line Technique for Memory Optimization
A Simplied Bit-Line Technique for Memory Optimization
 
Abstract The Prospect of 3D-IC
Abstract The Prospect of 3D-ICAbstract The Prospect of 3D-IC
Abstract The Prospect of 3D-IC
 
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Improving DRAM performance
Improving DRAM performanceImproving DRAM performance
Improving DRAM performance
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Design and Performance Evaluation of a 64-bit SRAM Memory Array Utilizing Mod...
Design and Performance Evaluation of a 64-bit SRAM Memory Array Utilizing Mod...Design and Performance Evaluation of a 64-bit SRAM Memory Array Utilizing Mod...
Design and Performance Evaluation of a 64-bit SRAM Memory Array Utilizing Mod...
 
DDR DIMM Design
DDR DIMM DesignDDR DIMM Design
DDR DIMM Design
 
Term paper
Term paperTerm paper
Term paper
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
 
Slides of talk
Slides of talkSlides of talk
Slides of talk
 
memeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memoriesmemeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memories
 
00364438
0036443800364438
00364438
 
Intelligent ram
Intelligent ramIntelligent ram
Intelligent ram
 
Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Spee...
 Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Spee... Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Spee...
Using CMOS Sub-Micron Technology VLSI Implementation of Low Power, High Spee...
 

Accelerix ISSCC 1998 Paper

  • 1. 21.5-1© IEEE 1998 SA 21.5: A 33GB/s 13.4Mb Integrated Graphics Accelerator and Frame Buffer R. Torrance, I. Mes, B. Hold, D. Jones, J. Crepeau, P. DeMone, D. MacDonald, C. O’Connell, P. Gillingham, R. White1 , S. Duggins2 , D. Fielder2 MOSAID Technologies Inc., Kanata, Ontario, Canada 1 Accelerix Incorporated, Ottawa, Ontario, Canada 2 Symbionics Ltd., Cambridge, England, UK With the increasing capacity of commodity DRAM, a dichotomy has appeared between memory requirements of PC graphics systems, and standard DRAM for PC main memory. PC graphics subsystems require wide, relatively shallow memory for high data bandwidth while commodity DRAM is too narrow and too deep. Integrating DRAM frame buffer and graphics accelerator logic on the same chip solves this problem (Figure 1). Reported integrated DRAM and logic devices separate DRAM and ASIC logic portions with a traditional memory interface [1]. Although this approach has a number of benefits over the discrete solution, much greater improvements are available by more tightly integrating the DRAM and logic, something not possible at the board level. This device integrates parts of the graphics processor within the DRAM to increase performance. Figure 2 shows the architecture of one bank of the frame buffer, where the pixel processing unit (PPU) and the serial output registers (SORs) are integrated into the DRAM architecture [2]. This allows the bus width between the DRAM frame buffer and the processor to be 4096b. Figure 3a shows a traditional DRAM, where a relatively narrow databus runs the length of the sense amp straps. To allow for massively-parallel DRAM access, the frame buffer uses databuses orthogonal to the standard direction (Figure 3b). These databuses can be routed in metal2 (M2) over an array with capacitor-over-bitline processing. This allows up to 1 databus for each column of DRAM, depending on the M2 pitch of the process. The device employs 8 banks, each of 512b databus width. The PPU is 512b wide per bank, requiring one bit of the PPU to pitch-match to 4 DRAM columns. With this architecture, the choice of circuits included in this massively parallel processor must be judicious. It is limited to the most basic pixel operations: the raster operations. Figure 4 shows a single bit of the PPU, and how it is pitch-matched to the DRAM to accelerate raster operations. The PPU registers are built using dual port 6T SRAM cells as shown in Figure 5a, while the rastop function unit is implemented with a pMOS based 8-1 multiplexor as shown in Figure 5b. This allows the PPU to perform any of the 256 rastops possible with 3 input variables. All 512b are identical, and since there is no dataflow directly between bits, redundant PPU bits can be easily added (pitch-matched to the DRAM redundant columns). This allows more aggressive core rules for the PPU to reduce area. This frame buffer architecture allows the graphics controller to accelerate some of the most common graphic operations. For instance, a block move can require a source pixel read, a destination pixel read, a rastop, and a destination pixel write. In a typical system with a 64b memory interface, this is done 64b at a time. In this chip, all reads can be done at up to 4096b wide, as can the actual raster operation, and the write back. For operationsthatrequirerealignmentofsourcepixelstodestination pixels, a 32b funnel shifter is built under the FB data-bus in each bank, allowing realignment of 256b at a time. Also, a 32b word can be written to all 16 PPU words simultaneously. If all 8 banks are enabled, this allows for a 4096b write to DRAM in a single cycle, useful for screen clearing. The gate array logic incorporates the graphic accelerator, PCI interface,VGAcoreandvideoinputblocktoprovidethenecessary system level functionality for a PC graphics device. Within the accelerator logic the highly-parallel nature of the DRAM and its embedded logic gives a much larger control space than a normal external D/VRAM interface. This requires a novel approach to BITBlt operation since data alignment, manipulation(ROP) and masking must be performed in the most optimal way to achieve the best performance. The availability of 5 512B registers within the PPUs allows data to be cached for some operations to increase performance further. The custom-designed pixel output path (POP) logic includes a 64b hardware cursor, a video scaler with a 4-tap FIR filter for horizontal scaling, and 2-tap for vertical scaling, color space conversion circuits, and a 135MHz LUTDAC. This device uses dual fully integrated PLLs, one for the 66MHz main clock (MCLK), and one for the 135MHz pixel clock (PCLK). Noise is a concern on a mixed DRAM/logic chip. Noise from the ASIC logic can cause a degradation of the refresh time for DRAM cells. Refresh time is programmable, to allow the use of the maximum refresh time possible after processing. The device employs separate power pins and on-chip power rails for DRAM, gate array, PLLs, DACs, and POP logic. Critical circuitry is guarded with substrate and well guard rings. Testing methodologies have yet to be standardized for mixed DRAM/logicchips.AlthoughBISTisusefulforembeddedSRAMs, redundancy and complicated cell coupling tests make it difficult to use for large embedded DRAMs. BIST also complicates the debug of an embedded DRAM, since the controllability and observability of that DRAM is severely limited. In this device all DRAM data and control is multiplexed onto the device pins in test mode, using an on-chip test interface. This circuit makes the entire frame buffer appear similar to a standard asynchronous DRAM in memory test , allowing use of standard DRAM test software. All bits of the SORs and PPUs are mapped into unused row addresses for test. Even the rastop processor is mapped to unused row addresses to allow the entire frame buffer to be easily tested on a standard DRAM tester. The digital logic blocks are tested using full scan methodology and functional test vectors, and the test interface allows access for these paths. For device burn-in only, the logic incorporates a BIST block for the DRAM array. This ensures that the DRAM is cycled during post- package burn-in testing with only the requirement for power, ground, and a few control pins connected to a tester. Thedeviceisimplementedina0.35µm/0.5µmblendedDRAMand logic process using triple-level-metal. Table 1 shows the fea- tures of the chip. Figure 6 is a chip micrograph. Acknowledgments: The authors acknowledge the support of M. Liang and C. Yoo of TSMC in developing this chip. References: [1] Miyano, S., et al., “A 1.6GB/s Data Transfer Rate 8Mb Embedded DRAM,” IEEE Journal of Solid-State Circuits, vol. 30, no. 11, pp. 1281- 1285, Nov., 1995. [2] Elliott, D., et al., “Computational RAM: A Memory SIMD Hybrid and ItsApplicationtoDSP,”CustomIntegratedCircuitsConference,May,1992.
  • 2. 21.5-2© IEEE 1998 Figure 1: Chip architecture. Figure 2: Frame buffer bank.Figure 3a: Traditional DRAM. Figure 3b: Wide databus scheme.Figure 3a: Traditional DRAM. Figure 5a: PPU register bit. Figure 5b: RASTOP processor.
  • 3. 21.5-3© IEEE 1998 SA 21.5: A 33Gb/s 13.4Mb Integrated Graphics Accelerator and Frame Buffer Maximum Screen Size 1280x1024x8b/pixel 1024x768x16b/pixel 800x600x24b/pixel Technology 0.35µm/0.5µm BlendIC, 3M DRAM 13.4Mb Pitch-matched logic 160k gates SRAM 38kb Maximum System Clock Rate 66MHz Maximum Pixel Clock Rate 135MHz Supply Voltage 3.3V Package 208-pin QFP Table 1: Chip overview. Figure 4: Pixel processing unit. Figure 6: Chip micrograph.
  • 4. 21.5-4© IEEE 1998 Figure 1: Chip architecture.
  • 5. 21.5-5© IEEE 1998 Figure 2: Frame buffer bank.Figure 3a: Traditional DRAM.
  • 6. 21.5-6© IEEE 1998 Figure 3a: Traditional DRAM.
  • 7. 21.5-7© IEEE 1998 Figure 3b: Wide databus scheme.
  • 8. 21.5-8© IEEE 1998 Figure 4: Pixel processing unit.
  • 9. 21.5-9© IEEE 1998 Figure 5a: PPU register bit.
  • 10. 21.5-10© IEEE 1998 Figure 5b: RASTOP processor.
  • 11. 21.5-11© IEEE 1998 Figure 6: Chip micrograph.
  • 12. 21.5-12© IEEE 1998 Maximum Screen Size 1280x1024x8b/pixel 1024x768x16b/pixel 800x600x24b/pixel Technology 0.35µm/0.5µm BlendIC, 3M DRAM 13.4Mb Pitch-matched logic 160k gates SRAM 38kb Maximum System Clock Rate 66MHz Maximum Pixel Clock Rate 135MHz Supply Voltage 3.3V Package 208-pin QFP Table 1: Chip overview.