This document discusses methods for improving the performance of Dynamic Random Access Memory (DRAM). It describes DRAM organization at the chip, bank, subarray, and row levels and methods to improve performance at each level. These include using photonic interconnects to replace electrical buses, reorganizing banks, processing requests in batches, converting subarrays to independent sub-banks, and optimizing set mapping policies and row buffer organization. The document argues that photonic interconnects allow higher bandwidth and lower power compared to electrical interconnects by enabling transmission of multiple data packets simultaneously. Improving DRAM performance at each hardware level can significantly increase overall system performance.
top level view of computer function and interconnectionSajid Marwat
The document summarizes key concepts from Chapter 3 of William Stallings' Computer Organization and Architecture textbook. It describes the top-level view of computer function and interconnection, including:
- The components of a computer system and how they interconnect via buses
- Common bus types like data, address, and control buses
- Bus arbitration and timing, whether synchronous or asynchronous
- Specific bus architectures like ISA and PCI buses
It provides instruction on computer organization concepts like instruction cycles, interrupts, and program flow control at a high level.
Nhận viết luận văn Đại học , thạc sĩ - Zalo: 0917.193.864
Tham khảo bảng giá dịch vụ viết bài tại: vietbaocaothuctap.net
Download luận văn đồ án tốt nghiệp ngành điện với đề tài: Thiết kế mạch chuyển đổi DAC 16 bít sử dụng vi mạch TDA 1541, cho các bạn làm luận văn tham khảo
top level view of computer function and interconnectionSajid Marwat
The document summarizes key concepts from Chapter 3 of William Stallings' Computer Organization and Architecture textbook. It describes the top-level view of computer function and interconnection, including:
- The components of a computer system and how they interconnect via buses
- Common bus types like data, address, and control buses
- Bus arbitration and timing, whether synchronous or asynchronous
- Specific bus architectures like ISA and PCI buses
It provides instruction on computer organization concepts like instruction cycles, interrupts, and program flow control at a high level.
Nhận viết luận văn Đại học , thạc sĩ - Zalo: 0917.193.864
Tham khảo bảng giá dịch vụ viết bài tại: vietbaocaothuctap.net
Download luận văn đồ án tốt nghiệp ngành điện với đề tài: Thiết kế mạch chuyển đổi DAC 16 bít sử dụng vi mạch TDA 1541, cho các bạn làm luận văn tham khảo
Chapter 2 - Computer Evolution and PerformanceCésar de Souza
The document discusses the evolution of computer hardware from the 1940s onwards. It describes early computers like ENIAC which used vacuum tubes and was programmed manually via switches. The stored program concept developed by von Neumann separated the program and data into memory. Transistors replaced vacuum tubes, making computers smaller, cheaper and more reliable. Integrated circuits led to generations of computers with increasing numbers of components on a single chip due to Moore's Law. Memory speed could not keep up with rising CPU speeds, leading to cache memory and other performance improvements.
The document discusses the memory hierarchy in computer architecture. It describes the memory hierarchy as separating computer storage into different levels based on response time, with faster but smaller memory closer to the processor. The levels include internal processor registers and cache, main system RAM, online mass storage, and offline bulk storage. Designing for performance requires considering how data moves through this hierarchy and minimizing how far data must travel.
The document discusses the memory hierarchy in computers. It explains that memory is organized in a hierarchy with different levels providing varying degrees of speed and capacity. The levels from fastest to slowest are: registers, cache, main memory, and auxiliary memory such as magnetic disks and tapes. Cache memory sits between the CPU and main memory to bridge the speed gap. It exploits locality of reference to improve memory access speed. The document provides details on the working of each memory level and how they interact with each other.
Nhận viết luận văn Đại học , thạc sĩ - Zalo: 0917.193.864
Tham khảo bảng giá dịch vụ viết bài tại: vietbaocaothuctap.net
Download luận văn thạc sĩ ngành công nghệ thông tin với đề tài: Phát hiện xâm nhập theo thời gian thực trong mạng internet của vạn vật, cho các bạn làm luận văn tham khảo
Đồ án tốt nghiệp điện tử viễn thông Điều khiển và giám sát thiết bị điện gia đình điểm cao
- sdt/ ZALO 093 189 2701
Viết thuê Đồ án tốt nghiệp ngành điện tử viễn thông, Bài mẫu Đồ án tốt nghiệp ngành điện tử viễn thông, Khóa luận tốt nghiệp ngành điện tử viễn thông, Đề tài đồ án tốt nghiệp ngành điện tử viễn thông
http://vietthuewriter.com/category/do-an-tot-nghiep-nganh-dien-tu-vien-thong/
The document discusses the evolution of computers from early machines like ENIAC to modern microprocessors. It describes key developments such as the stored-program concept pioneered by von Neumann, the transition to transistors which made computers smaller and more reliable, the development of integrated circuits and Moore's Law. It also summarizes improvements in processor design including pipelining, caching, superscalar execution and the use of multiple processor cores.
Nhận viết luận văn Đại học , thạc sĩ - Zalo: 0917.193.864
Tham khảo bảng giá dịch vụ viết bài tại: vietbaocaothuctap.net
Download luận văn đồ án tốt nghiệp ngành công nghệ thông tin với đề tài: Xây dựng website quản lý trả chứng chỉ ICDL, cho các bạn làm luận văn tham khảo
Download luận văn đồ án tốt nghiệp ngành công nghệ thông tin với đề tài: Xây dựng hệ thống hỗ trợ đăng ký đề tài nghiên cứu khoa học trên nền web, cho các bạn tham khảo
Componentes para el uso de la TIC en educaciònAngel Rivas
Este documento describe las condiciones esenciales para la implementación efectiva de las tecnologías de la información y la comunicación (TIC) en la formación docente. Las condiciones incluyen que los estudiantes y profesores tengan suficiente acceso a las tecnologías digitales y a Internet en las aulas y escuelas, que cuenten con contenidos educativos digitales de calidad y culturalmente diversos, y que los profesores posean las habilidades necesarias para ayudar a los estudiantes a lograr altos niveles académic
Chapter 2 - Computer Evolution and PerformanceCésar de Souza
The document discusses the evolution of computer hardware from the 1940s onwards. It describes early computers like ENIAC which used vacuum tubes and was programmed manually via switches. The stored program concept developed by von Neumann separated the program and data into memory. Transistors replaced vacuum tubes, making computers smaller, cheaper and more reliable. Integrated circuits led to generations of computers with increasing numbers of components on a single chip due to Moore's Law. Memory speed could not keep up with rising CPU speeds, leading to cache memory and other performance improvements.
The document discusses the memory hierarchy in computer architecture. It describes the memory hierarchy as separating computer storage into different levels based on response time, with faster but smaller memory closer to the processor. The levels include internal processor registers and cache, main system RAM, online mass storage, and offline bulk storage. Designing for performance requires considering how data moves through this hierarchy and minimizing how far data must travel.
The document discusses the memory hierarchy in computers. It explains that memory is organized in a hierarchy with different levels providing varying degrees of speed and capacity. The levels from fastest to slowest are: registers, cache, main memory, and auxiliary memory such as magnetic disks and tapes. Cache memory sits between the CPU and main memory to bridge the speed gap. It exploits locality of reference to improve memory access speed. The document provides details on the working of each memory level and how they interact with each other.
Nhận viết luận văn Đại học , thạc sĩ - Zalo: 0917.193.864
Tham khảo bảng giá dịch vụ viết bài tại: vietbaocaothuctap.net
Download luận văn thạc sĩ ngành công nghệ thông tin với đề tài: Phát hiện xâm nhập theo thời gian thực trong mạng internet của vạn vật, cho các bạn làm luận văn tham khảo
Đồ án tốt nghiệp điện tử viễn thông Điều khiển và giám sát thiết bị điện gia đình điểm cao
- sdt/ ZALO 093 189 2701
Viết thuê Đồ án tốt nghiệp ngành điện tử viễn thông, Bài mẫu Đồ án tốt nghiệp ngành điện tử viễn thông, Khóa luận tốt nghiệp ngành điện tử viễn thông, Đề tài đồ án tốt nghiệp ngành điện tử viễn thông
http://vietthuewriter.com/category/do-an-tot-nghiep-nganh-dien-tu-vien-thong/
The document discusses the evolution of computers from early machines like ENIAC to modern microprocessors. It describes key developments such as the stored-program concept pioneered by von Neumann, the transition to transistors which made computers smaller and more reliable, the development of integrated circuits and Moore's Law. It also summarizes improvements in processor design including pipelining, caching, superscalar execution and the use of multiple processor cores.
Nhận viết luận văn Đại học , thạc sĩ - Zalo: 0917.193.864
Tham khảo bảng giá dịch vụ viết bài tại: vietbaocaothuctap.net
Download luận văn đồ án tốt nghiệp ngành công nghệ thông tin với đề tài: Xây dựng website quản lý trả chứng chỉ ICDL, cho các bạn làm luận văn tham khảo
Download luận văn đồ án tốt nghiệp ngành công nghệ thông tin với đề tài: Xây dựng hệ thống hỗ trợ đăng ký đề tài nghiên cứu khoa học trên nền web, cho các bạn tham khảo
Componentes para el uso de la TIC en educaciònAngel Rivas
Este documento describe las condiciones esenciales para la implementación efectiva de las tecnologías de la información y la comunicación (TIC) en la formación docente. Las condiciones incluyen que los estudiantes y profesores tengan suficiente acceso a las tecnologías digitales y a Internet en las aulas y escuelas, que cuenten con contenidos educativos digitales de calidad y culturalmente diversos, y que los profesores posean las habilidades necesarias para ayudar a los estudiantes a lograr altos niveles académic
XP (Extreme Programming) es una metodología ágil que prioriza el trabajo directo para aumentar la productividad. Se enfoca en entregas pequeñas, pruebas continuas, diseño simple, programación en parejas y propiedad colectiva. Los valores centrales son la comunicación, simplicidad, retroalimentación y coraje.
The document discusses the present continuous tense in English and provides examples of its use. It lists what various people named Marcela, Mario, they, Martha, Miguel, and the carriers are doing at their jobs. Their activities include entering new products into a system, overseeing box contents instead of labeling, organizing new merchandise, receiving calls from customers, preparing merchandise for carriers but not delivering, and enlisting and transporting products for delivery to clients.
The resume is for Rakesh Suresh Kasar, seeking a challenging position utilizing his skills in legal compliance, taxation law, and accounting. He has over 15 years of experience in areas like company incorporation and secretarial audits. His qualifications include a Company Secretary license and degrees in law, commerce, and taxation. He is proficient in MS Office and accounting software, and has worked for audit firms and a bank conducting audits, tax filing, and more.
This document provides a summary of John Gouthro's work experience and qualifications. He has over 15 years of experience in mechanical and automotive fields, including as a Project Technician at Bruce Power where he participated in pressure testing, equipment inspections, and ensuring systems were installed properly. Prior to that, he worked as an automotive service technician and industrial mechanic, diagnosing and repairing various systems. He holds relevant certifications and training in areas like forklift operation, confined spaces, welding, and engineering drawings.
Can stress cause heartburn? Visit the following site http://bit.do/heartburnnomore to discover how to prevent acid reflux with 1 weird trick that forces your body to eliminate acid reflux, heartburn, and most digestive problems in 48hs - Guaranteed!
The Natural Capital Protocol is a standardized framework to help businesses identify, measure, and value their direct and indirect impacts and dependencies on natural capital. It provides a four-stage process to guide assessments: Frame, Scope, Measure and Value, and Apply. The goal is to generate credible information to inform internal business decisions. While flexible, it includes principles of relevance, rigor, replicability, and consistency. Sector guides provide additional practical guidance. The Protocol is intended to help managers understand natural capital considerations and engage necessary experts.
Risk dg 19 may 2016 presentation slidesNabila Gimadi
The document summarizes a Risk Discussion Group event held in Sydney. It provides details of the welcome and introduction, list of speakers for the event, and summaries of the three main presentations on value destruction in mergers and acquisitions, managing risk from the perspective of a non-executive director, and reflections on risk from a commodities finance risk management perspective. Contact information is also provided for those interested in joining the group.
The document outlines YMC's proposed S'well Campus Ambassador Program. The program aims to promote the S'well reusable water bottle brand to college students through campus marketing campaigns run by student brand ambassadors. The ambassadors would educate students about S'well's sustainability efforts and philanthropic partnerships while increasing brand awareness. Four sample marketing tactics are described, including giveaways at campus events and partnering with student organizations. Ideal candidate qualities for the ambassador role and example ambassadors are also provided.
This document provides an overview of the design of a dual port SRAM using Verilog HDL. It begins with an introduction describing the objectives and accomplishments of the project. It then reviews relevant literature on SRAM design. The document describes the FPGA design flow and introduces Verilog. It provides the design and operation of the SRAM, and discusses simulation results and conclusions. The proposed 8-bit dual port SRAM utilizes negative bitline techniques during write operations to improve write ability and reduce power consumption and area compared to conventional designs.
Time and Low Power Operation Using Embedded Dram to Gain Cell Data RetentionIJMTST Journal
Logic compatible gain cell (GC)-embedded DRAM (eDRAM) arrays are considered an alternative to SRAM because of their small size, non rationed operation, low static leakage, and two port functionality. But traditional GC-eDRAM implementations require boosted control signals in order to write full voltage levels to the cell to reduce the refresh rate and shorten access times. The boosted levels require an extra power supply or on-chip charge pumps, as well as nontrivial level shifting and toleration of high voltage levels. In this paper, we present a novel, logic compatible, 3T GC-eDRAM bit cell that operates with a single-supply voltage and provides superior write capability to the conventional GC structures. The proposed circuit is demonstrated in 0.25μm CMOS process targeted at low power, energy efficient application.
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET Journal
The document describes the design and verification of a DDR SDRAM controller using VHDL. It discusses the architecture and functional blocks of the DDR SDRAM controller, which includes a SDRAM controller module, control interface module, command module, and data path module. The control interface module decodes commands from the host and tracks refresh requests. The command module generates the appropriate commands to the SDRAM based on the decoded commands and addresses. The data path module handles read and write data transfer operations at double data rate to achieve higher bandwidth compared to SDRAM. The DDR SDRAM controller was implemented in Verilog HDL and simulated and synthesized using appropriate tools.
This document provides an introduction to a thesis on reducing leakage power in cache memory cells. It discusses how advanced microprocessors require large, low-cost memories that cannot be satisfied by embedded DRAMs or planar DRAMs alone. SRAM is commonly used for cache due to its fast access times. Reducing leakage in even a single cache cell can significantly improve overall system power efficiency since caches are large. The objective is to analyze circuit techniques to reduce leakage in 6T SRAM cache cells and compare a proposed 5T cell. Key topics to be covered include memory concepts, cache overview, leakage sources and reduction techniques, and 5T cell design. Terminology used is also defined.
The goal of Intelligent RAM (IRAM) is to design a cost-effective computer by designing a processor in a memory fabrication process, instead of in a conventional logic fabrication process, and include memory on-chip.
The document discusses several key concepts related to computer memory systems:
1. It describes the maximum size and organization of main memory, including byte-addressability and the connection between the CPU and memory.
2. It discusses measures of memory speed like access time and cycle time, and techniques to increase effective memory size and speed like cache memory and virtual memory.
3. It provides details on the basic organization and operation of different types of semiconductor memories like SRAM, DRAM, ROM, PROM, EPROM, and flash memory.
Embedded DRAM was developed to integrate DRAM memory circuits with logic circuits on a single chip. There are two approaches - incorporating memory in a logic-optimized technology or logic in a DRAM-optimized technology. Embedded DRAM benefits applications like network processors and DSPs by reducing chip count, power consumption, and increasing performance through high-density and proximity between memory and logic on the same die. It works by arranging DRAM cells of a transistor and capacitor at row and column crosspoints that can store data as charge on the capacitor.
DESIGN AND IMPLEMENTATION OF 4T, 3T AND 3T1D DRAM CELL DESIGN ON 32 NM TECHNO...VLSICS Design
The document analyzes and compares the average power consumption, write access time, read access time, and retention time of 4T, 3T, and 3T1D DRAM cell designs on 32nm technology. Simulation results show that the 3T1D DRAM cell has the lowest average power consumption, while the 4T DRAM cell has the fastest write and read access times. Retention time is highest for the 4T cell and lowest for the 3T1D cell. Power consumption increases with supply voltage for all cell designs.
Design and implementation of 4 t, 3t and 3t1d dram cell design on 32 nm techn...VLSICS Design
In this paper average power consumption, write access time, read access time and retention time of dram
cell designs have been analyzed for the nano-meter scale memories. Many modern day processors use
dram cell for on chip data and program memory storage. The major power in dram is the off state leakage
current. Improving on the power efficiency of a dram cell is critical for the improvement in average power
consumption of the overall system. 3T dram cell, 4T dram and 3T1D DRAM cells are designed with the
schematic design technique and their average power consumption are compared using TANNER EDA tool
.average power consumption, write access time, read access time and retention time of 4T, 3T dram and
3T1D DRAM cell are simulated and compared on 32 nm technology
This document discusses memory design considerations for system-on-chip and board-based systems. It begins by explaining that memory system performance largely depends on the memory placement (on-die or off-die), access time, and bandwidth. It then provides an overview of different memory technologies that can be used for on-chip and external memory, such as SRAM, DRAM, flash memory, and discusses their characteristics. The document emphasizes that on-die memory allows faster access times compared to off-die memory, and discusses cache memory design approaches to compensate for longer off-die memory access times.
This document discusses computer memory systems including main memory, cache, and virtual memory. It defines main memory as the central storage location that holds programs and data currently being used by the CPU. The document outlines memory hierarchy from fastest to slowest as registers, cache, main memory, and secondary storage. It describes RAM and ROM types as well as cache memory. Locality of reference and memory technologies such as magnetic disks are also summarized.
The document summarizes different types of computer memory. It describes RAM as volatile memory that can be randomly accessed. There are two main types of RAM: DRAM uses capacitors and must be refreshed, while SRAM uses flip-flops and does not need refreshing. The document also discusses cache memory, ROM, EPROM, EEPROM, flash memory, memory organization, errors and interleaving.
This document describes the implementation of a DDR SDRAM controller using Verilog HDL. It begins with background on DDR SDRAM and its advantages over SDR SDRAM. It then discusses the design of the DDR SDRAM controller, including its main functional blocks - the control interface module, command module, and data path module. The control interface module contains a finite state machine to generate control signals. The command module contains registers and multiplexers to handle commands. The data path module interfaces with the processor and SDRAM. The controller was simulated and synthesized using Modelsim and Xilinx ISE, with the results shown. In conclusion, the DDR SDRAM controller takes advantage of the high speed and pipelined
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORYcaijjournal
The rapid development of multi-core system and increase of data-intensive application in recent years call
for larger main memory. Traditional DRAM memory can increase its capacity by reducing the feature size
of storage cell. Now further scaling of DRAM faces great challenge, and the frequent refresh operations of
DRAM can bring a lot of energy consumption. As an emerging technology, Phase Change Memory (PCM)
is promising to be used as main memory. It draws wide attention due to the advantages of low power
consumption, high density and nonvolatility, while it incurs finite endurance and relatively long write
latency. To handle the problem of write, optimizing the cache replacement policy to protect dirty cache
block is an efficient way. In this paper, we construct a systematically multilevel structure, and based on it
propose a novel cache replacement policy called MAC. MAC can effectively reduce write traffic to PCM
memory with low hardware overhead. We conduct simulation experiments on GEM5 to evaluate the
performances of MAC and other related works. The results show that MAC performs best in reducing the
amount of writes (averagely 25.12%) without increasing the program execution time.
The document discusses the growing performance gap between processors and memory systems and proposes several "intelligent memory" architectures to address this issue. It describes Active Pages, Parallel Processing RAM (PPRAM), and Intelligent RAM (IRAM) models that integrate processing capabilities into memory. For each model, it provides details on the architectural design and evaluates performance based on simulation results or mathematical predictions, finding speedups ranging from 1 to 1000 times compared to conventional memory systems.
This document summarizes a survey on exploring memory optimizations in smartphones. It discusses several memory technologies for smartphones including mobile RAM (M-RAM), power aware virtual memory (PAVM), dynamic RAM (DRAM), phase change memory (PCM), and hybrid PCM-M RAM approaches. It evaluates these technologies in terms of energy consumption and performance when using different energy management mechanisms like PAVM, immediate power down, and immediate self refresh on sample smartphone applications. The survey finds that a hybrid memory approach using both M-RAM and PCM can improve both energy efficiency and performance over single memory technologies.
This document describes an integrated graphics accelerator and frame buffer chip. It integrates DRAM, pixel processing units (PPUs), and serial output registers directly into the DRAM architecture to achieve high bandwidth of up to 33GB/s. The PPUs perform basic pixel operations like raster operations. They are tightly pitch-matched to the DRAM columns. This allows wide parallel access and acceleration of graphics operations like block moves. The chip is implemented in a 0.35um blended logic/DRAM process and includes 13.4Mb of DRAM, 160k gates of logic, and supports screen resolutions up to 1280x1024x8bpp.
Paper chosen for DesignCon 2015. Critical Memory Performance Metrics for DDR4. Is DDR4 the end of the DDR line of memory technologies? If so then stretching DDR4 to give that much more performance is critical. Discussed in this paper is how to measure the intricate performance metrics of your DDR4 system and why they matter. Understanding these critical parameters can lead to better system design, memory controller architecture and software design. Metrics such as Power Management, Page Hits, Bank Group and Bank Utilization, Multiple Open Bank Analysis, Data Bus Utilization and overhead on a DDR4 memory bus will be demonstrated and discussed.
This document describes research on an efficient reconfigurable content addressable memory (CAM). CAM is a type of memory that can perform high-speed searches. It introduces a cache-CAM (C-CAM) that adds a small cache memory to reduce the high power consumption of CAMs. Simulation and test chip results show the C-CAM can save 40-80% power compared to a conventional CAM by caching frequently accessed data and avoiding searches of the larger CAM in many cases. C-CAM performance depends on the cache size and hit rate, with the maximum power savings achieved with a cache size of around 4K bits and a hit rate of 90%.
1. Improving DRAM performance
Prithvi Kambhampati
Master of Science, Electrical and Computer Engineering
Michigan Technological University
Houghton, Michigan
pkambham@mtu.edu
Abstract—In order to reduce the growing gap
between the clock speed of the processors and that
of memory, more research is being done to
improve the performance of memory than ever.
Dynamic Random Access Memory (DRAM) is
being used in the cache to make the memory
accesses faster by reducing miss rate and latency.
This makes the DRAM performance improvement
an important aspect in today’s computation.
DRAM cells are refreshed at the rank-level,
periodically, in order to keep the data loss to a
minimum, prevent a complete rank from accepting
memory requests. This is one of the major
challenges the DRAM technology is facing. The
improvement to the DRAM can be made at four
different levels, namely, chip level, bank level,
subarray level, and row level. One of the methods
to do so is by reorganizing the structure of the
banks and the row buffer to improve the hit rates
of DRAM. Another method is to use light to
transmit data between the processor and the
memory system to reduce power consumption and
increase bandwidth. We also look into different set
mapping policies with which data is accessed from
the DRAM rows and discuss about the best
solution to improve the hit rate and reduce latency.
This paper shows that the methods implemented to
improve the performance of DRAM are
significantly affective. In addition, we also discuss
about the errors that occur in DRAMs and
describe the error-resilient schemes such as single
subarray memory systems with chipkills that can
overcome bit failures.
Index terms—Dynamic Random access memory,
chip level, bank level, subarray level, row level.
I. INTRODUCTION
In the past, the clock rates of microprocessors have
increased exponentially due to process improvements,
longer pipelines, and circuit design techniques. But
the main memory speed did not grow as fast as the
processors. Along with this, the number of cores on a
single chip has been increasing and is expected to
further increase in the future, and this increases the
aggregate demand for off-chip memory which makes
it worse to access the main memory. To address this
problem, we need to design a memory system that is
fast, big, and cheap. Static Random Access Memory
(SRAM) is being used in cache for its speed but is not
used in a large scale due its cost and low capacity.
Whereas, DRAM is being used in the main memory
for its large capacity and low cost. Therefore,
improving the efficiency of DRAM has become a
priority in the recent years. Many methods have been
proposed to reduce the loss of data and improve the
throughput and power efficiency. One solution is to
have a DRAM memory in the memory hierarchy. In
the recent past, DRAM has been employed in the
memory hierarchy as it increases the capacity of cache
memory via its higher density compared to the SRAM
cells. DRAM also has a higher bandwidth and lower
latency compared to the off-chip memory. DRAM
memory seems like a good solution to bring down the
memory wall (gap between the processor speed and
memory speed). The increased implementation of
DRAM memory has led to more and more research by
both industry and the academic institutions. Their
main aim is to improve the performance of DRAM
memory in today’s computation. For this purpose,
there have been many methods proposed for a given
limited off-chip memory bandwidth. Like many
things, a DRAM chip also has a structure (discussed
below), and can be subdivided into many parts. This
means that there is a possibility to improve the
characteristics of each and every of these parts.
A DRAM chip is made of capacitor based cells that
represent the data in the form of electric charge. To
store data in a cell, charge is injected, whereas to
retrieve data, the charge is extracted [2]. As shown in
figure 1, a typical DRAM chip has a hierarchy which
consists of multiple banks, a shared internal bus for
reading/writing data, and a chip I/O through which
memory is transferred between DRAM chip and other
memory units. Each bank is sub-divided into
subarrays and a bank I/O [10]. Furthermore the
subarrays are arranged into 2D arrays of DRAM cells
2. along with a common row buffer that consists of
SRAM cells and buffers one row of the DRAM bank.
Data can only be accessed after it is fetched to the row
buffer. Any attempt to read the data from the same
row will result in directly reading from the row buffer.
Accessing data (in the form of a cache line) from a
subarray involves multiple steps. First, the data can be
read only through a row buffer. This means that the
row must first be activated so that the data from the
rows of the DRAM cells can be transferred to the row
buffer. Secondly, after activating the row, the cache
line has to be read from/written to. This allows the
data to be transferred from/to the corresponding cells
through the internal bus that is present in the DRAM
chip. Finally, the row buffer has to be cleared for the
subsequent instructions.
Figure 1. Organization of a DRAM chip [10]
(Taken without permission)
In this paper, we are going to discuss the various
levels at which the DRAM performance can be
improved and the methods to do so. We observe four
different levels at which the modifications can be
done, with each level having multiple proposals to do
so. The first level is the chip level. At this level, there
is a memory channel with a memory controller, which
manages the set of DRAM banks present on the chip.
The memory channel has a three bus system which
includes a command bus, a read bus, and a write bus.
Each of these buses have I/O pins as well. These buses
and I/O access points can be partially/completely
replaced by the Photonically Interconnected DRAM
(PIDRAM) [4] technology, which provides energy
efficient communication. The photonic technology
uses light to transmit data between the processor and
the memory. To transmit data/commands, external
light (typically from a laser) is passed through
resonators which give that light a unique wavelength.
This modulated light is received by a photodetector
and is converted to electricity and the data/commands
are transferred. The advantage with this technology is
that multiple wavelengths can be transmitted at once,
allowing us to transmit more data that usual at low
power usage. The second level is the bank level. At
this level PIDRAM technology can be used to
reorganizing the banks [4] to save energy. Another
method to improve the performance of DRAM at this
level is by processing DRAM requests in batches of
requests [9]. The third level is the subarray level. One
idea is to have a hierarchical multi-bank DRAM [3] in
which the subarrays are converted to semi-
independent sub-banks, to take an advantage of the
fact that most of the DRAM accesses occur locally
within the subarrays. This allows the subarrays to act
independently for such accesses and makes the
process faster. The last level that can be modified is
the row level. In DRAM cache, to access memory
easily, memory blocks in the banks are mapped to a
particular set of a particular row of a particular bank.
These set-mapping policies [1] either concentrate on
improving the hit rate or decrease the latency. Another
change that can be made to this level is dividing the
row buffer into multiple smaller row buffers [7].
Figure 2. DRAM Memory System – Each inset shows detail for a different level of current electrical DRAM
memory systems. [4]
(Taken without permission)
3. II. CHIP LEVEL
A DRAM chip consists of a shared internal bus,
multiple banks, a chip I/O and a memory channel
controlled by a memory controller. This section
describes different ways in which we can modify the
above parts of the chip to improve performance. One
such way is to use light to transmit data among the
parts of the DRAM chip. The following introduction
to the silicon photonic technology, which can replace
the conventional electrical circuit partially.
PHOTONICALLY INTERCONNECTED DRAM
The off-chip memory bandwidths are not likely to
match up to the performance of the processor. This
has been reducing the maximum achievable system
performance since 2008. The number of pins on the
board is limited by the area and power over heads of
high speed transceivers and package interconnect.
The number of packets transferred per pin can be
increased but only at the expense of using up more
energy. As described in the introduction, a DRAM
memory channel uses a memory controller to manage
a set of DRAM banks that are distributed across one
or more DRAM chips. We can overcome these
challenges by redesigning the DRAM memory using
Photonically Interconnected DRAM (PIDRAM) [4],
which uses a monolithically integrated silicon-
photonic technology. This technology uses light to
transfer data instead of electrical circuits. Firstly, the
light which is in the form of LASER is passed through
a series of resonators. These resonators modulate the
wavelength of the light which is transmitted from the
processor to the PIDRAM chip. At the PIDRAM chip,
this light is received and demodulated using filters and
is converted to electrical signal using a photo detector.
The advantages of this technology are: very less
power is required to transmit data, larger off-chip
bandwidths are supported at a minimum power
consumption, and transmission of the data at multiple
wavelengths at once, allowing multiple data packets
to be transferred at once. This is called as dense
wavelength division multiplexing (DWDM) [4] and
allows multiple links (wavelengths) to share the same
media (fibre or waveguide). The electrical I/O in
DRAM chips can be replaced by these energy
efficient photonic links. By redesigning DRAM banks
to provide greater bandwidth from an individual array
core, we can supply the bandwidth demands. This also
reduces the energy required to activate the banks. We
should keep in mind that all the electrical circuits
cannot be replaced by this technology as it needs more
area than a simple electrical circuit.
A. PIDRAM memory channel organization
A memory controller manages a set of DRAM
banks that are distributed across many DRAM chips.
This memory system has 3 logical buses: a command
bus, a write data bus, and a read data bus. We can
implement these buses using the photonic components
in 3 ways:
Shared Photonic Bus:
All the three logical buses can be implemented
using a shared photonic bus, which works like a
standard electrical bus. In this implementation, the
memory controller first issues a command to all the
banks, and these banks determine if they are the target
bank. Once the target bank knows that it is the target,
for a write command, it will tune-in its photonic
receiver on the write-data bus. The memory controller
places the data on that bus, and the target bank
receives the data and performs the write operation,
and for a read command, the target bank will perform
its read operation and sends the data through the read
data bus.
Figure 3. Shared Photonic Buses [4]
(Taken without permission)
Split Photonic Bus:
In this implementation, the long shared bus is
divided into multiple branches. The laser power is
sent to all the receivers of the command and write bus,
and the modulators of the read bus. However, the total
laser power is roughly a linear function of the number
of banks. This reduces the effective bandwidth density
of the photonic device and also the optical laser power
compared to the shared photonic bus.
4. Figure 4. Monolithically integrated silicon-photonic technology - Two DWDM links in opposite directions
between a memory controller in a processor chip and a bank in a PIDRAM chip. λ1 is used for the request and λ2 is
used for the response in the opposite direction on the same waveguides and fibre. [4]
(Taken without permission)
Figure 5.a. Split photonic buses [4]
(Taken without permission)
Guided Photonic Bus:
The optical power can be further reduced by this
implementation. Guided photonic bus uses optical
power guiding in the form of demultiplexers to
actively direct power to just the target bank. This
allows the total power to be constant throughout, and
also independent of the number of banks.
Figure 5.b. Guided photonic buses [4]
(Taken without permission)
B. PIDRAM Chip Organization
We have discussed above different ways in which
the buses can be implemented photonically. The
trade-off with this is that only a portion of the buses
can be implemented photonically and the rest,
electrically. The design choice is made on the trade-
offs in power and area. The photonics can be
gradually extended deeper into the PIDRAM chip.
Figure 6. PIDRAM chip floorplan [4]
(Taken without permission)
The vertical electrical data bus can be partitioned
into ‘n’ partitions and all the photonic circuits should
be replicated at each data access point for each bus
partition. Partitioning the data bus allows the DRAM
chip to use an energy efficient photonic interconnect.
This increases the fixed link power and higher optical
losses.
III. BANK LEVEL
Each bank consists of multiple subarrays and a
bank I/O. data is accessed in the form of cache lines
from each subarray. This requires activation of the
row containing the cache line, reading/writing the
cache line, and precharging the subarray to prepare for
subsequent requests. This section deals with novel
5. way to organize the banks a request scheduling
algorithm which help in increasing the number of
instructions executed.
A. PIDRAM Bank Organization
Most of the energy consumed in a DRAM chip is
by the banks themselves. Every array block in a bank
access activates an array core, which activates an
entire array core row. From this array core row, only
a few bits of data is used. Most of the bank energy is
used to wake up these unnecessary bits. This wastage
of energy can be reduced by either decreasing the
array core row size which reduces the number of the
unnecessary bits that are being activated, or increasing
the number of I/O per array core, and using fewer
array cores in parallel. Decreasing the array core row
size leads to a greater area penalty. Therefore the
access efficiency has to be improved by increasing the
number of I/Os per array core. The motivation to
make this change is not much because the energy
consumption by the bank is less compared to the
electrical inter-chip and intra-chip interconnect. Also
the number of pins we can have on a chip is limited.
The increased bandwidth allows more banks per
chip. The high bandwidth also allows energy savings
and does not affect the area of PIDRAM significantly.
This makes sure that the photonic technology will
play an important role in the future multiprocessor
performance. The upcoming PIDRAMs should not
only concentrate on high performance, low cost, and
energy efficiency at the chip level, but also support a
large range of multi-chip configurations with different
capacities and bandwidths.
B. Parallelism-aware batch scheduling
In a chip multiprocessor (CMP) system, the
DRAM is a frequently used resource. Inter-thread
interference can destroy the bank-level access
parallelism of individual threads. Bank level
parallelism [8] [9] is a method in which the requests
made by the threads are serviced in parallel in
different banks. Parallelism-aware batch scheduling is
the advanced method of bank level parallelism which
takes these requests and divides them into batches and
attends to those batches of requests. This method can
be divided into two steps:
i. Request Batching
A number of DRAM requests are grouped into a
batch. These batches of requests are completed one
after the other and this step ensures that all the
requests in one batch are completed before the arrival
of the next batch. The batch of requests serviced is
then removed from the memory request buffer and
only then the new batch is formed. When forming a
new batch, the batching component decides how
many requests issued by a thread for a certain bank
can be a part of a batch. Batching not only ensures that
all the requests are taken care of, but also provides a
uniform granularity due to which the performance
improves.
A fixed number of DRAM requests are grouped
into a batch. This is done based on the arrival time of
these requests. Even though there is interference from
other threads, the bank level access parallelism of
each thread is preserved. This guarantees the oldest
batch to be served first by prioritizing the oldest
requests and also prevents any thread from being
starved in the DRAM system due to interference from
other, potentially aggressive threads. Batching
reduces the serialization of thread requests by
executing them in parallel rather than allowing them
to run alone in the memory system.
ii. Parallelism-Aware Within-Batch Scheduling
In this step, the requests of each and every thread
in a batch is computed in parallel in the DRAM cells.
This hides the latency inside the batch and also
increases the throughput of the processor as many
requests are serviced in parallel. The Parallelism-
Aware Within-Batch Scheduling tries to maximize
the:
Row-buffer locality:
Bank accesses will have lower latencies if a high
row-hit rate is present within a batch.
Intra-thread bank parallelism:
Scheduling multiple requests from a thread to
various banks in parallel reduces the thread’s stall
time.
This scheduling uses a thread prioritization to
make use of both row-buffer locality and bank
parallelism. Thread ranking is done by maximum rule,
where the scheduler finds the maximum number of
marked requests and tie-breaker total rule, in which
the scheduler keeps track of the total number of
marked requests, called total-load, and assigns the
higher rank to the lower total load [9].
IV. SUB ARRAY LEVEL
Each subarray consists of a 2 dimensional array of
DRAM cells. The data stored in these cells are
accessed in terms of rows. The request is sent to a row
buffer which is common for all the rows of DRAM
cells. The data then is sent to the row buffer and the
data is accessed from the row buffer. The following
two sub-sections explain how the accesses can be
done faster by modifying the subarray.
6. A. Hierarchical multi-bank DRAM
Embedded DRAM or eDRAM is a dynamic
random-access memory integrated on the same die or
multi-chip module of an ASIC or microprocessor.
eDRAM allows for larger buses and higher operation
speeds, due to higher density of DRAM. eDRAM
cannot handle the number of memory accesses
generated by the high performance processor, which
creates a bottle neck. Successive accesses that need
the same bank must queue up and serialize. One
solution is having a parallelism aware batch
scheduling (discussed in III.B.). Another solution to
this problem is to simply increase the number of
independent DRAM banks in order to lower the
probability of a conflict. But increasing the number of
independent banks leads to a requirement of a larger
area. The number of independent banks can be
increased without effecting the area much by allowing
the subarrays to act as banks whenever they the
DRAM chip receives a request to that particular
subarray. This allows the subarrays to act as semi-
independent banks [3].
After dividing the DRAM banks into subarrays,
for the subarrays to act as semi-independent sub-
banks, some additions and modifications have to be
made to each subarray. The banks in the DRAM chip
use registers and controls to allow the accessing of
data. This means that a few pipeline registers and
controls, a set-reset flip-flop to hold the subarray
output from the subarray, and buffers to hold the
addresses for the access should be added to the
subarrays. The access queues of the DRAM should be
modified to detect accesses which do not cause
conflicts and then to start the accesses that do not
cause conflicts, in parallel.
Figure 7.a. Modifications made to each
subarray [9]
(Taken without permission)
Figure 7.b. Modifications made to the access
queues [9]
(Taken without permission)
This is a useful approach, since a large part of
each DRAM access actually occurs only locally
within individual DRAM subarrays. Individual
subarrays within independent banks are controlled as
semi-independent subbanks that share the main
bank’s I/O circuitry and decoders. The sub banks can
perform way better with just creating a small area
penalty
B. Fault tolerance in DRAMs
Errors are often occurred in DRAMs, which leads
to a significant downtime in datacentres. An
architecture of the DRAM has to be developed in
order to provide high standard of reliability. Error-
resilient schemes, called as chipkills [5] can be built
for such bit failures. Isolating an entire cache line to a
single small subarray on a single DRAM chip will
allow us to read an entire cache line out of a single
DRAM array, so the potential for correlated errors is
increased. In order to provide chipkill-level reliability
in concert with single small sub array, checksums [5]
stored for each cache line in the DRAM are
introduced, similar to that provided in hard drives.
Using the checksum we can provide robust error
detection capabilities, and provide chipkill-level
reliability through Redundant Array of Inexpensive
DRAMs [6]. In the Redundant Array of Inexpensive
DRAMs, a single disk serves as a parity check for
more than one other disks. On a disk access, only one
disk is read per ‘n’ number of disks. The check sum
related to the read block lets the Redundant Array of
Inexpensive DRAM controller know if the read is
correct or not. This approach is more effective in
terms of area and energy than prior chipkill
approaches, and only incurs a performance penalty
compared to a single sub array memory system
without chipkill.
7. Figure 8. Chipkill support in Single Sub
Array memory system (64KB) [5]
(Taken without permission)
V. ROW LEVEL
As the number of cores in a processor is increasing,
the demand for an off-chip memory is increasing. This
exacerbates the main memory access bottle neck.
Many solutions have been proposed for this problem.
One of them is to have an on-chip DRAM as the last
level of cache to improve the performance for a given
off-chip memory bandwidth. This is called as the on-
chip DRAM cache, which increases the cache
capacity through high capacity. It also improves the
on-chip communication through high bandwidth and
low latency interconnect.
In a cache, the storage is mapped to the memory
addresses it serves. There are different ways this
mapping can be done. The choice of mapping is very
critical to the design that the cache is often named
after the mapping. This is done so in N-way set
associatively. The same goes with the DRAM. Each
row in a subarray of a bank of a DRAM chip has a
series of DRAM cells. All the rows in a subarray of a
bank have a common row buffer which is used to
access data. Implementing the DRAM in the cache
requires the DRAM to have mapping between the
rows and the main memory system. The following
two methods explain the ways to do so. The first
method explains the way set mapping works and later
in the section we see how the row buffer can be
modified to make the accessing of data faster.
A. Set mapping policy
As explained in the introduction, the DRAM cache
is a multi-bank system, with each bank having a
number of rows. The DRAM cache uses set mapping
policy [1], in which memory blocks are mapped to a
particular set of a particular row of a particular bank.
The set mapping policy directly affects the throughput
of the system by effecting the DRAM cache miss rate
and DRAM cache hit latency which makes it an
important aspect in the cache.
Figure 9. DRAM cache hierarchy (Intel)
(Taken from the website)
8. New DRAM set mapping policies are proposed
regularly to reduce the DRAM cache miss rate.
Through higher associativity we can achieve reduced
DRAM latency via improved row buffer hit rate. A
typical DRAM cache organization has multiple banks,
each with subarrays and each subarray containing an
array of rows and columns of DRAM cells. Each
DRAM provides with a row buffer, which buffers one
row in a DRAM bank. Data in a DRAM bank is
accessed after it is fetched through the row buffer.
Figure 10. 29 way associativity for 4KB size row [1]
(taken without permission)
Associativity deals with the hit ratio and search
speed. There is a trade-off between these two factors.
A direct mapped cache has a good hit ratio but a better
search speed. For a fully associative cache, the search
rate is better than the hit ratio. This implies that as the
associativity increases, the hit ratio improves and the
search speed decreases. Thus, we need to come up
with a reasonable associativity. As said before, a
higher associativity decreases the cache miss rate
significantly. The DRAM cache row is divided into
tag blocks and cache lines. Each bank of the DRAM
cache is associated with a row buffer that holds the
last accessed row of that bank. If the associativity of
DRAM row organization is increased, a cache access
first hits the tag block instead of the whole cache
block and reduces access latency. Having a higher
associativity cache might slightly increase the tag
latency compared to lower associativity. But this
device benefits from higher associativity that reduces
conflict misses. It also provides a higher row buffer
hit rate compared to a simple cache, because there are
more number of consecutive memory blocks mapped
to the same set.
B. Modifying the row buffer
The present DRAM cache banks have a single row
buffer. Having multiple smaller row buffers instead of
the existing single large row buffers helps improve the
row hit rates and also reduce the energy required for
row activation [7]. As explained earlier, the data has
to come to the row buffer after read request, and the
row data has to be read from the buffer. The width of
each row buffer is the width of the entire row and
holds a few KB of data. The precharge writes back the
row buffer to the appropriate row after a column
read/write of the selected words from/to the row
buffer. The precharge operation involves charging
and discharging of a large number of capacitors.
In a multi-core processor memory, the memory
addresses are spread evenly across memory banks to
compensate for the relatively slow speed of DRAM.
This decreases the row buffer hit rate. We can
improve this reduced row buffer hit rate by dividing
the row buffer into multiple smaller row buffers. This
new organization will now require sub-row activation
in addition to the row-buffer selection and row-buffer
selection. This requires for the controller to add more
address bits to the DRAM cache to activate the sub-
row buffers and bring into a row buffer. The memory
controller allocates and manages the row buffers,
providing the DRAM logic with additional flexibility,
to implement many other buffer allocation policies.
Figure 11. Reorganized DRAM bank structure to support sub-rows and buffer selection [7]
(Taken without permission)
9. VI. CONCLUSION
From the problems we discussed above, it is clear
that improving the memory system is the top priority
to achieve greater speeds. DRAM plays an important
role in the memory system and therefore, more
techniques should be used to improve the DRAM.
DRAM is a hierarchical system and has four levels.
The components in each of these levels can be
modified by replacing or by modifying/reorganizing
them. All of the performance techniques discussed
above improve the DRAM efficiency significantly
and they do so in different ways. Some techniques
lead to decrease in power, some increase the
throughput, and some hide the latency. The final
taxonomy we obtained by analysing various
techniques to improve DRAM performance is shown
in the figure 12. We can also conclude that the
photonic technology will play a very crucial role in
the future of processors and memory systems.
ACKNOWLEDGEMENT
I thank Dr. Soner Onder, for his valuable and
comments on the earlier drafts and for being patient
throughout the process.
Figure 12. Resulting taxonomy of our analysis.
REFERENCES
[1] Hameed, F., Bauer, L., Henkel, J., "Architecting
On-Chip DRAM Cache for Simultaneous Miss Rate
and Latency Reduction," in Computer-Aided Design
of Integrated Circuits and Systems, IEEE
Transactions on , vol.PP, no.99, pp.1-1, Oct. 2015
[2] Donghyuk Lee, Yoongu Kim, Pekhimenko, G.,
Khan, S., Seshadri, V., Chang, K., Mutlu, O.,
"Adaptive-latency DRAM: Optimizing DRAM
timing for the common-case," in High Performance
Computer Architecture (HPCA), 2015 IEEE 21st
International Symposium on, pp.489-501, 7-11 Feb.
2015
[3] T. Yamauchi, L. Hammond and K. Olukotun, "The
hierarchical multi-bank DRAM: a high-performance
architecture for memory integrated with
processors," Advanced Research in VLSI, 1997.
Proceedings, Seventeenth Conference on, Ann Arbor,
MI, 1997, pp. 303-319.
DRAM performance
improvement
Chip level
PIDRAM Memory
channel organization
PIDRAM Chip
organization
Bank level
PIDRAM Bank
organization
Parallelism-aware
batch scheduling
Subarray level
Hierarchical multi-
bank DRAM
Fault tolerence
Row level
Set mapping policy
Row buffer
modification
10. [4] Scott Beamer, Chen Sun, Yong-Jin Kwon, Ajay
Joshi, Christopher Batten, Vladimir Stojanović, and
Krste Asanović, “Re-architecting DRAM memory
systems with monolithically integrated silicon
photonics,” in Proceedings of the 37th annual
international symposium on Computer
architecture (ISCA '10). ACM, New York, NY, USA,
pp. 129-140, 2010
[5] Aniruddha N. Udipi, Naveen Muralimanohar,
Niladrish Chatterjee, Rajeev Balasubramonian, Al
Davis, and Norman P. Jouppi, “Rethinking DRAM
design and organization for energy-constrained multi-
cores,” in Proceedings of the 37th annual
international symposium on Computer
architecture (ISCA '10). ACM, New York, NY, USA,
pp. 175-186, 2010
[6] J. L. Hennessy and D. A. Patterson. Computer
Architecture: A Quantitative Approach. Elsevier, 4th
edition, 2007.
[7] Gulur N., Manikantan R., Govindarajan R.,
Mehendale M., "Row-Buffer Reorganization:
Simultaneously Improving Performance and
Reducing Energy in DRAMs," in Parallel
Architectures and Compilation Techniques (PACT),
2011 International Conference on, pp.189-190, 10-14
Oct. 2011
[8] Chang K.K.-W., Donghyuk Lee, Chishti Z.,
Alameldeen A.R., Wilkerson C., Yoongu Kim, Mutlu
O., "Improving DRAM performance by parallelizing
refreshes with accesses," in High Performance
Computer Architecture (HPCA), 2014 IEEE 20th
International Symposium on , pp. 356-367, 15-19 Feb.
2014
[9] Mutlu, O., Moscibroda, T., "Parallelism-Aware
Batch Scheduling: Enhancing both Performance and
Fairness of Shared DRAM Systems," in Computer
Architecture, 2008. ISCA '08. 35th International
Symposium on, pp. 63-74, 21-25 June 2008
[10] Vivek Seshadri, Yoongu Kim, Chris Fallin,
Donghyuk Lee, Rachata Ausavarungnirun, Gennady
Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B.
Gibbons, Michael A. Kozuch, and Todd C. Mowry,
“RowClone: fast and energy-efficient in-DRAM bulk
data copy and initialization,” in Proceedings of the
46th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO-46). ACM, New York,
NY, USA, pp. 185-197, 2013