SlideShare a Scribd company logo
Improving DRAM performance
Prithvi Kambhampati
Master of Science, Electrical and Computer Engineering
Michigan Technological University
Houghton, Michigan
pkambham@mtu.edu
Abstract—In order to reduce the growing gap
between the clock speed of the processors and that
of memory, more research is being done to
improve the performance of memory than ever.
Dynamic Random Access Memory (DRAM) is
being used in the cache to make the memory
accesses faster by reducing miss rate and latency.
This makes the DRAM performance improvement
an important aspect in today’s computation.
DRAM cells are refreshed at the rank-level,
periodically, in order to keep the data loss to a
minimum, prevent a complete rank from accepting
memory requests. This is one of the major
challenges the DRAM technology is facing. The
improvement to the DRAM can be made at four
different levels, namely, chip level, bank level,
subarray level, and row level. One of the methods
to do so is by reorganizing the structure of the
banks and the row buffer to improve the hit rates
of DRAM. Another method is to use light to
transmit data between the processor and the
memory system to reduce power consumption and
increase bandwidth. We also look into different set
mapping policies with which data is accessed from
the DRAM rows and discuss about the best
solution to improve the hit rate and reduce latency.
This paper shows that the methods implemented to
improve the performance of DRAM are
significantly affective. In addition, we also discuss
about the errors that occur in DRAMs and
describe the error-resilient schemes such as single
subarray memory systems with chipkills that can
overcome bit failures.
Index terms—Dynamic Random access memory,
chip level, bank level, subarray level, row level.
I. INTRODUCTION
In the past, the clock rates of microprocessors have
increased exponentially due to process improvements,
longer pipelines, and circuit design techniques. But
the main memory speed did not grow as fast as the
processors. Along with this, the number of cores on a
single chip has been increasing and is expected to
further increase in the future, and this increases the
aggregate demand for off-chip memory which makes
it worse to access the main memory. To address this
problem, we need to design a memory system that is
fast, big, and cheap. Static Random Access Memory
(SRAM) is being used in cache for its speed but is not
used in a large scale due its cost and low capacity.
Whereas, DRAM is being used in the main memory
for its large capacity and low cost. Therefore,
improving the efficiency of DRAM has become a
priority in the recent years. Many methods have been
proposed to reduce the loss of data and improve the
throughput and power efficiency. One solution is to
have a DRAM memory in the memory hierarchy. In
the recent past, DRAM has been employed in the
memory hierarchy as it increases the capacity of cache
memory via its higher density compared to the SRAM
cells. DRAM also has a higher bandwidth and lower
latency compared to the off-chip memory. DRAM
memory seems like a good solution to bring down the
memory wall (gap between the processor speed and
memory speed). The increased implementation of
DRAM memory has led to more and more research by
both industry and the academic institutions. Their
main aim is to improve the performance of DRAM
memory in today’s computation. For this purpose,
there have been many methods proposed for a given
limited off-chip memory bandwidth. Like many
things, a DRAM chip also has a structure (discussed
below), and can be subdivided into many parts. This
means that there is a possibility to improve the
characteristics of each and every of these parts.
A DRAM chip is made of capacitor based cells that
represent the data in the form of electric charge. To
store data in a cell, charge is injected, whereas to
retrieve data, the charge is extracted [2]. As shown in
figure 1, a typical DRAM chip has a hierarchy which
consists of multiple banks, a shared internal bus for
reading/writing data, and a chip I/O through which
memory is transferred between DRAM chip and other
memory units. Each bank is sub-divided into
subarrays and a bank I/O [10]. Furthermore the
subarrays are arranged into 2D arrays of DRAM cells
along with a common row buffer that consists of
SRAM cells and buffers one row of the DRAM bank.
Data can only be accessed after it is fetched to the row
buffer. Any attempt to read the data from the same
row will result in directly reading from the row buffer.
Accessing data (in the form of a cache line) from a
subarray involves multiple steps. First, the data can be
read only through a row buffer. This means that the
row must first be activated so that the data from the
rows of the DRAM cells can be transferred to the row
buffer. Secondly, after activating the row, the cache
line has to be read from/written to. This allows the
data to be transferred from/to the corresponding cells
through the internal bus that is present in the DRAM
chip. Finally, the row buffer has to be cleared for the
subsequent instructions.
Figure 1. Organization of a DRAM chip [10]
(Taken without permission)
In this paper, we are going to discuss the various
levels at which the DRAM performance can be
improved and the methods to do so. We observe four
different levels at which the modifications can be
done, with each level having multiple proposals to do
so. The first level is the chip level. At this level, there
is a memory channel with a memory controller, which
manages the set of DRAM banks present on the chip.
The memory channel has a three bus system which
includes a command bus, a read bus, and a write bus.
Each of these buses have I/O pins as well. These buses
and I/O access points can be partially/completely
replaced by the Photonically Interconnected DRAM
(PIDRAM) [4] technology, which provides energy
efficient communication. The photonic technology
uses light to transmit data between the processor and
the memory. To transmit data/commands, external
light (typically from a laser) is passed through
resonators which give that light a unique wavelength.
This modulated light is received by a photodetector
and is converted to electricity and the data/commands
are transferred. The advantage with this technology is
that multiple wavelengths can be transmitted at once,
allowing us to transmit more data that usual at low
power usage. The second level is the bank level. At
this level PIDRAM technology can be used to
reorganizing the banks [4] to save energy. Another
method to improve the performance of DRAM at this
level is by processing DRAM requests in batches of
requests [9]. The third level is the subarray level. One
idea is to have a hierarchical multi-bank DRAM [3] in
which the subarrays are converted to semi-
independent sub-banks, to take an advantage of the
fact that most of the DRAM accesses occur locally
within the subarrays. This allows the subarrays to act
independently for such accesses and makes the
process faster. The last level that can be modified is
the row level. In DRAM cache, to access memory
easily, memory blocks in the banks are mapped to a
particular set of a particular row of a particular bank.
These set-mapping policies [1] either concentrate on
improving the hit rate or decrease the latency. Another
change that can be made to this level is dividing the
row buffer into multiple smaller row buffers [7].
Figure 2. DRAM Memory System – Each inset shows detail for a different level of current electrical DRAM
memory systems. [4]
(Taken without permission)
II. CHIP LEVEL
A DRAM chip consists of a shared internal bus,
multiple banks, a chip I/O and a memory channel
controlled by a memory controller. This section
describes different ways in which we can modify the
above parts of the chip to improve performance. One
such way is to use light to transmit data among the
parts of the DRAM chip. The following introduction
to the silicon photonic technology, which can replace
the conventional electrical circuit partially.
PHOTONICALLY INTERCONNECTED DRAM
The off-chip memory bandwidths are not likely to
match up to the performance of the processor. This
has been reducing the maximum achievable system
performance since 2008. The number of pins on the
board is limited by the area and power over heads of
high speed transceivers and package interconnect.
The number of packets transferred per pin can be
increased but only at the expense of using up more
energy. As described in the introduction, a DRAM
memory channel uses a memory controller to manage
a set of DRAM banks that are distributed across one
or more DRAM chips. We can overcome these
challenges by redesigning the DRAM memory using
Photonically Interconnected DRAM (PIDRAM) [4],
which uses a monolithically integrated silicon-
photonic technology. This technology uses light to
transfer data instead of electrical circuits. Firstly, the
light which is in the form of LASER is passed through
a series of resonators. These resonators modulate the
wavelength of the light which is transmitted from the
processor to the PIDRAM chip. At the PIDRAM chip,
this light is received and demodulated using filters and
is converted to electrical signal using a photo detector.
The advantages of this technology are: very less
power is required to transmit data, larger off-chip
bandwidths are supported at a minimum power
consumption, and transmission of the data at multiple
wavelengths at once, allowing multiple data packets
to be transferred at once. This is called as dense
wavelength division multiplexing (DWDM) [4] and
allows multiple links (wavelengths) to share the same
media (fibre or waveguide). The electrical I/O in
DRAM chips can be replaced by these energy
efficient photonic links. By redesigning DRAM banks
to provide greater bandwidth from an individual array
core, we can supply the bandwidth demands. This also
reduces the energy required to activate the banks. We
should keep in mind that all the electrical circuits
cannot be replaced by this technology as it needs more
area than a simple electrical circuit.
A. PIDRAM memory channel organization
A memory controller manages a set of DRAM
banks that are distributed across many DRAM chips.
This memory system has 3 logical buses: a command
bus, a write data bus, and a read data bus. We can
implement these buses using the photonic components
in 3 ways:
 Shared Photonic Bus:
All the three logical buses can be implemented
using a shared photonic bus, which works like a
standard electrical bus. In this implementation, the
memory controller first issues a command to all the
banks, and these banks determine if they are the target
bank. Once the target bank knows that it is the target,
for a write command, it will tune-in its photonic
receiver on the write-data bus. The memory controller
places the data on that bus, and the target bank
receives the data and performs the write operation,
and for a read command, the target bank will perform
its read operation and sends the data through the read
data bus.
Figure 3. Shared Photonic Buses [4]
(Taken without permission)
 Split Photonic Bus:
In this implementation, the long shared bus is
divided into multiple branches. The laser power is
sent to all the receivers of the command and write bus,
and the modulators of the read bus. However, the total
laser power is roughly a linear function of the number
of banks. This reduces the effective bandwidth density
of the photonic device and also the optical laser power
compared to the shared photonic bus.
Figure 4. Monolithically integrated silicon-photonic technology - Two DWDM links in opposite directions
between a memory controller in a processor chip and a bank in a PIDRAM chip. λ1 is used for the request and λ2 is
used for the response in the opposite direction on the same waveguides and fibre. [4]
(Taken without permission)
Figure 5.a. Split photonic buses [4]
(Taken without permission)
 Guided Photonic Bus:
The optical power can be further reduced by this
implementation. Guided photonic bus uses optical
power guiding in the form of demultiplexers to
actively direct power to just the target bank. This
allows the total power to be constant throughout, and
also independent of the number of banks.
Figure 5.b. Guided photonic buses [4]
(Taken without permission)
B. PIDRAM Chip Organization
We have discussed above different ways in which
the buses can be implemented photonically. The
trade-off with this is that only a portion of the buses
can be implemented photonically and the rest,
electrically. The design choice is made on the trade-
offs in power and area. The photonics can be
gradually extended deeper into the PIDRAM chip.
Figure 6. PIDRAM chip floorplan [4]
(Taken without permission)
The vertical electrical data bus can be partitioned
into ‘n’ partitions and all the photonic circuits should
be replicated at each data access point for each bus
partition. Partitioning the data bus allows the DRAM
chip to use an energy efficient photonic interconnect.
This increases the fixed link power and higher optical
losses.
III. BANK LEVEL
Each bank consists of multiple subarrays and a
bank I/O. data is accessed in the form of cache lines
from each subarray. This requires activation of the
row containing the cache line, reading/writing the
cache line, and precharging the subarray to prepare for
subsequent requests. This section deals with novel
way to organize the banks a request scheduling
algorithm which help in increasing the number of
instructions executed.
A. PIDRAM Bank Organization
Most of the energy consumed in a DRAM chip is
by the banks themselves. Every array block in a bank
access activates an array core, which activates an
entire array core row. From this array core row, only
a few bits of data is used. Most of the bank energy is
used to wake up these unnecessary bits. This wastage
of energy can be reduced by either decreasing the
array core row size which reduces the number of the
unnecessary bits that are being activated, or increasing
the number of I/O per array core, and using fewer
array cores in parallel. Decreasing the array core row
size leads to a greater area penalty. Therefore the
access efficiency has to be improved by increasing the
number of I/Os per array core. The motivation to
make this change is not much because the energy
consumption by the bank is less compared to the
electrical inter-chip and intra-chip interconnect. Also
the number of pins we can have on a chip is limited.
The increased bandwidth allows more banks per
chip. The high bandwidth also allows energy savings
and does not affect the area of PIDRAM significantly.
This makes sure that the photonic technology will
play an important role in the future multiprocessor
performance. The upcoming PIDRAMs should not
only concentrate on high performance, low cost, and
energy efficiency at the chip level, but also support a
large range of multi-chip configurations with different
capacities and bandwidths.
B. Parallelism-aware batch scheduling
In a chip multiprocessor (CMP) system, the
DRAM is a frequently used resource. Inter-thread
interference can destroy the bank-level access
parallelism of individual threads. Bank level
parallelism [8] [9] is a method in which the requests
made by the threads are serviced in parallel in
different banks. Parallelism-aware batch scheduling is
the advanced method of bank level parallelism which
takes these requests and divides them into batches and
attends to those batches of requests. This method can
be divided into two steps:
i. Request Batching
A number of DRAM requests are grouped into a
batch. These batches of requests are completed one
after the other and this step ensures that all the
requests in one batch are completed before the arrival
of the next batch. The batch of requests serviced is
then removed from the memory request buffer and
only then the new batch is formed. When forming a
new batch, the batching component decides how
many requests issued by a thread for a certain bank
can be a part of a batch. Batching not only ensures that
all the requests are taken care of, but also provides a
uniform granularity due to which the performance
improves.
A fixed number of DRAM requests are grouped
into a batch. This is done based on the arrival time of
these requests. Even though there is interference from
other threads, the bank level access parallelism of
each thread is preserved. This guarantees the oldest
batch to be served first by prioritizing the oldest
requests and also prevents any thread from being
starved in the DRAM system due to interference from
other, potentially aggressive threads. Batching
reduces the serialization of thread requests by
executing them in parallel rather than allowing them
to run alone in the memory system.
ii. Parallelism-Aware Within-Batch Scheduling
In this step, the requests of each and every thread
in a batch is computed in parallel in the DRAM cells.
This hides the latency inside the batch and also
increases the throughput of the processor as many
requests are serviced in parallel. The Parallelism-
Aware Within-Batch Scheduling tries to maximize
the:
 Row-buffer locality:
Bank accesses will have lower latencies if a high
row-hit rate is present within a batch.
 Intra-thread bank parallelism:
Scheduling multiple requests from a thread to
various banks in parallel reduces the thread’s stall
time.
This scheduling uses a thread prioritization to
make use of both row-buffer locality and bank
parallelism. Thread ranking is done by maximum rule,
where the scheduler finds the maximum number of
marked requests and tie-breaker total rule, in which
the scheduler keeps track of the total number of
marked requests, called total-load, and assigns the
higher rank to the lower total load [9].
IV. SUB ARRAY LEVEL
Each subarray consists of a 2 dimensional array of
DRAM cells. The data stored in these cells are
accessed in terms of rows. The request is sent to a row
buffer which is common for all the rows of DRAM
cells. The data then is sent to the row buffer and the
data is accessed from the row buffer. The following
two sub-sections explain how the accesses can be
done faster by modifying the subarray.
A. Hierarchical multi-bank DRAM
Embedded DRAM or eDRAM is a dynamic
random-access memory integrated on the same die or
multi-chip module of an ASIC or microprocessor.
eDRAM allows for larger buses and higher operation
speeds, due to higher density of DRAM. eDRAM
cannot handle the number of memory accesses
generated by the high performance processor, which
creates a bottle neck. Successive accesses that need
the same bank must queue up and serialize. One
solution is having a parallelism aware batch
scheduling (discussed in III.B.). Another solution to
this problem is to simply increase the number of
independent DRAM banks in order to lower the
probability of a conflict. But increasing the number of
independent banks leads to a requirement of a larger
area. The number of independent banks can be
increased without effecting the area much by allowing
the subarrays to act as banks whenever they the
DRAM chip receives a request to that particular
subarray. This allows the subarrays to act as semi-
independent banks [3].
After dividing the DRAM banks into subarrays,
for the subarrays to act as semi-independent sub-
banks, some additions and modifications have to be
made to each subarray. The banks in the DRAM chip
use registers and controls to allow the accessing of
data. This means that a few pipeline registers and
controls, a set-reset flip-flop to hold the subarray
output from the subarray, and buffers to hold the
addresses for the access should be added to the
subarrays. The access queues of the DRAM should be
modified to detect accesses which do not cause
conflicts and then to start the accesses that do not
cause conflicts, in parallel.
Figure 7.a. Modifications made to each
subarray [9]
(Taken without permission)
Figure 7.b. Modifications made to the access
queues [9]
(Taken without permission)
This is a useful approach, since a large part of
each DRAM access actually occurs only locally
within individual DRAM subarrays. Individual
subarrays within independent banks are controlled as
semi-independent subbanks that share the main
bank’s I/O circuitry and decoders. The sub banks can
perform way better with just creating a small area
penalty
B. Fault tolerance in DRAMs
Errors are often occurred in DRAMs, which leads
to a significant downtime in datacentres. An
architecture of the DRAM has to be developed in
order to provide high standard of reliability. Error-
resilient schemes, called as chipkills [5] can be built
for such bit failures. Isolating an entire cache line to a
single small subarray on a single DRAM chip will
allow us to read an entire cache line out of a single
DRAM array, so the potential for correlated errors is
increased. In order to provide chipkill-level reliability
in concert with single small sub array, checksums [5]
stored for each cache line in the DRAM are
introduced, similar to that provided in hard drives.
Using the checksum we can provide robust error
detection capabilities, and provide chipkill-level
reliability through Redundant Array of Inexpensive
DRAMs [6]. In the Redundant Array of Inexpensive
DRAMs, a single disk serves as a parity check for
more than one other disks. On a disk access, only one
disk is read per ‘n’ number of disks. The check sum
related to the read block lets the Redundant Array of
Inexpensive DRAM controller know if the read is
correct or not. This approach is more effective in
terms of area and energy than prior chipkill
approaches, and only incurs a performance penalty
compared to a single sub array memory system
without chipkill.
Figure 8. Chipkill support in Single Sub
Array memory system (64KB) [5]
(Taken without permission)
V. ROW LEVEL
As the number of cores in a processor is increasing,
the demand for an off-chip memory is increasing. This
exacerbates the main memory access bottle neck.
Many solutions have been proposed for this problem.
One of them is to have an on-chip DRAM as the last
level of cache to improve the performance for a given
off-chip memory bandwidth. This is called as the on-
chip DRAM cache, which increases the cache
capacity through high capacity. It also improves the
on-chip communication through high bandwidth and
low latency interconnect.
In a cache, the storage is mapped to the memory
addresses it serves. There are different ways this
mapping can be done. The choice of mapping is very
critical to the design that the cache is often named
after the mapping. This is done so in N-way set
associatively. The same goes with the DRAM. Each
row in a subarray of a bank of a DRAM chip has a
series of DRAM cells. All the rows in a subarray of a
bank have a common row buffer which is used to
access data. Implementing the DRAM in the cache
requires the DRAM to have mapping between the
rows and the main memory system. The following
two methods explain the ways to do so. The first
method explains the way set mapping works and later
in the section we see how the row buffer can be
modified to make the accessing of data faster.
A. Set mapping policy
As explained in the introduction, the DRAM cache
is a multi-bank system, with each bank having a
number of rows. The DRAM cache uses set mapping
policy [1], in which memory blocks are mapped to a
particular set of a particular row of a particular bank.
The set mapping policy directly affects the throughput
of the system by effecting the DRAM cache miss rate
and DRAM cache hit latency which makes it an
important aspect in the cache.
Figure 9. DRAM cache hierarchy (Intel)
(Taken from the website)
New DRAM set mapping policies are proposed
regularly to reduce the DRAM cache miss rate.
Through higher associativity we can achieve reduced
DRAM latency via improved row buffer hit rate. A
typical DRAM cache organization has multiple banks,
each with subarrays and each subarray containing an
array of rows and columns of DRAM cells. Each
DRAM provides with a row buffer, which buffers one
row in a DRAM bank. Data in a DRAM bank is
accessed after it is fetched through the row buffer.
Figure 10. 29 way associativity for 4KB size row [1]
(taken without permission)
Associativity deals with the hit ratio and search
speed. There is a trade-off between these two factors.
A direct mapped cache has a good hit ratio but a better
search speed. For a fully associative cache, the search
rate is better than the hit ratio. This implies that as the
associativity increases, the hit ratio improves and the
search speed decreases. Thus, we need to come up
with a reasonable associativity. As said before, a
higher associativity decreases the cache miss rate
significantly. The DRAM cache row is divided into
tag blocks and cache lines. Each bank of the DRAM
cache is associated with a row buffer that holds the
last accessed row of that bank. If the associativity of
DRAM row organization is increased, a cache access
first hits the tag block instead of the whole cache
block and reduces access latency. Having a higher
associativity cache might slightly increase the tag
latency compared to lower associativity. But this
device benefits from higher associativity that reduces
conflict misses. It also provides a higher row buffer
hit rate compared to a simple cache, because there are
more number of consecutive memory blocks mapped
to the same set.
B. Modifying the row buffer
The present DRAM cache banks have a single row
buffer. Having multiple smaller row buffers instead of
the existing single large row buffers helps improve the
row hit rates and also reduce the energy required for
row activation [7]. As explained earlier, the data has
to come to the row buffer after read request, and the
row data has to be read from the buffer. The width of
each row buffer is the width of the entire row and
holds a few KB of data. The precharge writes back the
row buffer to the appropriate row after a column
read/write of the selected words from/to the row
buffer. The precharge operation involves charging
and discharging of a large number of capacitors.
In a multi-core processor memory, the memory
addresses are spread evenly across memory banks to
compensate for the relatively slow speed of DRAM.
This decreases the row buffer hit rate. We can
improve this reduced row buffer hit rate by dividing
the row buffer into multiple smaller row buffers. This
new organization will now require sub-row activation
in addition to the row-buffer selection and row-buffer
selection. This requires for the controller to add more
address bits to the DRAM cache to activate the sub-
row buffers and bring into a row buffer. The memory
controller allocates and manages the row buffers,
providing the DRAM logic with additional flexibility,
to implement many other buffer allocation policies.
Figure 11. Reorganized DRAM bank structure to support sub-rows and buffer selection [7]
(Taken without permission)
VI. CONCLUSION
From the problems we discussed above, it is clear
that improving the memory system is the top priority
to achieve greater speeds. DRAM plays an important
role in the memory system and therefore, more
techniques should be used to improve the DRAM.
DRAM is a hierarchical system and has four levels.
The components in each of these levels can be
modified by replacing or by modifying/reorganizing
them. All of the performance techniques discussed
above improve the DRAM efficiency significantly
and they do so in different ways. Some techniques
lead to decrease in power, some increase the
throughput, and some hide the latency. The final
taxonomy we obtained by analysing various
techniques to improve DRAM performance is shown
in the figure 12. We can also conclude that the
photonic technology will play a very crucial role in
the future of processors and memory systems.
ACKNOWLEDGEMENT
I thank Dr. Soner Onder, for his valuable and
comments on the earlier drafts and for being patient
throughout the process.
Figure 12. Resulting taxonomy of our analysis.
REFERENCES
[1] Hameed, F., Bauer, L., Henkel, J., "Architecting
On-Chip DRAM Cache for Simultaneous Miss Rate
and Latency Reduction," in Computer-Aided Design
of Integrated Circuits and Systems, IEEE
Transactions on , vol.PP, no.99, pp.1-1, Oct. 2015
[2] Donghyuk Lee, Yoongu Kim, Pekhimenko, G.,
Khan, S., Seshadri, V., Chang, K., Mutlu, O.,
"Adaptive-latency DRAM: Optimizing DRAM
timing for the common-case," in High Performance
Computer Architecture (HPCA), 2015 IEEE 21st
International Symposium on, pp.489-501, 7-11 Feb.
2015
[3] T. Yamauchi, L. Hammond and K. Olukotun, "The
hierarchical multi-bank DRAM: a high-performance
architecture for memory integrated with
processors," Advanced Research in VLSI, 1997.
Proceedings, Seventeenth Conference on, Ann Arbor,
MI, 1997, pp. 303-319.
DRAM performance
improvement
Chip level
PIDRAM Memory
channel organization
PIDRAM Chip
organization
Bank level
PIDRAM Bank
organization
Parallelism-aware
batch scheduling
Subarray level
Hierarchical multi-
bank DRAM
Fault tolerence
Row level
Set mapping policy
Row buffer
modification
[4] Scott Beamer, Chen Sun, Yong-Jin Kwon, Ajay
Joshi, Christopher Batten, Vladimir Stojanović, and
Krste Asanović, “Re-architecting DRAM memory
systems with monolithically integrated silicon
photonics,” in Proceedings of the 37th annual
international symposium on Computer
architecture (ISCA '10). ACM, New York, NY, USA,
pp. 129-140, 2010
[5] Aniruddha N. Udipi, Naveen Muralimanohar,
Niladrish Chatterjee, Rajeev Balasubramonian, Al
Davis, and Norman P. Jouppi, “Rethinking DRAM
design and organization for energy-constrained multi-
cores,” in Proceedings of the 37th annual
international symposium on Computer
architecture (ISCA '10). ACM, New York, NY, USA,
pp. 175-186, 2010
[6] J. L. Hennessy and D. A. Patterson. Computer
Architecture: A Quantitative Approach. Elsevier, 4th
edition, 2007.
[7] Gulur N., Manikantan R., Govindarajan R.,
Mehendale M., "Row-Buffer Reorganization:
Simultaneously Improving Performance and
Reducing Energy in DRAMs," in Parallel
Architectures and Compilation Techniques (PACT),
2011 International Conference on, pp.189-190, 10-14
Oct. 2011
[8] Chang K.K.-W., Donghyuk Lee, Chishti Z.,
Alameldeen A.R., Wilkerson C., Yoongu Kim, Mutlu
O., "Improving DRAM performance by parallelizing
refreshes with accesses," in High Performance
Computer Architecture (HPCA), 2014 IEEE 20th
International Symposium on , pp. 356-367, 15-19 Feb.
2014
[9] Mutlu, O., Moscibroda, T., "Parallelism-Aware
Batch Scheduling: Enhancing both Performance and
Fairness of Shared DRAM Systems," in Computer
Architecture, 2008. ISCA '08. 35th International
Symposium on, pp. 63-74, 21-25 June 2008
[10] Vivek Seshadri, Yoongu Kim, Chris Fallin,
Donghyuk Lee, Rachata Ausavarungnirun, Gennady
Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B.
Gibbons, Michael A. Kozuch, and Todd C. Mowry,
“RowClone: fast and energy-efficient in-DRAM bulk
data copy and initialization,” in Proceedings of the
46th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO-46). ACM, New York,
NY, USA, pp. 185-197, 2013

More Related Content

What's hot

Bai07 bo nho
Bai07   bo nhoBai07   bo nho
Bai07 bo nhoVũ Sang
 
Chapter 2 - Computer Evolution and Performance
Chapter 2 - Computer Evolution and PerformanceChapter 2 - Computer Evolution and Performance
Chapter 2 - Computer Evolution and Performance
César de Souza
 
Overview of-dram
Overview of-dramOverview of-dram
Overview of-dram
Thiên Nguyễn
 
Phân tích kiến trúc và nguyên lý làm việc của bộ nhớ RAM chuẩn DDRAM
Phân tích kiến trúc và nguyên lý làm việc của bộ nhớ RAM chuẩn DDRAMPhân tích kiến trúc và nguyên lý làm việc của bộ nhớ RAM chuẩn DDRAM
Phân tích kiến trúc và nguyên lý làm việc của bộ nhớ RAM chuẩn DDRAM
nataliej4
 
Memory hierarchy
Memory hierarchyMemory hierarchy
Memory hierarchy
Anurag Verma
 
Slide hệ điều hành học viện công nghệ Bưu Chính viễn thông.pdf
Slide hệ điều hành học viện công nghệ Bưu Chính viễn thông.pdfSlide hệ điều hành học viện công nghệ Bưu Chính viễn thông.pdf
Slide hệ điều hành học viện công nghệ Bưu Chính viễn thông.pdf
NhmL7
 
Xây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
Xây dựng hệ thống quản lý sân bóng sử dụng Yii FrameworkXây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
Xây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
GMO-Z.com Vietnam Lab Center
 
ICMP-Học viện Kỹ thuật Mật mã
ICMP-Học viện Kỹ thuật Mật mãICMP-Học viện Kỹ thuật Mật mã
ICMP-Học viện Kỹ thuật Mật mã
Dũng Trần
 
Memory organization
Memory organizationMemory organization
Memory organization
ishapadhy
 
Luận văn: Phát hiện xâm nhập theo thời gian thực trong internet
Luận văn: Phát hiện xâm nhập theo thời gian thực trong internetLuận văn: Phát hiện xâm nhập theo thời gian thực trong internet
Luận văn: Phát hiện xâm nhập theo thời gian thực trong internet
Dịch vụ viết bài trọn gói ZALO 0917193864
 
Đề tài: Xây dựng phần mềm quản lý thông tin nhân sự ĐH Hải Phòng
Đề tài: Xây dựng phần mềm quản lý thông tin nhân sự ĐH Hải PhòngĐề tài: Xây dựng phần mềm quản lý thông tin nhân sự ĐH Hải Phòng
Đề tài: Xây dựng phần mềm quản lý thông tin nhân sự ĐH Hải Phòng
Dịch vụ viết bài trọn gói ZALO: 0909232620
 
bài giảng ký thuật vi xử lý PTIT
bài giảng ký thuật vi xử lý PTITbài giảng ký thuật vi xử lý PTIT
bài giảng ký thuật vi xử lý PTIT
NguynMinh294
 
Đồ án tốt nghiệp điện tử Điều khiển và giám sát thiết bị điện gia đình - sdt/...
Đồ án tốt nghiệp điện tử Điều khiển và giám sát thiết bị điện gia đình - sdt/...Đồ án tốt nghiệp điện tử Điều khiển và giám sát thiết bị điện gia đình - sdt/...
Đồ án tốt nghiệp điện tử Điều khiển và giám sát thiết bị điện gia đình - sdt/...
Viết thuê báo cáo thực tập giá rẻ
 
Phân tích thiết kế hệ thống của hàng bán điện thoại di động
Phân tích thiết kế hệ thống của hàng bán điện thoại di độngPhân tích thiết kế hệ thống của hàng bán điện thoại di động
Phân tích thiết kế hệ thống của hàng bán điện thoại di động
Nguyễn Danh Thanh
 
02 Computer Evolution And Performance
02  Computer  Evolution And  Performance02  Computer  Evolution And  Performance
02 Computer Evolution And Performance
Jeanie Delos Arcos
 
ỨNG DỤNG DEEP LEARNING ĐỂ ĐẾM SỐ LƯỢNG XE ÔTÔ TRONG NỘI THÀNH ĐÀ NẴNG 51920ed2
ỨNG DỤNG DEEP LEARNING ĐỂ ĐẾM SỐ LƯỢNG XE ÔTÔ TRONG NỘI THÀNH ĐÀ NẴNG 51920ed2ỨNG DỤNG DEEP LEARNING ĐỂ ĐẾM SỐ LƯỢNG XE ÔTÔ TRONG NỘI THÀNH ĐÀ NẴNG 51920ed2
ỨNG DỤNG DEEP LEARNING ĐỂ ĐẾM SỐ LƯỢNG XE ÔTÔ TRONG NỘI THÀNH ĐÀ NẴNG 51920ed2
nataliej4
 
Luận văn: Xây dựng website quản lý trả chứng chỉ ICDL, HAY
Luận văn: Xây dựng website quản lý trả chứng chỉ ICDL, HAYLuận văn: Xây dựng website quản lý trả chứng chỉ ICDL, HAY
Luận văn: Xây dựng website quản lý trả chứng chỉ ICDL, HAY
Dịch vụ viết bài trọn gói ZALO 0917193864
 
Đề tài: Hệ thống hỗ trợ đăng ký đề tài nghiên cứu khoa học trên web
Đề tài: Hệ thống hỗ trợ đăng ký đề tài nghiên cứu khoa học trên webĐề tài: Hệ thống hỗ trợ đăng ký đề tài nghiên cứu khoa học trên web
Đề tài: Hệ thống hỗ trợ đăng ký đề tài nghiên cứu khoa học trên web
Dịch vụ viết bài trọn gói ZALO: 0909232620
 
Tiếng anh chuyên ngành cntt
Tiếng anh chuyên ngành cnttTiếng anh chuyên ngành cntt
Tiếng anh chuyên ngành cntt
thientinh199
 
ĐỀ TÀI : ĐIỂM DANH BẰNG NHẬN DIỆN KHUÔN MẶT. Giảng viên : PGS.TS. HUỲNH CÔNG ...
ĐỀ TÀI : ĐIỂM DANH BẰNG NHẬN DIỆN KHUÔN MẶT. Giảng viên : PGS.TS. HUỲNH CÔNG ...ĐỀ TÀI : ĐIỂM DANH BẰNG NHẬN DIỆN KHUÔN MẶT. Giảng viên : PGS.TS. HUỲNH CÔNG ...
ĐỀ TÀI : ĐIỂM DANH BẰNG NHẬN DIỆN KHUÔN MẶT. Giảng viên : PGS.TS. HUỲNH CÔNG ...
nataliej4
 

What's hot (20)

Bai07 bo nho
Bai07   bo nhoBai07   bo nho
Bai07 bo nho
 
Chapter 2 - Computer Evolution and Performance
Chapter 2 - Computer Evolution and PerformanceChapter 2 - Computer Evolution and Performance
Chapter 2 - Computer Evolution and Performance
 
Overview of-dram
Overview of-dramOverview of-dram
Overview of-dram
 
Phân tích kiến trúc và nguyên lý làm việc của bộ nhớ RAM chuẩn DDRAM
Phân tích kiến trúc và nguyên lý làm việc của bộ nhớ RAM chuẩn DDRAMPhân tích kiến trúc và nguyên lý làm việc của bộ nhớ RAM chuẩn DDRAM
Phân tích kiến trúc và nguyên lý làm việc của bộ nhớ RAM chuẩn DDRAM
 
Memory hierarchy
Memory hierarchyMemory hierarchy
Memory hierarchy
 
Slide hệ điều hành học viện công nghệ Bưu Chính viễn thông.pdf
Slide hệ điều hành học viện công nghệ Bưu Chính viễn thông.pdfSlide hệ điều hành học viện công nghệ Bưu Chính viễn thông.pdf
Slide hệ điều hành học viện công nghệ Bưu Chính viễn thông.pdf
 
Xây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
Xây dựng hệ thống quản lý sân bóng sử dụng Yii FrameworkXây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
Xây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
 
ICMP-Học viện Kỹ thuật Mật mã
ICMP-Học viện Kỹ thuật Mật mãICMP-Học viện Kỹ thuật Mật mã
ICMP-Học viện Kỹ thuật Mật mã
 
Memory organization
Memory organizationMemory organization
Memory organization
 
Luận văn: Phát hiện xâm nhập theo thời gian thực trong internet
Luận văn: Phát hiện xâm nhập theo thời gian thực trong internetLuận văn: Phát hiện xâm nhập theo thời gian thực trong internet
Luận văn: Phát hiện xâm nhập theo thời gian thực trong internet
 
Đề tài: Xây dựng phần mềm quản lý thông tin nhân sự ĐH Hải Phòng
Đề tài: Xây dựng phần mềm quản lý thông tin nhân sự ĐH Hải PhòngĐề tài: Xây dựng phần mềm quản lý thông tin nhân sự ĐH Hải Phòng
Đề tài: Xây dựng phần mềm quản lý thông tin nhân sự ĐH Hải Phòng
 
bài giảng ký thuật vi xử lý PTIT
bài giảng ký thuật vi xử lý PTITbài giảng ký thuật vi xử lý PTIT
bài giảng ký thuật vi xử lý PTIT
 
Đồ án tốt nghiệp điện tử Điều khiển và giám sát thiết bị điện gia đình - sdt/...
Đồ án tốt nghiệp điện tử Điều khiển và giám sát thiết bị điện gia đình - sdt/...Đồ án tốt nghiệp điện tử Điều khiển và giám sát thiết bị điện gia đình - sdt/...
Đồ án tốt nghiệp điện tử Điều khiển và giám sát thiết bị điện gia đình - sdt/...
 
Phân tích thiết kế hệ thống của hàng bán điện thoại di động
Phân tích thiết kế hệ thống của hàng bán điện thoại di độngPhân tích thiết kế hệ thống của hàng bán điện thoại di động
Phân tích thiết kế hệ thống của hàng bán điện thoại di động
 
02 Computer Evolution And Performance
02  Computer  Evolution And  Performance02  Computer  Evolution And  Performance
02 Computer Evolution And Performance
 
ỨNG DỤNG DEEP LEARNING ĐỂ ĐẾM SỐ LƯỢNG XE ÔTÔ TRONG NỘI THÀNH ĐÀ NẴNG 51920ed2
ỨNG DỤNG DEEP LEARNING ĐỂ ĐẾM SỐ LƯỢNG XE ÔTÔ TRONG NỘI THÀNH ĐÀ NẴNG 51920ed2ỨNG DỤNG DEEP LEARNING ĐỂ ĐẾM SỐ LƯỢNG XE ÔTÔ TRONG NỘI THÀNH ĐÀ NẴNG 51920ed2
ỨNG DỤNG DEEP LEARNING ĐỂ ĐẾM SỐ LƯỢNG XE ÔTÔ TRONG NỘI THÀNH ĐÀ NẴNG 51920ed2
 
Luận văn: Xây dựng website quản lý trả chứng chỉ ICDL, HAY
Luận văn: Xây dựng website quản lý trả chứng chỉ ICDL, HAYLuận văn: Xây dựng website quản lý trả chứng chỉ ICDL, HAY
Luận văn: Xây dựng website quản lý trả chứng chỉ ICDL, HAY
 
Đề tài: Hệ thống hỗ trợ đăng ký đề tài nghiên cứu khoa học trên web
Đề tài: Hệ thống hỗ trợ đăng ký đề tài nghiên cứu khoa học trên webĐề tài: Hệ thống hỗ trợ đăng ký đề tài nghiên cứu khoa học trên web
Đề tài: Hệ thống hỗ trợ đăng ký đề tài nghiên cứu khoa học trên web
 
Tiếng anh chuyên ngành cntt
Tiếng anh chuyên ngành cnttTiếng anh chuyên ngành cntt
Tiếng anh chuyên ngành cntt
 
ĐỀ TÀI : ĐIỂM DANH BẰNG NHẬN DIỆN KHUÔN MẶT. Giảng viên : PGS.TS. HUỲNH CÔNG ...
ĐỀ TÀI : ĐIỂM DANH BẰNG NHẬN DIỆN KHUÔN MẶT. Giảng viên : PGS.TS. HUỲNH CÔNG ...ĐỀ TÀI : ĐIỂM DANH BẰNG NHẬN DIỆN KHUÔN MẶT. Giảng viên : PGS.TS. HUỲNH CÔNG ...
ĐỀ TÀI : ĐIỂM DANH BẰNG NHẬN DIỆN KHUÔN MẶT. Giảng viên : PGS.TS. HUỲNH CÔNG ...
 

Viewers also liked

My Personal Leadership Model 2011
My Personal Leadership Model 2011My Personal Leadership Model 2011
My Personal Leadership Model 2011
Alishap
 
Componentes para el uso de la TIC en educaciòn
Componentes para el uso de la TIC en educaciònComponentes para el uso de la TIC en educaciòn
Componentes para el uso de la TIC en educaciòn
Angel Rivas
 
Xp
XpXp
What are they doing
What are they doingWhat are they doing
What are they doing
Fernando IBM
 
RESUME Final_updated on 15.02.2016
RESUME Final_updated on 15.02.2016RESUME Final_updated on 15.02.2016
RESUME Final_updated on 15.02.2016
CS Rakesh Kasar
 
John Gouthro 2016 Resume
John Gouthro 2016 ResumeJohn Gouthro 2016 Resume
John Gouthro 2016 Resume
John Gouthro
 
грипп 2016 и орит
грипп 2016 и оритгрипп 2016 и орит
грипп 2016 и орит
Ксения Емануилова
 
Can Stress Cause Heartburn?
Can Stress Cause Heartburn?Can Stress Cause Heartburn?
Can Stress Cause Heartburn?
elise Rivas
 
NCC_Protocol_WEB_2016-07-12
NCC_Protocol_WEB_2016-07-12NCC_Protocol_WEB_2016-07-12
NCC_Protocol_WEB_2016-07-12
Baptiste Cassan-Barnel
 
Risk dg 19 may 2016 presentation slides
Risk dg 19 may 2016 presentation slidesRisk dg 19 may 2016 presentation slides
Risk dg 19 may 2016 presentation slides
Nabila Gimadi
 
Swell Bottle Pitch2
Swell Bottle Pitch2Swell Bottle Pitch2
Swell Bottle Pitch2
Katie O'Hara
 

Viewers also liked (11)

My Personal Leadership Model 2011
My Personal Leadership Model 2011My Personal Leadership Model 2011
My Personal Leadership Model 2011
 
Componentes para el uso de la TIC en educaciòn
Componentes para el uso de la TIC en educaciònComponentes para el uso de la TIC en educaciòn
Componentes para el uso de la TIC en educaciòn
 
Xp
XpXp
Xp
 
What are they doing
What are they doingWhat are they doing
What are they doing
 
RESUME Final_updated on 15.02.2016
RESUME Final_updated on 15.02.2016RESUME Final_updated on 15.02.2016
RESUME Final_updated on 15.02.2016
 
John Gouthro 2016 Resume
John Gouthro 2016 ResumeJohn Gouthro 2016 Resume
John Gouthro 2016 Resume
 
грипп 2016 и орит
грипп 2016 и оритгрипп 2016 и орит
грипп 2016 и орит
 
Can Stress Cause Heartburn?
Can Stress Cause Heartburn?Can Stress Cause Heartburn?
Can Stress Cause Heartburn?
 
NCC_Protocol_WEB_2016-07-12
NCC_Protocol_WEB_2016-07-12NCC_Protocol_WEB_2016-07-12
NCC_Protocol_WEB_2016-07-12
 
Risk dg 19 may 2016 presentation slides
Risk dg 19 may 2016 presentation slidesRisk dg 19 may 2016 presentation slides
Risk dg 19 may 2016 presentation slides
 
Swell Bottle Pitch2
Swell Bottle Pitch2Swell Bottle Pitch2
Swell Bottle Pitch2
 

Similar to Improving DRAM performance

301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
Srinivas Naidu
 
Time and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
Time and Low Power Operation Using Embedded Dram to Gain Cell Data RetentionTime and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
Time and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
IJMTST Journal
 
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET Journal
 
Sram pdf
Sram pdfSram pdf
Intelligent ram
Intelligent ramIntelligent ram
Intelligent ram
Nitin Goyal
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworld
Praveen Kumar
 
Embedded dram
Embedded dramEmbedded dram
Embedded dram
Shrikrishna Parab
 
DESIGN AND IMPLEMENTATION OF 4T, 3T AND 3T1D DRAM CELL DESIGN ON 32 NM TECHNO...
DESIGN AND IMPLEMENTATION OF 4T, 3T AND 3T1D DRAM CELL DESIGN ON 32 NM TECHNO...DESIGN AND IMPLEMENTATION OF 4T, 3T AND 3T1D DRAM CELL DESIGN ON 32 NM TECHNO...
DESIGN AND IMPLEMENTATION OF 4T, 3T AND 3T1D DRAM CELL DESIGN ON 32 NM TECHNO...
VLSICS Design
 
Design and implementation of 4 t, 3t and 3t1d dram cell design on 32 nm techn...
Design and implementation of 4 t, 3t and 3t1d dram cell design on 32 nm techn...Design and implementation of 4 t, 3t and 3t1d dram cell design on 32 nm techn...
Design and implementation of 4 t, 3t and 3t1d dram cell design on 32 nm techn...
VLSICS Design
 
UNIT 3.docx
UNIT 3.docxUNIT 3.docx
UNIT 3.docx
Nagendrababu Vasa
 
unit4 and unit5.pptx
unit4 and unit5.pptxunit4 and unit5.pptx
unit4 and unit5.pptx
bobbyk11
 
Internal memory
Internal memoryInternal memory
Internal memory
Riya Choudhary
 
L010236974
L010236974L010236974
L010236974
IOSR Journals
 
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORYMAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
caijjournal
 
CA assignment group.pptx
CA assignment group.pptxCA assignment group.pptx
CA assignment group.pptx
HAIDERALICH3
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentation
Karthik Iyr
 
Nvram applications in the architectural revolutions of main memory implementa...
Nvram applications in the architectural revolutions of main memory implementa...Nvram applications in the architectural revolutions of main memory implementa...
Nvram applications in the architectural revolutions of main memory implementa...
IAEME Publication
 
Accelerix ISSCC 1998 Paper
Accelerix ISSCC 1998 PaperAccelerix ISSCC 1998 Paper
Accelerix ISSCC 1998 Paper
Imagination Technologies
 
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
Barbara Aichinger
 
Bg4103362367
Bg4103362367Bg4103362367
Bg4103362367
IJERA Editor
 

Similar to Improving DRAM performance (20)

301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
 
Time and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
Time and Low Power Operation Using Embedded Dram to Gain Cell Data RetentionTime and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
Time and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
 
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
 
Sram pdf
Sram pdfSram pdf
Sram pdf
 
Intelligent ram
Intelligent ramIntelligent ram
Intelligent ram
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworld
 
Embedded dram
Embedded dramEmbedded dram
Embedded dram
 
DESIGN AND IMPLEMENTATION OF 4T, 3T AND 3T1D DRAM CELL DESIGN ON 32 NM TECHNO...
DESIGN AND IMPLEMENTATION OF 4T, 3T AND 3T1D DRAM CELL DESIGN ON 32 NM TECHNO...DESIGN AND IMPLEMENTATION OF 4T, 3T AND 3T1D DRAM CELL DESIGN ON 32 NM TECHNO...
DESIGN AND IMPLEMENTATION OF 4T, 3T AND 3T1D DRAM CELL DESIGN ON 32 NM TECHNO...
 
Design and implementation of 4 t, 3t and 3t1d dram cell design on 32 nm techn...
Design and implementation of 4 t, 3t and 3t1d dram cell design on 32 nm techn...Design and implementation of 4 t, 3t and 3t1d dram cell design on 32 nm techn...
Design and implementation of 4 t, 3t and 3t1d dram cell design on 32 nm techn...
 
UNIT 3.docx
UNIT 3.docxUNIT 3.docx
UNIT 3.docx
 
unit4 and unit5.pptx
unit4 and unit5.pptxunit4 and unit5.pptx
unit4 and unit5.pptx
 
Internal memory
Internal memoryInternal memory
Internal memory
 
L010236974
L010236974L010236974
L010236974
 
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORYMAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
 
CA assignment group.pptx
CA assignment group.pptxCA assignment group.pptx
CA assignment group.pptx
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentation
 
Nvram applications in the architectural revolutions of main memory implementa...
Nvram applications in the architectural revolutions of main memory implementa...Nvram applications in the architectural revolutions of main memory implementa...
Nvram applications in the architectural revolutions of main memory implementa...
 
Accelerix ISSCC 1998 Paper
Accelerix ISSCC 1998 PaperAccelerix ISSCC 1998 Paper
Accelerix ISSCC 1998 Paper
 
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
 
Bg4103362367
Bg4103362367Bg4103362367
Bg4103362367
 

Improving DRAM performance

  • 1. Improving DRAM performance Prithvi Kambhampati Master of Science, Electrical and Computer Engineering Michigan Technological University Houghton, Michigan pkambham@mtu.edu Abstract—In order to reduce the growing gap between the clock speed of the processors and that of memory, more research is being done to improve the performance of memory than ever. Dynamic Random Access Memory (DRAM) is being used in the cache to make the memory accesses faster by reducing miss rate and latency. This makes the DRAM performance improvement an important aspect in today’s computation. DRAM cells are refreshed at the rank-level, periodically, in order to keep the data loss to a minimum, prevent a complete rank from accepting memory requests. This is one of the major challenges the DRAM technology is facing. The improvement to the DRAM can be made at four different levels, namely, chip level, bank level, subarray level, and row level. One of the methods to do so is by reorganizing the structure of the banks and the row buffer to improve the hit rates of DRAM. Another method is to use light to transmit data between the processor and the memory system to reduce power consumption and increase bandwidth. We also look into different set mapping policies with which data is accessed from the DRAM rows and discuss about the best solution to improve the hit rate and reduce latency. This paper shows that the methods implemented to improve the performance of DRAM are significantly affective. In addition, we also discuss about the errors that occur in DRAMs and describe the error-resilient schemes such as single subarray memory systems with chipkills that can overcome bit failures. Index terms—Dynamic Random access memory, chip level, bank level, subarray level, row level. I. INTRODUCTION In the past, the clock rates of microprocessors have increased exponentially due to process improvements, longer pipelines, and circuit design techniques. But the main memory speed did not grow as fast as the processors. Along with this, the number of cores on a single chip has been increasing and is expected to further increase in the future, and this increases the aggregate demand for off-chip memory which makes it worse to access the main memory. To address this problem, we need to design a memory system that is fast, big, and cheap. Static Random Access Memory (SRAM) is being used in cache for its speed but is not used in a large scale due its cost and low capacity. Whereas, DRAM is being used in the main memory for its large capacity and low cost. Therefore, improving the efficiency of DRAM has become a priority in the recent years. Many methods have been proposed to reduce the loss of data and improve the throughput and power efficiency. One solution is to have a DRAM memory in the memory hierarchy. In the recent past, DRAM has been employed in the memory hierarchy as it increases the capacity of cache memory via its higher density compared to the SRAM cells. DRAM also has a higher bandwidth and lower latency compared to the off-chip memory. DRAM memory seems like a good solution to bring down the memory wall (gap between the processor speed and memory speed). The increased implementation of DRAM memory has led to more and more research by both industry and the academic institutions. Their main aim is to improve the performance of DRAM memory in today’s computation. For this purpose, there have been many methods proposed for a given limited off-chip memory bandwidth. Like many things, a DRAM chip also has a structure (discussed below), and can be subdivided into many parts. This means that there is a possibility to improve the characteristics of each and every of these parts. A DRAM chip is made of capacitor based cells that represent the data in the form of electric charge. To store data in a cell, charge is injected, whereas to retrieve data, the charge is extracted [2]. As shown in figure 1, a typical DRAM chip has a hierarchy which consists of multiple banks, a shared internal bus for reading/writing data, and a chip I/O through which memory is transferred between DRAM chip and other memory units. Each bank is sub-divided into subarrays and a bank I/O [10]. Furthermore the subarrays are arranged into 2D arrays of DRAM cells
  • 2. along with a common row buffer that consists of SRAM cells and buffers one row of the DRAM bank. Data can only be accessed after it is fetched to the row buffer. Any attempt to read the data from the same row will result in directly reading from the row buffer. Accessing data (in the form of a cache line) from a subarray involves multiple steps. First, the data can be read only through a row buffer. This means that the row must first be activated so that the data from the rows of the DRAM cells can be transferred to the row buffer. Secondly, after activating the row, the cache line has to be read from/written to. This allows the data to be transferred from/to the corresponding cells through the internal bus that is present in the DRAM chip. Finally, the row buffer has to be cleared for the subsequent instructions. Figure 1. Organization of a DRAM chip [10] (Taken without permission) In this paper, we are going to discuss the various levels at which the DRAM performance can be improved and the methods to do so. We observe four different levels at which the modifications can be done, with each level having multiple proposals to do so. The first level is the chip level. At this level, there is a memory channel with a memory controller, which manages the set of DRAM banks present on the chip. The memory channel has a three bus system which includes a command bus, a read bus, and a write bus. Each of these buses have I/O pins as well. These buses and I/O access points can be partially/completely replaced by the Photonically Interconnected DRAM (PIDRAM) [4] technology, which provides energy efficient communication. The photonic technology uses light to transmit data between the processor and the memory. To transmit data/commands, external light (typically from a laser) is passed through resonators which give that light a unique wavelength. This modulated light is received by a photodetector and is converted to electricity and the data/commands are transferred. The advantage with this technology is that multiple wavelengths can be transmitted at once, allowing us to transmit more data that usual at low power usage. The second level is the bank level. At this level PIDRAM technology can be used to reorganizing the banks [4] to save energy. Another method to improve the performance of DRAM at this level is by processing DRAM requests in batches of requests [9]. The third level is the subarray level. One idea is to have a hierarchical multi-bank DRAM [3] in which the subarrays are converted to semi- independent sub-banks, to take an advantage of the fact that most of the DRAM accesses occur locally within the subarrays. This allows the subarrays to act independently for such accesses and makes the process faster. The last level that can be modified is the row level. In DRAM cache, to access memory easily, memory blocks in the banks are mapped to a particular set of a particular row of a particular bank. These set-mapping policies [1] either concentrate on improving the hit rate or decrease the latency. Another change that can be made to this level is dividing the row buffer into multiple smaller row buffers [7]. Figure 2. DRAM Memory System – Each inset shows detail for a different level of current electrical DRAM memory systems. [4] (Taken without permission)
  • 3. II. CHIP LEVEL A DRAM chip consists of a shared internal bus, multiple banks, a chip I/O and a memory channel controlled by a memory controller. This section describes different ways in which we can modify the above parts of the chip to improve performance. One such way is to use light to transmit data among the parts of the DRAM chip. The following introduction to the silicon photonic technology, which can replace the conventional electrical circuit partially. PHOTONICALLY INTERCONNECTED DRAM The off-chip memory bandwidths are not likely to match up to the performance of the processor. This has been reducing the maximum achievable system performance since 2008. The number of pins on the board is limited by the area and power over heads of high speed transceivers and package interconnect. The number of packets transferred per pin can be increased but only at the expense of using up more energy. As described in the introduction, a DRAM memory channel uses a memory controller to manage a set of DRAM banks that are distributed across one or more DRAM chips. We can overcome these challenges by redesigning the DRAM memory using Photonically Interconnected DRAM (PIDRAM) [4], which uses a monolithically integrated silicon- photonic technology. This technology uses light to transfer data instead of electrical circuits. Firstly, the light which is in the form of LASER is passed through a series of resonators. These resonators modulate the wavelength of the light which is transmitted from the processor to the PIDRAM chip. At the PIDRAM chip, this light is received and demodulated using filters and is converted to electrical signal using a photo detector. The advantages of this technology are: very less power is required to transmit data, larger off-chip bandwidths are supported at a minimum power consumption, and transmission of the data at multiple wavelengths at once, allowing multiple data packets to be transferred at once. This is called as dense wavelength division multiplexing (DWDM) [4] and allows multiple links (wavelengths) to share the same media (fibre or waveguide). The electrical I/O in DRAM chips can be replaced by these energy efficient photonic links. By redesigning DRAM banks to provide greater bandwidth from an individual array core, we can supply the bandwidth demands. This also reduces the energy required to activate the banks. We should keep in mind that all the electrical circuits cannot be replaced by this technology as it needs more area than a simple electrical circuit. A. PIDRAM memory channel organization A memory controller manages a set of DRAM banks that are distributed across many DRAM chips. This memory system has 3 logical buses: a command bus, a write data bus, and a read data bus. We can implement these buses using the photonic components in 3 ways:  Shared Photonic Bus: All the three logical buses can be implemented using a shared photonic bus, which works like a standard electrical bus. In this implementation, the memory controller first issues a command to all the banks, and these banks determine if they are the target bank. Once the target bank knows that it is the target, for a write command, it will tune-in its photonic receiver on the write-data bus. The memory controller places the data on that bus, and the target bank receives the data and performs the write operation, and for a read command, the target bank will perform its read operation and sends the data through the read data bus. Figure 3. Shared Photonic Buses [4] (Taken without permission)  Split Photonic Bus: In this implementation, the long shared bus is divided into multiple branches. The laser power is sent to all the receivers of the command and write bus, and the modulators of the read bus. However, the total laser power is roughly a linear function of the number of banks. This reduces the effective bandwidth density of the photonic device and also the optical laser power compared to the shared photonic bus.
  • 4. Figure 4. Monolithically integrated silicon-photonic technology - Two DWDM links in opposite directions between a memory controller in a processor chip and a bank in a PIDRAM chip. λ1 is used for the request and λ2 is used for the response in the opposite direction on the same waveguides and fibre. [4] (Taken without permission) Figure 5.a. Split photonic buses [4] (Taken without permission)  Guided Photonic Bus: The optical power can be further reduced by this implementation. Guided photonic bus uses optical power guiding in the form of demultiplexers to actively direct power to just the target bank. This allows the total power to be constant throughout, and also independent of the number of banks. Figure 5.b. Guided photonic buses [4] (Taken without permission) B. PIDRAM Chip Organization We have discussed above different ways in which the buses can be implemented photonically. The trade-off with this is that only a portion of the buses can be implemented photonically and the rest, electrically. The design choice is made on the trade- offs in power and area. The photonics can be gradually extended deeper into the PIDRAM chip. Figure 6. PIDRAM chip floorplan [4] (Taken without permission) The vertical electrical data bus can be partitioned into ‘n’ partitions and all the photonic circuits should be replicated at each data access point for each bus partition. Partitioning the data bus allows the DRAM chip to use an energy efficient photonic interconnect. This increases the fixed link power and higher optical losses. III. BANK LEVEL Each bank consists of multiple subarrays and a bank I/O. data is accessed in the form of cache lines from each subarray. This requires activation of the row containing the cache line, reading/writing the cache line, and precharging the subarray to prepare for subsequent requests. This section deals with novel
  • 5. way to organize the banks a request scheduling algorithm which help in increasing the number of instructions executed. A. PIDRAM Bank Organization Most of the energy consumed in a DRAM chip is by the banks themselves. Every array block in a bank access activates an array core, which activates an entire array core row. From this array core row, only a few bits of data is used. Most of the bank energy is used to wake up these unnecessary bits. This wastage of energy can be reduced by either decreasing the array core row size which reduces the number of the unnecessary bits that are being activated, or increasing the number of I/O per array core, and using fewer array cores in parallel. Decreasing the array core row size leads to a greater area penalty. Therefore the access efficiency has to be improved by increasing the number of I/Os per array core. The motivation to make this change is not much because the energy consumption by the bank is less compared to the electrical inter-chip and intra-chip interconnect. Also the number of pins we can have on a chip is limited. The increased bandwidth allows more banks per chip. The high bandwidth also allows energy savings and does not affect the area of PIDRAM significantly. This makes sure that the photonic technology will play an important role in the future multiprocessor performance. The upcoming PIDRAMs should not only concentrate on high performance, low cost, and energy efficiency at the chip level, but also support a large range of multi-chip configurations with different capacities and bandwidths. B. Parallelism-aware batch scheduling In a chip multiprocessor (CMP) system, the DRAM is a frequently used resource. Inter-thread interference can destroy the bank-level access parallelism of individual threads. Bank level parallelism [8] [9] is a method in which the requests made by the threads are serviced in parallel in different banks. Parallelism-aware batch scheduling is the advanced method of bank level parallelism which takes these requests and divides them into batches and attends to those batches of requests. This method can be divided into two steps: i. Request Batching A number of DRAM requests are grouped into a batch. These batches of requests are completed one after the other and this step ensures that all the requests in one batch are completed before the arrival of the next batch. The batch of requests serviced is then removed from the memory request buffer and only then the new batch is formed. When forming a new batch, the batching component decides how many requests issued by a thread for a certain bank can be a part of a batch. Batching not only ensures that all the requests are taken care of, but also provides a uniform granularity due to which the performance improves. A fixed number of DRAM requests are grouped into a batch. This is done based on the arrival time of these requests. Even though there is interference from other threads, the bank level access parallelism of each thread is preserved. This guarantees the oldest batch to be served first by prioritizing the oldest requests and also prevents any thread from being starved in the DRAM system due to interference from other, potentially aggressive threads. Batching reduces the serialization of thread requests by executing them in parallel rather than allowing them to run alone in the memory system. ii. Parallelism-Aware Within-Batch Scheduling In this step, the requests of each and every thread in a batch is computed in parallel in the DRAM cells. This hides the latency inside the batch and also increases the throughput of the processor as many requests are serviced in parallel. The Parallelism- Aware Within-Batch Scheduling tries to maximize the:  Row-buffer locality: Bank accesses will have lower latencies if a high row-hit rate is present within a batch.  Intra-thread bank parallelism: Scheduling multiple requests from a thread to various banks in parallel reduces the thread’s stall time. This scheduling uses a thread prioritization to make use of both row-buffer locality and bank parallelism. Thread ranking is done by maximum rule, where the scheduler finds the maximum number of marked requests and tie-breaker total rule, in which the scheduler keeps track of the total number of marked requests, called total-load, and assigns the higher rank to the lower total load [9]. IV. SUB ARRAY LEVEL Each subarray consists of a 2 dimensional array of DRAM cells. The data stored in these cells are accessed in terms of rows. The request is sent to a row buffer which is common for all the rows of DRAM cells. The data then is sent to the row buffer and the data is accessed from the row buffer. The following two sub-sections explain how the accesses can be done faster by modifying the subarray.
  • 6. A. Hierarchical multi-bank DRAM Embedded DRAM or eDRAM is a dynamic random-access memory integrated on the same die or multi-chip module of an ASIC or microprocessor. eDRAM allows for larger buses and higher operation speeds, due to higher density of DRAM. eDRAM cannot handle the number of memory accesses generated by the high performance processor, which creates a bottle neck. Successive accesses that need the same bank must queue up and serialize. One solution is having a parallelism aware batch scheduling (discussed in III.B.). Another solution to this problem is to simply increase the number of independent DRAM banks in order to lower the probability of a conflict. But increasing the number of independent banks leads to a requirement of a larger area. The number of independent banks can be increased without effecting the area much by allowing the subarrays to act as banks whenever they the DRAM chip receives a request to that particular subarray. This allows the subarrays to act as semi- independent banks [3]. After dividing the DRAM banks into subarrays, for the subarrays to act as semi-independent sub- banks, some additions and modifications have to be made to each subarray. The banks in the DRAM chip use registers and controls to allow the accessing of data. This means that a few pipeline registers and controls, a set-reset flip-flop to hold the subarray output from the subarray, and buffers to hold the addresses for the access should be added to the subarrays. The access queues of the DRAM should be modified to detect accesses which do not cause conflicts and then to start the accesses that do not cause conflicts, in parallel. Figure 7.a. Modifications made to each subarray [9] (Taken without permission) Figure 7.b. Modifications made to the access queues [9] (Taken without permission) This is a useful approach, since a large part of each DRAM access actually occurs only locally within individual DRAM subarrays. Individual subarrays within independent banks are controlled as semi-independent subbanks that share the main bank’s I/O circuitry and decoders. The sub banks can perform way better with just creating a small area penalty B. Fault tolerance in DRAMs Errors are often occurred in DRAMs, which leads to a significant downtime in datacentres. An architecture of the DRAM has to be developed in order to provide high standard of reliability. Error- resilient schemes, called as chipkills [5] can be built for such bit failures. Isolating an entire cache line to a single small subarray on a single DRAM chip will allow us to read an entire cache line out of a single DRAM array, so the potential for correlated errors is increased. In order to provide chipkill-level reliability in concert with single small sub array, checksums [5] stored for each cache line in the DRAM are introduced, similar to that provided in hard drives. Using the checksum we can provide robust error detection capabilities, and provide chipkill-level reliability through Redundant Array of Inexpensive DRAMs [6]. In the Redundant Array of Inexpensive DRAMs, a single disk serves as a parity check for more than one other disks. On a disk access, only one disk is read per ‘n’ number of disks. The check sum related to the read block lets the Redundant Array of Inexpensive DRAM controller know if the read is correct or not. This approach is more effective in terms of area and energy than prior chipkill approaches, and only incurs a performance penalty compared to a single sub array memory system without chipkill.
  • 7. Figure 8. Chipkill support in Single Sub Array memory system (64KB) [5] (Taken without permission) V. ROW LEVEL As the number of cores in a processor is increasing, the demand for an off-chip memory is increasing. This exacerbates the main memory access bottle neck. Many solutions have been proposed for this problem. One of them is to have an on-chip DRAM as the last level of cache to improve the performance for a given off-chip memory bandwidth. This is called as the on- chip DRAM cache, which increases the cache capacity through high capacity. It also improves the on-chip communication through high bandwidth and low latency interconnect. In a cache, the storage is mapped to the memory addresses it serves. There are different ways this mapping can be done. The choice of mapping is very critical to the design that the cache is often named after the mapping. This is done so in N-way set associatively. The same goes with the DRAM. Each row in a subarray of a bank of a DRAM chip has a series of DRAM cells. All the rows in a subarray of a bank have a common row buffer which is used to access data. Implementing the DRAM in the cache requires the DRAM to have mapping between the rows and the main memory system. The following two methods explain the ways to do so. The first method explains the way set mapping works and later in the section we see how the row buffer can be modified to make the accessing of data faster. A. Set mapping policy As explained in the introduction, the DRAM cache is a multi-bank system, with each bank having a number of rows. The DRAM cache uses set mapping policy [1], in which memory blocks are mapped to a particular set of a particular row of a particular bank. The set mapping policy directly affects the throughput of the system by effecting the DRAM cache miss rate and DRAM cache hit latency which makes it an important aspect in the cache. Figure 9. DRAM cache hierarchy (Intel) (Taken from the website)
  • 8. New DRAM set mapping policies are proposed regularly to reduce the DRAM cache miss rate. Through higher associativity we can achieve reduced DRAM latency via improved row buffer hit rate. A typical DRAM cache organization has multiple banks, each with subarrays and each subarray containing an array of rows and columns of DRAM cells. Each DRAM provides with a row buffer, which buffers one row in a DRAM bank. Data in a DRAM bank is accessed after it is fetched through the row buffer. Figure 10. 29 way associativity for 4KB size row [1] (taken without permission) Associativity deals with the hit ratio and search speed. There is a trade-off between these two factors. A direct mapped cache has a good hit ratio but a better search speed. For a fully associative cache, the search rate is better than the hit ratio. This implies that as the associativity increases, the hit ratio improves and the search speed decreases. Thus, we need to come up with a reasonable associativity. As said before, a higher associativity decreases the cache miss rate significantly. The DRAM cache row is divided into tag blocks and cache lines. Each bank of the DRAM cache is associated with a row buffer that holds the last accessed row of that bank. If the associativity of DRAM row organization is increased, a cache access first hits the tag block instead of the whole cache block and reduces access latency. Having a higher associativity cache might slightly increase the tag latency compared to lower associativity. But this device benefits from higher associativity that reduces conflict misses. It also provides a higher row buffer hit rate compared to a simple cache, because there are more number of consecutive memory blocks mapped to the same set. B. Modifying the row buffer The present DRAM cache banks have a single row buffer. Having multiple smaller row buffers instead of the existing single large row buffers helps improve the row hit rates and also reduce the energy required for row activation [7]. As explained earlier, the data has to come to the row buffer after read request, and the row data has to be read from the buffer. The width of each row buffer is the width of the entire row and holds a few KB of data. The precharge writes back the row buffer to the appropriate row after a column read/write of the selected words from/to the row buffer. The precharge operation involves charging and discharging of a large number of capacitors. In a multi-core processor memory, the memory addresses are spread evenly across memory banks to compensate for the relatively slow speed of DRAM. This decreases the row buffer hit rate. We can improve this reduced row buffer hit rate by dividing the row buffer into multiple smaller row buffers. This new organization will now require sub-row activation in addition to the row-buffer selection and row-buffer selection. This requires for the controller to add more address bits to the DRAM cache to activate the sub- row buffers and bring into a row buffer. The memory controller allocates and manages the row buffers, providing the DRAM logic with additional flexibility, to implement many other buffer allocation policies. Figure 11. Reorganized DRAM bank structure to support sub-rows and buffer selection [7] (Taken without permission)
  • 9. VI. CONCLUSION From the problems we discussed above, it is clear that improving the memory system is the top priority to achieve greater speeds. DRAM plays an important role in the memory system and therefore, more techniques should be used to improve the DRAM. DRAM is a hierarchical system and has four levels. The components in each of these levels can be modified by replacing or by modifying/reorganizing them. All of the performance techniques discussed above improve the DRAM efficiency significantly and they do so in different ways. Some techniques lead to decrease in power, some increase the throughput, and some hide the latency. The final taxonomy we obtained by analysing various techniques to improve DRAM performance is shown in the figure 12. We can also conclude that the photonic technology will play a very crucial role in the future of processors and memory systems. ACKNOWLEDGEMENT I thank Dr. Soner Onder, for his valuable and comments on the earlier drafts and for being patient throughout the process. Figure 12. Resulting taxonomy of our analysis. REFERENCES [1] Hameed, F., Bauer, L., Henkel, J., "Architecting On-Chip DRAM Cache for Simultaneous Miss Rate and Latency Reduction," in Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.PP, no.99, pp.1-1, Oct. 2015 [2] Donghyuk Lee, Yoongu Kim, Pekhimenko, G., Khan, S., Seshadri, V., Chang, K., Mutlu, O., "Adaptive-latency DRAM: Optimizing DRAM timing for the common-case," in High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pp.489-501, 7-11 Feb. 2015 [3] T. Yamauchi, L. Hammond and K. Olukotun, "The hierarchical multi-bank DRAM: a high-performance architecture for memory integrated with processors," Advanced Research in VLSI, 1997. Proceedings, Seventeenth Conference on, Ann Arbor, MI, 1997, pp. 303-319. DRAM performance improvement Chip level PIDRAM Memory channel organization PIDRAM Chip organization Bank level PIDRAM Bank organization Parallelism-aware batch scheduling Subarray level Hierarchical multi- bank DRAM Fault tolerence Row level Set mapping policy Row buffer modification
  • 10. [4] Scott Beamer, Chen Sun, Yong-Jin Kwon, Ajay Joshi, Christopher Batten, Vladimir Stojanović, and Krste Asanović, “Re-architecting DRAM memory systems with monolithically integrated silicon photonics,” in Proceedings of the 37th annual international symposium on Computer architecture (ISCA '10). ACM, New York, NY, USA, pp. 129-140, 2010 [5] Aniruddha N. Udipi, Naveen Muralimanohar, Niladrish Chatterjee, Rajeev Balasubramonian, Al Davis, and Norman P. Jouppi, “Rethinking DRAM design and organization for energy-constrained multi- cores,” in Proceedings of the 37th annual international symposium on Computer architecture (ISCA '10). ACM, New York, NY, USA, pp. 175-186, 2010 [6] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Elsevier, 4th edition, 2007. [7] Gulur N., Manikantan R., Govindarajan R., Mehendale M., "Row-Buffer Reorganization: Simultaneously Improving Performance and Reducing Energy in DRAMs," in Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pp.189-190, 10-14 Oct. 2011 [8] Chang K.K.-W., Donghyuk Lee, Chishti Z., Alameldeen A.R., Wilkerson C., Yoongu Kim, Mutlu O., "Improving DRAM performance by parallelizing refreshes with accesses," in High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on , pp. 356-367, 15-19 Feb. 2014 [9] Mutlu, O., Moscibroda, T., "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems," in Computer Architecture, 2008. ISCA '08. 35th International Symposium on, pp. 63-74, 21-25 June 2008 [10] Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry, “RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, pp. 185-197, 2013