2. 1 | P a g e
List of Tables Page Number
Table 1: Sequence of Issued requests 4
Table 2: Memory System Configuration 7
Table 3: Comparison of key metrics on baseline and implemented schedulers 8
List of Figures Page Number
Figure 1: Internal organization of modern DRAMs. 2
Figure 2: Total Execution Time-RF-Set1 9
Figure 3: Total Execution Time-RF-Set2 9
Figure 4: Total Execution Time-BF-Set1 9
Figure 5: Total Execution Time-BF-Set2 9
Figure 6: EDP-RF Set1 10
Figure 7: EDP-RF Set2 10
Figure 8: EDP-BF Set1 10
Figure 9: EDP-BF Set2 10
Figure 11: Max Slowdown-RF-Set2 11
Figure 10: Max Slowdown-RF-Set1 11
Figure 12: Max Slowdown-BF-Set1 11
Figure 13: Max Slowdown-BF-Set2 11
SLNO TITLE PAGE
NUMBER
1 Introduction 2
DRAM Configuration 2
Memory Access Scheduling 3
2 Implementation 4
Bank First Memory Scheduling 4
Scheduler Implementation 4
Algorithm 5
Row First Memory Scheduling 5
Scheduler Implementation 5
Algorithm 6
3 Simulator Description 7
DRAM Commands 7
4 Results 8
Execution time 9
Energy delay product 10
Maximum Slowdown 11
5 Conclusions 12
6 References 13
3. 2 | P a g e
INTRODUCTION
DRAM Configuration:
DRAM (Dynamic Random-Access Memory), the most commonly used technology for building main
memory for modern computer system, has been a major performance bottleneck for decades. Throughout
many generations of DRAM, from DDR1 to DDR3, internal memory architecture and performance-
related characteristics of DRAM has experienced little change and most of modern DRAM systems make
use of dual in-line memory modules (DIMM). A basic DRAM consists of one or more DRAM channels,
each channel has one or more memory modules. A modern DDR3 channel typically can support 1-2
DIMMs; each DIMM is typically consists of 1-4 ranks; each rank can be partitioned into multiple (4-16)
banks. Each bank operates independent of the other banks and contains an array of memory cells that are
accessed as an entire row at a time. When a row of this memory array is accessed (row activation) the
entire row of the memory array is transferred into the bank's row buffer. The row buffer serves as a cache
to reduce the latency of subsequent accesses to that row. While a row is active in the row buffer, any
number of reads or writes (column accesses) may be performed, typically with a throughput of one per
cycle. After completing the available column accesses, the cached row must be written back to the memory
array by an explicit operation (bank pre-charge) which prepares the bank for a subsequent row activation.
Figure 1: Internal organization of modern DRAMs.
A bank cannot be accessed during the pre-charge activate latency, as a single cycle is required on the data
pins when switching between read and write column accesses, and a single set of address lines is shared
by all DRAM operations (bank pre-charge, row activation, and column access). The amount of bank
parallelism that is exploited and the number of column accesses that are made per row access dictate the
sustainable memory bandwidth out of such a DRAM.A memory access scheduler must generate a schedule
that conforms to the timing and resource constraints of these modern DRAMs. Each DRAM operation
makes different demands on the three DRAM resources: the internal banks, a single set of address lines,
and a single set of data lines. The scheduler must ensure that the required resources are available for each
DRAM operation it schedules.
Each DRAM bank has two stable states: IDLE and ACTIVE. In the IDLE state, the DRAM is pre-charged
and ready for a row access. It will remain in this state until a row activate operation is issued to the bank.
To issue a row activation, the address lines must be used to select the bank and the row being activated.
Row activation requires say 3 cycles, during which no other operations may be issued to that bank, during
4. 3 | P a g e
that time, however, operations may be issued to other banks of the DRAM. Once the DRAM’s row
activation latency has passed, the bank enters the ACTIVE state, during which the contents of the selected
row are held in the bank’s row buffer. Any number of pipelined column accesses may be performed while
the bank is in the ACTIVE state. To issue either a read or write column access, the address lines are
required to indicate the bank and the column of the active row in that bank. A write column access requires
the data to be transferred to the DRAM at the time of issue, whereas a read column access returns the
requested data three cycles later.
The bank will remain in the ACTIVE state until a pre-charge operation is issued to return it to the IDLE
state. The pre-charge operation requires the use of the address lines to indicate the bank which is to be
pre-charged. Like row activation, the pre-charge operation utilizes the bank resource for 3 cycles, during
which no new operations may be issued to that bank. Again, operations may be issued to other banks
during this time. After the DRAM’s pre-charge latency, the bank is returned to the IDLE state and is ready
for a new row activation operation. DRAMs typically also support column accesses with automatic pre-
charge, which implicitly pre-charges the DRAM bank as soon as possible after the column access. The
shared address and data resources serialize access to the different DRAM banks. While the state machines
for the individual banks are independent, only a single bank can perform a transition requiring a particular
shared resource each cycle. For many DRAMs, the bank, row, and column addresses share a single set of
lines. Hence, the scheduler must arbitrate between pre-charge, row, and column operations that all need
to use this single resource.
Some DRAMs, provide separate row and column address lines (each with their own associated bank
address) so that column and row accesses can be initiated simultaneously. To approach the peak data rate
with serialized resources, there must be enough column accesses to each row to hide the pre-
charge/activate latencies of other banks. Whether or not this can be achieved is dependent on the data
reference patterns and the order in which the DRAM is accessed to satisfy those references. The need to
hide the pre-charge/activate latency of the banks in order to sustain high bandwidth cannot be eliminated
by any DRAM architecture without reducing the pre-charge/activate latency, which would likely come at
the cost of decreased bandwidth or capacity, both of which are undesirable. So several scheduling schemes
are proposed to achieve the reduction in latency.
Memory Access Scheduling:
In a memory controller, the execution of a memory access instruction must adhere to the rules and timing
constraints of the hardware to access data in a modern DRAM. As shown in Figure 1, modern DRAMs
are three-dimensional memory devices with dimensions of bank, row and column. Thus, a location in the
DRAM is identified by an address that consists of bank, row and column fields. The steps of accessing a
location include a pre-charge, a row access, and then a column access. Due to the DRAM structure and its
hardware implementation, sequential accesses to different rows within one bank have high latency,
whereas accesses to different banks or different words within a single row have low latency [9].Memory
access scheduling can effectively reduce the average memory access latency and improve memory
bandwidth utilization by reducing cross-row data access. For example, prioritizing memory requests to
the same bank and the same row can improve performance.
5. 4 | P a g e
IMPLEMENTATION:
In this section we describe the implemented two memory scheduling algorithms that are mainly focused
to exploit the locality property of DRAM memory systems: Bank first Memory scheduling and Row first
memory scheduling
Bank First Memory Scheduling:
Bank-first policy [9] arranges all memory requests by banks, and schedules them in a round-robin manner
according to the bank identifier. This policy is beneficial because the requests to different banks can be
carried out simultaneously. For the request sequence shown in Table 1, the sequence of issued requests by
the bank-first policy will be A-C-D-F-E-B-G-H-J-I.
Table 1: Sequence of Issued requests
Scheduler Implementation:
1. Scheduler gets a random sequence of memory request from the memory controller
2. Memory scheduler checks both read queue and write queue (for draining conditions)
3. Initiate write drain if either the write queue occupancy has reached the HI_WM or if there are no
pending read requests
4. If not in write drain mode initiate read drain mode
5. In the write and read draining modes, scheduling of write and read queue is done based on the
bank first policy
6. In any draining mode instructions are executed in a round robin fashion based on the bank id
a. First, look through all the request ( already arranged in the order of arrival ) in the respected
queues based on the draining mode
b. If the next request is from the same bank do not schedule this instruction for execution
rather schedule an instruction that is from a different bank
c. Selection of the banks which needs to be scheduled next is done in a round robin fashion
6. 5 | P a g e
Algorithm:
INPUT: Random sequence of memory access requests from m cores
OUTPUT: Scheduled sequence of memory requests to the memory controller
BEGIN:
B = 0;
If write drains == true
Foreach request in write_queue
If B==current bank and request == issuable
Issue request
B=B+1
Else precharge
If write drains != true
Foreach request in read_queue
If B==current bank and request == issuable
Issue request
B+1
Else precharge
END
Row First Memory Scheduling:
Row-first policy gives the highest priority to the access to the same row of the same bank [10]. The row-
first policy essentially enhances the bank-first policy by grouping the accesses to the same bank and same
row together. This optimization is beneficial in reducing row misses. For the request sequence shown in
Table 1, the sequence of issued requests by the row-first policy will be A-B-J-C-D-G-I-F-H-E.
Scheduler Implementation:
This algorithm aims at optimizing the row hits and thus increasing the overall hit rate on the whole for all
the instructions. As row buffer hits have much shorter latency and consumes less power than the row
buffer misses, the scheduler tries to exploit row buffer hits as much as possible.
1. Scheduler gets a random sequence of memory request from the memory controller
2. Memory scheduler checks both read queue and write queue (for draining conditions)
3. Initiate write drain if either the write queue occupancy has reached the HI_WM or if there are no
pending read requests
4. If not in write drain mode initiate read drain mode
5. In the write and read draining modes, scheduling of write and read queue is done based on the
bank first policy
6. In any draining mode instructions are executed in a round robin fashion based on the bank id
a. First, look through all the request ( already arranged in the order of arrival ) in the respected
queues based on the draining mode
b. If the next request is from the same bank and from the same row schedule this instruction
for execution
c. If the next request is not from same bank and same row set a flag bit and get the new bank
and row
7. 6 | P a g e
Algorithm:
INPUT: Random sequence of memory access requests from m cores
OUTPUT: Scheduled sequence of memory requests to the memory controller
BEGIN:
B = 0;
R = 0;
If write drains == true
Foreach request in write_queue
If flag == true
B=current bank
R= current row
Flag = false
Issue request
If B==current bank and request == issuable
If R==current row
Issue request
Flag = false
If write drains != true
Foreach request in read_queue
If flag == true
B=current bank
R= current row
Flag = false
Issue request
If B==current bank and request == issuable
If R==current row
Issue request
Flag = false
END
8. 7 | P a g e
SIMULATOR DESCRIPTION:
This project operated on USIMM simulation infrastructure, a trace based full system simulator to build
and simulate the memory scheduler. Table gives the system configurations used in our evaluation.
Table 2: Memory System Configuration
DRAM Commands
The memory commands can be partitioned into two groups, commands that advance the execution of a
pending memory request (read or write), or commands that manage general DRAM state.
Advancing the execution of a memory request include four commands which are:
 PRE: Precharge the bitline of a bank so a new row can be read out.
 ACT: Bring the contents of a bank’s DRAM row into the bank’s row buffer.
 CLO-RD: Bring a cache line from the row buffer to the processor.
 CLO-WR: Write a cache line from the processor to the row buffer.
DRAM state management commands include five memory commands, as follows:
 PWR-DN-FAST: Puts a rank into the low-power-mode with quick exit times.
 PWR-DN-SLOW: Puts a rank into the precharge-powerdown (slow) mode with longer time to
transition into the activate state.
 PWR-UP: Brings a rank out of low-power mode.
 Refresh: Forces a refresh to multiple rows in all banks in a rank.
 PRE-ALL: Forces a precharge to all banks in a rank.
When the memory system is not busy, PWD-DN-FASTand PWR-DN-SLOW commands can put memory
ranks into low power- mode to save power. PWR-UP command is needed to bring a rank out of low-
power-mode. This project uses workloads from USIMM simulator with 1 channel and four channel
configuration. The memory workloads with corresponding traces are divided in to two sets: Set 1 and Set
2, comparison of evaluation metrics with FCFS is done on Set 1 and detailed results are described in result
section of this report
9. 8 | P a g e
RESULTS:
Using USIMM traces we have simulated Bank First (BF) and Row First (RF) memory scheduling
algorithms and compared with baseline FCFS algorithms. Three metrics taken in the evaluation are: the
sum of threads execution time, the threads’ max slowdown and the energy-delay-product (EDP). We used
several mixed workloads for the evaluations: 1, 2, 4 thread(s) workload in 1channel model and 1, 2, 4, 8,
16 thread(s) workload in 4channel model and all the results are tabulated in Table
Table 3: Comparison of key metrics on baseline and implemented schedulers
Workload Config Sum of Exec times
(10 M cycle)
Max Slow Down EDP (J’s)
FCFS Bank
First
Row
First
FCFS Bank
First
Row
First
FCFS Bank
First
Row
First
MT-canneal
MT-canneal
bl-bl-fr-fr
bl-bl-fr-fr
c1-c1
c1-c1
c1-c1-c2-c2
c1-c1-c2-c2
c2
c2
fa-fa-fe-fe
fa-fa-fe-fe
fl-fl-sw-sw-
c2-c2-fe-fe
fl-fl-sw-sw-
c2-c2-fe-fe-
bl-bl-fr-fr-c1-
c1-st-st
fl-sw-c2-c2
fl-sw-c2-c2
st-st-st-st
st-st-st-st
1 Ch
4 Ch
1 Ch
4 Ch
1 Ch
4 Ch
1 Ch
4 Ch
1 Ch
4 Ch
1 Ch
4 Ch
4 Ch
4 Ch
1 Ch
4 Ch
1 Ch
4 Ch
418
179
149
80
83
51
242
127
44
30
228
106
295
651
249
130
162
86
404
167
147
75.9
82.3
46.7
235
117
43
27
224
99.5
279
620
243
121
160
81.5
403
167
147
75.7
82.6
46.4
235
117
43.1
27
224
99.2
279
620
243
120
158
80
NA
NA
1.20
1.11
1.12
1.05
1.48
1.18
NA
NA
1.52
1.22
1.40
1.90
1.48
1.13
1.28
1.14
NA
NA
1.18
1.05
1.10
0.95
1.45
1.10
NA
NA
1.47
1.14
1.31
1.80
1.43
1.05
1.25
1.08
NA
NA
1.18
1.05
1.10
0.94
1.45
1.10
NA
NA
1.47
1.14
1.31
1.80
1.42
1.05
1.24
1.08
4.23
1.78
0.50
0.36
0.41
0.44
1.52
1.00
0.38
0.50
1.19
0.64
2.14
5.31
1.52
0.99
0.58
0.39
3.98
1.56
0.48
0.32
0.40
0.36
1.43
0.85
0.36
0.39
1.15
0.56
1.89
4.78
1.44
0.83
0.56
0.35
3.97
1.56
0.48
0.32
0.40
0.36
1.44
0.84
0.36
0.39
1.14
0.55
1.88
4.76
1.43
0.82
0.56
0.34
Overall PFP 3312 3173 3167 1.90 1.80 1.80 23.88 21.69 21.60
10. 9 | P a g e
Execution time:
Figure 2 to Figure 5 shows the total execution time of the implemented BF and RF algorithms. BF and
RF algorithm out performs the FCFS from 0.843% to 8.431 % and 0.4812% to 9.019% respectively as
evident from the Table. The overall execution time is reduced by 4.14% for BF and 4.32% for RF
scheduler. For the multi core cases, total execution time is calculated by adding each execution time of all
cores.
Figure 2: Total Execution Time-RF-Set1 Figure 3: Total Execution Time-RF-Set2
Figure 4: Total Execution Time-BF-Set1 Figure 5: Total Execution Time-BF-Set2
0
1E+09
2E+09
3E+09
4E+09
5E+09
6E+09
7E+09
Total Execution Time-RF-Set1
0
1E+09
2E+09
3E+09
4E+09
5E+09
6E+09
MTf-1
MTf-4
c3-c3-c3-…
c4-c4-c5-…
c4-c4-c5-…
le-le-le-le-1
le-le-le-le-4
li-li-1
li-li-4
li-li-li-mu-…
li-li-mu-…
li-li-mu-…
ti-ti-1
ti-ti-4
Total Execution Time-RF-
Set2
0
1E+09
2E+09
3E+09
4E+09
5E+09
6E+09
7E+09
Total Execution time-BF Set1
0
1E+09
2E+09
3E+09
4E+09
5E+09
6E+09
MTf-1
MTf-4
c3-c3-c3-c3-…
c4-c4-c5-c5-1
c4-c4-c5-c5-4
le-le-le-le-1
le-le-le-le-4
li-li-1
li-li-4
li-li-li-mu-…
li-li-mu-mu-1
li-li-mu-mu-4
ti-ti-1
ti-ti-4
Total Execution time-BF Set2
11. 10 | P a g e
Energy delay product:
Figure 6 to Figure 7 shows the energy delay product of BF and RF algorithms respectively. BF scheduler
improves the EDP by 3.613% to 18.1815 % compared to FCFS scheduler on the other hand RF scheduler
improves the EDP by 4.201% to 18.1812% compared to FCFS scheduler. The overall execution time is
reduced by 9.17% for BF and 9.54% for RF scheduler
Figure 6: EDP-RF Set1 Figure 7: EDP-RF Set2
Figure 8: EDP-BF Set1 Figure 9: EDP-BF Set2
0
1
2
3
4
5
6
EDP-RF Set1
0
1
2
3
4
5
6
7
8
MTf-1
MTf-4
c3-c3-c3-c3-…
c4-c4-c5-c5-1
c4-c4-c5-c5-4
le-le-le-le-1
le-le-le-le-4
li-li-1
li-li-4
li-li-li-mu-mu-…
li-li-mu-mu-1
li-li-mu-mu-4
ti-ti-1
ti-ti-4
EDP-RF Set2
0
1
2
3
4
5
6
EDP-BF Set1
0
1
2
3
4
5
6
7
8
MTf-1
MTf-4
c3-c3-c3-c3-c3-…
c4-c4-c5-c5-1
c4-c4-c5-c5-4
le-le-le-le-1
le-le-le-le-4
li-li-1
li-li-4
li-li-li-mu-mu-…
li-li-mu-mu-1
li-li-mu-mu-4
ti-ti-1
ti-ti-4
EDP-BF Set2
12. 11 | P a g e
Maximum Slowdown:
Figure and Figure shows the maximum slowdown metric of BF and RF algorithms respectively. An
improvement of around 1.667% to 9.523% is seen BF algorithm compared to FCFS scheduler and an
improvement of around 1.667% to 10.47619% is seen RF algorithm compared to FCFS scheduler.
Figure 10: Max Slowdown-RF-Set1 Figure 11: Max Slowdown-RF-Set2
Figure 12: Max Slowdown-BF-Set1 Figure 13: Max Slowdown-BF-Set2
0
0.5
1
1.5
2
Max Slowdown-RF-Set1
0
0.5
1
1.5
2
2.5
MTf-1
MTf-4
c3-c3-c3-c3-…
c4-c4-c5-c5-1
c4-c4-c5-c5-4
le-le-le-le-1
le-le-le-le-4
li-li-1
li-li-4
li-li-li-mu-…
li-li-mu-mu-1
li-li-mu-mu-4
ti-ti-1
ti-ti-4
Max Slowdown-RF-Set2
0
0.5
1
1.5
2
Max Slow down-BF Set1
0
0.5
1
1.5
2
2.5
MTf-1
MTf-4
c3-c3-c3-c3-…
c4-c4-c5-c5-1
c4-c4-c5-c5-4
le-le-le-le-1
le-le-le-le-4
li-li-1
li-li-4
li-li-li-mu-mu-…
li-li-mu-mu-1
li-li-mu-mu-4
ti-ti-1
ti-ti-4
Max Slow down-BF Set2
13. 12 | P a g e
CONCLUSIONS:
We have performed comprehensive study to analyze existing scheduling policies and experimental
results confirmed that memory scheduling policies have great influence on memory waiting latency. We
have considered results from the 3rd JILP Workshop on Computer Architecture Competitions (JWAC-3)
to compare with the schemes we have implemented.
Our results proved better performance than FCFS and are on par with some of the schemes proposed in
the competition. The Total EDP is obtained as 49.4782 and 49.6698 for Row First and Bank First
schemes respectively which proved to be better than the Stride- and Global History-based DRAM Page
Management scheme. We have the execution time, PFP and max slowdown of both the schemes to be
on par with the Stride- and Global History-based DRAM Page Management scheme. A better
performance can be obtained to these implemented schemes by introducing a core aware scheme along
with these basic schemes implemented.
REFERENCES:
[1] Thread-Fair Memory Request Reordering, Kun Fang, Nick Iliev, Ehsan Noohi, Suyu Zhang, and
Zhichun Zhu (University of Illinois at Chicago)
[2] The Compact Memory Scheduling Maximizing Row Buffer Locality, Young-Suk Moon, Yongkee
Kwon, Hong-Sik Kim, Dong-gun Kim, Hyungdong Hayden Lee, and Kunwoo park (SK Hynix)
[3] High Performance Memory Access Scheduling using Compute-Phase Prediction and Write back-
Refresh Overlap , Yasuo Ishii (The University of Tokyo, NEC Corporation) and Kouhei Hosokawa,
Mary Inaba, and Kei Hiraki (The University of Tokyo)
[4] Pre-Read and Write-Leak Memory Scheduling Algorithm, Long Chen, Yanan Cao, Sarah Kabala,
and Parijat Shukla (Iowa State University)
[5] Request Density Aware Fair Memory Scheduling, Takakazu Ikeda (Tokyo Institute of Technology),
Shinya Takamaeda-Yamazaki (Tokyo Institute of Technology / JSPS Research Fellow), and Naoki
Fujieda, Shimpei Sato, and Kenji Kise (Tokyo Institute of Technology)
[6] Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor
Systems, Chongmin Li, Dongsheng Wang, Haixia Wang, and Yibo Xue (Department of Computer
Science & Technology, Tsinghua University)
[7] Service Value Aware Memory Scheduler by Estimating Request Weight and Using per-Thread
Traffic Lights, Keisuke Kuroyanagi (INRIA/IRISA, The University of Tokyo) and Andre Seznec
(INRIA/IRISA)
[8] Stride- and Global History-based DRAM Page Management , Mushfique Junayed Khurshid, Mohit
Chainani, Alekhya Perugupalli, and Rahul Srikumar (University of Wisconsin-Madison)
14. 13 | P a g e
[9] Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens, “Memory
Access Scheduling”, Proceedings of the 27th International Symposium on Computer Architecture, 2000
[10] Jun Shao and Brian T. Davis, “A Burst Scheduling Access Reordering Mechanism”, Proceedings of
the 13th International Symposium on High-Performance Computer Architecture, 2007