In modern processors, which are multi-core(or multithreaded),the concurrently executingthreads share the DRAM system, and differentthreads running on different cores can delayeach other through resource contention. As the number of on-chip coresincreases, the pressure on the DRAM systemincreases, as does the interference amongthreads sharing the system
Uncontrolled inter-thread interference inDRAM scheduling results in major problems:◦ DRAM controller can unfairly prioritize somethreads while starving more important threadsfor long time periods, as they wait to accessmemory.◦ Low system performance.
we propose a new approach to providing fairand high-performance DRAM scheduling. Ourscheduling algorithm, called parallelism-aware batch scheduling (PAR-BS), is based ontwo new key ideas: request batching andparallelism-aware DRAM scheduling.
DRAM requests are very long latencyoperations that greatly impact theperformance of modern processors. When a load instruction misses in the last-level on-chip cache and needs to accessDRAM, the processor cannot commit that(and any subsequent) instruction becauseinstructions are committed in program orderto support precise exceptions.
The processor stalls until the miss is servicedby DRAM. Current processors try to reducethe performance loss due to a DRAM acces byservicing other DRAM accesses in parallelwith it.
Techniques strive to overlap the latency offuture DRAM accesses with the current accessso that the processor does not need to stall(long) for future DRAM accesses. Instead, atan abstract level, the processor stalls once forall overlapped accesses rather than stallingonce for each access in a serialized fashion.
In a single-threaded, single-core system, athread has exclusive access to the DRAMbanks, so its concurrent DRAM accesses areserviced in parallel as long as they aren’t tothe same bank.
Request1’s (Req1) latency is hidden by thelatency of Request0 (Req0), effectivelyexposing only a single bank access latency tothe thread’s processing core. Once Req0 isserviced, the core can commit Load 0 andthus enable the decode/execution of futureinstructions. When Load 1 becomes theoldest instruction in the window, its miss hasalready been serviced and therefore theprocessor can continue computation withoutstalling.
If multiple threads are generating memoryrequests concurrently (e.g. in a CMPsystem), modern DRAM controllers schedule theoutstanding requests in a way that completelyignores the inherent memory-level parallelism ofthreads. Instead, current DRAM controllersexclusively seek to maximize the DRAM datathroughput, i.e., the number of DRAM requestsserviced per second . As we show in thispaper, blindly maximizing the DRAM datathroughput does not minimize a thread’s stall-time (which directly correlates with systemthroughput).
The example in Figure 2 illustrates howparallelism-unawareness can result insuboptimal CMP system throughput andincreased stalltimes. We assume twocores, each running a single thread, Thread 0(T0) and Thread 1 (T1). Each thread has twoconcurrent DRAM requests caused byconsecutive independent load misses (Load 0andLoad 1), and the requests go to twodifferent DRAM banks.
With a conventional parallelism-unawareDRAM scheduler the requests can be servicedin their arrival order shown in Figure 2. First, T0’s request to Bank 0 is serviced inparallel with T1’s request to Bank 1. Later,T1’s request to Bank 0 is serviced in parallelwith T0’s request to Bank 1. This serviceorder serializes each thread’s concurrentrequests and therefore exposes two bankaccess latencies to each core.
As shown in the execution timeline, insteadof stalling once (i.e. for one bank accesslatency) for the two requests, both cores stalltwice. Core 0 ﬁrst stalls for Load 0, andshortly thereafter also for Load 1. Core 1stalls for its Load 0 for two bank accesslatencies.
In contrast, a parallelism-aware schedulerservices each thread’s concurrent requests inparallel, resulting in the service order andexecution timeline shown in Figure 2. Thescheduler preserves bank-parallelism by ﬁrstscheduling both of T0’s requests in parallel, andthen T1’s requests. This enables Core 0 toexecute faster (shown as “Saved cycles” in theﬁgure) as it stalls for only one bank accesslatency. Core 1’s stall time remains unchanged:although its second request (T1-Req1) isserviced later than with a conventionalscheduler, T1-Req0 still hides T1-Req1’s latency.
Based on the observation that inter-threadinterference destroys the bank-levelparallelism of the threads runningconcurrently on a CMP and thereforedegrades system throughput, we incorporateparallelism-awareness into the design of ourfair and high-performance memory accessscheduler.
Our PAR-BS controller is based on two keyprinciples. The first principle is parallelism-awareness. To preserve a thread’s bank-levelparallelism, a DRAM controller must service athread’s requests (to different banks) back toback (that is, one right after another, withoutany interfering requests from other threads).This way, each thread’s request servicelatencies overlap.
The second principle is request batching. Ifperformed greedily, servicing requests from athread back to back could cause unfairnessand even request starvation. To prevent this,PAR-BS groups a fixed number of oldestrequests from each thread into a batch, andservices the requests from the currentbatchbefore all other requests.