Vmreport

Name: Lin Yang, Department of EE&CS,

Ohio University, Stocker Center, Athens, OH 45701,

linyang@bobcat.ent.ohiou.edu.

CS 558

OPERATING SYSTEM 2

Spring 2003

Instructor: Dr. Frank Drews

Due: 05/31/2003

(Final Version)

The Virtual Memory Management of Linux

(Research Report)

Author:
Lin Yang

1

1. Introduction

Linux is outstanding in the area of memory management. Linux will use every scrap of memory in a
system to its full potential. For example: (1) The Linux kernel itself is much smaller and more efficient
than the NT kernel. NT typically takes up more memory than Linux kernel, which means extra memory
can be used by applications instead of just holding the OS. (2) Linux uses a copy-on-write scheme. If two
or more programs are using the same block of memory, only one copy is actually in RAM, and all the
programs read the same block. If one program writes to that block, then a copy is made for just that
program. All other programs still share the same memory. When loading things like shared objects, this is
a major memory saver. (3) Demand-loading is very useful, as well. Linux only loads into RAM the
portions of a program that are actually being used, which reduces overall RAM requirements significantly.
At the same time, when swapping is necessary, only portions of programs are swapped out to disk, not
entire processes. This helps to greatly enhance multiprocessing performance. (4) Finally, any RAM not
being used by the kernel or applications is automatically used as a disk cache. This speeds access to the
disk so long as there is unused memory.

The Linux virtual memory system is responsible for maintaining the address space visible to each process.
It creates pages of virtual memory on demand and it needs to manage the loading and swapping operation
of the pages. Virtual memory provides a way of running more processes than can physically fit within a
computer's physical address space. Each process that is a candidate for running on a processor is allocated
it's own virtual memory area which defines the logical set of addresses that a process can access to carry
out it's required task. As this total virtual memory area is very large (typically constrained by the number
of address bits the processor has and the maximum number of processes it supports), each process can be
allocated a large logical address space (typically 3Gb) in which to operate. It is the job of a virtual
memory manager to ensure that active processes and the areas they wish to access are remapped to
physical memory as required. This is achieved by a method of swapping or paging the required sections
(pages) into and out of physical memory as required. Swapping involves replacing a complete process
with another in memory whereas paging involves removal of a 'page' (typically 2-4kbytes) of the
process's mapped memory and replacing it with a page from another process. As this may be a computer
intensive and time consuming task, care is taken to minimize the overhead that it has. This is done by
usage of a number of algorithms designed to take advantage of the common locality of related sections of
code and also only carrying out some operations such as memory duplication or reading when absolutely
required ( techniques known as copy on write, lazy paging and demand paging).
The virtual memory owned by a process may contain code and data from many sources. Executable code
may be shared between processes in the form of shared libraries, as these areas are read-only there is little

2

chance of them becoming corrupted. Processes can allocate and link virtual memory to use during their
processing, Some of the memory management techniques used by Linux include the following issues:

Page based Each virtual page has a set of flags which determine the
protection mechanism types of access allowed in user mode or kernel mode.

Demand paging / lazy reading the virtual memory of a process is brought into physical
memory only when a process attempts to use it.

Kernel and User modes of Unrestricted access to process's memory in kernel mode but
operation access only to it's own memory for a process in user mode.

Mapped files Memory is extended by allowing disk files to be used as a
staging area for pages swapped out of physical memory.

Copy on write memory When two processes require access to a common area of
code the virtual memory manager does not copy the section
immediately as if only read access is required the section
may be used safely by both processes. Only when a write is
requested does the copy take place.

Shared memory An area of memory may be mapped into the address space of
more than one process by the calling of privileged
operations.

Memory Locking To ensure a critical page can never be swapped out of
memory it may be locked in, the vritual memory manager
will not then remove it.

In this research report, we focus on the research of virtual memory management, especially the page
replacement and the swapping technology, of Linux. The rest of the report is organized as follows:
section 2 introduces page replacement algorithm in Linux; section 3 introduce the swapping and caching
technology in Linux; some problems of virtual memory management have been proposed in section 4;
Section 5 concludes the report.

2. Page replacement algorithm in Linux

Before we introduce the algorithm used in Linux, we need to introduce the concept of PTE Cache. All
modern computers designed for virtual memory incorporate a special hardware cache called a PTE cache

3

or TLB (Translation Look aside Buffer), which caches page table entries in the CPU, so that the CPU
usually doesn't have to probe the page table to find a PTE that lets it translate an address.

The PTE cache is the magic gadget that makes virtual memory practical. Without it the CPU would have
to do extra main memory reads for every read or write instruction executed by the running program, just
to look up the PTE that let it translate a virtual address into a physical one.

Rather than looking up a PTE in the page table each time it needs to translate an address, the CPU looks
in its page table entry cache to find the right page table entry. If it's there already, it reuses it without
actually traversing the page table. Occasionally, the PTE cache doesn't hold the PTE it needs, so the CPU
loads the needed entry from the page table, and caches that. Note that a PTE cache does not cache normal
data---it only caches address translation information from the page table. A page table entry is very small,
and the PTE cache only caches a relatively small number of them (depending on the CPU, usually
somewhere from 32 and1024 of them). This means that PTE cache misses are a couple of orders of
magnitude more common than page faults---any time you touch a page you haven't touched fairly
recently, you're likely to miss the PTE cache. This isn't usually a big deal, because PTE cache misses are
many orders of magnitude cheaper than page faults---you only need to fetch a PTE from main memory,
not fetch a page from disk.

A PTE cache is very fast on a hit, and is able to translate addresses in a fraction of an instruction cycle.
This translation can generally be overlapped with other parts of instruction setup, so the PTE hardware
gives you virtual memory support at essentially zero time cost.

After knowing the concept of PTE, we can introduce the core page replacement algorithm used in Linux
The main component of the VM replacement mechanism is a clock algorithm. Clock algorithms are
commonly used because they provide a possible approximation of LRU replacement and are cheap to
implement. (All common general-purpose CPU's have hardware support for clock algorithms, in the form
of the reference bit maintained by the PTE Cache. This hardware support is very simple and fast, which is
why all designers of modern general-purpose CPU's put it in.)

A little refresher on the general idea clock algorithms: A clock algorithm cycles slowly through the pages
that are in RAM, checking to see whether they have been touched (and perhaps dirtied) lately. For this,
the hardware-supported reference and dirty bits of the page table entries are used. The reference bit is
automatically set by the PTE cache hardware whenever the page is touched---a flag bit is set in the page

4

table entry, if the PTE is evicted from the PTE cache, it will be written back to its home position in the
page table. The clock algorithm can therefore examine the reference bits in page-table entries to
"examine" the corresponding page.

The basic idea of the clock algorithm is that a slow incremental sweep repeatedly cycles through the all of
the cached (in-RAM) pages, noticing whether each page has been touched (and perhaps dirtied) since the
last time it was examined. If a page's reference bit is set, the clock algorithm doesn't consider it for
eviction at this cycle, and continues its sweep, looking for a better candidate for eviction. Before
continuing its sweep, however, it resets the reference bit in the page table entry. Resetting the reference
bit ensures that the next time the page is reached in the cyclic sweep, it will indicate whether the page was
touched since this time. Visiting all of the pages cyclically ensures that a page is only considered for
eviction if it hasn't been touched for at least a whole cycle.

The clock algorithm proceeds in increments, usually sweeping a small fraction of jobs in-memory pages
at a time, and keeps a record of its current position between increments of sweeping. This allows it to
resume its sweeping from that page at the next increment. Technically, this simple clock scheme is known
as "second chance" algorithm, because it gives a page a second chance to stay in memory---one more
sweep cycle.

More refined versions of the clock algorithm may keep multiple bits, recording whether a page has been
touched in the last two cycles, or even three or four. Only one hardware-supported bit is needed for this,
however. Rather than just testing the hardware supported bit, the clock hand records the current value of
the bit before resetting it, for use next time around. Intuitively, it would seem that the more bits are used,
the more precise an approximation of LRU we'd get, but that's usually not the case. Once two bits are
used, clock algorithms don't generally get much better, due to fundamental weaknesses of clock
algorithms. Linux uses a simple second-chance (one-bit clock) algorithm, sort of, but with several
elaborations and complications.

The main clock algorithm is implemented by the kernel swap demon, a kernel thread that runs the
procedure Kswapd(). Kswapd is an infinite loop, which incrementally scans all the normal VM pages
subject to paging, then starting over. Kswapd generally does its clock sweeping in increments, and sleeps
in between increments so that normal processes may run. The page out daemon should usually be able to
keep enough free memory, but if it isn’t, the programs will end up calling the page out code itself (The
following file is the source code in Linux to realize this algorithm):

5

static int swap_out(unsigned int priority, int gfp_mask)
int counter;
int __ret = 0;
counter = (nr_threads << SWAP_SHIFT) >> priority;
if (counter < 1)
counter = 1;
for (; counter >= 0; counter--) {
struct list_head *p;
unsigned long max_cnt = 0;
struct mm_struct *best = NULL;
int assign = 0;
int found_task = 0;
select:
spin_lock(&mmlist_lock);
p = init_mm.mmlist.next;
for (; p != &init_mm.mmlist; p = p->next) {
struct mm_struct *mm = list_entry(p, struct mm_struct, mmlist);
if (mm->rss <= 0)
continue;
found_task++;
if (assign == 1) {
mm->swap_cnt = (mm->rss >> SWAP_SHIFT);
if (mm->swap_cnt < SWAP_MIN)
mm->swap_cnt = SWAP_MIN;
if (mm->swap_cnt > max_cnt) {
max_cnt = mm->swap_cnt;
best = mm;
if (best)
atomic_inc(&best->mm_users);
spin_unlock(&mmlist_lock);
if (!best) {
if (!assign && found_task > 0) {
assign = 1;
goto select;

6

break;
} else {
__ret = swap_out_mm(best, gfp_mask);
mmput(best);
break;
}
}
return __ret;
}

3. Swapping and caching technology in Linux

Linux performs a clock sweep over the *virtual* pages, by cycling through each process's pages in
address order. For this it uses the vm_area mappings and page tables of the processes,so that it can scan
the pages of each process sequentially.

Rather than sweeping through all of the pages of an entire process before switching to another, the main
clock tries to evict a batch of pages from a process, and then move on to another process. It visits all of
the (page able) processes and then repeats. The effect of this is that there is a large number of distinct
clock sweeps, one per page able processes, and the overall clock sweep advances each of these smaller
sweeps periodically.

The following considerations led to this design:
• Related pages should be paged out together, to increase locality in the paging store (so-called
swap files or swap partitions). By evicting a moderate number of virtual pages from a given
process, in virtual address order, the sweep through virtual address space tends to group related
pages together in the paging store.
• By alternating between processes at a coarser granularity, it avoids evicting a large number of
pages from a given victim process---after it's evicted a reasonable number of pages from a
particular victim, it moves on to another to provide some semblance of fairness between the
processes.
• The use of a main clock over processes and virtual address pages and a secondary clock over page
frames provides a way of combining the hardware-supported virtual page reference bits to get
recency-of-touch information about logical pages stored in page frames.

7

• The secondary clock (and the use of a separate per-page-frame PG_referenced bit maintained in
software) can act as an additional aging" period for pages that are evicted from the main clock. A
page can be held in the "swap cache" after being evicted from the main clock, and allowed to age
a while before being evicted from RAM.
The swap cache is just a set of page frames holding logical pages that have been evicted from the main
clock, but whose contents are have not yet been discarded. The contents of page frames need not be
copied to "move" them into the swap cache---rather, the page frame is simply marked as "swap cached"
by the main clock algorithm, and linked into a hash table that holds all of the page frames that currently
constitute the swap cache. The following is the part of source code used in Linux to operate the swap
cache.

void show_swap_cache_info(void)
printk("Swap cache: add %ld, delete %ld, find %ld/%ldn",
swap_cache_add_total,
swap_cache_del_total,
swap_cache_find_success, swap_cache_find_total);
endif
void add_to_swap_cache(struct page *page, swp_entry_t entry)
unsigned long flags;
#ifdef SWAP_CACHE_INFO
swap_cache_add_total++;
#endif
if (!PageLocked(page))
BUG();
if (PageTestandSetSwapCache(page))
BUG();
if (page->mapping)
BUG();
flags = page->flags & ~((1 << PG_error) | (1 << PG_arch_1));
page->flags = flags | (1 << PG_uptodate);
add_to_page_cache_locked(page, &swapper_space, entry.val);
static inline void remove_from_swap_cache(struct page *page)
struct address_space *mapping = page->mapping;
if (mapping != &swapper_space)

8

BUG();
if (!PageSwapCache(page) || !PageLocked(page))
PAGE_BUG(page);
PageClearSwapCache(page);
ClearPageDirty(page);
remove_inode_page(page);
}

4. Problems of virtual memory management in Linux

There are several possible problems with the page replacement algorithm in Linux in my opinion, which
can be listed as follows:
• The system may react badly to variable VM load or to load spikes after a period of no VM activity.
Since the Kswapd, the page out daemon, only scans when the system is low on memory, the
system can end up in a state where some pages have reference bits from the last 5 seconds, while
other pages have reference bits from 20 minutes ago. This means that on a load spike the system
have no clue which are the right pages to evict from memory, this can lead to a swapping storm,
where the wrong pages are evicted and almost immediately after towards faulted back in, leading
to the page out of another random page, etc.
• There is no method to prevent the possible memory deadlock. With the arrival of journaling and
delay allocation file systems it is possible that the systems will need to allocate memory in order
to free memory, that is, to write out data so memory can become free. It may be useful to
introduce some algorithm to prevent the possible deadlock under extremely low memory situation.

5. Conclusion
The virtual memory management system, especially the paging and swapping technologies have been
introduced in this paper. Some problems have been proposed based on these strategies.
.
6. Reference
[1] Rodrigo S. de Castro @home, Linux 2.4 Virtual Memory overview (2001); http://linuxcompressed .sourceforge.net/vm24
[2] Sun Grid Engine, Matthew Dillon Design Elements of the FreeBSD VM System (2000):
http://www.daemonews.org/200001/freebsd_vm.html
[3] Kernelnewbies http://kernelnewbies.org/
[4] The Linux Memory Management home page; http//linux-mm.org
[5] Yannis Smaragdakis, Scott F. Kaplan and Paul R. Wilson; EELRU: Simple and Effective Adaptive Page Replacement,
SIGMETRICS’ 99 http://www.cs.amherst.edu/~sfkaplan/papers/index.html

9

Vmreport

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Vmreport

Similar to Vmreport (20)

Vmreport