OS-What is an operating system

OPERATING SYSTEM
Operating system
The most important program that runs on a computer. Every general-purpose computer must have an
operating system to run other programs. Operating systems perform basic tasks, such as recognizing
input from the keyboard, sending output to the display screen, keeping track of files and directories
on the disk, and controlling peripheral devices such as disk drives and printers.
For large systems, the operating system has even greater responsibilities and powers. It is like a
traffic cop -- it makes sure that different programs and users running at the same time do not
interfere with each other. The operating system is also responsible for security, ensuring that
unauthorized users do not access the system.
Operating systems can be classified as follows:
• multi-user : Allows two or more users to run programs at the same time. Some operating
systems permit hundreds or even thousands of concurrent users.
• multiprocessing : Supports running a program on more than one CPU.
• multitasking : Allows more than one program to run concurrently.
• multithreading : Allows different parts of a single program to run concurrently.
• real time: Responds to input instantly. General-purpose operating systems, such as DOS and
UNIX, are not real-time.
Operating systems provide a software platform on top of which other programs, called application
programs, can run. The application programs must be written to run on top of a particular operating
system. Your choice of operating system, therefore, determines to a great extent the applications you
can run. For PCs, the most popular operating systems are DOS, OS/2, and Windows, but others are
available, such as Linux.
1
K. ADISESHA
LECTURER, PRESIDENCY COLLEGE

OPERATING SYSTEM
Process management
Process management is an integral part of any modern day operating system (OS). The OS must
allocate resources to processes, enable processes to share and exchange information, protect the
resources of each process from other processes and enable synchronisation among processes. To
meet these requirements, the OS must maintain a data structure for each process, which describes the
state and resource ownership of that process, and which enables the OS to exert control over each
process.
Scheduling
Scheduling is a key concept in computer multitasking and multiprocessing operating system design,
and in real-time operating system design. In modern operating systems, there are typically many
more processes running than there are CPUs available to run them. Scheduling refers to the way
processes are assigned to run on the available CPUs. This assignment is carried out by software
known as a scheduler.
The scheduler is concerned mainly with:
1. CPU utilization - to keep the CPU as busy as possible.
2. Throughput - number of process that complete their execution per time unit.
3. Turnaround - amount of time to execute a particular process.
4. Waiting time - amount of time a process has been waiting in the ready queue.
5. Response time - amount of time it takes from when a request was submitted until the first
response is produced.
6. Fairness - Equal CPU time to each thread.
7. In real-time environments, such as mobile devices for automatic control in industry (for
example robotics), the scheduler also must ensure that processes can meet deadlines; this is
crucial for keeping the system stable. Scheduled tasks are sent to mobile devices and
managed through an administrative back end.
Types of operating system schedulers
Operating systems may feature up to 3 distinct types of schedulers: a long-term scheduler (also
known as an admission scheduler or high-level scheduler), a mid-term or medium-term scheduler
and a short-term scheduler (also known as a dispatcher). The names suggest the relative frequency
with which these functions are performed.
Long-term Scheduler
The long-term, or admission, scheduler decides which jobs or processes are to be admitted to the
ready queue; that is, when an attempt is made to execute a program, its admission to the set of
currently executing processes is either authorized or delayed by the long-term scheduler. Thus, this
scheduler dictates what processes are to run on a system, and the degree of concurrency to be
supported at any one time - ie: whether a high or low amount of processes are to be executed
concurrently, and how the split between IO intensive and CPU intensive processes is to be handled.
2
K. ADISESHA

OPERATING SYSTEM
In modern OS's, this is used to make sure that real time processes get enough CPU time to finish
their tasks.
Long-term scheduling is also important in large-scale systems such as batch processing systems,
computer clusters, supercomputers and render farms. In these cases, special purpose job scheduler
software is typically used to assist these functions, in addition to any underlying admission
scheduling support in the operating system.
Mid-term Scheduler
The mid-term scheduler, present in all systems with virtual memory, temporarily removes processes
from main memory and places them on secondary memory (such as a disk drive) or vice versa. This
is commonly referred to as "swapping out" or "swapping in" (also incorrectly as "paging out" or
"paging in"). The mid-term scheduler may decide to swap out a process which has not been active
for some time, or a process which has a low priority, or a process which is page faulting frequently,
or a process which is taking up a large amount of memory in order to free up main memory for other
processes, swapping the process back in later when more memory is available, or when the process
has been unblocked and is no longer waiting for a resource. [Stallings, 396] [Stallings, 370]
In many systems today (those that support mapping virtual address space to secondary storage other
than the swap file), the mid-term scheduler may actually perform the role of the long-term scheduler,
by treating binaries as "swapped out processes" upon their execution.
Short-term Scheduler
The short-term scheduler (also known as the dispatcher) decides which of the ready, in-memory
processes are to be executed (allocated a CPU) next following a clock interrupt, an IO interrupt, an
operating system call or another form of signal. Thus the short-term scheduler makes scheduling
decisions much more frequently than the long-term or mid-term schedulers - a scheduling decision
will at a minimum have to be made after every time slice, and these are very short. This scheduler
can be preemptive, implying that it is capable of forcibly removing processes from a CPU when it
decides to allocate that CPU to another process, or non-preemptive (also known as "voluntary" or
"co-operative"), in which case the scheduler is unable to "force" processes off the CPU.
Process states
In a multitasking computer system, processes may occupy a variety of states. These distinct states
may not actually be recognized as such by the operating system kernel, however they are a useful
abstraction for the understanding of processes. The various process states, displayed in a state
diagram, with arrows indicating possible transitions between states - as can be seen, some processes
are stored in main memory, and some are stored in secondary (virtual) memory.
Primary process states
The following typical process states are possible on computer systems of all kinds. In most of these
states, processes are "stored" on main memory.
Created
3
K. ADISESHA

OPERATING SYSTEM
(Also called new.) When a process is first created, it occupies the "created" or "new" state. In this
state, the process awaits admission to the "ready" state. This admission will be approved or delayed
by a long-term, or admission, scheduler. Typically in most desktop computer systems, this admission
will be approved automatically, however for real time operating systems this admission may be
delayed. In a real time system, admitting too many processes to the "ready" state may lead to
oversaturation and overcontention for the systems resources, leading to an inability to meet process
deadlines.
Ready
(Also called waiting or runnable.) A "ready" or "waiting" process has been loaded into main
memory and is awaiting execution on a CPU (to be context switched onto the CPU by the dispatcher,
or short-term scheduler). There may be many "ready" processes at any one point of the systems
execution - for example, in a one processor system, only one process can be executing at any one
time, and all other "concurrently executing" processes will be waiting for execution.
A ready queue is used in computer scheduling. Modern computers are capable of running many
different programs or processes at the same time. However, the CPU is only capable of handling one
process at a time. Processes that are ready for the CPU are kept in a queue for "ready" processes.
Other processes that are waiting for an event to occur, such as loading information from a hard drive
or waiting on an internet connection, are not in the ready queue.
Running
(Also called active or executing.) A "running", "executing" or "active" process is a process which is
currently executing on a CPU. From this state the process may exceed its allocated time slice and be
context switched out and back to "ready" by the operating system, it may indicate that it has finished
and be terminated or it may block on some needed resource (such as an input / output resource) and
be moved to a "blocked" state.
Blocked
(Also called sleeping.) Should a process "block" on a resource (such as a file, a semaphore or a
device), it will be removed from the CPU (as a blocked process cannot continue execution) and will
be in the blocked state. The process will remain "blocked" until its resource becomes available,
which can unfortunately lead to deadlock. From the blocked state, the operating system may notify
the process of the availability of the resource it is blocking on (the operating system itself may be
alerted to the resource availability by an interrupt). Once the operating system is aware that a process
is no longer blocking, the process is again "ready" and can from there be dispatched to its "running"
state, and from there the process may make use of its newly available resource.
Terminated
A process may be terminated, either from the "running" state by completing its execution or by
explicitly being killed. In either of these cases, the process moves to the "terminated" state. If a
process is not removed from memory after entering this state, this state may also be called zombie.
Additional process states
Two additional states are available for processes in systems that support virtual memory. In both of
these states, processes are "stored" on secondary memory (typically a hard disk).
Swapped out and waiting
4
K. ADISESHA

OPERATING SYSTEM
(Also called suspended and waiting.) In systems that support virtual memory, a process may be
swapped out, that is removed from main memory and placed in virtual memory by the mid-term
scheduler. From here the process may be swapped back into the waiting state.
Swapped out and blocked
(Also called suspended and blocked.) Processes that are blocked may also be swapped out. In this
event the process is both swapped out and blocked, and may be swapped back in again under the
same circumstances as a swapped out and waiting process (although in this case, the process will
move to the blocked state, and may still be waiting for a resource to become available).
Scheduling algorithm
scheduling algorithm is the method by which threads, processes or data flows are given access to
system resources (e.g. processor time, communications bandwidth). This is usually done to load
balance a system effectively or achieve a target quality of service. The need for a scheduling
algorithm arises from the requirement for most modern systems to perform multitasking (execute
more than one process at a time) and multiplexing (transmit multiple flows simultaneously).
More advanced algorithms take into account process priority, or the importance of the process. This
allows some processes to use more time than other processes. It should be noted that the kernel
always uses whatever resources it needs to ensure proper functioning of the system, and so can be
said to have infinite priority. In SMP(symmetric multiprocessing) systems, processor affinity is
considered to increase overall system performance, even if it may cause a process itself to run more
slowly. This generally improves performance by reducing cache thrashing.
scheduling algorithm
• First-come first-served
• Shortes job first
• Round-robin
• Priority algorithm
• Fair queuing
First-come, first-served (FCFS)
FIFO is an acronym for First In, First Out, an abstraction in ways of organizing and manipulation
of data relative to time and prioritization. This expression describes the principle of a queue
processing technique or servicing conflicting demands by ordering process by first-come, first-
served (FCFS) behaviour: what comes in first is handled first, what comes in next waits until the first
is finished, etc.
Thus it is analogous to the behaviour of persons queueing (or "standing in line", in common
American parlance), where the persons leave the queue in the order they arrive, or waiting one's turn
at a traffic control signal. FCFS is also the shorthand name for the FIFO operating system
scheduling algorithm, which gives every process CPU time in the order they come. In the broader
sense, the abstraction LIFO, or Last-In-First-Out is the opposite of the abstraction FIFO
organization, the difference perhaps is clearest with considering the less commonly used synonym of
LIFO, FILO—meaning First-In-Last-Out. In essence, both are specific cases of a more generalized
list (which could be accessed anywhere). The difference is not in the list (data), but in the rules for
accessing the content. One sub-type adds to one end, and takes off from the other, its opposite takes
and puts things only on one end
5
K. ADISESHA

OPERATING SYSTEM
Shortest job First (SJF)
Shortest job First (SJF) (also known as Shortest Process Next (SPN)) is a scheduling policy that
selects the waiting process with the smallest execution time to execute next.
Shortest job next is advantageous because of its simplicity and because it maximizes process
throughput (in terms of the number of processes run to completion in a given amount of time).
However, it has the potential for process starvation for processes which will require a long time to
complete if short processes are continually added. Highest response ratio next is similar but provides
a solution to this problem.
Shortest job next scheduling is rarely used outside of specialized environments because it requires
accurate estimations of the runtime of all processes that are waiting to execute.
Round-robin scheduling
Round-robin (RR) is one of the simplest scheduling algorithms for processes in an operating
system, which assigns time slices to each process in equal portions and in circular order, handling all
processes without priority. Round-robin scheduling is both simple and easy to implement, and
starvation-free. Round-robin scheduling can also be applied to other scheduling problems, such as
data packet scheduling in computer networks.
The name of the algorithm comes from the round-robin principle known from other fields, where
each person takes an equal share of something in turn.
Round-robin job scheduling may not be desirable if the size of the jobs or tasks are strongly varying.
A process that produces large jobs would be favored over other processes. This problem may be
solved by time-sharing, i.e. by giving each job a time slot or quantum (its allowance of CPU time),
and interrupt the job if it is not completed by then. The job is resumed next time a time slot is
assigned to that process.
Example: The time slot could be 100 milliseconds. If a job1 takes a total time of 250ms to complete,
the round-robin scheduler will suspend the job after 100ms and give other jobs their time on the
CPU. Once the other jobs have had their equal share (100ms each), job1 will get another allocation
of CPU time and the cycle will repeat. This process continues until the job finishes and needs no
more time on the CPU.
• Job1 = Total time to complete 250ms (quantum 100ms).
1. First allocation = 100ms.
2. Second allocation = 100ms.
3. Third allocation = 100ms but job1 self-terminates after 50ms.
4. Total CPU time of job1 = 250ms.
Multiprogramming
6
In many modern operating systems, there can be more than one instance of a program loaded in
memory at the same time; for example, more than one user could be executing the same program,
each user having separate copies of the program loaded into memory. With some programs, it is
possible to have one copy loaded into memory, while several users have shared access to it so that
they each can execute the same program-code. Such a program is said to be re-entrant. The processor
at any instant can only be executing one instruction from one program but several processes can be
K. ADISESHA

OPERATING SYSTEM
sustained over a period of time by assigning each process to the processor at intervals while the
remainder become temporarily inactive. A number of processes being executed over a period of time
instead of at the same time is called concurrent execution.
A multiprogramming or multitasking OS is a system executing many processes concurrently.
Multiprogramming requires that the processor be allocated to each process for a period of time and
de-allocated at an appropriate moment. If the processor is de-allocated during the execution of a
process, it must be done in such a way that it can be restarted later as easily as possible.
There are two possible ways for an OS to regain control of the processor during a program’s
execution in order for the OS to perform de-allocation or allocation:
1. The process issues a system call (sometimes called a software interrupt); for example, an I/O
request occurs requesting to access a file on hard disk.
2. A hardware interrupt occurs; for example, a key was pressed on the keyboard, or a timer runs
out (used in pre-emptive multitasking).
The stopping of one process and starting (or restarting) of another process is called a context switch
or context change. In many modern operating systems, processes can consist of many sub-processes.
This introduces the concept of a thread. A thread may be viewed as a sub-process; that is, a separate,
independent sequence of execution within the code of one process. Threads are becoming
increasingly important in the design of distributed and client-server systems and in software run on
multi-processor systems.
How multiprogramming increases efficiency
A common trait observed among processes associated with most computer programs, is that they
alternate between CPU cycles and I/O cycles. For the portion of the time required for CPU cycles,
the process is being executed; i.e. is occupying the CPU. During the time required for I/O cycles, the
process is not using the processor. Instead, it is either waiting to perform Input/Output, or is actually
performing Input/Output. An example of this is the reading from or writing to a file on disk. Prior to
the advent of multiprogramming, computers operated as single-user systems. Users of such systems
quickly became aware that for much of the time that a computer was allocated to a single user, the
processor was idle; when the user was entering information or debugging programs for example.
Computer scientists observed that overall performance of the machine could be improved by letting
a different process use the processor whenever one process was waiting for input/output. In a uni-
programming system, if N users were to execute programs with individual execution times of t1, t2,
..., tN, then the total time, tuni, to service the N processes (consecutively) of all N users would be:
tuni = t1 + t2 + ... + tN.
However, because each process consumes both CPU cycles and I/O cycles, the time which each
process actually uses the CPU is a very small fraction of the total execution time for the process. So,
for process i:
ti (processor) ti (execution)
where
7
K. ADISESHA

OPERATING SYSTEM
ti (processor) is the time process i spends using the CPU, and
ti (execution) is the total execution time for the process; i.e. the time for CPU cycles plus I/O cycles to be
carried out (executed) until completion of the process.
In fact, usually the sum of all the processor time, used by N processes, rarely exceeds a small
fraction of the time to execute any one of the processes;
Therefore, in uni-programming systems, the processor lay idle for a considerable proportion of the
time. To overcome this inefficiency, multiprogramming is now implemented in modern operating
systems such as Linux, UNIX and Microsoft Windows. This enables the processor to switch from
one process, X, to another, Y, whenever X is involved in the I/O phase of its execution. Since the
processing time is much less than a single job's runtime, the total time to service all N users with a
multiprogramming system can be reduced to approximately:
tmulti = max(t1, t2, ..., tN)
Process creation
Operating systems need some ways to create processes. In a very simple system designed for
running only a single application (e.g., the controller in a microwave oven), it may be possible to
have all the processes that will ever be needed be present when the system comes up. In general-
purpose systems, however, some way is needed to create and terminate processes as needed during
operation.
There are four principal events that cause a process to be created:
• System initialization.
• Execution of process creation system call by running a process.
• A user request to create a new process.
• Initiation of a batch job.
When an operating system is booted, typically several processes are created. Some of these are
foreground processes, that interacts with a (human) user and perform work for them. Other are
background processes, which are not associated with particular users, but instead have some specific
function. For example, one background process may be designed to accept incoming e-mails,
sleeping most of the day but suddenly springing to life when an incoming e-mail arrives. Another
background process may be designed to accept an incoming request for web pages hosted on the
machine, waking up when a request arrives to service that request.
Process termination
There are many reasons for process termination:
• Batch job issues halt instruction
• User logs off
• Process executes a service request to terminate
• Error and fault conditions
8
K. ADISESHA

OPERATING SYSTEM
• Normal completion
• Time limit exceeded
• Memory unavailable
• Bounds violation; for example: attempted access of (non-existent) 11th element of a 10-
element array
• Protection error; for example: attempted write to read-only file
• Arithmetic error; for example: attempted division by zero
• Time overrun; for example: process waited longer than a specified maximum for an event
• I/O failure
• Invalid instruction; for example: when a process tries to execute data (text)
• Privileged instruction
• Data misuse
• Operating system intervention; for example: to resolve a deadlock
• Parent terminates so child processes terminate (cascading termination)
• Parent request
Two-state process management model
The operating system’s principal responsibility is in controlling the execution of processes. This
includes determining the interleaving pattern for execution and allocation of resources to processes.
One part of designing an OS is to describe the behaviour that we would like each process to exhibit.
The simplest model is based on the fact that a process is either being executed by a processor or it is
not. Thus, a process may be considered to be in one of two states, RUNNING or NOT RUNNING.
When the operating system creates a new process, that process is initially labeled as NOT
RUNNING, and is placed into a queue in the system in the NOT RUNNING state. The process (or
some portion of it) then exists in main memory, and it waits in the queue for an opportunity to be
executed. After some period of time, the currently RUNNING process will be interrupted, and moved
from the RUNNING state to the NOT RUNNING state, making the processor available for a different
process. The dispatch portion of the OS will then select, from the queue of NOT RUNNING
processes, one of the waiting processes to transfer to the processor. The chosen process is then
relabeled from a NOT RUNNING state to a RUNNING state, and its execution is either begun if it is
a new process, or is resumed if it is a process which was interrupted at an earlier time.
From this model we can identify some design elements of the OS:
• The need to represent, and keep track of each process.
• The state of a process.
• The queuing of NON RUNNING processes
Three-state process management model
Although the two-state process management model is a perfectly valid design for an operating
system, the absence of a BLOCKED state means that the processor lies idle when the active process
changes from CPU cycles to I/O cycles. This design does not make efficient use of the processor.
The three-state process management model is designed to overcome this problem, by introducing a
new state called the BLOCKED state. This state describes any process which is waiting for an I/O
event to take place. In this case, an I/O event can mean the use of some device or a signal from
another process. The three states in this model are:
9
K. ADISESHA

OPERATING SYSTEM
• RUNNING: The process that is currently being executed.
• READY: A process that is queuing and prepared to execute when given the opportunity.
• BLOCKED: A process that cannot execute until some event occurs, such as the completion
of an I/O operation.
At any instant, a process is in one and only one of the three states. For a single processor computer,
only one process can be in the RUNNING state at any one instant. There can be many processes in
the READY and BLOCKED states, and each of these states will have an associated queue for
processes.
Processes entering the system must go initially into the READY state, processes can only enter the
RUNNING state via the READY state. Processes normally leave the system from the RUNNING
state. For each of the three states, the process occupies space in main memory. While the reason for
most transitions from one state to another might be obvious, some may not be so clear.
• RUNNING → READY The most common reason for this transition is that the running
process has reached the maximum allowable time for uninterrupted execution; i.e. time-out
occurs. Other reasons can be the imposition of priority levels as determined by the
scheduling policy used for the Low Level Scheduler, and the arrival of a higher priority
process into the READY state.
• RUNNING → BLOCKED A process is put into the BLOCKED state if it requests something
for which it must wait. A request to the OS is usually in the form of a system call, (i.e. a call
from the running process to a function that is part of the OS code). For example, requesting a
file from disk or a saving a section of code or data from memory to a file on disk.
Five-state process management model
While the three state model is sufficient to describe the behavior of processes with the given events,
we have to extend the model to allow for other possible events, and for more sophisticated design. In
particular, the use of a portion of the hard disk to emulate main memory (so called virtual memory)
requires additional states to describe the state of processes which are suspended from main memory,
and placed in virtual memory (on disk). Of course, such processes can, at a future time, be resumed
by being transferred back into main memory. The Medium Level Scheduler controls these events. A
process can be suspended from the RUNNING, READY or BLOCKED state, giving rise to two other
states, namely, READY SUSPEND and BLOCKED SUSPEND. A RUNNING process that is
suspended becomes READY SUSPEND, and a BLOCKED process that is suspended becomes
BLOCKED SUSPEND. A process can be suspended for a number of reasons; the most significant of
which arises from the process being swapped out of memory by the memory management system in
order to free memory for other processes. Other common reasons for a process being suspended are
when one suspends execution while debugging a program, or when the system is monitoring
processes. For the five-state process management model, consider the following transitions
described in the next sections.
• BLOCKED → BLOCKED SUSPEND If a process in the RUNNING state requires more
memory, then at least one BLOCKED process can be swapped out of memory onto disk. The
transition can also be made for the BLOCKED process if there are READY processes
10
K. ADISESHA

OPERATING SYSTEM
available, and the OS determines that the READY process that it would like to dispatch
requires more main memory to maintain adequate performance.
• BLOCKED SUSPEND → READY SUSPEND A process in the BLOCKED SUSPEND state
is moved to the READY SUSPEND state when the event for which it has been waiting occurs.
Note that this requires that the state information concerning suspended processes be
accessible to the OS.
• READY SUSPEND → READY When there are no READY processes in main memory, the
OS will need to bring one in to continue execution. In addition, it might be the case that a
process in the READY SUSPEND state has higher priority than any of the processes in the
READY state. In that case, the OS designer may dictate that it is more important to get at the
higher priority process than to minimise swapping.
• READY → READY SUSPEND Normally, the OS would be designed so that the preference
would be to suspend a BLOCKED process rather than a READY one. This is because the
READY process can be executed as soon as the CPU becomes available for it, whereas the
BLOCKED process is taking up main memory space and cannot be executed since it is
waiting on some other event to occur. However, it may be necessary to suspend a READY
process if that is the only way to free a sufficiently large block of main memory. Finally, the
OS may choose to suspend a lower-priority READY process rather than a higher-priority
BLOCKED process if it believes that the BLOCKED process will be ready soon.
Process description and control
Each process in the system is represented by a data structure called a Process Control Block (PCB),
or Process Descriptor in Linux, which performs the same function as a traveller's passport. The PCB
contains the basic information about the job including:
• What it is
• Where it is going
• How much of its processing has been completed
• Where it is stored
• How much it has “spent” in using resources
Process Identification: Each process is uniquely identified by the user’s identification and a pointer
connecting it to its descriptor.
Process Status: This indicates the current status of the process; READY, RUNNING, BLOCKED,
READY SUSPEND, BLOCKED SUSPEND.
Process State: This contains all of the information needed to indicate the current state of the job.
Accounting: This contains information used mainly for billing purposes and for performance
measurement. It indicates what kind of resources the process has used and for how long.
Processor modes
11
K. ADISESHA

OPERATING SYSTEM
Contemporary processors incorporate a mode bit to define the execution capability of a program in
the processor. This bit can be set to kernel mode or user mode. Kernel mode is also commonly
referred to as supervisor mode, monitor mode or ring 0. In kernel mode, the processor can execute
every instruction in its hardware repertoire, whereas in user mode, it can only execute a subset of the
instructions. Instructions that can be executed only in kernel mode are called kernel, privileged or
protected instructions to distinguish them from the user mode instructions. For example, I/O
instructions are privileged. So, if an application program executes in user mode, it cannot perform its
own I/O. Instead, it must request the OS to perform I/O on its behalf. The system may logically
extend the mode bit to define areas of memory to be used when the processor is in kernel mode
versus user mode. If the mode bit is set to kernel mode, the process executing in the processor can
access either the kernel or user partition of the memory. However, if user mode is set, the process
can reference only the user memory space. We frequently refer to two classes of memory user space
and system space (or kernel, supervisor or protected space). In general, the mode bit extends the
operating system's protection rights. The mode bit is set by the user mode trap instruction, also
called a supervisor call instruction. This instruction sets the mode bit, and branches to a fixed
location in the system space. Since only system code is loaded in the system space, only system code
can be invoked via a trap. When the OS has completed the supervisor call, it resets the mode bit to
user mode prior to the return.
The kernel concept
The parts of the OS critical to its correct operation execute in kernel mode, while other software
(such as generic system software) and all application programs execute in user mode. This
fundamental distinction is usually the irrefutable distinction between the operating system and other
system software. The part of the system executing in kernel supervisor state is called the kernel, or
nucleus, of the operating system. The kernel operates as trusted software, meaning that when it was
designed and implemented, it was intended to implement protection mechanisms that could not be
covertly changed through the actions of untrusted software executing in user space. Extensions to the
OS execute in user mode, so the OS does not rely on the correctness of those parts of the system
software for correct operation of the OS. Hence, a fundamental design decision for any function to
be incorporated into the OS is whether it needs to be implemented in the kernel. If it is implemented
in the kernel, it will execute in kernel (supervisor) space, and have access to other parts of the kernel.
It will also be trusted software by the other parts of the kernel. If the function is implemented to
execute in user mode, it will have no access to kernel data structures. However, the advantage is that
it will normally require very limited effort to invoke the function. While kernel-implemented
functions may be easy to implement, the trap mechanism and authentication at the time of the call
are usually relatively expensive. The kernel code runs fast, but there is a large performance overhead
in the actual call. This is a subtle, but important point.
Requesting system services
There are two techniques by which a program executing in user mode can request the kernel's
services:
• System call
• Message passing
12
K. ADISESHA

OPERATING SYSTEM
Operating systems are designed with one or the other of these two facilities, but not both. First,
assume that a user process wishes to invoke a particular target system function. For the system call
approach, the user process uses the trap instruction. The idea is that the system call should appear to
be an ordinary procedure call to the application program; the OS provides a library of user functions
with names corresponding to each actual system call. Each of these stub functions contains a trap to
the OS function. When the application program calls the stub, it executes the trap instruction, which
switches the CPU to kernel mode, and then branches (indirectly through an OS table), to the entry
point of the function which is to be invoked. When the function completes, it switches the processor
to user mode and then returns control to the user process; thus simulating a normal procedure return.
In the message passing approach, the user process constructs a message, that describes the desired
service. Then it uses a trusted send function to pass the message to a trusted OS process. The send
function serves the same purpose as the trap; that is, it carefully checks the message, switches the
processor to kernel mode, and then delivers the message to a process that implements the target
functions. Meanwhile, the user process waits for the result of the service request with a message
receive operation. When the OS process completes the operation, it sends a message back to the user
process.
The distinction between the two approaches has important consequences regarding the relative
independence of the OS behavior, from the application process behavior, and the resulting
performance. As a rule of thumb, operating system based on a system call interface can be made
more efficient than those requiring messages to be exchanged between distinct processes. This is the
case, even though the system call must be implemented with a trap instruction; that is, even though
the trap is relatively expensive to perform, it is more efficient than the message passing approach,
where there are generally higher costs associated with process multiplexing, message formation and
message copying. The system call approach has the interesting property that there is not necessarily
any OS process. Instead, a process executing in user mode changes to kernel mode when it is
executing kernel code, and switches back to user mode when it returns from the OS call. If, on the
other hand, the OS is designed as a set of separate processes, it is usually easier to design it so that it
gets control of the machine in special situations, than if the kernel is simply a collection of functions
executed by users processes in kernel mode. Even procedure-based operating system usually find it
necessary to include at least a few system processes (called daemons in UNIX) to handle situation
whereby the machine is otherwise idle such as scheduling and handling the network
13
K. ADISESHA

OPERATING SYSTEM
Memory Management
The memory management subsystem is one of the most important parts of the operating system.
Since the early days of computing, there has been a need for more memory than exists physically in
a system. Strategies have been developed to overcome this limitation and the most successful of
these is virtual memory. Virtual memory makes the system appear to have more memory than it
actually has by sharing it between competing processes as they need it.
Virtual memory does more than just make your computer's memory go further. The memory
management subsystem provides:
Large Address Spaces
The operating system makes the system appear as if it has a larger amount of memory than it
actually has. The virtual memory can be many times larger than the physical memory in the
system,
Protection
Each process in the system has its own virtual address space. These virtual address spaces are
completely separate from each other and so a process running one application cannot affect
another. Also, the hardware virtual memory mechanisms allow areas of memory to be
protected against writing. This protects code and data from being overwritten by rogue
applications.
Memory Mapping
Memory mapping is used to map image and data files into a processes address space. In
memory mapping, the contents of a file are linked directly into the virtual address space of a
process.
Fair Physical Memory Allocation
The memory management subsystem allows each running process in the system a fair share
of the physical memory of the system,
Shared Virtual Memory
Although virtual memory allows processes to have separate (virtual) address spaces, there are
times when you need processes to share memory. For example there could be several
processes in the system running the bash command shell. Rather than have several copies of
bash, one in each processes virtual address space, it is better to have only one copy in
physical memory and all of the processes running bash share it. Dynamic libraries are
another common example of executing code shared between several processes.
Shared memory can also be used as an Inter Process Communication (IPC) mechanism, with
two or more processes exchanging information via memory common to all of them. Linux
supports the Unix TM
System V shared memory IPC.
An Abstract Model of Virtual Memory
14
K. ADISESHA

OPERATING SYSTEM
Abstract model of Virtual to Physical address mapping
Before considering the methods that Linux uses to support virtual memory it is useful to consider an
abstract model that is not cluttered by too much detail.
As the processor executes a program it reads an instruction from memory and decodes it. In
decoding the instruction it may need to fetch or store the contents of a location in memory. The
processor then executes the instruction and moves onto the next instruction in the program. In this
way the processor is always accessing memory either to fetch instructions or to fetch and store data.
In a virtual memory system all of these addresses are virtual addresses and not physical addresses.
These virtual addresses are converted into physical addresses by the processor based on information
held in a set of tables maintained by the operating system.
To make this translation easier, virtual and physical memory are divided into handy sized chunks
called pages. These pages are all the same size, they need not be but if they were not, the system
would be very hard to administer. Linux on Alpha AXP systems uses 8 Kbyte pages and on Intel x86
systems it uses 4 Kbyte pages. Each of these pages is given a unique number; the page frame number
(PFN).
In this paged model, a virtual address is composed of two parts; an offset and a virtual page frame
number. If the page size is 4 Kbytes, bits 11:0 of the virtual address contain the offset and bits 12
and above are the virtual page frame number. Each time the processor encounters a virtual address it
must extract the offset and the virtual page frame number. The processor must translate the virtual
page frame number into a physical one and then access the location at the correct offset into that
physical page. To do this the processor uses page tables.
Figure shows the virtual address spaces of two processes, process X and process Y, each with their
own page tables. These page tables map each processes virtual pages into physical pages in memory.
This shows that process X's virtual page frame number 0 is mapped into memory in physical page
frame number 1 and that process Y's virtual page frame number 1 is mapped into physical page
frame number 4. Each entry in the theoretical page table contains the following information:
15
K. ADISESHA

OPERATING SYSTEM
• Valid flag. This indicates if this page table entry is valid,
• The physical page frame number that this entry is describing,
• Access control information. This describes how the page may be used. Can it be written to?
Does it contain executable code?
The page table is accessed using the virtual page frame number as an offset. Virtual page frame 5
would be the 6th element of the table (0 is the first element).
To translate a virtual address into a physical one, the processor must first work out the virtual
addresses page frame number and the offset within that virtual page. By making the page size a
power of 2 this can be easily done by masking and shifting. Looking again at Figures and assuming a
page size of 0x2000 bytes (which is decimal 8192) and an address of 0x2194 in process Y's virtual
address space then the processor would translate that address into offset 0x194 into virtual page
frame number 1.
The processor uses the virtual page frame number as an index into the processes page table to
retrieve its page table entry. If the page table entry at that offset is valid, the processor takes the
physical page frame number from this entry. If the entry is invalid, the process has accessed a non-
existent area of its virtual memory. In this case, the processor cannot resolve the address and must
pass control to the operating system so that it can fix things up.
Just how the processor notifies the operating system that the correct process has attempted to access
a virtual address for which there is no valid translation is specific to the processor. However the
processor delivers it, this is known as a page fault and the operating system is notified of the faulting
virtual address and the reason for the page fault.
Assuming that this is a valid page table entry, the processor takes that physical page frame number
and multiplies it by the page size to get the address of the base of the page in physical memory.
Finally, the processor adds in the offset to the instruction or data that it needs.
Using the above example again, process Y's virtual page frame number 1 is mapped to physical page
frame number 4 which starts at 0x8000 (4 x 0x2000). Adding in the 0x194 byte offset gives us a final
physical address of 0x8194.
By mapping virtual to physical addresses this way, the virtual memory can be mapped into the
system's physical pages in any order. For example, in Figure 3.1 process X's virtual page frame
number 0 is mapped to physical page frame number 1 whereas virtual page frame number 7 is
mapped to physical page frame number 0 even though it is higher in virtual memory than virtual
page frame number 0. This demonstrates an interesting byproduct of virtual memory; the pages of
virtual memory do not have to be present in physical memory in any particular order.
Demand Paging
As there is much less physical memory than virtual memory the operating system must be careful
that it does not use the physical memory inefficiently. One way to save physical memory is to only
load virtual pages that are currently being used by the executing program. For example, a database
program may be run to query a database. In this case not all of the database needs to be loaded into
memory, just those data records that are being examined. If the database query is a search query then
16
K. ADISESHA

OPERATING SYSTEM
it does not make sense to load the code from the database program that deals with adding new
records. This technique of only loading virtual pages into memory as they are accessed is known as
demand paging.
When a process attempts to access a virtual address that is not currently in memory the processor
cannot find a page table entry for the virtual page referenced. For example, in Figure there is no
entry in process X's page table for virtual page frame number 2 and so if process X attempts to read
from an address within virtual page frame number 2 the processor cannot translate the address into a
physical one. At this point the processor notifies the operating system that a page fault has occurred.
If the faulting virtual address is invalid this means that the process has attempted to access a virtual
address that it should not have. Maybe the application has gone wrong in some way, for example
writing to random addresses in memory. In this case the operating system will terminate it,
protecting the other processes in the system from this rogue process.
If the faulting virtual address was valid but the page that it refers to is not currently in memory, the
operating system must bring the appropriate page into memory from the image on disk. Disk access
takes a long time, relatively speaking, and so the process must wait quite a while until the page has
been fetched. If there are other processes that could run then the operating system will select one of
them to run. The fetched page is written into a free physical page frame and an entry for the virtual
page frame number is added to the processes page table. The process is then restarted at the machine
instruction where the memory fault occurred. This time the virtual memory access is made, the
processor can make the virtual to physical address translation and so the process continues to run.
Linux uses demand paging to load executable images into a processes virtual memory. Whenever a
command is executed, the file containing it is opened and its contents are mapped into the processes
virtual memory. This is done by modifying the data structures describing this processes memory
map and is known as memory mapping. However, only the first part of the image is actually brought
into physical memory. The rest of the image is left on disk. As the image executes, it generates page
faults and Linux uses the processes memory map in order to determine which parts of the image to
bring into memory for execution.
Swapping
If a process needs to bring a virtual page into physical memory and there are no free physical pages
available, the operating system must make room for this page by discarding another page from
physical memory.
If the page to be discarded from physical memory came from an image or data file and has not been
written to then the page does not need to be saved. Instead it can be discarded and if the process
needs that page again it can be brought back into memory from the image or data file.
However, if the page has been modified, the operating system must preserve the contents of that
page so that it can be accessed at a later time. This type of page is known as a dirty page and when it
is removed from memory it is saved in a special sort of file called the swap file. Accesses to the
swap file are very long relative to the speed of the processor and physical memory and the operating
system must juggle the need to write pages to disk with the need to retain them in memory to be used
again.
17
K. ADISESHA

OPERATING SYSTEM
If the algorithm used to decide which pages to discard or swap (the swap algorithm is not efficient
then a condition known as thrashing occurs. In this case, pages are constantly being written to disk
and then being read back and the operating system is too busy to allow much real work to be
performed. If, for example, physical page frame number 1 in Figure 3.1 is being regularly accessed
then it is not a good candidate for swapping to hard disk. The set of pages that a process is currently
using is called the working set. An efficient swap scheme would make sure that all processes have
their working set in physical memory.
Linux uses a Least Recently Used (LRU) page aging technique to fairly choose pages which might
be removed from the system. This scheme involves every page in the system having an age which
changes as the page is accessed. The more that a page is accessed, the younger it is; the less that it is
accessed the older and more stale it becomes. Old pages are good candidates for swapping.
Shared Virtual Memory
Virtual memory makes it easy for several processes to share memory. All memory access are made
via page tables and each process has its own separate page table. For two processes sharing a
physical page of memory, its physical page frame number must appear in a page table entry in both
of their page tables.
Figure shows two processes that each share physical page frame number 4. For process X this is
virtual page frame number 4 whereas for process Y this is virtual page frame number 6. This
illustrates an interesting point about sharing pages: the shared physical page does not have to exist at
the same place in virtual memory for any or all of the processes sharing it.
Physical and Virtual Addressing Modes
It does not make much sense for the operating system itself to run in virtual memory. This would be
a nightmare situation where the operating system must maintain page tables for itself. Most multi-
purpose processors support the notion of a physical address mode as well as a virtual address mode.
Physical addressing mode requires no page tables and the processor does not attempt to perform any
address translations in this mode. The Linux kernel is linked to run in physical address space.
The Alpha AXP processor does not have a special physical addressing mode. Instead, it divides up
the memory space into several areas and designates two of them as physically mapped addresses.
This kernel address space is known as KSEG address space and it encompasses all addresses
upwards from 0xfffffc0000000000. In order to execute from code linked in KSEG (by definition,
kernel code) or access data there, the code must be executing in kernel mode. The Linux kernel on
Alpha is linked to execute from address 0xfffffc0000310000.
Access Control
The page table entries also contain access control information. As the processor is already using the
page table entry to map a processes virtual address to a physical one, it can easily use the access
control information to check that the process is not accessing memory in a way that it should not.
There are many reasons why you would want to restrict access to areas of memory. Some memory,
such as that containing executable code, is naturally read only memory; the operating system should
18
K. ADISESHA

OPERATING SYSTEM
not allow a process to write data over its executable code. By contrast, pages containing data can be
written to but attempts to execute that memory as instructions should fail. Most processors have at
least two modes of execution: kernel and user. You would not want kernel code executing by a user
or kernel data structures to be accessible except when the processor is running in kernel mode.
Alpha AXP Page Table Entry
The access control information is held in the PTE and is processor specific; figure shows the PTE for
Alpha AXP. The bit fields have the following meanings:
V Valid, if set this PTE is valid,
FOE ``Fault on Execute'', Whenever an attempt to execute instructions in this page occurs, the
processor reports a page fault and passes control to the operating system,
FOW ``Fault on Write'', as above but page fault on an attempt to write to this page,
FOR ``Fault on Read'', as above but page fault on an attempt to read from this page,
ASM Address Space Match. This is used when the operating system wishes to clear only some of
the entries from the Translation Buffer,
KRE Code running in kernel mode can read this page,
URE Code running in user mode can read this page,
GH Granularity hint used when mapping an entire block with a single Translation Buffer entry
rather than many,
KWE Code running in kernel mode can write to this page,
UWE Code running in user mode can write to this page,
page frame number
For PTEs with the V bit set, this field contains the physical Page Frame Number (page frame
number) for this PTE. For invalid PTEs, if this field is not zero, it contains information about
where the page is in the swap file.
The following two bits are defined and used by Linux:
_PAGE_DIRTY if set, the page needs to be written out to the swap file,
_PAGE_ACCESSED Used by Linux to mark a page as having been accessed.
Caches
If you were to implement a system using the above theoretical model then it would work, but not
particularly efficiently. Both operating system and processor designers try hard to extract more
performance from the system. Apart from making the processors, memory and so on faster the best
19
K. ADISESHA

OPERATING SYSTEM
approach is to maintain caches of useful information and data that make some operations faster.
Linux uses a number of memory management related caches:
Buffer Cache
The buffer cache contains data buffers that are used by the block device drivers.
These buffers are of fixed sizes (for example 512 bytes) and contain blocks of information
that have either been read from a block device or are being written to it. A block device is
one that can only be accessed by reading and writing fixed sized blocks of data. All hard
disks are block devices.
The buffer cache is indexed via the device identifier and the desired block number and is
used to quickly find a block of data. Block devices are only ever accessed via the buffer
cache. If data can be found in the buffer cache then it does not need to be read from the
physical block device, for example a hard disk, and access to it is much faster.
Page Cache
This is used to speed up access to images and data on disk.
It is used to cache the logical contents of a file a page at a time and is accessed via the file
and offset within the file. As pages are read into memory from disk, they are cached in the
page cache.
Swap Cache
Only modified (or dirty) pages are saved in the swap file.
So long as these pages are not modified after they have been written to the swap file then the
next time the page is swapped out there is no need to write it to the swap file as the page is
already in the swap file. Instead the page can simply be discarded. In a heavily swapping
system this saves many unnecessary and costly disk operations.
Hardware Caches
One commonly implemented hardware cache is in the processor; a cache of Page Table
Entries. In this case, the processor does not always read the page table directly but instead
caches translations for pages as it needs them. These are the Translation Look-aside Buffers
and contain cached copies of the page table entries from one or more processes in the system.
When the reference to the virtual address is made, the processor will attempt to find a
matching TLB entry. If it finds one, it can directly translate the virtual address into a physical
one and perform the correct operation on the data. If the processor cannot find a matching
TLB entry then it must get the operating system to help. It does this by signalling the
operating system that a TLB miss has occurred. A system specific mechanism is used to
deliver that exception to the operating system code that can fix things up. The operating
system generates a new TLB entry for the address mapping. When the exception has been
20
K. ADISESHA

OPERATING SYSTEM
cleared, the processor will make another attempt to translate the virtual address. This time it
will work because there is now a valid entry in the TLB for that address.
The drawback of using caches, hardware or otherwise, is that in order to save effort Linux must use
more time and space maintaining these caches and, if the caches become corrupted, the system will
crash.
Linux Page Tables
Three Level Page Tables
Linux assumes that there are three levels of page tables. Each Page Table accessed contains the page
frame number of the next level of Page Table. Figure shows how a virtual address can be broken
into a number of fields; each field providing an offset into a particular Page Table. To translate a
virtual address into a physical one, the processor must take the contents of each level field, convert it
into an offset into the physical page containing the Page Table and read the page frame number of
the next level of Page Table. This is repeated three times until the page frame number of the physical
page containing the virtual address is found. Now the final field in the virtual address, the byte
offset, is used to find the data inside the page.
Each platform that Linux runs on must provide translation macros that allow the kernel to traverse
the page tables for a particular process. This way, the kernel does not need to know the format of the
page table entries or how they are arranged.
This is so successful that Linux uses the same page table manipulation code for the Alpha processor,
which has three levels of page tables, and for Intel x86 processors, which have two levels of page
tables.
Page Allocation and Deallocation
There are many demands on the physical pages in the system. For example, when an image is loaded
into memory the operating system needs to allocate pages. These will be freed when the image has
finished executing and is unloaded. Another use for physical pages is to hold kernel specific data
structures such as the page tables themselves. The mechanisms and data structures used for page
allocation and deallocation are perhaps the most critical in maintaining the efficiency of the virtual
memory subsystem.
21
K. ADISESHA

OPERATING SYSTEM
All of the physical pages in the system are described by the mem_map data structure which is a list of
mem_map_t
structures which is initialized at boot time. Each mem_map_t describes a single physical page in the
system. Important fields (so far as memory management is concerned) are:
count
This is a count of the number of users of this page. The count is greater than one when the
page is shared between many processes,
age
This field describes the age of the page and is used to decide if the page is a good candidate
for discarding or swapping,
map_nr
This is the physical page frame number that this mem_map_t describes.
The free_area vector is used by the page allocation code to find and free pages. The whole buffer
management scheme is supported by this mechanism and so far as the code is concerned, the size of
the page and physical paging mechanisms used by the processor are irrelevant.
Each element of free_area contains information about blocks of pages. The first element in the
array describes single pages, the next blocks of 2 pages, the next blocks of 4 pages and so on
upwards in powers of two. The list element is used as a queue head and has pointers to the page
data structures in the mem_map array. Free blocks of pages are queued here. map is a pointer to a
bitmap which keeps track of allocated groups of pages of this size. Bit N of the bitmap is set if the
Nth block of pages is free.
Figure free-area-figure shows the free_area structure. Element 0 has one free page (page frame
number 0) and element 2 has 2 free blocks of 4 pages, the first starting at page frame number 4 and
the second at page frame number 56.
Page Allocation
Linux uses the Buddy algorithm 2
to effectively allocate and deallocate blocks of pages. The page
allocation code
attempts to allocate a block of one or more physical pages. Pages are allocated in blocks which are
powers of 2 in size. That means that it can allocate a block 1 page, 2 pages, 4 pages and so on. So
long as there are enough free pages in the system to grant this request (nr_free_pages
> min_free_pages) the allocation code will search the free_area for a block of pages of the size
requested. Each element of the free_area has a map of the allocated and free blocks of pages for
that sized block. For example, element 2 of the array has a memory map that describes free and
allocated blocks each of 4 pages long.
The allocation algorithm first searches for blocks of pages of the size requested. It follows the chain
of free pages that is queued on the list element of the free_area data structure. If no blocks of pages
of the requested size are free, blocks of the next size (which is twice that of the size requested) are
looked for. This process continues until all of the free_area has been searched or until a block of
pages has been found. If the block of pages found is larger than that requested it must be broken
down until there is a block of the right size. Because the blocks are each a power of 2 pages big then
22
K. ADISESHA

OPERATING SYSTEM
this breaking down process is easy as you simply break the blocks in half. The free blocks are
queued on the appropriate queue and the allocated block of pages is returned to the caller.
The free_area data structure
For example, in Figure if a block of 2 pages was requested, the first block of 4 pages (starting at
page frame number 4) would be broken into two 2 page blocks. The first, starting at page frame
number 4 would be returned to the caller as the allocated pages and the second block, starting at
page frame number 6 would be queued as a free block of 2 pages onto element 1 of the free_area
array.
Page Deallocation
Allocating blocks of pages tends to fragment memory with larger blocks of free pages being broken
down into smaller ones. The page deallocation code
recombines pages into larger blocks of free pages whenever it can. In fact the page block size is
important as it allows for easy combination of blocks into larger blocks.
Whenever a block of pages is freed, the adjacent or buddy block of the same size is checked to see if
it is free. If it is, then it is combined with the newly freed block of pages to form a new free block of
pages for the next size block of pages. Each time two blocks of pages are recombined into a bigger
block of free pages the page deallocation code attempts to recombine that block into a yet larger one.
In this way the blocks of free pages are as large as memory usage will allow.
For example, in Figure, if page frame number 1 were to be freed, then that would be combined with
the already free page frame number 0 and queued onto element 1 of the free_area as a free block of
size 2 pages.
Memory Mapping
When an image is executed, the contents of the executable image must be brought into the processes
virtual address space. The same is also true of any shared libraries that the executable image has
23
K. ADISESHA

OPERATING SYSTEM
been linked to use. The executable file is not actually brought into physical memory, instead it is
merely linked into the processes virtual memory. Then, as the parts of the program are referenced by
the running application, the image is brought into memory from the executable image. This linking
of an image into a processes virtual address space is known as memory mapping.
Areas of Virtual Memory
Every processes virtual memory is represented by an mm_struct data structure. This contains
information about the image that it is currently executing (for example bash) and also has pointers
to a number of vm_area_struct data structures. Each vm_area_struct data structure describes the
start and end of the area of virtual memory, the processes access rights to that memory and a set of
operations for that memory. These operations are a set of routines that Linux must use when
manipulating this area of virtual memory. For example, one of the virtual memory operations
performs the correct actions when the process has attempted to access this virtual memory but finds
(via a page fault) that the memory is not actually in physical memory. This operation is the nopage
operation. The nopage operation is used when Linux demand pages the pages of an executable
image into memory.
When an executable image is mapped into a processes virtual address a set of vm_area_struct data
structures is generated. Each vm_area_struct data structure represents a part of the executable
image; the executable code, initialized data (variables), unitialized data and so on. Linux supports a
number of standard virtual memory operations and as the vm_area_struct data structures are
created, the correct set of virtual memory operations are associated with them.
Demand Paging
Once an executable image has been memory mapped into a processes virtual memory it can start to
execute. As only the very start of the image is physically pulled into memory it will soon access an
area of virtual memory that is not yet in physical memory. When a process accesses a virtual address
that does not have a valid page table entry, the processor will report a page fault to Linux.
24
K. ADISESHA

OPERATING SYSTEM
The page fault describes the virtual address where the page fault occurred and the type of memory
access that caused.
Linux must find the vm_area_struct that represents the area of memory that the page fault occurred
in. As searching through the vm_area_struct data structures is critical to the efficient handling of
page faults, these are linked together in an AVL (Adelson-Velskii and Landis) tree structure. If there
is no vm_area_struct data structure for this faulting virtual address, this process has accessed an
illegal virtual address. Linux will signal the process, sending a SIGSEGV signal, and if the process
does not have a handler for that signal it will be terminated.
Linux next checks the type of page fault that occurred against the types of accesses allowed for this
area of virtual memory. If the process is accessing the memory in an illegal way, say writing to an
area that it is only allowed to read from, it is also signalled with a memory error.
Now that Linux has determined that the page fault is legal, it must deal with it.
Linux must differentiate between pages that are in the swap file and those that are part of an
executable image on a disk somewhere. It does this by using the page table entry for this faulting
virtual address.
If the page's page table entry is invalid but not empty, the page fault is for a page currently being
held in the swap file. For Alpha AXP page table entries, these are entries which do not have their
valid bit set but which have a non-zero value in their PFN field. In this case the PFN field holds
information about where in the swap (and which swap file) the page is being held. How pages in the
swap file are handled is described later in this chapter.
Not all vm_area_struct data structures have a set of virtual memory operations and even those that
do may not have a nopage operation. This is because by default Linux will fix up the access by
allocating a new physical page and creating a valid page table entry for it. If there is a nopage
operation for this area of virtual memory, Linux will use it.
The generic Linux nopage operation is used for memory mapped executable images and it uses the
page cache to bring the required image page into physical memory.
However the required page is brought into physical memory, the processes page tables are updated.
It may be necessary for hardware specific actions to update those entries, particularly if the processor
uses translation look aside buffers. Now that the page fault has been handled it can be dismissed and
the process is restarted at the instruction that made the faulting virtual memory access.
The Linux Page Cache
The Linux Page Cache
25
K. ADISESHA

OPERATING SYSTEM
The role of the Linux page cache is to speed up access to files on disk. Memory mapped files are
read a page at a time and these pages are stored in the page cache. Figure 3.6 shows that the page
cache consists of the page_hash_table, a vector of pointers to mem_map_t data structures.
Each file in Linux is identified by a VFS inode data structure (described in Chapter filesystem-
chapter) and each VFS inode is unique and fully describes one and only one file. The index into the
page table is derived from the file's VFS inode and the offset into the file.
Whenever a page is read from a memory mapped file, for example when it needs to be brought back
into memory during demand paging, the page is read through the page cache. If the page is present in
the cache, a pointer to the mem_map_t data structure representing it is returned to the page fault
handling code. Otherwise the page must be brought into memory from the file system that holds the
image. Linux allocates a physical page and reads the page from the file on disk.
If it is possible, Linux will initiate a read of the next page in the file. This single page read ahead
means that if the process is accessing the pages in the file serially, the next page will be waiting in
memory for the process.
Over time the page cache grows as images are read and executed. Pages will be removed from the
cache as they are no longer needed, say as an image is no longer being used by any process. As
Linux uses memory it can start to run low on physical pages. In this case Linux will reduce the size
of the page cache.
Swapping Out and Discarding Pages
When physical memory becomes scarce the Linux memory management subsystem must attempt to
free physical pages. This task falls to the kernel swap daemon (kswapd).
The kernel swap daemon is a special type of process, a kernel thread. Kernel threads are processes
have no virtual memory, instead they run in kernel mode in the physical address space. The kernel
swap daemon is slightly misnamed in that it does more than merely swap pages out to the system's
swap files. Its role is make sure that there are enough free pages in the system to keep the memory
management system operating efficiently.
The Kernel swap daemon (kswapd) is started by the kernel init process at startup time and sits
waiting for the kernel swap timer to periodically expire.
Every time the timer expires, the swap daemon looks to see if the number of free pages in the system
is getting too low. It uses two variables, free_pages_high and free_pages_low to decide if it should
free some pages. So long as the number of free pages in the system remains above free_pages_high,
the kernel swap daemon does nothing; it sleeps again until its timer next expires. For the purposes of
this check the kernel swap daemon takes into account the number of pages currently being written
out to the swap file. It keeps a count of these in nr_async_pages; this is incremented each time a
page is queued waiting to be written out to the swap file and decremented when the write to the swap
device has completed. free_pages_low and free_pages_high are set at system startup time and are
related to the number of physical pages in the system. If the number of free pages in the system has
fallen below free_pages_high or worse still free_pages_low, the kernel swap daemon will try three
ways to reduce the number of physical pages being used by the system:
1. Reducing the size of the buffer and page caches,
2. Swapping out System V shared memory pages,
3. Swapping out and discarding pages.
26
K. ADISESHA

OPERATING SYSTEM
If the number of free pages in the system has fallen below free_pages_low, the kernel swap daemon
will try to free 6 pages before it next runs. Otherwise it will try to free 3 pages. Each of the above
methods are tried in turn until enough pages have been freed. The kernel swap daemon remembers
which method it was using the last time that it attempted to free physical pages. Each time it runs it
will start trying to free pages using this last successful method.
After it has free sufficient pages, the swap daemon sleeps again until its timer expires. If the reason
that the kernel swap daemon freed pages was that the number of free pages in the system had fallen
below free_pages_low, it only sleeps for half its usual time. Once the number of free pages is more
than free_pages_low the kernel swap daemon goes back to sleeping longer between checks.
Reducing the Size of the Page and Buffer Caches
The pages held in the page and buffer caches are good candidates for being freed into the free_area
vector. The Page Cache, which contains pages of memory mapped files, may contain unneccessary
pages that are filling up the system's memory. Likewise the Buffer Cache, which contains buffers
read from or being written to physical devices, may also contain unneeded buffers. When the
physical pages in the system start to run out, discarding pages from these caches is relatively easy as
it requires no writing to physical devices (unlike swapping pages out of memory). Discarding these
pages does not have too many harmful side effects other than making access to physical devices and
memory mapped files slower. However, if the discarding of pages from these caches is done fairly,
all processes will suffer equally.
Every time the Kernel swap daemon tries to shrink these caches it examines a block of pages in the
mem_map page vector to see if any can be discarded from physical memory. The size of the block of
pages examined is higher if the kernel swap daemon is intensively swapping; that is if the number of
free pages in the system has fallen dangerously low. The blocks of pages are examined in a cyclical
manner; a different block of pages is examined each time an attempt is made to shrink the memory
map. This is known as the clock algorithm as, rather like the minute hand of a clock, the whole
mem_map page vector is examined a few pages at a time.
Each page being examined is checked to see if it is cached in either the page cache or the buffer
cache. You should note that shared pages are not considered for discarding at this time and that a
page cannot be in both caches at the same time. If the page is not in either cache then the next page
in the mem_map page vector is examined.
Pages are cached in the buffer cache (or rather the buffers within the pages are cached) to make
buffer allocation and deallocation more efficient. The memory map shrinking code tries to free the
buffers that are contained within the page being examined.
If all the buffers are freed, then the pages that contain them are also be freed. If the examined page is
in the Linux page cache, it is removed from the page cache and freed.
When enough pages have been freed on this attempt then the kernel swap daemon will wait until the
next time it is periodically woken. As none of the freed pages were part of any process's virtual
memory (they were cached pages), then no page tables need updating. If there were not enough
cached pages discarded then the swap daemon will try to swap out some shared pages.
Swapping Out System V Shared Memory Pages
27
K. ADISESHA

OPERATING SYSTEM
System V shared memory is an inter-process communication mechanism which allows two or more
processes to share virtual memory in order to pass information amongst themselves. How processes
share memory in this way is described in more detail in Chapter IPC-chapter. For now it is enough
to say that each area of System V shared memory is described by a shmid_ds data structure. This
contains a pointer to a list of vm_area_struct data structures, one for each process sharing this area
of virtual memory. The vm_area_struct data structures describe where in each processes virtual
memory this area of System V shared memory goes. Each vm_area_struct data structure for this
System V shared memory is linked together using the vm_next_shared and vm_prev_shared
pointers. Each shmid_ds data structure also contains a list of page table entries each of which
describes the physical page that a shared virtual page maps to.
The kernel swap daemon also uses a clock algorithm when swapping out System V shared memory
pages.
. Each time it runs it remembers which page of which shared virtual memory area it last swapped
out. It does this by keeping two indices, the first is an index into the set of shmid_ds data structures,
the second into the list of page table entries for this area of System V shared memory. This makes
sure that it fairly victimizes the areas of System V shared memory.
As the physical page frame number for a given virtual page of System V shared memory is
contained in the page tables of all of the processes sharing this area of virtual memory, the kernel
swap daemon must modify all of these page tables to show that the page is no longer in memory but
is now held in the swap file. For each shared page it is swapping out, the kernel swap daemon finds
the page table entry in each of the sharing processes page tables (by following a pointer from each
vm_area_struct data structure). If this processes page table entry for this page of System V shared
memory is valid, it converts it into an invalid but swapped out page table entry and reduces this
(shared) page's count of users by one. The format of a swapped out System V shared page table entry
contains an index into the set of shmid_ds data structures and an index into the page table entries for
this area of System V shared memory.
If the page's count is zero after the page tables of the sharing processes have all been modified, the
shared page can be written out to the swap file. The page table entry in the list pointed at by the
shmid_ds data structure for this area of System V shared memory is replaced by a swapped out page
table entry. A swapped out page table entry is invalid but contains an index into the set of open swap
files and the offset in that file where the swapped out page can be found. This information will be
used when the page has to be brought back into physical memory.
Swapping Out and Discarding Pages
The swap daemon looks at each process in the system in turn to see if it is a good candidate for
swapping.
Good candidates are processes that can be swapped (some cannot) and that have one or more pages
which can be swapped or discarded from memory. Pages are swapped out of physical memory into
the system's swap files only if the data in them cannot be retrieved another way.
A lot of the contents of an executable image come from the image's file and can easily be re-read
from that file. For example, the executable instructions of an image will never be modified by the
28
K. ADISESHA

OPERATING SYSTEM
image and so will never be written to the swap file. These pages can simply be discarded; when they
are again referenced by the process, they will be brought back into memory from the executable
image.
Once the process to swap has been located, the swap daemon looks through all of its virtual memory
regions looking for areas which are not shared or locked.
Linux does not swap out all of the swappable pages of the process that it has selected; instead it
removes only a small number of pages.
Pages cannot be swapped or discarded if they are locked in memory.
The Linux swap algorithm uses page aging. Each page has a counter (held in the mem_map_t data
structure) that gives the Kernel swap daemon some idea whether or not a page is worth swapping.
Pages age when they are unused and rejuvinate on access; the swap daemon only swaps out old
pages. The default action when a page is first allocated, is to give it an initial age of 3. Each time it is
touched, it's age is increased by 3 to a maximum of 20. Every time the Kernel swap daemon runs it
ages pages, decrementing their age by 1. These default actions can be changed and for this reason
they (and other swap related information) are stored in the swap_control data structure.
If the page is old (age = 0), the swap daemon will process it further. Dirty pages are pages which can
be swapped out. Linux uses an architecture specific bit in the PTE to describe pages this way.
However, not all dirty pages are necessarily written to the swap file. Every virtual memory region of
a process may have its own swap operation (pointed at by the vm_ops pointer in the
vm_area_struct) and that method is used. Otherwise, the swap daemon will allocate a page in the
swap file and write the page out to that device.
The page's page table entry is replaced by one which is marked as invalid but which contains
information about where the page is in the swap file. This is an offset into the swap file where the
page is held and an indication of which swap file is being used. Whatever the swap method used, the
original physical page is made free by putting it back into the free_area. Clean (or rather not dirty)
pages can be discarded and put back into the free_area for re-use.
If enough of the swappable processes pages have been swapped out or discarded, the swap daemon
will again sleep. The next time it wakes it will consider the next process in the system. In this way,
the swap daemon nibbles away at each processes physical pages until the system is again in balance.
This is much fairer than swapping out whole processes.
The Swap Cache
When swapping pages out to the swap files, Linux avoids writing pages if it does not have to. There
are times when a page is both in a swap file and in physical memory. This happens when a page that
was swapped out of memory was then brought back into memory when it was again accessed by a
process. So long as the page in memory is not written to, the copy in the swap file remains valid.
Linux uses the swap cache to track these pages. The swap cache is a list of page table entries, one
per physical page in the system. This is a page table entry for a swapped out page and describes
which swap file the page is being held in together with its location in the swap file. If a swap cache
entry is non-zero, it represents a page which is being held in a swap file that has not been modified.
If the page is subsequently modified (by being written to), its entry is removed from the swap cache.
29
K. ADISESHA

OPERATING SYSTEM
When Linux needs to swap a physical page out to a swap file it consults the swap cache and, if there
is a valid entry for this page, it does not need to write the page out to the swap file. This is because
the page in memory has not been modified since it was last read from the swap file.
The entries in the swap cache are page table entries for swapped out pages. They are marked as
invalid but contain information which allow Linux to find the right swap file and the right page
within that swap file.
Swapping Pages In
The dirty pages saved in the swap files may be needed again, for example when an application writes
to an area of virtual memory whose contents are held in a swapped out physical page. Accessing a
page of virtual memory that is not held in physical memory causes a page fault to occur. The page
fault is the processor signalling the operating system that it cannot translate a virtual address into a
physical one. In this case this is because the page table entry describing this page of virtual memory
was marked as invalid when the page was swapped out. The processor cannot handle the virtual to
physical address translation and so hands control back to the operating system describing as it does
so the virtual address that faulted and the reason for the fault. The format of this information and
how the processor passes control to the operating system is processor specific.
The processor specific page fault handling code must locate the vm_area_struct data structure that
describes the area of virtual memory that contains the faulting virtual address. It does this by
searching the vm_area_struct data structures for this process until it finds the one containing the
faulting virtual address. This is very time critical code and a processes vm_area_struct data
structures are so arranged as to make this search take as little time as possible.
Having carried out the appropriate processor specific actions and found that the faulting virtual
address is for a valid area of virtual memory, the page fault processing becomes generic and
applicable to all processors that Linux runs on.
The generic page fault handling code looks for the page table entry for the faulting virtual address. If
the page table entry it finds is for a swapped out page, Linux must swap the page back into physical
memory. The format of the page table entry for a swapped out page is processor specific but all
processors mark these pages as invalid and put the information neccessary to locate the page within
the swap file into the page table entry. Linux needs this information in order to bring the page back
into physical memory.
At this point, Linux knows the faulting virtual address and has a page table entry containing
information about where this page has been swapped to. The vm_area_struct data structure may
contain a pointer to a routine which will swap any page of the area of virtual memory that it
describes back into physical memory. This is its swapin operation. If there is a swapin operation for
this area of virtual memory then Linux will use it. This is, in fact, how swapped out System V shared
memory pages are handled as it requires special handling because the format of a swapped out
System V shared page is a little different from that of an ordinairy swapped out page. There may not
be a swapin operation, in which case Linux will assume that this is an ordinairy page that does not
need to be specially handled.
It allocates a free physical page and reads the swapped out page back from the swap file. Information
telling it where in the swap file (and which swap file) is taken from the the invalid page table entry.
30
K. ADISESHA

OPERATING SYSTEM
If the access that caused the page fault was not a write access then the page is left in the swap cache
and its page table entry is not marked as writable. If the page is subsequently written to, another page
fault will occur and, at that point, the page is marked as dirty and its entry is removed from the swap
cache. If the page is not written to and it needs to be swapped out again, Linux can avoid the write of
the page to its swap file because the page is already in the swap file.
If the access that caused the page to be brought in from the swap file was a write operation, this page
is removed from the swap cache and its page table entry is marked as both dirty and writable.
File management
The term computer file management refers to the manipulation of [document]s and [data] in
[Computer file|file]s on a [computer]].
Specifically, one may create a new file or edit an existing file and save it; open or load a pre-existing
file into memory; or close a file without saving it. Additionally, one may group related files in
directories. These tasks are accomplished in different ways in different operating systems and
depend on the user interface design and, to some extent, the storage medium being used.
Although the file management paradigm described above is currently the dominant one in
computing, attempts have been made to create more efficient or usable paradigms. The concept of
saving a file, in particular, has been the subject of much innovation, with some applications
including an autosave feature (to periodically save changes to a file in case of a computer crash,
power outage, etc.) and others doing away with the save concept completely. In the latter case, one
typically opens and closes files without ever being given the option to save them. Such applications
usually have a multi-level undo feature to replace the concept of closing a file without saving any
changes.
Concept of the hierarchy of files
Files can also be managed based on their location on a storage device. They are stored in a storage
medium in binary form. Physically, the data is placed in a not-so-well organized structure, due to
fragmentation. However, the grouping of files into directories (for operating systems such as DOS,
Unix, Linux) or folders (for the Mac OS and Windows) is done by changing an index of file
information known as the File Allocation Table (NTFS for recent versions of Windows) or Master
File Table (depending on operating system used). In this index, the physical location of a particular
file on the storage medium is stored, as well as its position in the hierarchy of directories (as we see
it using commands such as DIR, LS and programs such as Explorer, Finder).
On Unix/Linux machines the hierarchy is:
• The root directory (/)
o Directories (/usr "user" or /dev "device")
Sub-directories (/usr/local)
Files: data, devices, links, etc. (/usr/local/readme.txt or /dev/hda1,
which is the hard disk device)
31
K. ADISESHA

OPERATING SYSTEM
For DOS/Windows the hierarchy (along with examples):
• Drive (C:)
o Directory/Folder (C:My Documents)
Sub-directory/Sub-folder (C:My DocumentsMy Pictures)
File (C:My DocumentsMy PicturesVacationPhoto.jpg)
Commands such as:
• Unix/Linux: cp, mv
• DOS: copy, move
• Windows: the Cut/Copy/Paste commands in the file menu of Explorer
can be used to manage (copy or move) the files to and from other directories.
File system fragmentation
In computing, file system fragmentation, sometimes called file system aging, is the inability of a
file system to lay out related data sequentially (contiguously), an inherent phenomenon in storage-
backed file systems that allow in-place modification of their contents. It is a special case of data
fragmentation. File system fragmentation increases disk head movement or seeks, which are known
to hinder throughput. The correction to existing fragmentation is to reorganize files and free space
back into contiguous areas, a process called defragmentation.
When a file system is first initialized on a partition (the partition is formatted for the file system), the
entire space allotted is empty. This means that the allocator algorithm is completely free to position
newly created files anywhere on the disk. For some time after creation, files on the file system can
be laid out near-optimally. When the operating system and applications are installed or other
archives are unpacked, laying out separate files sequentially also means that related files are likely to
be positioned close to each other.
However, as existing files are deleted or truncated, new regions of free space are created. When
existing files are appended to, it is often impossible to resume the write exactly where the file used
to end, as another file may already be allocated there — thus, a new fragment has to be allocated. As
time goes on, and the same factors are continuously present, free space as well as frequently
appended files tend to fragment more. Shorter regions of free space also mean that the allocator is no
longer able to allocate new files contiguously, and has to break them into fragments. This is
especially true when the file system is more full — longer contiguous regions of free space are less
likely to occur.
Note that the following is a simplification of an otherwise complicated subject. The method which is
about to be explained has been the general practice for allocating files on disk and other random-
access storage, for over 30 years. Some operating systems do not simply allocate files one after the
other, and some use various methods to try to prevent fragmentation, but in general, sooner or later,
for the reasons explained in the following explanation, fragmentation will occur as time goes by on
any system where files are routinely deleted or expanded. Consider the following scenario, as shown
by the image on the right:
32
K. ADISESHA

OPERATING SYSTEM
A new disk has had 5 files saved on it, named A, B, C, D and E, and each file is using 10 blocks of
space (here the block size is unimportant.) As the free space is contiguous the files are located one
after the other (Example (1).)
If file B is deleted, a second region of 10 blocks of free space is created, and the disk becomes
fragmented. The file system could defragment the disk immediately after a deletion, which would
incur a severe performance penalty at unpredictable times, but in general the empty space is simply
left there, marked in a table as available for later use, then used again as needed[2]
(Example (2).)
Now if a new file F requires 7 blocks of space, it can be placed into the first 7 blocks of the space
formerly holding the file B, and the 3 blocks following it will remain available (Example (3).) If
another new file G is added, and needs only three blocks, it could then occupy the space after F and
before C (Example (4).)
If subsequently F needs to be expanded, since the space immediately following it is occupied, there
are three options: (1) add a new block somewhere else and indicate that F has a second extent, (2)
move files in the way of the expansion elsewhere, to allow F to remain contiguous; or (3) move file
F so it can be one contiguous file of the new, larger size. The second option is probably impractical
for performance reasons, as is the third when the file is very large. Indeed the third option is
impossible when there is no single contiguous free space large enough to hold the new file. Thus the
usual practice is simply to create an extent somewhere else and chain the new extent onto the old one
(Example (5).)
Material added to the end of file F would be part of the same extent. But if there is so much material
that no room is available after the last extent, then another extent would have to be created, and so
on, and so on. Eventually the file system has free segments in many places and some files may be
spread over many extents. Access time for those files (or for all files) may become excessively long.
To summarize, factors that typically cause or facilitate fragmentation, include:
• low free space.
• frequent deletion, truncation or extension of files.
• overuse of sparse files.
Performance implications
File system fragmentation is projected to become more problematic with newer hardware due to the
increasing disparity between sequential access speed and rotational delay (and to a lesser extent seek
time), of consumer-grade hard disks, which file systems are usually placed on. Thus, fragmentation
is an important problem in recent file system research and design. The containment of fragmentation
not only depends on the on-disk format of the file system, but also heavily on its implementation. In
simple file system benchmarks, the fragmentation factor is often omitted, as realistic aging and
fragmentation is difficult to model. Rather, for simplicity of comparison, file system benchmarks are
often run on empty file systems, and unsurprisingly, the results may vary heavily from real-life
access patterns.
Types of fragmentation
33
K. ADISESHA

OPERATING SYSTEM
File system fragmentation may occur on several levels:
• Fragmentation within individual files and their metadata.
• Free space fragmentation, making it increasingly difficult to lay out new files contiguously.
• The decrease of locality of reference between separate, but related files.
File fragmentation
Individual file fragmentation occurs when a single file has been broken into multiple pieces (called
extents on extent-based file systems). While disk file systems attempt to keep individual files
contiguous, this is not often possible without significant performance penalties. File system check
and defragmentation tools typically only account for file fragmentation in their "fragmentation
percentage" statistic.
Free space fragmentation
Free (unallocated) space fragmentation occurs when there are several unused areas of the file system
where new files or metadata can be written to. Unwanted free space fragmentation is generally
caused by deletion or truncation of files, but file systems may also intentionally insert fragments
("bubbles") of free space in order to facilitate extending nearby files (see preemptive techniques
below).......
File scattering
File segmentation, also called related-file fragmentation, or application-level (file) fragmentation,
refers to the lack of locality of reference (within the storing medium) between related files (see file
sequencess for more detail). Unlike the previous two types of fragmentation, file scattering is a much
more vague concept, as it heavily depends on the access pattern of specific applications. This also
makes objectively measuring or estimating it very difficult. However, arguably, it is the most critical
type of fragmentation, as studies have found that the most frequently accessed files tend to be small
compared to available disk throughput per second.
To avoid related file fragmentation and improve locality of reference (in this case called file
contiguity), assumptions about the operation of applications have to be made. A very frequent
assumption made is that it is worthwhile to keep smaller files within a single directory together, and
lay them out in the natural file system order. While it is often a reasonable assumption, it does not
always hold. For example, an application might read several different files, perhaps in different
directories, in the exact same order they were written. Thus, a file system that simply orders all
writes successively, might work faster for the given application.
Techniques for mitigating fragmentation
Several techniques have been developed to fight fragmentation. They can usually be classified into
two categories: preemptive and retroactive. Due to the hard predictability of access patterns, these
techniques are most often heuristic in nature, and may degrade performance under unexpected
workloads.
34
K. ADISESHA

OS-What is an operating system

OS-What is an operating system

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to OS-What is an operating system

Similar to OS-What is an operating system (20)

More from Prof. Dr. K. Adisesha

More from Prof. Dr. K. Adisesha (20)

Recently uploaded

Recently uploaded (20)

OS-What is an operating system