Process Scheduling
Hao-Ran Liu
Objective
• Decide which process runs, when, and for
how long
• Considering the overhead of context
switches, we need to balance between
conflicting goal
– CPU utilization (high throughput)
– Better Interactive performance (low latency)
Multitasking
• Cooperative
– A process does not stop running until it voluntary
decides to do so
– Anyone can monopolize the processor; a hung
process that never yields can lock the entire system
– A technique used in many user-mode threading
libraries.
• Preemptive
– A running process can be suspended at any time
(usually because it exhausts its time slice)
Type of processes
• I/O-bound processes
– spend most of their time waiting for I/O
– should be executed often (for short durations)
when they are runnable
• CPU-bound processes
– spend most of their time executing code; tend
to run until they are preempted
– should be executed for longer durations (to
improve throughput)
Scheduling policies
• Check sched(7) man page for more details
• Normal
• Real-time
Name Description
SCHED_NORMAL The standard time-sharing policy for regular tasks
SCHED_BATCH For CPU-bound tasks that does not preempt often
SCHED_IDLE For running very low priority background jobs (lower
than a +19 nice value)
SCHED_FIFO FIFO without time slice
SCHED_RR Round robin with maximum time slice
SCHED_DEADLINE Earliest Deadline First + Constant Bandwidth Server
Accept a task only if its periodic job can be done
before deadline
Process priority
• Processes with a higher priority
– run before those with a lower priority
– receive a longer time slice
• Priority range [static, dynamic]
– Normal, batch: [always 0, -20~+19], default: [0, 0],
dynamic priority is the nice value you adjust in user
space. A larger “nice” value correspond to a lower
priority
– FIFO, RR: [0~99, 0], higher value means greater
priority. FIFO, RR processes are at a higher priority
than normal processes
– Deadline: Not applicable. Deadline processes are
always the highest priority in the system
Time slice
• How long a task can run until it is
preempted
• The value of the time slice:
– higher: better throughput
– lower: better interactive performance (shorter
scheduling latency), but more CPU time
wasted on context switches
– Default value is usally pretty small (for good
interactive performance)
Completely Fair Scheduler
• The scheduler for SCHED_NORMAL,
SCHED_BATCH, SCHED_IDLE classes
• CFS assigns a proportion of the processor,
instead of time slices, to processes
– A process with higher nice value receives a smaller
proportion of the CPU
• If a process enters runnable state and has
consumed a smaller proportion of the CPU than
the currently executing one, it runs immediately,
preempting the current one.
CFS scheduler in action
• Two processes
– Video encoder(CPU-bound) and text editor(I/O-bound)
– Both processes have the same nice value
• We want text editor to preempt video encoder
when the editor is runnable
– the text editor consumes a smaller proportion of the
CPU than the video encoder, so it will preempt the
video encoder once it is runnable.
“timeslice” in CFS
• Target latency
– /proc/sys/kernel/sched_latency_ns
– the period in which all run queue tasks are scheduled at least
once
• Timeslice_CFS = target latency / number of runnable
processes * nice_weight
– Ex: target latency = 20ms, two runnable processes at the same
priority, each will run for 10ms before preemption
• If the number of runnable processes =>∞,
timeslice_CFS => 0
– Unacceptable switching costs
– CFS imposes a floor on the “timeslice”:
/proc/sys/kernel/sched_min_granularity_ns, default value is 1ms
• CFS is not “fair” if the number of processes is extremely
large
CFS example again
• Two processes, nice value = 0, 5
– Weight for a nice value of 5 is 1/3
– If target latency = 20ms, the two processes receive 15,
5ms “”timeslice” respectively
– If we change nice value to 10,15, they still receive the
same “timeslice”
• The proportion of processor time that any
process receives is determined only by the
relative difference in niceness between it and
the other runnable processes
CFS group scheduling
• Sometimes, it may be desirable to group tasks and
provide fair CPU time to each such task group
• Kernel config required:
– CONFIG_FAIR_GROUP_SCHED
– CONFIG_RT_GROUP_SCHED
• Example:
# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpu
# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
# cd /sys/fs/cgroup/cpu
# mkdir multimedia # create "multimedia" group of tasks
# mkdir browser # create "browser" group of tasks
# #Configure the multimedia group to receive twice the CPU bandwidth
# #that of browser group
# echo 2048 > multimedia/cpu.shares
# echo 1024 > browser/cpu.shares
# firefox & # Launch firefox and move it to "browser" group
# echo <firefox_pid> > browser/tasks
# #Launch gmplayer (or your favourite movie player)
# echo <movie_player_pid> > multimedia/tasks
Sporadic task model
deadline scheduling
• Each SCHED_DEADLINE task is characterized by the
"runtime", "deadline", and "period" parameters
• The kernel performs an admittance test when setting or
changing SCHED_DEADLINE policy and attributes with
sched_attr() system call.
arrival/wakeup absolute deadline
| start time |
| | |
v v v
-----x--------xooooooooooooooooo--------x--------x---
|<-- Runtime ------->|
|<----------- Deadline ----------->|
|<-------------- Period ------------------->|
Some tools for real-time tasks
• chrt sets or retrieves the real-time scheduling attributes
of an existing pid, or runs command with the given
attributes.
• Limiting the CPU usage of real-time and deadline
processes
– A nonblocking infinite loop in a thread scheduled under the
FIFO, RR, or DEADLINE policy will block all threads with lower
priority forever
– two /proc files can be used to reserve a certain amount of CPU
time to be used by non-real-time processes.
• /proc/sys/kernel/sched_rt_period_us (default: 1000000)
• /proc/sys/kernel/sched_rt_runtime_us (default: 950000)
chrt [options] [<policy>] <priority> [-p <pid> | <command> [<arg>...]]
Context switches
• schedule() called context_switch() after when a
new process has been selected to run
• context_switch()
– switch_mm(): switch virtual memory mapping
– switch_to(): switch processor state.
• Kernel are informed to reschedule if
need_resched variable is set true
– Set by scheduler_tick() when a process should be
preempted
– Set by try_to_wake_up() when a process with higher
priority than current process is awaken
Example: creating kernel thread
#include <linux/module.h>
#include <linux/kthread.h>
#define DPRINTK(fmt, args...) 
printk("%s(): " fmt, __func__, ##args)
static struct task_struct *kth_test_task;
static int data;
static int kth_test(void *arg)
{
unsigned int timeout;
int *d = (int *) arg;
while (!kthread_should_stop()) {
DPRINTK("data=%dn", ++(*d));
set_current_state(TASK_INTERRUPTIBLE);
timeout = schedule_timeout(10 * HZ);
if (timeout)
DPRINTK("schedule_timeout return early.n");
}
DPRINTK("exit.n");
return 0;
}
static int __init init_modules(void)
{
int ret;
kth_test_task = kthread_create(kth_test, 
&data, "kth_test");
if (IS_ERR(kth_test_task)) {
ret = PTR_ERR(kth_test_task);
kth_test_task = NULL;
goto out;
}
wake_up_process(kth_test_task);
return 0;
out:
return ret;
}
static void __exit exit_modules(void)
{
/* block until kth_test_task exit */
kthread_stop(kth_test_task);
}
module_init(init_modules);
module_exit(exit_modules);
Process sleeping
 Processes need to sleep when requests cannot be
satisfied immediately
 Kernel output buffer is full or no data is available
 Rule for sleeping
 Never sleep in an atomic context
 Holding a spinlock, seqlock or RCU lock
 Interrupts are disabled
 Always check to ensure that the condition the process
was waiting for is indeed true after the process wakes up
Wait queue
 Wait queue contains a list of processes, all
waiting for a specific event
 Declaration and initialization of wait queue
// defined and initialized statically with
DECLARE_WAIT_QUEUE_HEAD(name);
// initialized dynamically
Wait_queue_head_t my_queue;
init_waitqueue_head(&my_queue);
wait_event macros
// queue: the wait queue head to use. Note that it is passed “by value”
// condition: arbitrary boolean expression, evaluated by the macro before
// and after sleeping until the condition becomes true. It may
// be evaluated an arbitrary number of times, so it should not
// have any side effects.
// timeout: wait for the specific number of clock ticks (in jiffies)
// uninterruptible sleep until a condition gets true
wait_event(queue, condition);
// interruptible sleep until a condition gets true, return –ERESTARTSYS if
// interrupted by a signal, return 0 if condition evaluated to be true
wait_event_interruptible(queue, condition);
// uninterruptible sleep until a condition gets true or a timeout elapses
// return 0 if the timeout elapsed, and the remaining jiffies if the
// condition evaluated to true before the timout elapsed
wait_event_timeout(queue, condition, timeout);
// interruptible sleep until a condition gets true or a timeout elapses
// return 0 if the timeout elapsed, -ERESTARTSYS if interrupted by a
// signal, and the remaining jiffies if the condition evaluated to true
// before the timout elapsed
wait_event_interruptible_timeout(queue, condition, timeout);
wake_up macros
// Wake processes that are sleeping on the queue q. The _interruptible
// form wakes only interruptible processes. Normally, only one exclusive
// waiter is awakened (to avoid thundering herd problem), but that
// behavior can be changed with the _nr or _all forms. The _sync version
// does not reschedule the CPU before returning.
void wake_up(struct wait_queue_head_t *q);
void wake_up_interruptible(struct wait_queue_head_t *q);
void wake_up_nr(struct wait_queue_head_t *q, int nr);
void wake_up_interruptible_nr(struct wait_queue_head_t *q, int nr);
void wake_up_all(struct wait_queue_head_t *q);
void wake_up_interruptible_all(struct wait_queue_head_t *q);
void wake_up_interruptible_sync(struct wait_queue_head_t *q);
 Within a real device driver, a process blocked in a read call is
awaken when data arrives; usually the hardware issues an
interrupt to signal such an event, and the driver awakens
waiting processes as part of handling the interrupt
A simple example of putting
processes to sleep
 sleepy device behavior: any process that
attempts to read from the device is put to
sleep. Whenever a process writes to the
device, all sleeping processes are awaken
 Note that on single processor, the second
process to wake up would immediately go
back to sleep
sleepy’s read and write
ssize_t sleepy_read (struct file *filp, char __user *buf,
size_t count, loff_t *pos) {
printk(KERN_DEBUG "process %i (%s) going to sleepn",
current->pid, current->comm);
wait_event_interruptible(wq, flag != 0);
flag = 0;
printk(KERN_DEBUG "awoken %i (%s)n", current->pid, current->comm);
return 0; /* EOF */
}
ssize_t sleepy_write (struct file *filp, const char __user *buf,
size_t count, loff_t *pos) {
printk(KERN_DEBUG "process %i (%s) awakening the readers...n",
current->pid, current->comm);
flag = 1;
wake_up_interruptible(&wq);
return count; /* succeed, to avoid retrial */
}
Implementation of wait_event:
How to implement sleep manually
#define wait_event(wq, condition) 
do { 
if (condition) 
break; 
__wait_event(wq, condition); 
} while (0)
#define __wait_event(wq, condition) 
do { 
DEFINE_WAIT(__wait); 

for (;;) { 
prepare_to_wait(&wq, &__wait, TASK_UNINTERRUPTIBLE); 
if (condition) 
break; 
schedule(); 
} 
finish_wait(&wq, &__wait); 
} while (0)
Implementation of wait_event:
How to implement sleep manually
 prepare_to_wait
 add wait queue entry to the wait queue and set the
process state
 finish_wait
 set task state to TASK_RUNNING and remove wait queue
entry from wait queue
 Questions:
 What if the ‘if (condition) ..’ statement is moved to
the front of prepare_to_wait()?
 What if the ‘wake_up’ event happens just after the ’if
(condition) ..‘ statement but before the execution of
the schedule() function?
User Preemption
• It can occur if need_resched is true when
returning to user-space
– from a system call
– from an interrupt handler
Kernel Preemption
• In nonpreemptive kernels, kernel code runs until
completion.
– The scheduler cannot reschedule a task while it is in
the kernel
– kernel code is scheduled cooperatively, not
preemptively
• In the 2.6+ kernel, however, the Linux kernel
became preemptive:
– It is now possible to preempt a task at any point, so
long as the kernel is in a state in which it is safe to
reschedule
• Safe => preempt_count == 0 (kernel doesn’t hold any lock
and isn’t in any atomic context like softirq or hardirq)
Kernel Preemption
• preempt_count
– a variable in each process’s thread_info
– Begins at zero and increments when kernel
enters any atomic contexts, decrements when
leaves.
– If this counter is zero, kernel is preemptible
Cases that needs preemption disable
• Per-CPU data structures
• Some registers must be protected
– On x86, kernel does not save FPU state
except for user tasks. Entering and exiting
FPU mode is a critical section that must occur
while preemption is disabled
struct this_needs_locking tux[NR_CPUS];
tux[smp_processor_id()] = some_value;
/* task is preempted here... */
something = tux[smp_processor_id()];
preempt_count
/*
* We put the hardirq and softirq counter into the preemption
* counter. The bitmask has the following meaning:
*
* - bits 0-7 are the preemption count (max preemption depth: 256)
* - bits 8-15 are the softirq count (max # of softirqs: 256)
*
* The hardirq count can in theory reach the same as NR_IRQS.
* In reality, the number of nested IRQS is limited to the stack
* size as well. For archs with over 1000 IRQS it is not practical
* to expect that they will all nest. We give a max of 10 bits for
* hardirq nesting. An arch may choose to give less than 10 bits.
* m68k expects it to be 8.
*
* - bits 16-25 are the hardirq count (max # of nested hardirqs: 1024)
* - bit 26 is the NMI_MASK
* - bit 28 is the PREEMPT_ACTIVE flag
*
* PREEMPT_MASK: 0x000000ff
* SOFTIRQ_MASK: 0x0000ff00
* HARDIRQ_MASK: 0x03ff0000
* NMI_MASK: 0x04000000
*/
include/linux/hardirq.h
References
• Linux Kernel Development, 3rd Edition,
Robert Love, 2010
• Linux kernel source, http://lxr.free-
electrons.com

Process scheduling

  • 1.
  • 2.
    Objective • Decide whichprocess runs, when, and for how long • Considering the overhead of context switches, we need to balance between conflicting goal – CPU utilization (high throughput) – Better Interactive performance (low latency)
  • 3.
    Multitasking • Cooperative – Aprocess does not stop running until it voluntary decides to do so – Anyone can monopolize the processor; a hung process that never yields can lock the entire system – A technique used in many user-mode threading libraries. • Preemptive – A running process can be suspended at any time (usually because it exhausts its time slice)
  • 4.
    Type of processes •I/O-bound processes – spend most of their time waiting for I/O – should be executed often (for short durations) when they are runnable • CPU-bound processes – spend most of their time executing code; tend to run until they are preempted – should be executed for longer durations (to improve throughput)
  • 5.
    Scheduling policies • Checksched(7) man page for more details • Normal • Real-time Name Description SCHED_NORMAL The standard time-sharing policy for regular tasks SCHED_BATCH For CPU-bound tasks that does not preempt often SCHED_IDLE For running very low priority background jobs (lower than a +19 nice value) SCHED_FIFO FIFO without time slice SCHED_RR Round robin with maximum time slice SCHED_DEADLINE Earliest Deadline First + Constant Bandwidth Server Accept a task only if its periodic job can be done before deadline
  • 6.
    Process priority • Processeswith a higher priority – run before those with a lower priority – receive a longer time slice • Priority range [static, dynamic] – Normal, batch: [always 0, -20~+19], default: [0, 0], dynamic priority is the nice value you adjust in user space. A larger “nice” value correspond to a lower priority – FIFO, RR: [0~99, 0], higher value means greater priority. FIFO, RR processes are at a higher priority than normal processes – Deadline: Not applicable. Deadline processes are always the highest priority in the system
  • 7.
    Time slice • Howlong a task can run until it is preempted • The value of the time slice: – higher: better throughput – lower: better interactive performance (shorter scheduling latency), but more CPU time wasted on context switches – Default value is usally pretty small (for good interactive performance)
  • 8.
    Completely Fair Scheduler •The scheduler for SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE classes • CFS assigns a proportion of the processor, instead of time slices, to processes – A process with higher nice value receives a smaller proportion of the CPU • If a process enters runnable state and has consumed a smaller proportion of the CPU than the currently executing one, it runs immediately, preempting the current one.
  • 9.
    CFS scheduler inaction • Two processes – Video encoder(CPU-bound) and text editor(I/O-bound) – Both processes have the same nice value • We want text editor to preempt video encoder when the editor is runnable – the text editor consumes a smaller proportion of the CPU than the video encoder, so it will preempt the video encoder once it is runnable.
  • 10.
    “timeslice” in CFS •Target latency – /proc/sys/kernel/sched_latency_ns – the period in which all run queue tasks are scheduled at least once • Timeslice_CFS = target latency / number of runnable processes * nice_weight – Ex: target latency = 20ms, two runnable processes at the same priority, each will run for 10ms before preemption • If the number of runnable processes =>∞, timeslice_CFS => 0 – Unacceptable switching costs – CFS imposes a floor on the “timeslice”: /proc/sys/kernel/sched_min_granularity_ns, default value is 1ms • CFS is not “fair” if the number of processes is extremely large
  • 11.
    CFS example again •Two processes, nice value = 0, 5 – Weight for a nice value of 5 is 1/3 – If target latency = 20ms, the two processes receive 15, 5ms “”timeslice” respectively – If we change nice value to 10,15, they still receive the same “timeslice” • The proportion of processor time that any process receives is determined only by the relative difference in niceness between it and the other runnable processes
  • 12.
    CFS group scheduling •Sometimes, it may be desirable to group tasks and provide fair CPU time to each such task group • Kernel config required: – CONFIG_FAIR_GROUP_SCHED – CONFIG_RT_GROUP_SCHED • Example: # mount -t tmpfs cgroup_root /sys/fs/cgroup # mkdir /sys/fs/cgroup/cpu # mount -t cgroup -ocpu none /sys/fs/cgroup/cpu # cd /sys/fs/cgroup/cpu # mkdir multimedia # create "multimedia" group of tasks # mkdir browser # create "browser" group of tasks # #Configure the multimedia group to receive twice the CPU bandwidth # #that of browser group # echo 2048 > multimedia/cpu.shares # echo 1024 > browser/cpu.shares # firefox & # Launch firefox and move it to "browser" group # echo <firefox_pid> > browser/tasks # #Launch gmplayer (or your favourite movie player) # echo <movie_player_pid> > multimedia/tasks
  • 13.
    Sporadic task model deadlinescheduling • Each SCHED_DEADLINE task is characterized by the "runtime", "deadline", and "period" parameters • The kernel performs an admittance test when setting or changing SCHED_DEADLINE policy and attributes with sched_attr() system call. arrival/wakeup absolute deadline | start time | | | | v v v -----x--------xooooooooooooooooo--------x--------x--- |<-- Runtime ------->| |<----------- Deadline ----------->| |<-------------- Period ------------------->|
  • 14.
    Some tools forreal-time tasks • chrt sets or retrieves the real-time scheduling attributes of an existing pid, or runs command with the given attributes. • Limiting the CPU usage of real-time and deadline processes – A nonblocking infinite loop in a thread scheduled under the FIFO, RR, or DEADLINE policy will block all threads with lower priority forever – two /proc files can be used to reserve a certain amount of CPU time to be used by non-real-time processes. • /proc/sys/kernel/sched_rt_period_us (default: 1000000) • /proc/sys/kernel/sched_rt_runtime_us (default: 950000) chrt [options] [<policy>] <priority> [-p <pid> | <command> [<arg>...]]
  • 15.
    Context switches • schedule()called context_switch() after when a new process has been selected to run • context_switch() – switch_mm(): switch virtual memory mapping – switch_to(): switch processor state. • Kernel are informed to reschedule if need_resched variable is set true – Set by scheduler_tick() when a process should be preempted – Set by try_to_wake_up() when a process with higher priority than current process is awaken
  • 16.
    Example: creating kernelthread #include <linux/module.h> #include <linux/kthread.h> #define DPRINTK(fmt, args...) printk("%s(): " fmt, __func__, ##args) static struct task_struct *kth_test_task; static int data; static int kth_test(void *arg) { unsigned int timeout; int *d = (int *) arg; while (!kthread_should_stop()) { DPRINTK("data=%dn", ++(*d)); set_current_state(TASK_INTERRUPTIBLE); timeout = schedule_timeout(10 * HZ); if (timeout) DPRINTK("schedule_timeout return early.n"); } DPRINTK("exit.n"); return 0; } static int __init init_modules(void) { int ret; kth_test_task = kthread_create(kth_test, &data, "kth_test"); if (IS_ERR(kth_test_task)) { ret = PTR_ERR(kth_test_task); kth_test_task = NULL; goto out; } wake_up_process(kth_test_task); return 0; out: return ret; } static void __exit exit_modules(void) { /* block until kth_test_task exit */ kthread_stop(kth_test_task); } module_init(init_modules); module_exit(exit_modules);
  • 17.
    Process sleeping  Processesneed to sleep when requests cannot be satisfied immediately  Kernel output buffer is full or no data is available  Rule for sleeping  Never sleep in an atomic context  Holding a spinlock, seqlock or RCU lock  Interrupts are disabled  Always check to ensure that the condition the process was waiting for is indeed true after the process wakes up
  • 18.
    Wait queue  Waitqueue contains a list of processes, all waiting for a specific event  Declaration and initialization of wait queue // defined and initialized statically with DECLARE_WAIT_QUEUE_HEAD(name); // initialized dynamically Wait_queue_head_t my_queue; init_waitqueue_head(&my_queue);
  • 19.
    wait_event macros // queue:the wait queue head to use. Note that it is passed “by value” // condition: arbitrary boolean expression, evaluated by the macro before // and after sleeping until the condition becomes true. It may // be evaluated an arbitrary number of times, so it should not // have any side effects. // timeout: wait for the specific number of clock ticks (in jiffies) // uninterruptible sleep until a condition gets true wait_event(queue, condition); // interruptible sleep until a condition gets true, return –ERESTARTSYS if // interrupted by a signal, return 0 if condition evaluated to be true wait_event_interruptible(queue, condition); // uninterruptible sleep until a condition gets true or a timeout elapses // return 0 if the timeout elapsed, and the remaining jiffies if the // condition evaluated to true before the timout elapsed wait_event_timeout(queue, condition, timeout); // interruptible sleep until a condition gets true or a timeout elapses // return 0 if the timeout elapsed, -ERESTARTSYS if interrupted by a // signal, and the remaining jiffies if the condition evaluated to true // before the timout elapsed wait_event_interruptible_timeout(queue, condition, timeout);
  • 20.
    wake_up macros // Wakeprocesses that are sleeping on the queue q. The _interruptible // form wakes only interruptible processes. Normally, only one exclusive // waiter is awakened (to avoid thundering herd problem), but that // behavior can be changed with the _nr or _all forms. The _sync version // does not reschedule the CPU before returning. void wake_up(struct wait_queue_head_t *q); void wake_up_interruptible(struct wait_queue_head_t *q); void wake_up_nr(struct wait_queue_head_t *q, int nr); void wake_up_interruptible_nr(struct wait_queue_head_t *q, int nr); void wake_up_all(struct wait_queue_head_t *q); void wake_up_interruptible_all(struct wait_queue_head_t *q); void wake_up_interruptible_sync(struct wait_queue_head_t *q);  Within a real device driver, a process blocked in a read call is awaken when data arrives; usually the hardware issues an interrupt to signal such an event, and the driver awakens waiting processes as part of handling the interrupt
  • 21.
    A simple exampleof putting processes to sleep  sleepy device behavior: any process that attempts to read from the device is put to sleep. Whenever a process writes to the device, all sleeping processes are awaken  Note that on single processor, the second process to wake up would immediately go back to sleep
  • 22.
    sleepy’s read andwrite ssize_t sleepy_read (struct file *filp, char __user *buf, size_t count, loff_t *pos) { printk(KERN_DEBUG "process %i (%s) going to sleepn", current->pid, current->comm); wait_event_interruptible(wq, flag != 0); flag = 0; printk(KERN_DEBUG "awoken %i (%s)n", current->pid, current->comm); return 0; /* EOF */ } ssize_t sleepy_write (struct file *filp, const char __user *buf, size_t count, loff_t *pos) { printk(KERN_DEBUG "process %i (%s) awakening the readers...n", current->pid, current->comm); flag = 1; wake_up_interruptible(&wq); return count; /* succeed, to avoid retrial */ }
  • 23.
    Implementation of wait_event: Howto implement sleep manually #define wait_event(wq, condition) do { if (condition) break; __wait_event(wq, condition); } while (0) #define __wait_event(wq, condition) do { DEFINE_WAIT(__wait); for (;;) { prepare_to_wait(&wq, &__wait, TASK_UNINTERRUPTIBLE); if (condition) break; schedule(); } finish_wait(&wq, &__wait); } while (0)
  • 24.
    Implementation of wait_event: Howto implement sleep manually  prepare_to_wait  add wait queue entry to the wait queue and set the process state  finish_wait  set task state to TASK_RUNNING and remove wait queue entry from wait queue  Questions:  What if the ‘if (condition) ..’ statement is moved to the front of prepare_to_wait()?  What if the ‘wake_up’ event happens just after the ’if (condition) ..‘ statement but before the execution of the schedule() function?
  • 25.
    User Preemption • Itcan occur if need_resched is true when returning to user-space – from a system call – from an interrupt handler
  • 26.
    Kernel Preemption • Innonpreemptive kernels, kernel code runs until completion. – The scheduler cannot reschedule a task while it is in the kernel – kernel code is scheduled cooperatively, not preemptively • In the 2.6+ kernel, however, the Linux kernel became preemptive: – It is now possible to preempt a task at any point, so long as the kernel is in a state in which it is safe to reschedule • Safe => preempt_count == 0 (kernel doesn’t hold any lock and isn’t in any atomic context like softirq or hardirq)
  • 27.
    Kernel Preemption • preempt_count –a variable in each process’s thread_info – Begins at zero and increments when kernel enters any atomic contexts, decrements when leaves. – If this counter is zero, kernel is preemptible
  • 28.
    Cases that needspreemption disable • Per-CPU data structures • Some registers must be protected – On x86, kernel does not save FPU state except for user tasks. Entering and exiting FPU mode is a critical section that must occur while preemption is disabled struct this_needs_locking tux[NR_CPUS]; tux[smp_processor_id()] = some_value; /* task is preempted here... */ something = tux[smp_processor_id()];
  • 29.
    preempt_count /* * We putthe hardirq and softirq counter into the preemption * counter. The bitmask has the following meaning: * * - bits 0-7 are the preemption count (max preemption depth: 256) * - bits 8-15 are the softirq count (max # of softirqs: 256) * * The hardirq count can in theory reach the same as NR_IRQS. * In reality, the number of nested IRQS is limited to the stack * size as well. For archs with over 1000 IRQS it is not practical * to expect that they will all nest. We give a max of 10 bits for * hardirq nesting. An arch may choose to give less than 10 bits. * m68k expects it to be 8. * * - bits 16-25 are the hardirq count (max # of nested hardirqs: 1024) * - bit 26 is the NMI_MASK * - bit 28 is the PREEMPT_ACTIVE flag * * PREEMPT_MASK: 0x000000ff * SOFTIRQ_MASK: 0x0000ff00 * HARDIRQ_MASK: 0x03ff0000 * NMI_MASK: 0x04000000 */ include/linux/hardirq.h
  • 30.
    References • Linux KernelDevelopment, 3rd Edition, Robert Love, 2010 • Linux kernel source, http://lxr.free- electrons.com