Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
CSE 5343/7343
CSE 5343/7343
Fall 2006
Fall 2006
Case Studies
Case Studies
Comparing Windows XP and Linux
Comparing Windows XP and Linux
2
Copyright Notice
Copyright Notice
© 2000-2005 David A. Solomon and Mark Russinovich
© 2000-2005 David A. Solomon and Mark Russinovich
These materials are part of the
These materials are part of the Windows Operating
Windows Operating
System Internals Curriculum Development Kit,
System Internals Curriculum Development Kit,
developed by David A. Solomon and Mark E.
developed by David A. Solomon and Mark E.
Russinovich with Andreas Polze
Russinovich with Andreas Polze
Microsoft has licensed these materials from David
Microsoft has licensed these materials from David
Solomon Expert Seminars, Inc. for distribution to
Solomon Expert Seminars, Inc. for distribution to
academic organizations solely for use in academic
academic organizations solely for use in academic
environments (and not for commercial use)
environments (and not for commercial use)
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Background
Background
Architecture
Architecture
4
Linus and Linux
Linus and Linux
In 1991 Linus Torvalds took a college computer science
In 1991 Linus Torvalds took a college computer science
course that used the Minix operating system
course that used the Minix operating system
Minix is a “toy” UNIX-like OS written by Andrew Tanenbaum as a
Minix is a “toy” UNIX-like OS written by Andrew Tanenbaum as a
learning workbench
learning workbench
Linus wanted to make MINIX more usable, but Tanenbaum
Linus wanted to make MINIX more usable, but Tanenbaum
wanted to keep it ultra-simple
wanted to keep it ultra-simple
Linus went in his own direction and began working on
Linus went in his own direction and began working on
Linux
Linux
In October 1991 he announced Linux v0.02
In October 1991 he announced Linux v0.02
In March 1994 he released Linux v1.0
In March 1994 he released Linux v1.0
5
Windows and Linux
Windows and Linux
Both Linux and Windows are based on
Both Linux and Windows are based on
foundations developed in the mid-1970s
foundations developed in the mid-1970s
1970 1980 1990 2000
U
N
I
X
b
o
r
n
U
N
I
X
p
u
b
l
i
c
U
N
I
X
V
6
L
i
n
u
x
v
1
.
0
v
2
.
0
v
2
.
1
v
2
.
2
v
2
.
3
v
2
.
4
v
2
.
6
1970 1980 1990 2000
V
M
S
v
1
.
0
W
i
n
d
o
w
s
N
T
3
.
1
N
T
4
.
0
W
i
n
d
o
w
s
2
0
0
0
W
i
n
d
o
w
s
X
P
S
e
r
v
e
r
2
0
0
3
6
Comparing the Architectures
Comparing the Architectures
Both Linux and Windows are monolithic
Both Linux and Windows are monolithic
All core operating system services run in a shared address space
All core operating system services run in a shared address space
in kernel-mode
in kernel-mode
All core operating system services are part of a single module
All core operating system services are part of a single module
Linux: vmlinuz
Linux: vmlinuz
Windows: ntoskrnl.exe
Windows: ntoskrnl.exe
Windowing is handled differently:
Windowing is handled differently:
Windows has a kernel-mode Windowing subsystem
Windows has a kernel-mode Windowing subsystem
Linux has a user-mode X-Windowing system
Linux has a user-mode X-Windowing system
7
Kernel Architectures
Kernel Architectures
Device
Drivers
Process Management,
Memory Management,
I/O Management, etc.
X-Windows
Application
System Services
User Mode
Kernel Mode
Hardware Dependent Code
Linux
Device
Drivers
Process Management,
Memory Management,
I/O Management, etc.
Win32
Windowing
Application
System Services
User Mode
Kernel Mode
Hardware Dependent Code
Windows
8
Linux Kernel
Linux Kernel
Linux is a monolithic but modular system
Linux is a monolithic but modular system
All kernel subsystems form a single piece of code with no
All kernel subsystems form a single piece of code with no
protection between them
protection between them
Modularity is supported in two ways:
Modularity is supported in two ways:
Compile-time options
Compile-time options
Most kernel components can be built as a dynamically loadable
Most kernel components can be built as a dynamically loadable
kernel module (DLKM)
kernel module (DLKM)
DLKMs
DLKMs
Built separately from the main kernel
Built separately from the main kernel
Loaded into the kernel at runtime and on demand (infrequently
Loaded into the kernel at runtime and on demand (infrequently
used components take up kernel memory only when needed)
used components take up kernel memory only when needed)
Kernel modules can be upgraded incrementally
Kernel modules can be upgraded incrementally
Support for minimal kernels that automatically adapt to the
Support for minimal kernels that automatically adapt to the
machine and load only those kernel components that are used
machine and load only those kernel components that are used
9
Windows Kernel
Windows Kernel
Windows is a monolithic but modular system
Windows is a monolithic but modular system
No protection among pieces of kernel code and drivers
No protection among pieces of kernel code and drivers
Support for Modularity is somewhat weak:
Support for Modularity is somewhat weak:
Windows Drivers allow for dynamic extension of kernel
Windows Drivers allow for dynamic extension of kernel
functionality
functionality
Windows XP Embedded has special tools / packaging rules that
Windows XP Embedded has special tools / packaging rules that
allow coarse-grained configuration of the OS
allow coarse-grained configuration of the OS
Windows Drivers are dynamically loadable kernel modules
Windows Drivers are dynamically loadable kernel modules
Significant amount of code run as drivers (including network
Significant amount of code run as drivers (including network
stacks such as TCP/IP and many services)
stacks such as TCP/IP and many services)
Built independently from the kernel
Built independently from the kernel
Can be loaded on-demand
Can be loaded on-demand
Dependencies among drivers can be specified
Dependencies among drivers can be specified
10
Comparing Portability
Comparing Portability
Both Linux and Windows kernels are portable
Both Linux and Windows kernels are portable
Mainly written in C
Mainly written in C
Have been ported to a range of processor architectures
Have been ported to a range of processor architectures
Windows
Windows
i486, MIPS, PowerPC, Alpha, IA-64, x86-64
i486, MIPS, PowerPC, Alpha, IA-64, x86-64
Only x86-64 and IA-64 currently supported
Only x86-64 and IA-64 currently supported
> 64MB memory required
> 64MB memory required
Linux
Linux
Alpha, ARM, ARM26, CRIS, H8300, i386, IA-64, M68000, MIPS,
Alpha, ARM, ARM26, CRIS, H8300, i386, IA-64, M68000, MIPS,
PA-RISC, PowerPC, S/390, SuperH, SPARC, VAX, v850, x86-
PA-RISC, PowerPC, S/390, SuperH, SPARC, VAX, v850, x86-
64
64
DLKMs allow for minimal kernels for microcontrollers
DLKMs allow for minimal kernels for microcontrollers
> 4MB memory required
> 4MB memory required
11
Comparing Layering, APIs, Complexity
Comparing Layering, APIs, Complexity
Windows
Windows
Kernel exports about 250 system calls (accessed via ntdll.dll)
Kernel exports about 250 system calls (accessed via ntdll.dll)
Layered Windows/POSIX subsystems
Layered Windows/POSIX subsystems
Rich Windows API (17 500 functions on top of native APIs)
Rich Windows API (17 500 functions on top of native APIs)
Linux
Linux
Kernel supports about 200 different system calls
Kernel supports about 200 different system calls
Layered BSD, Unix Sys V, POSIX shared system libraries
Layered BSD, Unix Sys V, POSIX shared system libraries
Compact APIs (1742 functions in Single Unix Specification
Compact APIs (1742 functions in Single Unix Specification
Version 3; not including X Window APIs)
Version 3; not including X Window APIs)
12
Comparing Architectures
Comparing Architectures
Processes and scheduling
Processes and scheduling
SMP support
SMP support
Memory management
Memory management
I/O
I/O
File Caching
File Caching
Security
Security
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Process Management
Process Management
14
Process Management
Process Management
Windows
Windows
Process
Process
Address space, handle
Address space, handle
table, statistics and at least
table, statistics and at least
one thread
one thread
No inherent parent/child
No inherent parent/child
relationship
relationship
Threads
Threads
Basic scheduling unit
Basic scheduling unit
Fibers - cooperative user-
Fibers - cooperative user-
mode threads
mode threads
Linux
Linux
Process is called a Task
Process is called a Task
Basic Address space,
Basic Address space,
handle table, statistics
handle table, statistics
Parent/child relationship
Parent/child relationship
Basic scheduling unit
Basic scheduling unit
Threads
Threads
No threads per-se
No threads per-se
Tasks can act like Windows
Tasks can act like Windows
threads by sharing handle
threads by sharing handle
table, PID and address
table, PID and address
space
space
PThreads – cooperative
PThreads – cooperative
user-mode threads
user-mode threads
15
Scheduling Priorities
Scheduling Priorities
Windows
Windows
Two scheduling classes
Two scheduling classes
“
“Real time” (fixed) -
Real time” (fixed) -
priority 16-31
priority 16-31
Dynamic - priority 1-15
Dynamic - priority 1-15
Higher priorities are
Higher priorities are
favored
favored
Priorities of dynamic
Priorities of dynamic
threads get boosted on
threads get boosted on
wakeups
wakeups
Thread priorities are
Thread priorities are
never lowered
never lowered
31
15
16
0
Fixed
Dynamic
I/O
Windows
16
Scheduling Priorities
Scheduling Priorities
Windows
Windows
Two scheduling classes
Two scheduling classes
“
“Real time” (fixed) -
Real time” (fixed) -
priority 16-31
priority 16-31
Dynamic - priority 1-15
Dynamic - priority 1-15
Higher priorities are
Higher priorities are
favored
favored
Priorities of dynamic
Priorities of dynamic
threads get boosted on
threads get boosted on
wakeups
wakeups
Thread priorities are
Thread priorities are
never lowered
never lowered
Linux
Linux
Has 3 scheduling classes:
Has 3 scheduling classes:
Normal – priority 100-139
Normal – priority 100-139
Fixed Round Robin – priority
Fixed Round Robin – priority
0-99
0-99
Fixed FIFO – priority 0-99
Fixed FIFO – priority 0-99
Lower priorities are favored
Lower priorities are favored
Priorities of normal threads
Priorities of normal threads
go up (decay) as they use
go up (decay) as they use
CPU
CPU
Priorities of interactive
Priorities of interactive
threads go down (boost)
threads go down (boost)
17
Scheduling Priorities (cont)
Scheduling Priorities (cont)
31
15
16
0
Fixed
Dynamic
I/O
Windows
140
100
99
0
Fixed FIFO
Fixed Round-Robin
Normal
CPU
I/O
Linux
18
Linux Scheduling Details
Linux Scheduling Details
Most threads use a dynamic priority policy
Most threads use a dynamic priority policy
Normal class - similar to the classic UNIX scheduler
Normal class - similar to the classic UNIX scheduler
A newly created thread starts with a base priority
A newly created thread starts with a base priority
Threads that block frequently (I/O bound) will have their priority
Threads that block frequently (I/O bound) will have their priority
gradually increased
gradually increased
Threads that always exhaust their time slice (CPU bound) will
Threads that always exhaust their time slice (CPU bound) will
have their priority gradually decreased
have their priority gradually decreased
“
“Nice value” sets a thread’s base priority
Nice value” sets a thread’s base priority
Larger values = less priority, lower values = higher priority
Larger values = less priority, lower values = higher priority
Valid nice values are in the range of -20 to +20
Valid nice values are in the range of -20 to +20
Nonprivileged users can only specify positive nice value
Nonprivileged users can only specify positive nice value
Dynamic priority policy threads have static priority zero
Dynamic priority policy threads have static priority zero
Execute only when there are no runnable real-time threads
Execute only when there are no runnable real-time threads
19
Real-Time Scheduling on Linux
Real-Time Scheduling on Linux
Linux supports two static priority scheduling policies:
Linux supports two static priority scheduling policies:
Round-robin and FIFO (first in, first out)
Round-robin and FIFO (first in, first out)
Selected with the sched-setscheduler( ) system call
Selected with the sched-setscheduler( ) system call
Use static priority values in the range of 1 to 99
Use static priority values in the range of 1 to 99
Executed strictly in order of decreasing static priority
Executed strictly in order of decreasing static priority
FIFO policy lets a thread run to completion
FIFO policy lets a thread run to completion
Thread needs to indicate completion by calling the sched-yield( )
Thread needs to indicate completion by calling the sched-yield( )
Round-robin lets threads run for up to one time slice
Round-robin lets threads run for up to one time slice
Then switches to the next thread with the same static priority
Then switches to the next thread with the same static priority
RT threads can easily starve lower-prio threads from executing
RT threads can easily starve lower-prio threads from executing
Root privileges or the CAP-SYS-NICE capability are required for the
Root privileges or the CAP-SYS-NICE capability are required for the
selection of a real-time scheduling policy
selection of a real-time scheduling policy
Long running system calls can cause priority-inversion
Long running system calls can cause priority-inversion
Same as in Windows; but cmp. rtLinux
Same as in Windows; but cmp. rtLinux
20
Windows Scheduling Details
Windows Scheduling Details
Most threads run in variable priority levels
Most threads run in variable priority levels
Priorities 1-15;
Priorities 1-15;
A newly created thread starts with a base priority
A newly created thread starts with a base priority
Threads that complete I/O operations experience priority
Threads that complete I/O operations experience priority
boosts (but never higher than 15)
boosts (but never higher than 15)
A thread’s priority will never be below base priority
A thread’s priority will never be below base priority
The Windows API function SetThreadPriority() sets the
The Windows API function SetThreadPriority() sets the
priority value for a specified thread
priority value for a specified thread
This value, together with the priority class of the thread's
This value, together with the priority class of the thread's
process, determines the thread's base priority level
process, determines the thread's base priority level
Windows will dynamically adjust priorities for non-realtime
Windows will dynamically adjust priorities for non-realtime
threads
threads
21
Real-Time Scheduling on Windows
Real-Time Scheduling on Windows
Windows supports static round-robin scheduling policy for
Windows supports static round-robin scheduling policy for
threads with priorities in real-time range (16-31)
threads with priorities in real-time range (16-31)
Threads run for up to one quantum
Threads run for up to one quantum
Quantum is reset to full turn on preemption
Quantum is reset to full turn on preemption
Priorities never get boosted
Priorities never get boosted
RT threads can starve important system services
RT threads can starve important system services
Such as CSRSS.EXE
Such as CSRSS.EXE
SeIncreaseBasePriorityPrivilege required to elevate a thread’s
SeIncreaseBasePriorityPrivilege required to elevate a thread’s
priority into real-time range (this privilege is assigned to
priority into real-time range (this privilege is assigned to
members of Administrators group)
members of Administrators group)
System calls and DPC/APC handling can cause priority
System calls and DPC/APC handling can cause priority
inversion
inversion
22
Scheduling Timeslices
Scheduling Timeslices
Windows
Windows
The thread timeslice
The thread timeslice
(quantum) is 10ms-120ms
(quantum) is 10ms-120ms
When quanta can vary,
When quanta can vary,
has one of 2 values
has one of 2 values
Reentrant and
Reentrant and
preemptible
preemptible
Fixed: 120ms
20ms
Foreground: 60ms
Background
Linux
Linux
The thread quantum is
The thread quantum is
10ms-200ms
10ms-200ms
Default is 100ms
Default is 100ms
Varies across entire
Varies across entire
range based on priority,
range based on priority,
which is based on
which is based on
interactivity level
interactivity level
Reentrant and
Reentrant and
preemptible
preemptible
100ms
200ms
10ms
23
Kernel Reentrancy
Kernel Reentrancy
Mark Russinovich’s April 1999 Windows NT Magazine article, “Linux
Mark Russinovich’s April 1999 Windows NT Magazine article, “Linux
and the Enterprise”, pointed out that much of the Linux 2.2 was not
and the Enterprise”, pointed out that much of the Linux 2.2 was not
reentrant
reentrant
Ingo Molnar stated in rebuttal:
Ingo Molnar stated in rebuttal:
“
“his example is a clear red herring.”
his example is a clear red herring.”
A month later he made all major paths reentrant
A month later he made all major paths reentrant
cpu 1
cpu 2
cpu 1
cpu 2
Non-reentrant
Reentrant
Time Saved
24
Kernel Preemptibility
Kernel Preemptibility
A preemptible kernel is more responsive to high-priority
A preemptible kernel is more responsive to high-priority
tasks
tasks
Through the base release of v2.4 Linux was only
Through the base release of v2.4 Linux was only
cooperatively
cooperatively preemptible
preemptible
There are well-defined safe places where a thread running in the
There are well-defined safe places where a thread running in the
kernel can be preempted
kernel can be preempted
The kernel is preemptible in v2.4 patches and v2.6
The kernel is preemptible in v2.4 patches and v2.6
Windows NT has always been preemptible
Windows NT has always been preemptible
25
Scheduling
Scheduling
The Linux 2.4 scheduler is O(n)
The Linux 2.4 scheduler is O(n)
If there are 10 active tasks, it scans 10 of them in a list in order to
If there are 10 active tasks, it scans 10 of them in a list in order to
decide which should execute next
decide which should execute next
This means long scans and long durations under the scheduler lock
This means long scans and long durations under the scheduler lock
103 112 112 101
Ready
List
Highest
Priority
Task
26
Scheduling
Scheduling
Linux 2.6 has a revamped scheduler that’s O(1) from Ingo Molnar
Linux 2.6 has a revamped scheduler that’s O(1) from Ingo Molnar
that:
that:
Calculates a task’s priority at the time it makes scheduling decision
Calculates a task’s priority at the time it makes scheduling decision
Has per-CPU ready queues where the tasks are pre-sorted by priority
Has per-CPU ready queues where the tasks are pre-sorted by priority
112 112
101
103
Highest-priority
Non-empty Queue
27
Scheduling
Scheduling
Windows NT has always had an O(1) scheduler based
Windows NT has always had an O(1) scheduler based
on pre-sorted thread priority queues
on pre-sorted thread priority queues
Server 2003 introduced per-CPU ready queues
Server 2003 introduced per-CPU ready queues
Linux load balances queues
Linux load balances queues
Windows does not
Windows does not
Not seen as an issue in performance testing by Microsoft
Not seen as an issue in performance testing by Microsoft
Applications where it might be an issue are expected to use affinity
Applications where it might be an issue are expected to use affinity
28
Zero-Copy Sendfile
Zero-Copy Sendfile
Linux 2.2 introduced Sendfile to efficiently send file data over a
Linux 2.2 introduced Sendfile to efficiently send file data over a
socket
socket
I pointed out that the initial implementation incurred a copy operation,
I pointed out that the initial implementation incurred a copy operation,
even if the file data was cached
even if the file data was cached
Linux 2.4 introduced zero-copy Sendfile
Linux 2.4 introduced zero-copy Sendfile
Windows NT pioneered zero-copy file sending with TransmitFile, the
Windows NT pioneered zero-copy file sending with TransmitFile, the
Sendfile equivalent, in Windows NT 4
Sendfile equivalent, in Windows NT 4
File Data
Buffer
Network
Adapter
Buffer
Network
File Data
Buffer
Network
Driver
Network
Network
Driver
1-Copy 0-Copy
29
Wake-one Socket Semantics
Wake-one Socket Semantics
Linux 2.2 kernel had the
Linux 2.2 kernel had the thundering herd
thundering herd or
or
overscheduling
overscheduling problem
problem
In a network server application there are typically several
In a network server application there are typically several
threads waiting for a new connection
threads waiting for a new connection
In v2.2 when a new connection came in all the waiters would
In v2.2 when a new connection came in all the waiters would
race to get it
race to get it
Ingo Molnar’s response:
Ingo Molnar’s response:
5/2/99: “here he again forgets to _prove_ that overscheduling
5/2/99: “here he again forgets to _prove_ that overscheduling
happens in Linux.”
happens in Linux.”
5/7/99: “as of 2.3.1 my wake-one implementation and
5/7/99: “as of 2.3.1 my wake-one implementation and
waitqueues rewrite went in”
waitqueues rewrite went in”
In Linux 2.4 only one thread wakes up to claim the new
In Linux 2.4 only one thread wakes up to claim the new
connection
connection
Windows NT has always had wake-1 semantics
Windows NT has always had wake-1 semantics
30
Light-Weight Synchronization
Light-Weight Synchronization
Linux 2.6 introduces Futexes
Linux 2.6 introduces Futexes
There’s only a transition to kernel-mode when there’s
There’s only a transition to kernel-mode when there’s
contention
contention
Windows has always had CriticalSections
Windows has always had CriticalSections
Same behavior
Same behavior
Futexes go further:
Futexes go further:
Allow for prioritization of waits
Allow for prioritization of waits
Works interprocess as well
Works interprocess as well
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Memory Management
Memory Management
32
Virtual Memory Management
Virtual Memory Management
Windows
Windows
32-bit versions split
32-bit versions split
user-mode/kernel-mode from
user-mode/kernel-mode from
2GB/2GB to 3GB/1GB
2GB/2GB to 3GB/1GB
Demand-paged virtual memory
Demand-paged virtual memory
32 or 64-bits
32 or 64-bits
Copy-on-write
Copy-on-write
Shared memory
Shared memory
Memory mapped files
Memory mapped files
User
System
0
2GB
4GB
Linux
Linux
Splits user-mode/kernel-mode
Splits user-mode/kernel-mode
from 1GB/3GB to 3GB/1GB
from 1GB/3GB to 3GB/1GB
2.6 has “4/4 split” option where
2.6 has “4/4 split” option where
kernel has its own address
kernel has its own address
space
space
Demand-paged virtual memory
Demand-paged virtual memory
32-bits and/or 64-bits
32-bits and/or 64-bits
Copy-on-write
Copy-on-write
Shared memory
Shared memory
Memory mapped files
Memory mapped files
User
System
0
3GB
4GB
33
Physical Memory Management
Physical Memory Management
Windows
Windows
Per-process working sets
Per-process working sets
Working set tuner adjust
Working set tuner adjust
sets according to memory
sets according to memory
needs using the “clock”
needs using the “clock”
algorithm
algorithm
No “swapper”
No “swapper”
Process
LRU
Reused Page
Linux
Linux
Global working set
Global working set
management
management
uses “clock” algorithm
uses “clock” algorithm
No “swapper” (the working
No “swapper” (the working
set trimmer code is called
set trimmer code is called
the swap daemon, however)
the swap daemon, however)
LRU
Reused Page
Other Process
LRU
34
I/O Management
I/O Management
Windows
Windows
Centered around the file object
Centered around the file object
Layered driver architecture
Layered driver architecture
throughout driver types
throughout driver types
Most I/O supports asynchronous
Most I/O supports asynchronous
operation
operation
Internal interrupt request level
Internal interrupt request level
(IRQL) controls interruptability
(IRQL) controls interruptability
Interrupts are split between an
Interrupts are split between an
Interrupt Service Routine (ISR)
Interrupt Service Routine (ISR)
and a Deferred Procedure Call
and a Deferred Procedure Call
(DPC)
(DPC)
Supports plug-and-play
Supports plug-and-play
Linux
Linux
Centered around the vnode
Centered around the vnode
No layered I/O model
No layered I/O model
Most I/O is synchronous
Most I/O is synchronous
Only sockets and direct disk
Only sockets and direct disk
I/O support asynchronous
I/O support asynchronous
I/O
I/O
Internal interrupt request level
Internal interrupt request level
(IRQL) controls interruptability
(IRQL) controls interruptability
Interrupts are split between an
Interrupts are split between an
ISR and soft IRQ or tasklet
ISR and soft IRQ or tasklet
Supports plug-and-play
Supports plug-and-play
IRQL
Masked
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
I/O & File System
I/O & File System
Management
Management
36
File Caching
File Caching
Windows
Windows
Single global common cache
Single global common cache
Virtual file cache
Virtual file cache
Caching is at file vs. disk block
Caching is at file vs. disk block
level
level
Files are memory mapped into
Files are memory mapped into
kernel memory
kernel memory
Cache allows for zero-copy file
Cache allows for zero-copy file
serving
serving
File Cache
File System Driver
Disk Driver
Linux
Linux
Single global common cache
Single global common cache
Virtual file cache
Virtual file cache
Caching is at file vs. disk block
Caching is at file vs. disk block
level
level
Files are memory mapped into
Files are memory mapped into
kernel memory
kernel memory
Cache allows for zero-copy file
Cache allows for zero-copy file
serving
serving
File Cache
File System Driver
Disk Driver
37
Monitoring - Linux procfs
Monitoring - Linux procfs
Linux supports a number of special filesystems
Linux supports a number of special filesystems
Like special files, they are of a more dynamic nature and tend to have side
Like special files, they are of a more dynamic nature and tend to have side
effects when accessed
effects when accessed
Prime example is procfs
Prime example is procfs (mounted at /proc)
(mounted at /proc)
provides access to and control over various aspects of Linux (I.e.; scheduling
provides access to and control over various aspects of Linux (I.e.; scheduling
and memory management)
and memory management)
/proc/meminfo contains detailed statistics on the current memory usage of Linux
/proc/meminfo contains detailed statistics on the current memory usage of Linux
Content changes as memory usage changes over time
Content changes as memory usage changes over time
Services for Unix implements procfs on Windows
Services for Unix implements procfs on Windows
38
I/O Processing
I/O Processing
Linux 2.2 had the notion of bottom halves (BH) for low-
Linux 2.2 had the notion of bottom halves (BH) for low-
priority interrupt processing
priority interrupt processing
Fixed number of BHs
Fixed number of BHs
Only one BH of a given type could be active on a SMP
Only one BH of a given type could be active on a SMP
Linux 2.4 introduced
Linux 2.4 introduced tasklets
tasklets, which are non-preemptible
, which are non-preemptible
procedures called with interrupts enabled
procedures called with interrupts enabled
Tasklets are the equivalent of Windows Deferred
Tasklets are the equivalent of Windows Deferred
Procedure Calls (DPCs)
Procedure Calls (DPCs)
39
Asynchronous I/O
Asynchronous I/O
Linux 2.2 only supported asynchronous I/O on socket
Linux 2.2 only supported asynchronous I/O on socket
connect operations and tty’s
connect operations and tty’s
Linux 2.6 adds asynchronous I/O for direct-disk access
Linux 2.6 adds asynchronous I/O for direct-disk access
AIO model includes efficient management of asynchronous I/O
AIO model includes efficient management of asynchronous I/O
Also added alternate epoll model
Also added alternate epoll model
Useful for database servers managing their database on a
Useful for database servers managing their database on a
dedicated raw partition
dedicated raw partition
Database servers that manage a file-based database suffer from
Database servers that manage a file-based database suffer from
synchronous I/O
synchronous I/O
Windows I/O is inherently asynchronous
Windows I/O is inherently asynchronous
Windows has had completion ports since NT 3.5
Windows has had completion ports since NT 3.5
More advanced form of AIO
More advanced form of AIO
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Security
Security
41
Security
Security
Windows
Windows
Very flexible security model based on
Very flexible security model based on
Access Control Lists
Access Control Lists
Users are defined with
Users are defined with
Privileges
Privileges
Member groups
Member groups
Security can be applied to any Object
Security can be applied to any Object
Manager object
Manager object
Files, processes, synchronization
Files, processes, synchronization
objects, …
objects, …
Supports auditing
Supports auditing
Linux
Linux
Two models:
Two models:
Standard UNIX model
Standard UNIX model
Access Control Lists (SELinux)
Access Control Lists (SELinux)
Users are defined with:
Users are defined with:
Capabilities (privileges)
Capabilities (privileges)
Member groups
Member groups
Security is implemented on an
Security is implemented on an
object-by-object basis
object-by-object basis
Has no built-in auditing support
Has no built-in auditing support
Version 2.6 includes Linux Security
Version 2.6 includes Linux Security
Module framework for add-on
Module framework for add-on
security models
security models
42
A Look at the Future
A Look at the Future
The kernel architectures are fundamentally similar
The kernel architectures are fundamentally similar
There are differences in the details
There are differences in the details
Linux implementation is adopting more of the good ideas used in
Linux implementation is adopting more of the good ideas used in
Windows
Windows
For the next 2-4 years Windows has and will maintain an edge
For the next 2-4 years Windows has and will maintain an edge
Linux is still behind on the cutting edge of performance tricks
Linux is still behind on the cutting edge of performance tricks
Large performance team and lab at Microsoft has direct ties into the
Large performance team and lab at Microsoft has direct ties into the
kernel developers
kernel developers
As time goes on the technological gap will narrow
As time goes on the technological gap will narrow
Open Source Development Labs (OSDL) will feed performance test
Open Source Development Labs (OSDL) will feed performance test
results to the kernel team
results to the kernel team
IBM and other vendors have Linux technology centers
IBM and other vendors have Linux technology centers
Squeezing performance out of the OS gets much harder as the OS
Squeezing performance out of the OS gets much harder as the OS
gets more tuned
gets more tuned
43
Linux Technology Unknowns
Linux Technology Unknowns
Linux kernel forking
Linux kernel forking
RedHat has already done it: Red Hat Enterprise Server v3.0 is
RedHat has already done it: Red Hat Enterprise Server v3.0 is
Linux 2.4 with some Linux 2.6 features
Linux 2.4 with some Linux 2.6 features
Backward compatibility philosophy
Backward compatibility philosophy
Linus Torvalds makes decisions on kernel APIs and
Linus Torvalds makes decisions on kernel APIs and
architecture based on technical reasons, not business reasons
architecture based on technical reasons, not business reasons
44
Further Reading
Further Reading
Transaction Processing Council: www.tpc.org
Transaction Processing Council: www.tpc.org
SPEC: www.spec.org
SPEC: www.spec.org
NT vs Linux benchmarks: www.kegel.com/nt-linux-benchmarks.html
NT vs Linux benchmarks: www.kegel.com/nt-linux-benchmarks.html
The C10K problem: http://www.kegel.com/c10k.html
The C10K problem: http://www.kegel.com/c10k.html
Linus Torvald’s home: http://www.osdl.org/
Linus Torvald’s home: http://www.osdl.org/
Linux Kernel Archives: http://www.kernel.org/
Linux Kernel Archives: http://www.kernel.org/
Linux history: http://www.firstmonday.dk/issues/issue5_11/moon/
Linux history: http://www.firstmonday.dk/issues/issue5_11/moon/
Veritest Netbench result:
Veritest Netbench result:
http://www.veritest.com/clients/reports/microsoft/ms_netbench.pdf
http://www.veritest.com/clients/reports/microsoft/ms_netbench.pdf
Mark Russinovich’s 1999 article, “Linux and the Enterprise”:
Mark Russinovich’s 1999 article, “Linux and the Enterprise”:
http://www.winntmag.com/Articles/Index.cfm?ArticleID=5048
http://www.winntmag.com/Articles/Index.cfm?ArticleID=5048
The Open Group's Single UNIX Specification:
The Open Group's Single UNIX Specification:
http://www.unix.org/version3/
http://www.unix.org/version3/

casecomp.ppt. shsjsi sjsjjsjsjsjsuaiajjajwjsjsksks

  • 1.
    Windows Operating SystemInternals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze CSE 5343/7343 CSE 5343/7343 Fall 2006 Fall 2006 Case Studies Case Studies Comparing Windows XP and Linux Comparing Windows XP and Linux
  • 2.
    2 Copyright Notice Copyright Notice ©2000-2005 David A. Solomon and Mark Russinovich © 2000-2005 David A. Solomon and Mark Russinovich These materials are part of the These materials are part of the Windows Operating Windows Operating System Internals Curriculum Development Kit, System Internals Curriculum Development Kit, developed by David A. Solomon and Mark E. developed by David A. Solomon and Mark E. Russinovich with Andreas Polze Russinovich with Andreas Polze Microsoft has licensed these materials from David Microsoft has licensed these materials from David Solomon Expert Seminars, Inc. for distribution to Solomon Expert Seminars, Inc. for distribution to academic organizations solely for use in academic academic organizations solely for use in academic environments (and not for commercial use) environments (and not for commercial use)
  • 3.
    Windows Operating SystemInternals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Background Background Architecture Architecture
  • 4.
    4 Linus and Linux Linusand Linux In 1991 Linus Torvalds took a college computer science In 1991 Linus Torvalds took a college computer science course that used the Minix operating system course that used the Minix operating system Minix is a “toy” UNIX-like OS written by Andrew Tanenbaum as a Minix is a “toy” UNIX-like OS written by Andrew Tanenbaum as a learning workbench learning workbench Linus wanted to make MINIX more usable, but Tanenbaum Linus wanted to make MINIX more usable, but Tanenbaum wanted to keep it ultra-simple wanted to keep it ultra-simple Linus went in his own direction and began working on Linus went in his own direction and began working on Linux Linux In October 1991 he announced Linux v0.02 In October 1991 he announced Linux v0.02 In March 1994 he released Linux v1.0 In March 1994 he released Linux v1.0
  • 5.
    5 Windows and Linux Windowsand Linux Both Linux and Windows are based on Both Linux and Windows are based on foundations developed in the mid-1970s foundations developed in the mid-1970s 1970 1980 1990 2000 U N I X b o r n U N I X p u b l i c U N I X V 6 L i n u x v 1 . 0 v 2 . 0 v 2 . 1 v 2 . 2 v 2 . 3 v 2 . 4 v 2 . 6 1970 1980 1990 2000 V M S v 1 . 0 W i n d o w s N T 3 . 1 N T 4 . 0 W i n d o w s 2 0 0 0 W i n d o w s X P S e r v e r 2 0 0 3
  • 6.
    6 Comparing the Architectures Comparingthe Architectures Both Linux and Windows are monolithic Both Linux and Windows are monolithic All core operating system services run in a shared address space All core operating system services run in a shared address space in kernel-mode in kernel-mode All core operating system services are part of a single module All core operating system services are part of a single module Linux: vmlinuz Linux: vmlinuz Windows: ntoskrnl.exe Windows: ntoskrnl.exe Windowing is handled differently: Windowing is handled differently: Windows has a kernel-mode Windowing subsystem Windows has a kernel-mode Windowing subsystem Linux has a user-mode X-Windowing system Linux has a user-mode X-Windowing system
  • 7.
    7 Kernel Architectures Kernel Architectures Device Drivers ProcessManagement, Memory Management, I/O Management, etc. X-Windows Application System Services User Mode Kernel Mode Hardware Dependent Code Linux Device Drivers Process Management, Memory Management, I/O Management, etc. Win32 Windowing Application System Services User Mode Kernel Mode Hardware Dependent Code Windows
  • 8.
    8 Linux Kernel Linux Kernel Linuxis a monolithic but modular system Linux is a monolithic but modular system All kernel subsystems form a single piece of code with no All kernel subsystems form a single piece of code with no protection between them protection between them Modularity is supported in two ways: Modularity is supported in two ways: Compile-time options Compile-time options Most kernel components can be built as a dynamically loadable Most kernel components can be built as a dynamically loadable kernel module (DLKM) kernel module (DLKM) DLKMs DLKMs Built separately from the main kernel Built separately from the main kernel Loaded into the kernel at runtime and on demand (infrequently Loaded into the kernel at runtime and on demand (infrequently used components take up kernel memory only when needed) used components take up kernel memory only when needed) Kernel modules can be upgraded incrementally Kernel modules can be upgraded incrementally Support for minimal kernels that automatically adapt to the Support for minimal kernels that automatically adapt to the machine and load only those kernel components that are used machine and load only those kernel components that are used
  • 9.
    9 Windows Kernel Windows Kernel Windowsis a monolithic but modular system Windows is a monolithic but modular system No protection among pieces of kernel code and drivers No protection among pieces of kernel code and drivers Support for Modularity is somewhat weak: Support for Modularity is somewhat weak: Windows Drivers allow for dynamic extension of kernel Windows Drivers allow for dynamic extension of kernel functionality functionality Windows XP Embedded has special tools / packaging rules that Windows XP Embedded has special tools / packaging rules that allow coarse-grained configuration of the OS allow coarse-grained configuration of the OS Windows Drivers are dynamically loadable kernel modules Windows Drivers are dynamically loadable kernel modules Significant amount of code run as drivers (including network Significant amount of code run as drivers (including network stacks such as TCP/IP and many services) stacks such as TCP/IP and many services) Built independently from the kernel Built independently from the kernel Can be loaded on-demand Can be loaded on-demand Dependencies among drivers can be specified Dependencies among drivers can be specified
  • 10.
    10 Comparing Portability Comparing Portability BothLinux and Windows kernels are portable Both Linux and Windows kernels are portable Mainly written in C Mainly written in C Have been ported to a range of processor architectures Have been ported to a range of processor architectures Windows Windows i486, MIPS, PowerPC, Alpha, IA-64, x86-64 i486, MIPS, PowerPC, Alpha, IA-64, x86-64 Only x86-64 and IA-64 currently supported Only x86-64 and IA-64 currently supported > 64MB memory required > 64MB memory required Linux Linux Alpha, ARM, ARM26, CRIS, H8300, i386, IA-64, M68000, MIPS, Alpha, ARM, ARM26, CRIS, H8300, i386, IA-64, M68000, MIPS, PA-RISC, PowerPC, S/390, SuperH, SPARC, VAX, v850, x86- PA-RISC, PowerPC, S/390, SuperH, SPARC, VAX, v850, x86- 64 64 DLKMs allow for minimal kernels for microcontrollers DLKMs allow for minimal kernels for microcontrollers > 4MB memory required > 4MB memory required
  • 11.
    11 Comparing Layering, APIs,Complexity Comparing Layering, APIs, Complexity Windows Windows Kernel exports about 250 system calls (accessed via ntdll.dll) Kernel exports about 250 system calls (accessed via ntdll.dll) Layered Windows/POSIX subsystems Layered Windows/POSIX subsystems Rich Windows API (17 500 functions on top of native APIs) Rich Windows API (17 500 functions on top of native APIs) Linux Linux Kernel supports about 200 different system calls Kernel supports about 200 different system calls Layered BSD, Unix Sys V, POSIX shared system libraries Layered BSD, Unix Sys V, POSIX shared system libraries Compact APIs (1742 functions in Single Unix Specification Compact APIs (1742 functions in Single Unix Specification Version 3; not including X Window APIs) Version 3; not including X Window APIs)
  • 12.
    12 Comparing Architectures Comparing Architectures Processesand scheduling Processes and scheduling SMP support SMP support Memory management Memory management I/O I/O File Caching File Caching Security Security
  • 13.
    Windows Operating SystemInternals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Process Management Process Management
  • 14.
    14 Process Management Process Management Windows Windows Process Process Addressspace, handle Address space, handle table, statistics and at least table, statistics and at least one thread one thread No inherent parent/child No inherent parent/child relationship relationship Threads Threads Basic scheduling unit Basic scheduling unit Fibers - cooperative user- Fibers - cooperative user- mode threads mode threads Linux Linux Process is called a Task Process is called a Task Basic Address space, Basic Address space, handle table, statistics handle table, statistics Parent/child relationship Parent/child relationship Basic scheduling unit Basic scheduling unit Threads Threads No threads per-se No threads per-se Tasks can act like Windows Tasks can act like Windows threads by sharing handle threads by sharing handle table, PID and address table, PID and address space space PThreads – cooperative PThreads – cooperative user-mode threads user-mode threads
  • 15.
    15 Scheduling Priorities Scheduling Priorities Windows Windows Twoscheduling classes Two scheduling classes “ “Real time” (fixed) - Real time” (fixed) - priority 16-31 priority 16-31 Dynamic - priority 1-15 Dynamic - priority 1-15 Higher priorities are Higher priorities are favored favored Priorities of dynamic Priorities of dynamic threads get boosted on threads get boosted on wakeups wakeups Thread priorities are Thread priorities are never lowered never lowered 31 15 16 0 Fixed Dynamic I/O Windows
  • 16.
    16 Scheduling Priorities Scheduling Priorities Windows Windows Twoscheduling classes Two scheduling classes “ “Real time” (fixed) - Real time” (fixed) - priority 16-31 priority 16-31 Dynamic - priority 1-15 Dynamic - priority 1-15 Higher priorities are Higher priorities are favored favored Priorities of dynamic Priorities of dynamic threads get boosted on threads get boosted on wakeups wakeups Thread priorities are Thread priorities are never lowered never lowered Linux Linux Has 3 scheduling classes: Has 3 scheduling classes: Normal – priority 100-139 Normal – priority 100-139 Fixed Round Robin – priority Fixed Round Robin – priority 0-99 0-99 Fixed FIFO – priority 0-99 Fixed FIFO – priority 0-99 Lower priorities are favored Lower priorities are favored Priorities of normal threads Priorities of normal threads go up (decay) as they use go up (decay) as they use CPU CPU Priorities of interactive Priorities of interactive threads go down (boost) threads go down (boost)
  • 17.
    17 Scheduling Priorities (cont) SchedulingPriorities (cont) 31 15 16 0 Fixed Dynamic I/O Windows 140 100 99 0 Fixed FIFO Fixed Round-Robin Normal CPU I/O Linux
  • 18.
    18 Linux Scheduling Details LinuxScheduling Details Most threads use a dynamic priority policy Most threads use a dynamic priority policy Normal class - similar to the classic UNIX scheduler Normal class - similar to the classic UNIX scheduler A newly created thread starts with a base priority A newly created thread starts with a base priority Threads that block frequently (I/O bound) will have their priority Threads that block frequently (I/O bound) will have their priority gradually increased gradually increased Threads that always exhaust their time slice (CPU bound) will Threads that always exhaust their time slice (CPU bound) will have their priority gradually decreased have their priority gradually decreased “ “Nice value” sets a thread’s base priority Nice value” sets a thread’s base priority Larger values = less priority, lower values = higher priority Larger values = less priority, lower values = higher priority Valid nice values are in the range of -20 to +20 Valid nice values are in the range of -20 to +20 Nonprivileged users can only specify positive nice value Nonprivileged users can only specify positive nice value Dynamic priority policy threads have static priority zero Dynamic priority policy threads have static priority zero Execute only when there are no runnable real-time threads Execute only when there are no runnable real-time threads
  • 19.
    19 Real-Time Scheduling onLinux Real-Time Scheduling on Linux Linux supports two static priority scheduling policies: Linux supports two static priority scheduling policies: Round-robin and FIFO (first in, first out) Round-robin and FIFO (first in, first out) Selected with the sched-setscheduler( ) system call Selected with the sched-setscheduler( ) system call Use static priority values in the range of 1 to 99 Use static priority values in the range of 1 to 99 Executed strictly in order of decreasing static priority Executed strictly in order of decreasing static priority FIFO policy lets a thread run to completion FIFO policy lets a thread run to completion Thread needs to indicate completion by calling the sched-yield( ) Thread needs to indicate completion by calling the sched-yield( ) Round-robin lets threads run for up to one time slice Round-robin lets threads run for up to one time slice Then switches to the next thread with the same static priority Then switches to the next thread with the same static priority RT threads can easily starve lower-prio threads from executing RT threads can easily starve lower-prio threads from executing Root privileges or the CAP-SYS-NICE capability are required for the Root privileges or the CAP-SYS-NICE capability are required for the selection of a real-time scheduling policy selection of a real-time scheduling policy Long running system calls can cause priority-inversion Long running system calls can cause priority-inversion Same as in Windows; but cmp. rtLinux Same as in Windows; but cmp. rtLinux
  • 20.
    20 Windows Scheduling Details WindowsScheduling Details Most threads run in variable priority levels Most threads run in variable priority levels Priorities 1-15; Priorities 1-15; A newly created thread starts with a base priority A newly created thread starts with a base priority Threads that complete I/O operations experience priority Threads that complete I/O operations experience priority boosts (but never higher than 15) boosts (but never higher than 15) A thread’s priority will never be below base priority A thread’s priority will never be below base priority The Windows API function SetThreadPriority() sets the The Windows API function SetThreadPriority() sets the priority value for a specified thread priority value for a specified thread This value, together with the priority class of the thread's This value, together with the priority class of the thread's process, determines the thread's base priority level process, determines the thread's base priority level Windows will dynamically adjust priorities for non-realtime Windows will dynamically adjust priorities for non-realtime threads threads
  • 21.
    21 Real-Time Scheduling onWindows Real-Time Scheduling on Windows Windows supports static round-robin scheduling policy for Windows supports static round-robin scheduling policy for threads with priorities in real-time range (16-31) threads with priorities in real-time range (16-31) Threads run for up to one quantum Threads run for up to one quantum Quantum is reset to full turn on preemption Quantum is reset to full turn on preemption Priorities never get boosted Priorities never get boosted RT threads can starve important system services RT threads can starve important system services Such as CSRSS.EXE Such as CSRSS.EXE SeIncreaseBasePriorityPrivilege required to elevate a thread’s SeIncreaseBasePriorityPrivilege required to elevate a thread’s priority into real-time range (this privilege is assigned to priority into real-time range (this privilege is assigned to members of Administrators group) members of Administrators group) System calls and DPC/APC handling can cause priority System calls and DPC/APC handling can cause priority inversion inversion
  • 22.
    22 Scheduling Timeslices Scheduling Timeslices Windows Windows Thethread timeslice The thread timeslice (quantum) is 10ms-120ms (quantum) is 10ms-120ms When quanta can vary, When quanta can vary, has one of 2 values has one of 2 values Reentrant and Reentrant and preemptible preemptible Fixed: 120ms 20ms Foreground: 60ms Background Linux Linux The thread quantum is The thread quantum is 10ms-200ms 10ms-200ms Default is 100ms Default is 100ms Varies across entire Varies across entire range based on priority, range based on priority, which is based on which is based on interactivity level interactivity level Reentrant and Reentrant and preemptible preemptible 100ms 200ms 10ms
  • 23.
    23 Kernel Reentrancy Kernel Reentrancy MarkRussinovich’s April 1999 Windows NT Magazine article, “Linux Mark Russinovich’s April 1999 Windows NT Magazine article, “Linux and the Enterprise”, pointed out that much of the Linux 2.2 was not and the Enterprise”, pointed out that much of the Linux 2.2 was not reentrant reentrant Ingo Molnar stated in rebuttal: Ingo Molnar stated in rebuttal: “ “his example is a clear red herring.” his example is a clear red herring.” A month later he made all major paths reentrant A month later he made all major paths reentrant cpu 1 cpu 2 cpu 1 cpu 2 Non-reentrant Reentrant Time Saved
  • 24.
    24 Kernel Preemptibility Kernel Preemptibility Apreemptible kernel is more responsive to high-priority A preemptible kernel is more responsive to high-priority tasks tasks Through the base release of v2.4 Linux was only Through the base release of v2.4 Linux was only cooperatively cooperatively preemptible preemptible There are well-defined safe places where a thread running in the There are well-defined safe places where a thread running in the kernel can be preempted kernel can be preempted The kernel is preemptible in v2.4 patches and v2.6 The kernel is preemptible in v2.4 patches and v2.6 Windows NT has always been preemptible Windows NT has always been preemptible
  • 25.
    25 Scheduling Scheduling The Linux 2.4scheduler is O(n) The Linux 2.4 scheduler is O(n) If there are 10 active tasks, it scans 10 of them in a list in order to If there are 10 active tasks, it scans 10 of them in a list in order to decide which should execute next decide which should execute next This means long scans and long durations under the scheduler lock This means long scans and long durations under the scheduler lock 103 112 112 101 Ready List Highest Priority Task
  • 26.
    26 Scheduling Scheduling Linux 2.6 hasa revamped scheduler that’s O(1) from Ingo Molnar Linux 2.6 has a revamped scheduler that’s O(1) from Ingo Molnar that: that: Calculates a task’s priority at the time it makes scheduling decision Calculates a task’s priority at the time it makes scheduling decision Has per-CPU ready queues where the tasks are pre-sorted by priority Has per-CPU ready queues where the tasks are pre-sorted by priority 112 112 101 103 Highest-priority Non-empty Queue
  • 27.
    27 Scheduling Scheduling Windows NT hasalways had an O(1) scheduler based Windows NT has always had an O(1) scheduler based on pre-sorted thread priority queues on pre-sorted thread priority queues Server 2003 introduced per-CPU ready queues Server 2003 introduced per-CPU ready queues Linux load balances queues Linux load balances queues Windows does not Windows does not Not seen as an issue in performance testing by Microsoft Not seen as an issue in performance testing by Microsoft Applications where it might be an issue are expected to use affinity Applications where it might be an issue are expected to use affinity
  • 28.
    28 Zero-Copy Sendfile Zero-Copy Sendfile Linux2.2 introduced Sendfile to efficiently send file data over a Linux 2.2 introduced Sendfile to efficiently send file data over a socket socket I pointed out that the initial implementation incurred a copy operation, I pointed out that the initial implementation incurred a copy operation, even if the file data was cached even if the file data was cached Linux 2.4 introduced zero-copy Sendfile Linux 2.4 introduced zero-copy Sendfile Windows NT pioneered zero-copy file sending with TransmitFile, the Windows NT pioneered zero-copy file sending with TransmitFile, the Sendfile equivalent, in Windows NT 4 Sendfile equivalent, in Windows NT 4 File Data Buffer Network Adapter Buffer Network File Data Buffer Network Driver Network Network Driver 1-Copy 0-Copy
  • 29.
    29 Wake-one Socket Semantics Wake-oneSocket Semantics Linux 2.2 kernel had the Linux 2.2 kernel had the thundering herd thundering herd or or overscheduling overscheduling problem problem In a network server application there are typically several In a network server application there are typically several threads waiting for a new connection threads waiting for a new connection In v2.2 when a new connection came in all the waiters would In v2.2 when a new connection came in all the waiters would race to get it race to get it Ingo Molnar’s response: Ingo Molnar’s response: 5/2/99: “here he again forgets to _prove_ that overscheduling 5/2/99: “here he again forgets to _prove_ that overscheduling happens in Linux.” happens in Linux.” 5/7/99: “as of 2.3.1 my wake-one implementation and 5/7/99: “as of 2.3.1 my wake-one implementation and waitqueues rewrite went in” waitqueues rewrite went in” In Linux 2.4 only one thread wakes up to claim the new In Linux 2.4 only one thread wakes up to claim the new connection connection Windows NT has always had wake-1 semantics Windows NT has always had wake-1 semantics
  • 30.
    30 Light-Weight Synchronization Light-Weight Synchronization Linux2.6 introduces Futexes Linux 2.6 introduces Futexes There’s only a transition to kernel-mode when there’s There’s only a transition to kernel-mode when there’s contention contention Windows has always had CriticalSections Windows has always had CriticalSections Same behavior Same behavior Futexes go further: Futexes go further: Allow for prioritization of waits Allow for prioritization of waits Works interprocess as well Works interprocess as well
  • 31.
    Windows Operating SystemInternals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Memory Management Memory Management
  • 32.
    32 Virtual Memory Management VirtualMemory Management Windows Windows 32-bit versions split 32-bit versions split user-mode/kernel-mode from user-mode/kernel-mode from 2GB/2GB to 3GB/1GB 2GB/2GB to 3GB/1GB Demand-paged virtual memory Demand-paged virtual memory 32 or 64-bits 32 or 64-bits Copy-on-write Copy-on-write Shared memory Shared memory Memory mapped files Memory mapped files User System 0 2GB 4GB Linux Linux Splits user-mode/kernel-mode Splits user-mode/kernel-mode from 1GB/3GB to 3GB/1GB from 1GB/3GB to 3GB/1GB 2.6 has “4/4 split” option where 2.6 has “4/4 split” option where kernel has its own address kernel has its own address space space Demand-paged virtual memory Demand-paged virtual memory 32-bits and/or 64-bits 32-bits and/or 64-bits Copy-on-write Copy-on-write Shared memory Shared memory Memory mapped files Memory mapped files User System 0 3GB 4GB
  • 33.
    33 Physical Memory Management PhysicalMemory Management Windows Windows Per-process working sets Per-process working sets Working set tuner adjust Working set tuner adjust sets according to memory sets according to memory needs using the “clock” needs using the “clock” algorithm algorithm No “swapper” No “swapper” Process LRU Reused Page Linux Linux Global working set Global working set management management uses “clock” algorithm uses “clock” algorithm No “swapper” (the working No “swapper” (the working set trimmer code is called set trimmer code is called the swap daemon, however) the swap daemon, however) LRU Reused Page Other Process LRU
  • 34.
    34 I/O Management I/O Management Windows Windows Centeredaround the file object Centered around the file object Layered driver architecture Layered driver architecture throughout driver types throughout driver types Most I/O supports asynchronous Most I/O supports asynchronous operation operation Internal interrupt request level Internal interrupt request level (IRQL) controls interruptability (IRQL) controls interruptability Interrupts are split between an Interrupts are split between an Interrupt Service Routine (ISR) Interrupt Service Routine (ISR) and a Deferred Procedure Call and a Deferred Procedure Call (DPC) (DPC) Supports plug-and-play Supports plug-and-play Linux Linux Centered around the vnode Centered around the vnode No layered I/O model No layered I/O model Most I/O is synchronous Most I/O is synchronous Only sockets and direct disk Only sockets and direct disk I/O support asynchronous I/O support asynchronous I/O I/O Internal interrupt request level Internal interrupt request level (IRQL) controls interruptability (IRQL) controls interruptability Interrupts are split between an Interrupts are split between an ISR and soft IRQ or tasklet ISR and soft IRQ or tasklet Supports plug-and-play Supports plug-and-play IRQL Masked
  • 35.
    Windows Operating SystemInternals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze I/O & File System I/O & File System Management Management
  • 36.
    36 File Caching File Caching Windows Windows Singleglobal common cache Single global common cache Virtual file cache Virtual file cache Caching is at file vs. disk block Caching is at file vs. disk block level level Files are memory mapped into Files are memory mapped into kernel memory kernel memory Cache allows for zero-copy file Cache allows for zero-copy file serving serving File Cache File System Driver Disk Driver Linux Linux Single global common cache Single global common cache Virtual file cache Virtual file cache Caching is at file vs. disk block Caching is at file vs. disk block level level Files are memory mapped into Files are memory mapped into kernel memory kernel memory Cache allows for zero-copy file Cache allows for zero-copy file serving serving File Cache File System Driver Disk Driver
  • 37.
    37 Monitoring - Linuxprocfs Monitoring - Linux procfs Linux supports a number of special filesystems Linux supports a number of special filesystems Like special files, they are of a more dynamic nature and tend to have side Like special files, they are of a more dynamic nature and tend to have side effects when accessed effects when accessed Prime example is procfs Prime example is procfs (mounted at /proc) (mounted at /proc) provides access to and control over various aspects of Linux (I.e.; scheduling provides access to and control over various aspects of Linux (I.e.; scheduling and memory management) and memory management) /proc/meminfo contains detailed statistics on the current memory usage of Linux /proc/meminfo contains detailed statistics on the current memory usage of Linux Content changes as memory usage changes over time Content changes as memory usage changes over time Services for Unix implements procfs on Windows Services for Unix implements procfs on Windows
  • 38.
    38 I/O Processing I/O Processing Linux2.2 had the notion of bottom halves (BH) for low- Linux 2.2 had the notion of bottom halves (BH) for low- priority interrupt processing priority interrupt processing Fixed number of BHs Fixed number of BHs Only one BH of a given type could be active on a SMP Only one BH of a given type could be active on a SMP Linux 2.4 introduced Linux 2.4 introduced tasklets tasklets, which are non-preemptible , which are non-preemptible procedures called with interrupts enabled procedures called with interrupts enabled Tasklets are the equivalent of Windows Deferred Tasklets are the equivalent of Windows Deferred Procedure Calls (DPCs) Procedure Calls (DPCs)
  • 39.
    39 Asynchronous I/O Asynchronous I/O Linux2.2 only supported asynchronous I/O on socket Linux 2.2 only supported asynchronous I/O on socket connect operations and tty’s connect operations and tty’s Linux 2.6 adds asynchronous I/O for direct-disk access Linux 2.6 adds asynchronous I/O for direct-disk access AIO model includes efficient management of asynchronous I/O AIO model includes efficient management of asynchronous I/O Also added alternate epoll model Also added alternate epoll model Useful for database servers managing their database on a Useful for database servers managing their database on a dedicated raw partition dedicated raw partition Database servers that manage a file-based database suffer from Database servers that manage a file-based database suffer from synchronous I/O synchronous I/O Windows I/O is inherently asynchronous Windows I/O is inherently asynchronous Windows has had completion ports since NT 3.5 Windows has had completion ports since NT 3.5 More advanced form of AIO More advanced form of AIO
  • 40.
    Windows Operating SystemInternals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Security Security
  • 41.
    41 Security Security Windows Windows Very flexible securitymodel based on Very flexible security model based on Access Control Lists Access Control Lists Users are defined with Users are defined with Privileges Privileges Member groups Member groups Security can be applied to any Object Security can be applied to any Object Manager object Manager object Files, processes, synchronization Files, processes, synchronization objects, … objects, … Supports auditing Supports auditing Linux Linux Two models: Two models: Standard UNIX model Standard UNIX model Access Control Lists (SELinux) Access Control Lists (SELinux) Users are defined with: Users are defined with: Capabilities (privileges) Capabilities (privileges) Member groups Member groups Security is implemented on an Security is implemented on an object-by-object basis object-by-object basis Has no built-in auditing support Has no built-in auditing support Version 2.6 includes Linux Security Version 2.6 includes Linux Security Module framework for add-on Module framework for add-on security models security models
  • 42.
    42 A Look atthe Future A Look at the Future The kernel architectures are fundamentally similar The kernel architectures are fundamentally similar There are differences in the details There are differences in the details Linux implementation is adopting more of the good ideas used in Linux implementation is adopting more of the good ideas used in Windows Windows For the next 2-4 years Windows has and will maintain an edge For the next 2-4 years Windows has and will maintain an edge Linux is still behind on the cutting edge of performance tricks Linux is still behind on the cutting edge of performance tricks Large performance team and lab at Microsoft has direct ties into the Large performance team and lab at Microsoft has direct ties into the kernel developers kernel developers As time goes on the technological gap will narrow As time goes on the technological gap will narrow Open Source Development Labs (OSDL) will feed performance test Open Source Development Labs (OSDL) will feed performance test results to the kernel team results to the kernel team IBM and other vendors have Linux technology centers IBM and other vendors have Linux technology centers Squeezing performance out of the OS gets much harder as the OS Squeezing performance out of the OS gets much harder as the OS gets more tuned gets more tuned
  • 43.
    43 Linux Technology Unknowns LinuxTechnology Unknowns Linux kernel forking Linux kernel forking RedHat has already done it: Red Hat Enterprise Server v3.0 is RedHat has already done it: Red Hat Enterprise Server v3.0 is Linux 2.4 with some Linux 2.6 features Linux 2.4 with some Linux 2.6 features Backward compatibility philosophy Backward compatibility philosophy Linus Torvalds makes decisions on kernel APIs and Linus Torvalds makes decisions on kernel APIs and architecture based on technical reasons, not business reasons architecture based on technical reasons, not business reasons
  • 44.
    44 Further Reading Further Reading TransactionProcessing Council: www.tpc.org Transaction Processing Council: www.tpc.org SPEC: www.spec.org SPEC: www.spec.org NT vs Linux benchmarks: www.kegel.com/nt-linux-benchmarks.html NT vs Linux benchmarks: www.kegel.com/nt-linux-benchmarks.html The C10K problem: http://www.kegel.com/c10k.html The C10K problem: http://www.kegel.com/c10k.html Linus Torvald’s home: http://www.osdl.org/ Linus Torvald’s home: http://www.osdl.org/ Linux Kernel Archives: http://www.kernel.org/ Linux Kernel Archives: http://www.kernel.org/ Linux history: http://www.firstmonday.dk/issues/issue5_11/moon/ Linux history: http://www.firstmonday.dk/issues/issue5_11/moon/ Veritest Netbench result: Veritest Netbench result: http://www.veritest.com/clients/reports/microsoft/ms_netbench.pdf http://www.veritest.com/clients/reports/microsoft/ms_netbench.pdf Mark Russinovich’s 1999 article, “Linux and the Enterprise”: Mark Russinovich’s 1999 article, “Linux and the Enterprise”: http://www.winntmag.com/Articles/Index.cfm?ArticleID=5048 http://www.winntmag.com/Articles/Index.cfm?ArticleID=5048 The Open Group's Single UNIX Specification: The Open Group's Single UNIX Specification: http://www.unix.org/version3/ http://www.unix.org/version3/