Introduction to Windows Kernel

         Sisimon Soman
How system call works
• Cannot directly enter kernel space using jmp or a call instruction.
• When make a system call (like CreateFile, ReadFile) OS enter kernel mode
  (Ring 0) using instruction int 2E (it is called interrupt gate).
• Code segment descriptor contain information about the ‘Ring’ at which
  the code can run. For kernel mode modules it will be always Ring 0. If a
  user mode program try to do ‘jmp <kernel mode address>’ it will cause
  access violation, because of the segment descriptor flag says processor
  should be in Ring 0.
• The frequency of entering kernel mode is high (most of the Windows API
  call cause to enter kernel mode) sysenter is the new optimized instruction
  to enter kernel mode.
Rings continued..
System Call continued..
• Windows maintains a system service dispatch table which is
  similar to the IDT. Each entry in system service table point to
  kernel mode system call routine.
• The int 2E probe and copy parameters from user mode stack
  to thread’s kernel mode stack and fetch and execute the
  correct system call procedure from the system service table.

• There are multiple system service tables. One table for NT
  Native APIs, one table for IIS and GDI etc.
Lets try it in WinDBG..
• NtWriteFile:
  mov eax, 0x0E ; build 2195 system service number for NtWriteFile
  mov ebx, esp ; point to parameters
  int 0x2E ; execute system service trap
  ret 0x2C ; pop parameters off stack and return to caller
App issue ReadFile



             NtReadFile
                                                        User Land

                                                        Kernel Land
IO Manager

                            IO Mgr create IRP Packet,
                 IRP        send to driver stack

        File System

        Volume Manager

        Disk Class Driver

        Hardware Driver
What is IO Request Packet (IRP)
• IO Operation passes thru,
  – Different stages.
  – Different threads.
  – Different drivers.
• IRP Encapsulate the IO request.
• IRP is thread independent.
IRP Continued..
• Compare IRP with Windows Messages -MSG
  structure.
• Each driver in the stack do its own task, finally
  forward the IRP to the lower driver in the
  stack.
• IRP can be processed synchronously or
  asynchronously.
IRP Continued..

• Usually lower level hardware driver takes more time.
  H/W driver can mark the IRP for pending and return.
• When H/W finish IO, H/W driver complete the IRP by
  calling IoCompleteRequest().
• IoCompleteRequest() call IO completion routine set
  by drivers in stack and complete the IO.
Structure of IRP
• Fixed IRP Header                 IRP Header
• Variable Stack locations –
  – One sub stack per driver     Stack Location 1

                                 Stack Location 2

                                 Stack Location 3

                                 Stack Location N
Flow of IRP
                                                 IRP for Storage Stack



                Storage Stack                     IRP Header


              File System                       Stack Location 1

              Volume Manager                    Stack Location 2

              Disk Class Driver                 Stack Location 3

              Hardware Driver                   Stack Location 4


Forward IRP to lower
driver in the stack
Flow of IRP Completion
                                       IRP for Storage Stack



                  Storage Stack         IRP Header


               File System –
                                      Stack Location 1
              Completion Routine
                Volume Manager –
                                      Stack Location 2
               Completion Routine
                Disk Class Driver –
                                      Stack Location 3
               Completion Routine
               Hardware Driver –
                                      Stack Location 4
               Complete the IRP


Call the completion routine while
completing the IRP
IRP Header
• IO buffer Information.
• Flags
  – Page IO Flag
  – No Caching IO flag


• IO Status – On Completion set this to IO
  Completed.
• IRP cancel routine
IRP Stack Location
• IO Manager get the driver count in the stack
  from the top device in the stack.
• While creating IRP, IO manager allocate the IO
  stack locations equal to the device count from
  the top device object.
Contents of IO Stack Location
• IO Completion routine specific to the driver.
• File object specific to the request.
Software Interrupt Request Levels (IRQLs)
• Windows has its own interrupt priority schemes know as
  IRQL.
• IRQL levels from 0 to 31, the higher the number means higher
  priority interrupt level.
• HAL map hardware interrupts to IRQL 3 (Device 1) to IRQL 31
  (High)
• When higher priority interrupt occur, it mask the all lower
  interrupts and execute the ISR for the higher interrupt.
• After executing the ISR, kernel lower the interrupt levels and
  execute the lower interrupt ISR.
• ISR routine should do minimal work and it should defer the
  major chunk of work to Deferred Procedure Call (DPC) which
  run at lower IRQL 2.
Software Interrupt Request Levels (IRQLs)
IRQL and DPC
• DPC concept is similar to other OS, in Linux it
  is called bottom half.
• DPC is per processor, means a duel processor
  SMP box contains two DPC Qs.
• The ISR routine generally fetch data from
  hardware and queue a DPC for further
  processing.
• IRQL priority is different from thread
  scheduling priority.
IRQL and DPC
• The scheduler (dispatcher) also runs at IRQL 2.
• So a code that execute on or above IRQL 2(dispatch
  level) cannot preempt.
• From the Diagram, see only hardware interrupts and
  some higher priority interrupts like clock, power fail
  are above IRQL 2.
• Most of the time OS will be in IRQL 0(Passive level)
• All user programs and most of the kernel code
  execute on Passive level only.
IRQL continued..
• Scheduler runs at IRQL 2, so what happen if my driver try to wait on or
  above dispatch level ?.
• Simple system will crash with ‘Blue Screen’, usually with the bug check ID
  IRQL_NOT_LESSTHAN_EQUAL.
• Because if wait above dispatch level, no one there to come and switch the
  thread.
• What happen if try to access a PagedPool in above dispatch level ?.
• If the pages are on disk, then a page fault exception will happen, the
  current thread need to wait and page fault handler will read the pages
  from page file to page frames in memory.
• If page fault happen above the dispatch level, no one there to stop the
  current thread and schedule the page fault handler. Thus cannot access
  PagedPool on or above dispatch level.
IRQL 1 - APCs
• Asynchronous Procedure Call (APC) run at IRQL 1.
• The main duty of APC is to send the data to user thread
  context.
• APC Q is thread specific, each thread has its own APC Q.
• User space thread initiate the read operation from a device
  and either it wait to finish it or continue with another job.
• The IO may finish sometime later, now the buffer need to
  send to the calling thread’s process context. It is the duty of
  APC.

Introduction to windows kernel

  • 1.
    Introduction to WindowsKernel Sisimon Soman
  • 2.
    How system callworks • Cannot directly enter kernel space using jmp or a call instruction. • When make a system call (like CreateFile, ReadFile) OS enter kernel mode (Ring 0) using instruction int 2E (it is called interrupt gate). • Code segment descriptor contain information about the ‘Ring’ at which the code can run. For kernel mode modules it will be always Ring 0. If a user mode program try to do ‘jmp <kernel mode address>’ it will cause access violation, because of the segment descriptor flag says processor should be in Ring 0. • The frequency of entering kernel mode is high (most of the Windows API call cause to enter kernel mode) sysenter is the new optimized instruction to enter kernel mode.
  • 3.
  • 4.
    System Call continued.. •Windows maintains a system service dispatch table which is similar to the IDT. Each entry in system service table point to kernel mode system call routine. • The int 2E probe and copy parameters from user mode stack to thread’s kernel mode stack and fetch and execute the correct system call procedure from the system service table. • There are multiple system service tables. One table for NT Native APIs, one table for IIS and GDI etc.
  • 5.
    Lets try itin WinDBG.. • NtWriteFile: mov eax, 0x0E ; build 2195 system service number for NtWriteFile mov ebx, esp ; point to parameters int 0x2E ; execute system service trap ret 0x2C ; pop parameters off stack and return to caller
  • 6.
    App issue ReadFile NtReadFile User Land Kernel Land IO Manager IO Mgr create IRP Packet, IRP send to driver stack File System Volume Manager Disk Class Driver Hardware Driver
  • 7.
    What is IORequest Packet (IRP) • IO Operation passes thru, – Different stages. – Different threads. – Different drivers. • IRP Encapsulate the IO request. • IRP is thread independent.
  • 8.
    IRP Continued.. • CompareIRP with Windows Messages -MSG structure. • Each driver in the stack do its own task, finally forward the IRP to the lower driver in the stack. • IRP can be processed synchronously or asynchronously.
  • 9.
    IRP Continued.. • Usuallylower level hardware driver takes more time. H/W driver can mark the IRP for pending and return. • When H/W finish IO, H/W driver complete the IRP by calling IoCompleteRequest(). • IoCompleteRequest() call IO completion routine set by drivers in stack and complete the IO.
  • 10.
    Structure of IRP •Fixed IRP Header IRP Header • Variable Stack locations – – One sub stack per driver Stack Location 1 Stack Location 2 Stack Location 3 Stack Location N
  • 11.
    Flow of IRP IRP for Storage Stack Storage Stack IRP Header File System Stack Location 1 Volume Manager Stack Location 2 Disk Class Driver Stack Location 3 Hardware Driver Stack Location 4 Forward IRP to lower driver in the stack
  • 12.
    Flow of IRPCompletion IRP for Storage Stack Storage Stack IRP Header File System – Stack Location 1 Completion Routine Volume Manager – Stack Location 2 Completion Routine Disk Class Driver – Stack Location 3 Completion Routine Hardware Driver – Stack Location 4 Complete the IRP Call the completion routine while completing the IRP
  • 13.
    IRP Header • IObuffer Information. • Flags – Page IO Flag – No Caching IO flag • IO Status – On Completion set this to IO Completed. • IRP cancel routine
  • 14.
    IRP Stack Location •IO Manager get the driver count in the stack from the top device in the stack. • While creating IRP, IO manager allocate the IO stack locations equal to the device count from the top device object.
  • 15.
    Contents of IOStack Location • IO Completion routine specific to the driver. • File object specific to the request.
  • 16.
    Software Interrupt RequestLevels (IRQLs) • Windows has its own interrupt priority schemes know as IRQL. • IRQL levels from 0 to 31, the higher the number means higher priority interrupt level. • HAL map hardware interrupts to IRQL 3 (Device 1) to IRQL 31 (High) • When higher priority interrupt occur, it mask the all lower interrupts and execute the ISR for the higher interrupt. • After executing the ISR, kernel lower the interrupt levels and execute the lower interrupt ISR. • ISR routine should do minimal work and it should defer the major chunk of work to Deferred Procedure Call (DPC) which run at lower IRQL 2.
  • 17.
  • 18.
    IRQL and DPC •DPC concept is similar to other OS, in Linux it is called bottom half. • DPC is per processor, means a duel processor SMP box contains two DPC Qs. • The ISR routine generally fetch data from hardware and queue a DPC for further processing. • IRQL priority is different from thread scheduling priority.
  • 19.
    IRQL and DPC •The scheduler (dispatcher) also runs at IRQL 2. • So a code that execute on or above IRQL 2(dispatch level) cannot preempt. • From the Diagram, see only hardware interrupts and some higher priority interrupts like clock, power fail are above IRQL 2. • Most of the time OS will be in IRQL 0(Passive level) • All user programs and most of the kernel code execute on Passive level only.
  • 20.
    IRQL continued.. • Schedulerruns at IRQL 2, so what happen if my driver try to wait on or above dispatch level ?. • Simple system will crash with ‘Blue Screen’, usually with the bug check ID IRQL_NOT_LESSTHAN_EQUAL. • Because if wait above dispatch level, no one there to come and switch the thread. • What happen if try to access a PagedPool in above dispatch level ?. • If the pages are on disk, then a page fault exception will happen, the current thread need to wait and page fault handler will read the pages from page file to page frames in memory. • If page fault happen above the dispatch level, no one there to stop the current thread and schedule the page fault handler. Thus cannot access PagedPool on or above dispatch level.
  • 21.
    IRQL 1 -APCs • Asynchronous Procedure Call (APC) run at IRQL 1. • The main duty of APC is to send the data to user thread context. • APC Q is thread specific, each thread has its own APC Q. • User space thread initiate the read operation from a device and either it wait to finish it or continue with another job. • The IO may finish sometime later, now the buffer need to send to the calling thread’s process context. It is the duty of APC.