FlexSC
Flexible System Call Scheduling with
    Exception-Less System Calls

 Livio Soares and Michael Stumm
        University of Toronto
Motivation
  The synchronous system call interface is a
       legacy from the single core era


                       Expensive! Costs are:
                        ➔ direct: mode-switch
                        ➔ indirect: processor

                               structure pollution


  FlexSC implements efficient and flexible
      system calls for the multicore era
                                                2
FlexSC overview
Two contributions: FlexSC and FlexSC-Threads




Results in:
  1) MySQL throughput increase of up to 40%
     and latency reduction of 30%
  2) Apache throughput increase of up to 115%
     and latency reduction of 50%               3
Performance impact of synchronous syscalls
➔   Xalan from SPEC CPU 2006
    ➔   Virtually no time in the OS
➔   Linux on Intel Core i7 (Nehalem)
➔   Injected exceptions with varying frequencies
    ➔ Direct: emulate null system call
      Direct
    ➔ Indirect: emulate “write()” system call
      Indirect
➔   Measured only user-mode time
    ➔   Kernel time ignored

    Ideally, user-mode performance is unaltered
                                                   4
Degradation due to sync. syscalls
 Degradation (lower is faster)
                                              Xalan (SPEC CPU 2006)
                                 70%
                                 60%
                                             Apache                      Indirect
                                 50%                                     Direct
                                 40%                        MySQL
                                 30%
                                 20%
                                 10%
                                 0%
                                   1K       2K     5K     10K     20K    50K       100K
                                       user-mode instructions between exceptions
                                                       (log scale)

        System calls can half processor efficiency;
            indirect cause is major contributor
                                                                                          5
Processor state pollution

➔   Key source of performance impact

➔   On a Linux write() call:
                rd
    ➔   up to 2/3 of the L1 data cache and data
        TLB are evicted

➔   Kernel performance equally affected
    ➔   Processor efficiency for OS code is also cut
        in half
                                                       6
Synchronous system calls are expensive




User

Kernel


  Traditional system calls are synchronous
    and use exceptions to cross domains
                                             7
Alternative: side-step the boundary




User

Kernel



Exception-less syscalls remove synchronicity
  by decoupling invocation from execution
                                          8
Benefits of exception-less system calls
 ➔   Significantly reduce direct costs
     ➔   Fewer mode switches

User
 ➔ Allow for batching


Kernel
   ➔ Reduce indirect costs




 ➔   Allow for dynamic multicore specialization
     ➔   Further reduce direct and indirect costs
                                                    9
Exception-less interface: syscall page
write(fd, buf, 4096);



entry = free_syscall_entry();

/* write syscall */
entry->syscall = 1;
entry->num_args = 3;
entry->args[0] = fd;
entry->args[1] = buf;
entry->args[2] = 4096;
entry->status = SUBMIT;
                SUBMIT

while (entry->status != DONE)
                        DONE
   do_something_else();

return entry->return_code;

                                         10
Exception-less interface: syscall page
write(fd, buf, 4096);



entry = free_syscall_entry();

/* write syscall */
entry->syscall = 1;
entry->num_args = 3;
entry->args[0] = fd;
entry->args[1] = buf;
entry->args[2] = 4096;            SUBMIT
entry->status = SUBMIT;
                SUBMIT

while (entry->status != DONE)
                        DONE
   do_something_else();

return entry->return_code;

                                           11
Exception-less interface: syscall page
write(fd, buf, 4096);



entry = free_syscall_entry();

/* write syscall */
entry->syscall = 1;
entry->num_args = 3;
entry->args[0] = fd;
entry->args[1] = buf;
entry->args[2] = 4096;            DONE
entry->status = SUBMIT;
                SUBMIT

while (entry->status != DONE)
                        DONE
   do_something_else();

return entry->return_code;

                                         12
Syscall threads
➔   Kernel-only threads
    ➔   Part of application process
➔   Execute requests from syscall page
➔   Schedulable on a per-core basis




                                         13
System call batching




  Request as many system calls as possible
  Switch to kernel-mode
  Start executing all posted system calls

     Avoids direct and indirect costs,
          even on a single core          14
Dynamic multicore specialization




 FlexSC makes specializing cores simple
 Dynamically adapts to workload needs
                                          15
What programs can benefit from FlexSC?
Event-driven servers
(e.g., memcached, nginx webserver)
  ➔   Use asynchoronous calls, similar to FlexSC
  ➔   Can use FlexSC directly
  ➔   Mix sync and exception-less system calls

Multi-threaded servers: FlexSC-Threads
  ➔   Thread library, compatible with Pthreads
  ➔   No changes to app. code or recompilation required
  ➔   Transparently converts legacy syscalls into
      exception-less ones
                                                      16
FlexSC-Threads library
➔   Hybrid (M-on-N) threading model
    ➔ One kernel visible thread per core
    ➔ Many user threads per kernel-visible thread



➔   Redirects system calls (libc wrappers)
    ➔ Posts exception-less syscall to syscall page
    ➔ Switches to other user-level thread

    ➔ Resumes thread upon syscall completion



     Benefits of exception-less syscalls
while maintaining sequential syscall interface
                                                     17
FlexSC-Threads in action




User




                           18
FlexSC-Threads in action




On a syscall:
  Post request to system call page
  Block user-level thread
                                     19
FlexSC-Threads in action




Kernel

 On a syscall:
   Post request to system call page
   Block user-level thread
   Switch to next ready thread        20
FlexSC-Threads in action




User

Kernel

  If all user-level threads become blocked:
  1) enter kernel
  2) wait for completion of at least 1 syscall
                                                 21
Evaluation
➔   Linux 2.6.33

➔   Nehalem (Core i7) server, 2.3GHz
    ➔   4 cores on a chip

➔   Clients connected on 1 Gbps network

➔   Workloads
    ➔   Sysbench on MySQL (80% user, 20% kernel)
    ➔   ApacheBench on Apache (50% user, 50% kernel)

➔   Default Linux NTPL (“sync”) vs.
                         sync
         FlexSC-Threads (“flexsc”)
                          flexsc
                                                       22
Sysbench: “OLTP” on MySQL (1 core)

                  500

                  400
(requests/sec.)
  Throughput




                  300
                         15% improvement
                  200
                                                    flexsc
                  100
                                                    sync
                    0
                     0   50     100    150    200   250      300
                              Request Concurrency

                                                               23
Sysbench: “OLTP” on MySQL (4 cores)

                  1,000

                   800
(requests/sec.)
  Throughput




                   600
                               40% improvement
                   400

                   200                                flexsc
                                                      sync
                     0
                      0   50      100    150   200   250   300
                               Request Concurrency

                                                               24
MySQL latency per client request
                                     256 connections
                        1900
               1,000
                 900                                   95th
                 800                                   percentile
Latency (ms)




                 700                                   average
                 600
                 500
                 400
                 300
                 200
                 100
                   0
                       sync flexsc       sync flexsc   sync flexsc
                         1 core            2 cores       4 cores

                       Up to 30% reduction of average
                              request latencies
                                                                     25
MySQL processor metrics
                                               SysBench (4 cores)
                       1.4
                       1.2
Relative Performance




                                        User                             Kernel
                        1
    (flexsc/sync)




                       0.8
                       0.6
                       0.4
                       0.2
                        0
                                   L3    d-cache TLB          IPC      L2   i-cache Branch
                             IPC        L2   i-cache Branch         L3  d-cache TLB

                       Performance improvements consequence of
                            more efficient processor execution
                                                                                         26
ApacheBench throughput (1 core)

                  45,000
                  40,000                               flexsc
                  35,000                               sync
(requests/sec.)
  Throughput




                  30,000
                  25,000
                  20,000
                  15,000           80-90% improvement
                  10,000
                   5,000
                       0
                        0   200   400     600    800    1000
                             Request Concurrency
                                                                27
ApacheBench throughput (4 cores)

                  45,000
                  40,000
                  35,000
(requests/sec.)




                             115% improvement
  Throughput




                  30,000
                  25,000
                  20,000
                  15,000
                  10,000                               flexsc
                   5,000                               sync
                       0
                        0   200   400     600    800   1000
                             Request Concurrency
                                                                28
Apache latency per client request
                                256 concurrent requests
                      238
                 30
                                                   99th
                 25                                percentile
  Latency (ms)




                 20                                average

                 15

                 10

                  5

                  0
                      sync flexsc    sync flexsc    sync flexsc
                       1 core         2 cores        4 cores

                      Up to 50% reduction of average
                             request latencies
                                                                  29
Apache processor metrics
                                               Apache (1 core)
                        2
Relative Performance




                       1.5
    (flexsc/sync)




                                        User                             Kernel
                        1


                       0.5


                        0
                                   L3    d-cache TLB          IPC      L2   i-cache Branch
                             IPC        L2   i-cache Branch         L3  d-cache TLB

                             Processor efficiency doubles for kernel
                                   and user-mode execution
                                                                                         30
Discussion

➔   New OS architecture not necessary
    ➔   Exception-less syscalls can coexist with legacy ones
➔   Foundation for non-blocking system calls
    ➔   select() / poll() in user-space
    ➔   Interesting case of non-blocking free()
➔   Multicore ultra -specialization
    ➔   TCP Servers (Rutgers; Iftode et.al), FS Servers
➔   Single-ISA asymmetric cores
    ➔   OS-friendly cores (HP Labs; Mogul et. al)

                                                               31
Concluding Remarks
➔   System calls degrade server performance
    ➔   Processor pollution is inherent to synchronous
        system calls
➔   Exception-less syscalls
    ➔   Flexible and efficient system call execution
➔   FlexSC-Threads
    ➔   Leverages exception-less syscalls
    ➔   No modifications to multi-threaded applications
➔   Throughput & latency gains
    ➔   2x throughput improvement for Apache and BIND
    ➔   1.4x throughput improvement for MySQL             32
FlexSC
Flexible System Call Scheduling with
    Exception-Less System Calls

 Livio Soares and Michael Stumm
        University of Toronto

FlexSC: Exception-Less System Calls - presented @ OSDI 2010

  • 1.
    FlexSC Flexible System CallScheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto
  • 2.
    Motivation Thesynchronous system call interface is a legacy from the single core era Expensive! Costs are: ➔ direct: mode-switch ➔ indirect: processor structure pollution FlexSC implements efficient and flexible system calls for the multicore era 2
  • 3.
    FlexSC overview Two contributions:FlexSC and FlexSC-Threads Results in: 1) MySQL throughput increase of up to 40% and latency reduction of 30% 2) Apache throughput increase of up to 115% and latency reduction of 50% 3
  • 4.
    Performance impact ofsynchronous syscalls ➔ Xalan from SPEC CPU 2006 ➔ Virtually no time in the OS ➔ Linux on Intel Core i7 (Nehalem) ➔ Injected exceptions with varying frequencies ➔ Direct: emulate null system call Direct ➔ Indirect: emulate “write()” system call Indirect ➔ Measured only user-mode time ➔ Kernel time ignored Ideally, user-mode performance is unaltered 4
  • 5.
    Degradation due tosync. syscalls Degradation (lower is faster) Xalan (SPEC CPU 2006) 70% 60% Apache Indirect 50% Direct 40% MySQL 30% 20% 10% 0% 1K 2K 5K 10K 20K 50K 100K user-mode instructions between exceptions (log scale) System calls can half processor efficiency; indirect cause is major contributor 5
  • 6.
    Processor state pollution ➔ Key source of performance impact ➔ On a Linux write() call: rd ➔ up to 2/3 of the L1 data cache and data TLB are evicted ➔ Kernel performance equally affected ➔ Processor efficiency for OS code is also cut in half 6
  • 7.
    Synchronous system callsare expensive User Kernel Traditional system calls are synchronous and use exceptions to cross domains 7
  • 8.
    Alternative: side-step theboundary User Kernel Exception-less syscalls remove synchronicity by decoupling invocation from execution 8
  • 9.
    Benefits of exception-lesssystem calls ➔ Significantly reduce direct costs ➔ Fewer mode switches User ➔ Allow for batching Kernel ➔ Reduce indirect costs ➔ Allow for dynamic multicore specialization ➔ Further reduce direct and indirect costs 9
  • 10.
    Exception-less interface: syscallpage write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT; SUBMIT while (entry->status != DONE) DONE do_something_else(); return entry->return_code; 10
  • 11.
    Exception-less interface: syscallpage write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; SUBMIT entry->status = SUBMIT; SUBMIT while (entry->status != DONE) DONE do_something_else(); return entry->return_code; 11
  • 12.
    Exception-less interface: syscallpage write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; DONE entry->status = SUBMIT; SUBMIT while (entry->status != DONE) DONE do_something_else(); return entry->return_code; 12
  • 13.
    Syscall threads ➔ Kernel-only threads ➔ Part of application process ➔ Execute requests from syscall page ➔ Schedulable on a per-core basis 13
  • 14.
    System call batching Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls Avoids direct and indirect costs, even on a single core 14
  • 15.
    Dynamic multicore specialization FlexSC makes specializing cores simple Dynamically adapts to workload needs 15
  • 16.
    What programs canbenefit from FlexSC? Event-driven servers (e.g., memcached, nginx webserver) ➔ Use asynchoronous calls, similar to FlexSC ➔ Can use FlexSC directly ➔ Mix sync and exception-less system calls Multi-threaded servers: FlexSC-Threads ➔ Thread library, compatible with Pthreads ➔ No changes to app. code or recompilation required ➔ Transparently converts legacy syscalls into exception-less ones 16
  • 17.
    FlexSC-Threads library ➔ Hybrid (M-on-N) threading model ➔ One kernel visible thread per core ➔ Many user threads per kernel-visible thread ➔ Redirects system calls (libc wrappers) ➔ Posts exception-less syscall to syscall page ➔ Switches to other user-level thread ➔ Resumes thread upon syscall completion Benefits of exception-less syscalls while maintaining sequential syscall interface 17
  • 18.
  • 19.
    FlexSC-Threads in action Ona syscall: Post request to system call page Block user-level thread 19
  • 20.
    FlexSC-Threads in action Kernel On a syscall: Post request to system call page Block user-level thread Switch to next ready thread 20
  • 21.
    FlexSC-Threads in action User Kernel If all user-level threads become blocked: 1) enter kernel 2) wait for completion of at least 1 syscall 21
  • 22.
    Evaluation ➔ Linux 2.6.33 ➔ Nehalem (Core i7) server, 2.3GHz ➔ 4 cores on a chip ➔ Clients connected on 1 Gbps network ➔ Workloads ➔ Sysbench on MySQL (80% user, 20% kernel) ➔ ApacheBench on Apache (50% user, 50% kernel) ➔ Default Linux NTPL (“sync”) vs. sync FlexSC-Threads (“flexsc”) flexsc 22
  • 23.
    Sysbench: “OLTP” onMySQL (1 core) 500 400 (requests/sec.) Throughput 300 15% improvement 200 flexsc 100 sync 0 0 50 100 150 200 250 300 Request Concurrency 23
  • 24.
    Sysbench: “OLTP” onMySQL (4 cores) 1,000 800 (requests/sec.) Throughput 600 40% improvement 400 200 flexsc sync 0 0 50 100 150 200 250 300 Request Concurrency 24
  • 25.
    MySQL latency perclient request 256 connections 1900 1,000 900 95th 800 percentile Latency (ms) 700 average 600 500 400 300 200 100 0 sync flexsc sync flexsc sync flexsc 1 core 2 cores 4 cores Up to 30% reduction of average request latencies 25
  • 26.
    MySQL processor metrics SysBench (4 cores) 1.4 1.2 Relative Performance User Kernel 1 (flexsc/sync) 0.8 0.6 0.4 0.2 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB Performance improvements consequence of more efficient processor execution 26
  • 27.
    ApacheBench throughput (1core) 45,000 40,000 flexsc 35,000 sync (requests/sec.) Throughput 30,000 25,000 20,000 15,000 80-90% improvement 10,000 5,000 0 0 200 400 600 800 1000 Request Concurrency 27
  • 28.
    ApacheBench throughput (4cores) 45,000 40,000 35,000 (requests/sec.) 115% improvement Throughput 30,000 25,000 20,000 15,000 10,000 flexsc 5,000 sync 0 0 200 400 600 800 1000 Request Concurrency 28
  • 29.
    Apache latency perclient request 256 concurrent requests 238 30 99th 25 percentile Latency (ms) 20 average 15 10 5 0 sync flexsc sync flexsc sync flexsc 1 core 2 cores 4 cores Up to 50% reduction of average request latencies 29
  • 30.
    Apache processor metrics Apache (1 core) 2 Relative Performance 1.5 (flexsc/sync) User Kernel 1 0.5 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB Processor efficiency doubles for kernel and user-mode execution 30
  • 31.
    Discussion ➔ New OS architecture not necessary ➔ Exception-less syscalls can coexist with legacy ones ➔ Foundation for non-blocking system calls ➔ select() / poll() in user-space ➔ Interesting case of non-blocking free() ➔ Multicore ultra -specialization ➔ TCP Servers (Rutgers; Iftode et.al), FS Servers ➔ Single-ISA asymmetric cores ➔ OS-friendly cores (HP Labs; Mogul et. al) 31
  • 32.
    Concluding Remarks ➔ System calls degrade server performance ➔ Processor pollution is inherent to synchronous system calls ➔ Exception-less syscalls ➔ Flexible and efficient system call execution ➔ FlexSC-Threads ➔ Leverages exception-less syscalls ➔ No modifications to multi-threaded applications ➔ Throughput & latency gains ➔ 2x throughput improvement for Apache and BIND ➔ 1.4x throughput improvement for MySQL 32
  • 33.
    FlexSC Flexible System CallScheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto