FlexSC
Flexible System Call Scheduling with
    Exception-Less System Calls

 Livio Soares and Michael Stumm
        Unive...
Motivation
  The synchronous system call interface is a
       legacy from the single core era


                       Ex...
FlexSC overview
Two contributions: FlexSC and FlexSC-Threads




Results in:
  1) MySQL throughput increase of up to 40%
 ...
Performance impact of synchronous syscalls
➔   Xalan from SPEC CPU 2006
    ➔   Virtually no time in the OS
➔   Linux on I...
Degradation due to sync. syscalls
 Degradation (lower is faster)
                                              Xalan (SPEC...
Processor state pollution

➔   Key source of performance impact

➔   On a Linux write() call:
                rd
    ➔   u...
Synchronous system calls are expensive




User

Kernel


  Traditional system calls are synchronous
    and use exception...
Alternative: side-step the boundary




User

Kernel



Exception-less syscalls remove synchronicity
  by decoupling invoc...
Benefits of exception-less system calls
 ➔   Significantly reduce direct costs
     ➔   Fewer mode switches

User
 ➔ Allow...
Exception-less interface: syscall page
write(fd, buf, 4096);



entry = free_syscall_entry();

/* write syscall */
entry->...
Exception-less interface: syscall page
write(fd, buf, 4096);



entry = free_syscall_entry();

/* write syscall */
entry->...
Exception-less interface: syscall page
write(fd, buf, 4096);



entry = free_syscall_entry();

/* write syscall */
entry->...
Syscall threads
➔   Kernel-only threads
    ➔   Part of application process
➔   Execute requests from syscall page
➔   Sch...
System call batching




  Request as many system calls as possible
  Switch to kernel-mode
  Start executing all posted s...
Dynamic multicore specialization




 FlexSC makes specializing cores simple
 Dynamically adapts to workload needs
       ...
What programs can benefit from FlexSC?
Event-driven servers
(e.g., memcached, nginx webserver)
  ➔   Use asynchoronous cal...
FlexSC-Threads library
➔   Hybrid (M-on-N) threading model
    ➔ One kernel visible thread per core
    ➔ Many user thread...
FlexSC-Threads in action




User




                           18
FlexSC-Threads in action




On a syscall:
  Post request to system call page
  Block user-level thread
                  ...
FlexSC-Threads in action




Kernel

 On a syscall:
   Post request to system call page
   Block user-level thread
   Swit...
FlexSC-Threads in action




User

Kernel

  If all user-level threads become blocked:
  1) enter kernel
  2) wait for com...
Evaluation
➔   Linux 2.6.33

➔   Nehalem (Core i7) server, 2.3GHz
    ➔   4 cores on a chip

➔   Clients connected on 1 Gb...
Sysbench: “OLTP” on MySQL (1 core)

                  500

                  400
(requests/sec.)
  Throughput




        ...
Sysbench: “OLTP” on MySQL (4 cores)

                  1,000

                   800
(requests/sec.)
  Throughput




    ...
MySQL latency per client request
                                     256 connections
                        1900
       ...
MySQL processor metrics
                                               SysBench (4 cores)
                       1.4
     ...
ApacheBench throughput (1 core)

                  45,000
                  40,000                               flexsc
  ...
ApacheBench throughput (4 cores)

                  45,000
                  40,000
                  35,000
(requests/sec...
Apache latency per client request
                                256 concurrent requests
                      238
      ...
Apache processor metrics
                                               Apache (1 core)
                        2
Relative...
Discussion

➔   New OS architecture not necessary
    ➔   Exception-less syscalls can coexist with legacy ones
➔   Foundat...
Concluding Remarks
➔   System calls degrade server performance
    ➔   Processor pollution is inherent to synchronous
    ...
FlexSC
Flexible System Call Scheduling with
    Exception-Less System Calls

 Livio Soares and Michael Stumm
        Unive...
Upcoming SlideShare
Loading in …5
×

FlexSC: Exception-Less System Calls - presented @ OSDI 2010

1,971 views

Published on

  • Be the first to comment

  • Be the first to like this

FlexSC: Exception-Less System Calls - presented @ OSDI 2010

  1. 1. FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto
  2. 2. Motivation The synchronous system call interface is a legacy from the single core era Expensive! Costs are: ➔ direct: mode-switch ➔ indirect: processor structure pollution FlexSC implements efficient and flexible system calls for the multicore era 2
  3. 3. FlexSC overview Two contributions: FlexSC and FlexSC-Threads Results in: 1) MySQL throughput increase of up to 40% and latency reduction of 30% 2) Apache throughput increase of up to 115% and latency reduction of 50% 3
  4. 4. Performance impact of synchronous syscalls ➔ Xalan from SPEC CPU 2006 ➔ Virtually no time in the OS ➔ Linux on Intel Core i7 (Nehalem) ➔ Injected exceptions with varying frequencies ➔ Direct: emulate null system call Direct ➔ Indirect: emulate “write()” system call Indirect ➔ Measured only user-mode time ➔ Kernel time ignored Ideally, user-mode performance is unaltered 4
  5. 5. Degradation due to sync. syscalls Degradation (lower is faster) Xalan (SPEC CPU 2006) 70% 60% Apache Indirect 50% Direct 40% MySQL 30% 20% 10% 0% 1K 2K 5K 10K 20K 50K 100K user-mode instructions between exceptions (log scale) System calls can half processor efficiency; indirect cause is major contributor 5
  6. 6. Processor state pollution ➔ Key source of performance impact ➔ On a Linux write() call: rd ➔ up to 2/3 of the L1 data cache and data TLB are evicted ➔ Kernel performance equally affected ➔ Processor efficiency for OS code is also cut in half 6
  7. 7. Synchronous system calls are expensive User Kernel Traditional system calls are synchronous and use exceptions to cross domains 7
  8. 8. Alternative: side-step the boundary User Kernel Exception-less syscalls remove synchronicity by decoupling invocation from execution 8
  9. 9. Benefits of exception-less system calls ➔ Significantly reduce direct costs ➔ Fewer mode switches User ➔ Allow for batching Kernel ➔ Reduce indirect costs ➔ Allow for dynamic multicore specialization ➔ Further reduce direct and indirect costs 9
  10. 10. Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT; SUBMIT while (entry->status != DONE) DONE do_something_else(); return entry->return_code; 10
  11. 11. Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; SUBMIT entry->status = SUBMIT; SUBMIT while (entry->status != DONE) DONE do_something_else(); return entry->return_code; 11
  12. 12. Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; DONE entry->status = SUBMIT; SUBMIT while (entry->status != DONE) DONE do_something_else(); return entry->return_code; 12
  13. 13. Syscall threads ➔ Kernel-only threads ➔ Part of application process ➔ Execute requests from syscall page ➔ Schedulable on a per-core basis 13
  14. 14. System call batching Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls Avoids direct and indirect costs, even on a single core 14
  15. 15. Dynamic multicore specialization FlexSC makes specializing cores simple Dynamically adapts to workload needs 15
  16. 16. What programs can benefit from FlexSC? Event-driven servers (e.g., memcached, nginx webserver) ➔ Use asynchoronous calls, similar to FlexSC ➔ Can use FlexSC directly ➔ Mix sync and exception-less system calls Multi-threaded servers: FlexSC-Threads ➔ Thread library, compatible with Pthreads ➔ No changes to app. code or recompilation required ➔ Transparently converts legacy syscalls into exception-less ones 16
  17. 17. FlexSC-Threads library ➔ Hybrid (M-on-N) threading model ➔ One kernel visible thread per core ➔ Many user threads per kernel-visible thread ➔ Redirects system calls (libc wrappers) ➔ Posts exception-less syscall to syscall page ➔ Switches to other user-level thread ➔ Resumes thread upon syscall completion Benefits of exception-less syscalls while maintaining sequential syscall interface 17
  18. 18. FlexSC-Threads in action User 18
  19. 19. FlexSC-Threads in action On a syscall: Post request to system call page Block user-level thread 19
  20. 20. FlexSC-Threads in action Kernel On a syscall: Post request to system call page Block user-level thread Switch to next ready thread 20
  21. 21. FlexSC-Threads in action User Kernel If all user-level threads become blocked: 1) enter kernel 2) wait for completion of at least 1 syscall 21
  22. 22. Evaluation ➔ Linux 2.6.33 ➔ Nehalem (Core i7) server, 2.3GHz ➔ 4 cores on a chip ➔ Clients connected on 1 Gbps network ➔ Workloads ➔ Sysbench on MySQL (80% user, 20% kernel) ➔ ApacheBench on Apache (50% user, 50% kernel) ➔ Default Linux NTPL (“sync”) vs. sync FlexSC-Threads (“flexsc”) flexsc 22
  23. 23. Sysbench: “OLTP” on MySQL (1 core) 500 400 (requests/sec.) Throughput 300 15% improvement 200 flexsc 100 sync 0 0 50 100 150 200 250 300 Request Concurrency 23
  24. 24. Sysbench: “OLTP” on MySQL (4 cores) 1,000 800 (requests/sec.) Throughput 600 40% improvement 400 200 flexsc sync 0 0 50 100 150 200 250 300 Request Concurrency 24
  25. 25. MySQL latency per client request 256 connections 1900 1,000 900 95th 800 percentile Latency (ms) 700 average 600 500 400 300 200 100 0 sync flexsc sync flexsc sync flexsc 1 core 2 cores 4 cores Up to 30% reduction of average request latencies 25
  26. 26. MySQL processor metrics SysBench (4 cores) 1.4 1.2 Relative Performance User Kernel 1 (flexsc/sync) 0.8 0.6 0.4 0.2 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB Performance improvements consequence of more efficient processor execution 26
  27. 27. ApacheBench throughput (1 core) 45,000 40,000 flexsc 35,000 sync (requests/sec.) Throughput 30,000 25,000 20,000 15,000 80-90% improvement 10,000 5,000 0 0 200 400 600 800 1000 Request Concurrency 27
  28. 28. ApacheBench throughput (4 cores) 45,000 40,000 35,000 (requests/sec.) 115% improvement Throughput 30,000 25,000 20,000 15,000 10,000 flexsc 5,000 sync 0 0 200 400 600 800 1000 Request Concurrency 28
  29. 29. Apache latency per client request 256 concurrent requests 238 30 99th 25 percentile Latency (ms) 20 average 15 10 5 0 sync flexsc sync flexsc sync flexsc 1 core 2 cores 4 cores Up to 50% reduction of average request latencies 29
  30. 30. Apache processor metrics Apache (1 core) 2 Relative Performance 1.5 (flexsc/sync) User Kernel 1 0.5 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB Processor efficiency doubles for kernel and user-mode execution 30
  31. 31. Discussion ➔ New OS architecture not necessary ➔ Exception-less syscalls can coexist with legacy ones ➔ Foundation for non-blocking system calls ➔ select() / poll() in user-space ➔ Interesting case of non-blocking free() ➔ Multicore ultra -specialization ➔ TCP Servers (Rutgers; Iftode et.al), FS Servers ➔ Single-ISA asymmetric cores ➔ OS-friendly cores (HP Labs; Mogul et. al) 31
  32. 32. Concluding Remarks ➔ System calls degrade server performance ➔ Processor pollution is inherent to synchronous system calls ➔ Exception-less syscalls ➔ Flexible and efficient system call execution ➔ FlexSC-Threads ➔ Leverages exception-less syscalls ➔ No modifications to multi-threaded applications ➔ Throughput & latency gains ➔ 2x throughput improvement for Apache and BIND ➔ 1.4x throughput improvement for MySQL 32
  33. 33. FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto

×