FlexSC: Exception-Less System Calls - presented @ OSDI 2010

FlexSC
Flexible System Call Scheduling with
Exception-Less System Calls

Livio Soares and Michael Stumm
University of Toronto

Motivation
The synchronous system call interface is a
legacy from the single core era

Expensive! Costs are:
➔ direct: mode-switch
➔ indirect: processor

structure pollution

FlexSC implements efficient and flexible
system calls for the multicore era
2

FlexSC overview
Two contributions: FlexSC and FlexSC-Threads

Results in:
1) MySQL throughput increase of up to 40%
and latency reduction of 30%
2) Apache throughput increase of up to 115%
and latency reduction of 50% 3

Performance impact of synchronous syscalls
➔ Xalan from SPEC CPU 2006
➔ Virtually no time in the OS
➔ Linux on Intel Core i7 (Nehalem)
➔ Injected exceptions with varying frequencies
➔ Direct: emulate null system call
Direct
➔ Indirect: emulate “write()” system call
Indirect
➔ Measured only user-mode time
➔ Kernel time ignored

Ideally, user-mode performance is unaltered
4

Degradation due to sync. syscalls
Degradation (lower is faster)
Xalan (SPEC CPU 2006)
70%
60%
Apache Indirect
50% Direct
40% MySQL
30%
20%
10%
0%
1K 2K 5K 10K 20K 50K 100K
user-mode instructions between exceptions
(log scale)

System calls can half processor efficiency;
indirect cause is major contributor
5

Processor state pollution

➔ Key source of performance impact

➔ On a Linux write() call:
rd
➔ up to 2/3 of the L1 data cache and data
TLB are evicted

➔ Kernel performance equally affected
➔ Processor efficiency for OS code is also cut
in half
6

Synchronous system calls are expensive

User

Kernel

Traditional system calls are synchronous
and use exceptions to cross domains
7

Alternative: side-step the boundary

User

Kernel

Exception-less syscalls remove synchronicity
by decoupling invocation from execution
8

Benefits of exception-less system calls
➔ Significantly reduce direct costs
➔ Fewer mode switches

User
➔ Allow for batching

Kernel
➔ Reduce indirect costs

➔ Allow for dynamic multicore specialization
➔ Further reduce direct and indirect costs
9

Exception-less interface: syscall page
write(fd, buf, 4096);

entry = free_syscall_entry();

/* write syscall */
entry->syscall = 1;
entry->num_args = 3;
entry->args[0] = fd;
entry->args[1] = buf;
entry->args[2] = 4096;
entry->status = SUBMIT;
SUBMIT

while (entry->status != DONE)
DONE
do_something_else();

return entry->return_code;

10



/* write syscall */
entry->syscall = 1;
entry->args[2] = 4096; SUBMIT
SUBMIT

DONE


11



/* write syscall */
entry->syscall = 1;
entry->args[2] = 4096; DONE
SUBMIT

DONE


12

Syscall threads
➔ Kernel-only threads
➔ Part of application process
➔ Execute requests from syscall page
➔ Schedulable on a per-core basis

13

System call batching

Request as many system calls as possible
Switch to kernel-mode
Start executing all posted system calls

Avoids direct and indirect costs,
even on a single core 14

Dynamic multicore specialization

FlexSC makes specializing cores simple
Dynamically adapts to workload needs
15

What programs can benefit from FlexSC?
Event-driven servers
(e.g., memcached, nginx webserver)
➔ Use asynchoronous calls, similar to FlexSC
➔ Can use FlexSC directly
➔ Mix sync and exception-less system calls

Multi-threaded servers: FlexSC-Threads
➔ Thread library, compatible with Pthreads
➔ No changes to app. code or recompilation required
➔ Transparently converts legacy syscalls into
exception-less ones
16

FlexSC-Threads library
➔ Hybrid (M-on-N) threading model
➔ One kernel visible thread per core
➔ Many user threads per kernel-visible thread

➔ Redirects system calls (libc wrappers)
➔ Posts exception-less syscall to syscall page
➔ Switches to other user-level thread

➔ Resumes thread upon syscall completion

Benefits of exception-less syscalls
while maintaining sequential syscall interface
17

FlexSC-Threads in action

User

18


On a syscall:
Post request to system call page
Block user-level thread
19


Kernel

On a syscall:
Post request to system call page
Block user-level thread
Switch to next ready thread 20


User

Kernel

If all user-level threads become blocked:
1) enter kernel
2) wait for completion of at least 1 syscall
21

Evaluation
➔ Linux 2.6.33

➔ Nehalem (Core i7) server, 2.3GHz
➔ 4 cores on a chip

➔ Clients connected on 1 Gbps network

➔ Workloads
➔ Sysbench on MySQL (80% user, 20% kernel)
➔ ApacheBench on Apache (50% user, 50% kernel)

➔ Default Linux NTPL (“sync”) vs.
sync
FlexSC-Threads (“flexsc”)
flexsc
22

Sysbench: “OLTP” on MySQL (1 core)

500

400
(requests/sec.)
Throughput

300
15% improvement
200
flexsc
100
sync
0
0 50 100 150 200 250 300
Request Concurrency

23

Sysbench: “OLTP” on MySQL (4 cores)

1,000

800
(requests/sec.)
Throughput

600
40% improvement
400

200 flexsc
sync
0
0 50 100 150 200 250 300
Request Concurrency

24

MySQL latency per client request
256 connections
1900
1,000
900 95th
800 percentile
Latency (ms)

700 average
600
500
400
300
200
100
0
sync flexsc sync flexsc sync flexsc
1 core 2 cores 4 cores

Up to 30% reduction of average
request latencies
25

MySQL processor metrics
SysBench (4 cores)
1.4
1.2
Relative Performance

User Kernel
1
(flexsc/sync)

0.8
0.6
0.4
0.2
0
L3 d-cache TLB IPC L2 i-cache Branch
IPC L2 i-cache Branch L3 d-cache TLB

Performance improvements consequence of
more efficient processor execution
26

ApacheBench throughput (1 core)

45,000
40,000 flexsc
35,000 sync
(requests/sec.)
Throughput

30,000
25,000
20,000
15,000 80-90% improvement
10,000
5,000
0
0 200 400 600 800 1000
Request Concurrency
27

ApacheBench throughput (4 cores)

45,000
40,000
35,000
(requests/sec.)

115% improvement
Throughput

30,000
25,000
20,000
15,000
10,000 flexsc
5,000 sync
0
0 200 400 600 800 1000
Request Concurrency
28

Apache latency per client request
256 concurrent requests
238
30
99th
25 percentile
Latency (ms)

20 average

15

10

5

0
sync flexsc sync flexsc sync flexsc
1 core 2 cores 4 cores

Up to 50% reduction of average
request latencies
29

Apache processor metrics
Apache (1 core)
2
Relative Performance

1.5
(flexsc/sync)

User Kernel
1

0.5

0
L3 d-cache TLB IPC L2 i-cache Branch
IPC L2 i-cache Branch L3 d-cache TLB

Processor efficiency doubles for kernel
and user-mode execution
30

Discussion

➔ New OS architecture not necessary
➔ Exception-less syscalls can coexist with legacy ones
➔ Foundation for non-blocking system calls
➔ select() / poll() in user-space
➔ Interesting case of non-blocking free()
➔ Multicore ultra -specialization
➔ TCP Servers (Rutgers; Iftode et.al), FS Servers
➔ Single-ISA asymmetric cores
➔ OS-friendly cores (HP Labs; Mogul et. al)

31

Concluding Remarks
➔ System calls degrade server performance
➔ Processor pollution is inherent to synchronous
system calls
➔ Exception-less syscalls
➔ Flexible and efficient system call execution
➔ FlexSC-Threads
➔ Leverages exception-less syscalls
➔ No modifications to multi-threaded applications
➔ Throughput & latency gains
➔ 2x throughput improvement for Apache and BIND
➔ 1.4x throughput improvement for MySQL 32

FlexSC: Exception-Less System Calls - presented @ OSDI 2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FlexSC: Exception-Less System Calls - presented @ OSDI 2010

Similar to FlexSC: Exception-Less System Calls - presented @ OSDI 2010 (20)

FlexSC: Exception-Less System Calls - presented @ OSDI 2010