A ScyllaDB Community
Performance Pitfalls of Rust Async
Function Pointers (And Why It
Might Not Matter)
Byron Wasti
Founder of Balter Load Testing
Byron Wasti (he/him)
Founder of Balter Load Testing
■ Programming with Rust for 7+ years, 4+
professionally
■ Focused on building robust high-performance,
low-latency systems
■ Developer of Open Source load testing framework
for Rust called Balter
(github.com/BalterLoadTesting/balter)
Motivation: Building a Load Testing Framework
■ Ability to run a user provided function repeatedly and in parallel
// Balter code
pub fn load_test(user_func: fn()) {
loop {
user_func();
}
}
// User Code
fn main() {
balter::load_test(my_load_test_scenario);
}
fn my_load_test_scenario() {
...
}
Function Pointers
■ Any function with the same
signature
■ Run the function in multiple threads
■ Send the function to other machines
(its just a pointer)
pub fn load_test(user_func: fn()) {
for _ in 0..THREADS {
thread::spawn(|| {
loop {
user_func();
}
});
}
}
load_test(my_func_a);
load_test(my_func_b);
load_test(my_func_c);
Async Function Pointers
For IO bound tasks (e.g. HTTP requests), async promises better performance
pub async fn load_test(user_func: async fn()) {
for _ in 0..TASKS {
tokio::spawn(|| async {
loop {
user_func().await;
}
});
}
}
async fn foo() {
}
load_test(foo).await;
Async Function Pointers
For IO bound tasks (e.g. HTTP requests), async promises better performance
pub async fn load_test(user_func: async fn()) {
for _ in 0..TASKS {
tokio::spawn(|| async {
loop {
user_func().await;
}
});
}
}
async fn foo() {
}
load_test(foo).await;
Async Functions in Rust
■ Desugar into normal functions returning `impl Future<Output=?>`
■ The compiler auto-generates an opaque type for the `impl Trait`
async fn foo() -> i32 {
}
async fn bar() -> i32 {
}
// Compiler error!
let arr: [fn() -> impl Future<Output=i32>] = [foo, bar];
Type-Erased Async Function Pointers
■ Common workaround is to use `Box::pin()`
fn foo() -> Pin<Box<dyn Future<Output=i32>>> {
Box::pin(async {
// Our usual async code
})
}
fn bar() -> Pin<Box<dyn Future<Output=i32>>> {
Box::pin(async {
// Our usual async code
})
}
// This works now!
let arr = [foo, bar];
Performance Characteristics
use std::hint::black_box;
fn main() {
load_test(black_box(foo));
}
fn load_test(func: fn(i32) -> i32) {
for i in 0..250_000_000 {
let _res = func(i);
}
}
fn foo(arg: i32) -> i32 {
black_box(arg * 2)
}
Performance Characteristics
Time (mean ± σ) Range (min … max)
Function Pointer 429.1 ms ± 7.0 ms 418.9 ms … 436.7 ms
Boxed Function Pointer 537.9 ms ± 2.5 ms 536.1 ms … 544.0 ms
Async Function 407.6 ms ± 3.6 ms 403.7 ms … 411.6 ms
Boxed Async Function 4.985 s ± 0.090 s 4.922 s … 5.198 s
Source: https://github.com/byronwasti/async-fn-pointer-perf
What is (Probably) Going On?
■ Boxed Async Functions are an order of magnitude slower than boxed
functions
■ Heap allocation for async functions includes the opaque state-machine Struct
the compiler generates
● A normal boxed function is just… a pointer on the heap
Alternative 1: `Box::Pin()` at the Edge
■ Make use of Generics to have one `Box::pin()` call.
async fn load_test<T, F>(func: T)
where T: Fn() -> F,
F: Future<Output=i32>,
{
loop {
func().await;
}
}
async fn foo() -> i32 {
}
Let arr = [Box::pin(load_test(foo)), Box::pin(load_test(bar))];
Performance Characteristics
Time (mean ± σ) Range (min … max)
Function Pointer 429.1 ms ± 7.0 ms 418.9 ms … 436.7 ms
Boxed Function Pointer 537.9 ms ± 2.5 ms 536.1 ms … 544.0 ms
Async Function 407.6 ms ± 3.6 ms 403.7 ms … 411.6 ms
Boxed Async Function 4.985 s ± 0.090 s 4.922 s … 5.198 s
Generic Async Boxed 318.1 ms ± 1.2 ms 317.1 ms … 320.9 ms
Source: https://github.com/byronwasti/async-fn-pointer-perf
Alternative 2: Use an Enum
async fn load_test(func: Func)
{
loop {
func.run().await;
}
}
async fn foo() -> i32 {
}
async fn bar() -> i32 {
}
enum Func {
Foo,
Bar,
}
impl Func {
async fn run(&self) -> i32 {
match self {
Func::Foo => foo().await,
Func::Bar => bar().await,
}
}
}
Performance Characteristics
Time (mean ± σ) Range (min … max)
Function Pointer 429.1 ms ± 7.0 ms 418.9 ms … 436.7 ms
Boxed Function Pointer 537.9 ms ± 2.5 ms 536.1 ms … 544.0 ms
Async Function 407.6 ms ± 3.6 ms 403.7 ms … 411.6 ms
Boxed Async Function 4.985 s ± 0.090 s 4.922 s … 5.198 s
Generic Async Boxed 318.1 ms ± 1.2 ms 317.1 ms … 320.9 ms
Async Enum Dispatch 526.5 ms ± 0.8 ms 525.6 ms … 528.1 ms
Source: https://github.com/byronwasti/async-fn-pointer-perf
Alternative 3: Reset the Future
■ Used by the Tower
rate-limiting
functionality [1]
■ Unfortunately no
generic way to
implement
pub struct RateLimit {
...
sleep: Pin<Box<Sleep>>,
}
impl RateLimit {
fn call() {
...
// The service is disabled until further notice
// Reset the sleep future in place, so that we
don't have to
// deallocate the existing box and allocate a
new one.
self.sleep.as_mut().reset(until);
}
}
[1] https://docs.rs/tower/latest/src/tower/limit/rate/service.rs.html#106-109
Implementing In Practice
■ Converted Balter to use Generics (pushing the Box::pin() to the edge)
■ Saw no performance difference
■ Functions calls are ridiculously fast, a 10x slowdown is… still really fast
■ There is a Storage RFC for Rust which may add new options in the future
Thank you!
Byron Wasti
p99@byronwasti.com
www.byronwasti.com
github.com/byronwasti

Performance Pitfalls of Rust Async Function Pointers (And Why It Might Not Matter)

  • 1.
    A ScyllaDB Community PerformancePitfalls of Rust Async Function Pointers (And Why It Might Not Matter) Byron Wasti Founder of Balter Load Testing
  • 2.
    Byron Wasti (he/him) Founderof Balter Load Testing ■ Programming with Rust for 7+ years, 4+ professionally ■ Focused on building robust high-performance, low-latency systems ■ Developer of Open Source load testing framework for Rust called Balter (github.com/BalterLoadTesting/balter)
  • 3.
    Motivation: Building aLoad Testing Framework ■ Ability to run a user provided function repeatedly and in parallel // Balter code pub fn load_test(user_func: fn()) { loop { user_func(); } } // User Code fn main() { balter::load_test(my_load_test_scenario); } fn my_load_test_scenario() { ... }
  • 4.
    Function Pointers ■ Anyfunction with the same signature ■ Run the function in multiple threads ■ Send the function to other machines (its just a pointer) pub fn load_test(user_func: fn()) { for _ in 0..THREADS { thread::spawn(|| { loop { user_func(); } }); } } load_test(my_func_a); load_test(my_func_b); load_test(my_func_c);
  • 5.
    Async Function Pointers ForIO bound tasks (e.g. HTTP requests), async promises better performance pub async fn load_test(user_func: async fn()) { for _ in 0..TASKS { tokio::spawn(|| async { loop { user_func().await; } }); } } async fn foo() { } load_test(foo).await;
  • 6.
    Async Function Pointers ForIO bound tasks (e.g. HTTP requests), async promises better performance pub async fn load_test(user_func: async fn()) { for _ in 0..TASKS { tokio::spawn(|| async { loop { user_func().await; } }); } } async fn foo() { } load_test(foo).await;
  • 7.
    Async Functions inRust ■ Desugar into normal functions returning `impl Future<Output=?>` ■ The compiler auto-generates an opaque type for the `impl Trait` async fn foo() -> i32 { } async fn bar() -> i32 { } // Compiler error! let arr: [fn() -> impl Future<Output=i32>] = [foo, bar];
  • 8.
    Type-Erased Async FunctionPointers ■ Common workaround is to use `Box::pin()` fn foo() -> Pin<Box<dyn Future<Output=i32>>> { Box::pin(async { // Our usual async code }) } fn bar() -> Pin<Box<dyn Future<Output=i32>>> { Box::pin(async { // Our usual async code }) } // This works now! let arr = [foo, bar];
  • 9.
    Performance Characteristics use std::hint::black_box; fnmain() { load_test(black_box(foo)); } fn load_test(func: fn(i32) -> i32) { for i in 0..250_000_000 { let _res = func(i); } } fn foo(arg: i32) -> i32 { black_box(arg * 2) }
  • 10.
    Performance Characteristics Time (mean± σ) Range (min … max) Function Pointer 429.1 ms ± 7.0 ms 418.9 ms … 436.7 ms Boxed Function Pointer 537.9 ms ± 2.5 ms 536.1 ms … 544.0 ms Async Function 407.6 ms ± 3.6 ms 403.7 ms … 411.6 ms Boxed Async Function 4.985 s ± 0.090 s 4.922 s … 5.198 s Source: https://github.com/byronwasti/async-fn-pointer-perf
  • 11.
    What is (Probably)Going On? ■ Boxed Async Functions are an order of magnitude slower than boxed functions ■ Heap allocation for async functions includes the opaque state-machine Struct the compiler generates ● A normal boxed function is just… a pointer on the heap
  • 12.
    Alternative 1: `Box::Pin()`at the Edge ■ Make use of Generics to have one `Box::pin()` call. async fn load_test<T, F>(func: T) where T: Fn() -> F, F: Future<Output=i32>, { loop { func().await; } } async fn foo() -> i32 { } Let arr = [Box::pin(load_test(foo)), Box::pin(load_test(bar))];
  • 13.
    Performance Characteristics Time (mean± σ) Range (min … max) Function Pointer 429.1 ms ± 7.0 ms 418.9 ms … 436.7 ms Boxed Function Pointer 537.9 ms ± 2.5 ms 536.1 ms … 544.0 ms Async Function 407.6 ms ± 3.6 ms 403.7 ms … 411.6 ms Boxed Async Function 4.985 s ± 0.090 s 4.922 s … 5.198 s Generic Async Boxed 318.1 ms ± 1.2 ms 317.1 ms … 320.9 ms Source: https://github.com/byronwasti/async-fn-pointer-perf
  • 14.
    Alternative 2: Usean Enum async fn load_test(func: Func) { loop { func.run().await; } } async fn foo() -> i32 { } async fn bar() -> i32 { } enum Func { Foo, Bar, } impl Func { async fn run(&self) -> i32 { match self { Func::Foo => foo().await, Func::Bar => bar().await, } } }
  • 15.
    Performance Characteristics Time (mean± σ) Range (min … max) Function Pointer 429.1 ms ± 7.0 ms 418.9 ms … 436.7 ms Boxed Function Pointer 537.9 ms ± 2.5 ms 536.1 ms … 544.0 ms Async Function 407.6 ms ± 3.6 ms 403.7 ms … 411.6 ms Boxed Async Function 4.985 s ± 0.090 s 4.922 s … 5.198 s Generic Async Boxed 318.1 ms ± 1.2 ms 317.1 ms … 320.9 ms Async Enum Dispatch 526.5 ms ± 0.8 ms 525.6 ms … 528.1 ms Source: https://github.com/byronwasti/async-fn-pointer-perf
  • 16.
    Alternative 3: Resetthe Future ■ Used by the Tower rate-limiting functionality [1] ■ Unfortunately no generic way to implement pub struct RateLimit { ... sleep: Pin<Box<Sleep>>, } impl RateLimit { fn call() { ... // The service is disabled until further notice // Reset the sleep future in place, so that we don't have to // deallocate the existing box and allocate a new one. self.sleep.as_mut().reset(until); } } [1] https://docs.rs/tower/latest/src/tower/limit/rate/service.rs.html#106-109
  • 17.
    Implementing In Practice ■Converted Balter to use Generics (pushing the Box::pin() to the edge) ■ Saw no performance difference ■ Functions calls are ridiculously fast, a 10x slowdown is… still really fast ■ There is a Storage RFC for Rust which may add new options in the future
  • 18.