Adventures in Thread-per-Core Async with Redpanda and Seastar

HOSTED BY
Adventures in Thread-per-core Async
with Redpanda & Seastar
Travis Downs
Software Engineer at Redpanda

Travis Downs (He/Him)
Software Engineer at Redpanda
■ I love going deep on performance – all the way to
assembly, if necessary
■ I’ve held principal staff positions at Salesforce &
architect roles at SAP and Business Objects
■ I had hobbies like writing a software performance
blog, but now I’m a parent, so…

3
Redpanda in 60 seconds
Redpanda is a streaming storage engine
Clients speak Apache Kafka API to Redpanda nodes to produce
and consume from topic partitions.
Partitions are logs (~10,000s per cluster)
Each partition is a Raft group (~3 members)
Scale up and scale out should be ~equivalent

What is thread-per-core?
One thread per core and pinned: make scheduling decisions in userspace.
This thread must not block.
Question: how do we replace blocking calls?
Answer: …
5

Seastar was created by the ScyllaDB project.
Redpanda is built on Seastar. We 😍it.
Shared nothing architecture made up of “shards”:
■ A CPU core
■ A pool of memory NUMA-local to that core
■ All-to-all mesh of SPSC message queues
■ Cooperative multitasking
Seastar
6

Continuation style
ss::future<> consensus::stop() {
return _event_manager.stop()
.then([this] { return _append_requests_buffer.stop(); })
.then([this] { return _batcher.stop(); })
.then([this] { return _bg.close(); })
.then([this] {
if (likely(!_snapshot_writer)) {
return ss::now();
}
return _snapshot_writer->close().then(
[this] { _snapshot_writer.reset(); });
});
}

C++ coroutines
seastar::future<std::string> my_coroutine() {
co_await seastar::sleep(100ms); // returns future<>
co_return "hello world";
}
New in C++ 20: three new keywords
co_await
co_yield
co_return
Language provides a future concept but not implementation: Seastar still defines the
future/promise type.
When compiler sees a co_* keyword, the function is rewritten to stash stack variables on the
heap as needed to support suspension/resumption of execution.

C++20 coroutines: after
…
co_await _event_manager.stop();
co_await _append_requests_buffer.stop();
co_await _batcher.stop();
_op_lock.broken();
co_await _bg.close();
if (unlikely(_snapshot_writer)) {
co_await _snapshot_writer->close();
_snapshot_writer.reset();
}
}

New vs old
…
co_await _event_manager.stop();
co_await _append_requests_buffer.stop();
co_await _batcher.stop();
_op_lock.broken();
co_await _bg.close();
if (unlikely(_snapshot_writer)) {
co_await _snapshot_writer->close();
_snapshot_writer.reset();
}
}
…
return _event_manager.stop()
.then([this] { return
_append_requests_buffer.stop(); })
.then([this] { return _batcher.stop(); })
.then([this] { return _bg.close(); })
.then([this] {
if (likely(!_snapshot_writer)) {
return ss::now();
}
return _snapshot_writer->close().then(
[this] { _snapshot_writer.reset(); });
});

Coroutine Performance
Coroutine performance depends on both on the framework implementing the
promise type and the compiler
Here we talk about seastar’s implementation and clang++
Preview: coroutines are not transparent when it comes to performance

Frame allocations
Observation: almost every coroutine allocates
Exception: if the compiler can statically prove the coro never suspends
- No suspension points (co_await or co_yield) in the function
- Suspension points is never reachable
- Suspension point is reachable but never suspends

Frame allocations 2
This coroutine:
- Never suspends
- Never even executes co_await
- ~200 instructions and ~80 cycles
- Always allocates
seastar::future<> empty_coro() {
if (always_false) {
co_await make_ready_future<>();
}
}

Case study: varint decode
Let’s look at a case study drawn from Redpanda code
Decode an unsigned 32-bit varint
1-5 bytes and MSB of 0 indicates final byte
Widely used in Kafka protocol (and other places)

Case study: coroutine decoder
read1() is async
Almost the same as the synchronous
version
Allocates once per decode
auto coro_decode(input_stream& s) {
detail::var_decoder decoder;
while (true) {
char c = co_await s.read1();
if (decoder.accept(c)) {
break;
}
}
co_return decoder.result();
}

Case study: continuation decoder
Much harder to read (and
write)
Does not allocate
Recursion is bounded by
decoder
auto cont_recurse(iobuf_reader& s, var_decoder decoder) {
return s.read1().then([&s, decoder](char c) mutable {
return ss::make_ready_future<result_type>
(decoder.result());
}
return cont_recurse(s, decoder);
});
}

Case study: runtime comparison

Case study: mystery method 1
Optimistic approach
Avoid any async machinery if
possible
Doubles the amount of code
auto cont_tricky(iobuf_reader& s, var_decoder decoder) {
auto f = s.read1();
while (f.available()) {
if (decoder.accept(f.get())) {
return decoder.result_as_future();
}
f = s.read1();
}
return std::move(f).then([&s, decoder](char c) mutable {
return decoder.result_as_future();
}
return cont_tricky(s, decoder);
});
}

Case study: mystery method 2
Synchronous version
Almost identical to coro version
Speedup varies from 4x to 9x
auto sync_decode(input_stream& s) {
detail::var_decoder decoder;
while (true) {
char c = s.read1_sync();
break;
}
}
return decoder.result();
}

Sync with async fallback
So how should we really do this?
Use sync with async fallback.
Peek at 5 bytes, fallback if not available.
Fallback must in own method!
auto decode_fallback(iobuf_reader& s) {
auto [buf, filled] = s.peek<5>();
if (filled) {
auto result = decode_u32(buf.data());
s.skip(result.second);
return ss::make_ready_future(result);
}
return coro_decode(s);
}

Performance Bottom Line
Async is still cheap in the large
- Context switches are 1,000s of cycles, large cache impact
Very short coroutines may be expensive: consider continuations
Continuations have a per-continuation cost: consider coroutines
Consider sync vs async fallback
Drive the above decisions via profiling

Summary: are coroutines “async made easy”?
⚠️ C++ is not memory safe, and async makes it (even) easier to write a segfault with a
careless reference. Sometimes coroutines help with this.
ℹ️ Compiler bugs: LLVM is great and things get fixed fast, but coroutines are “early adopter”
stage. Use the latest release! (e.g. llvm/llvm-project#51843).
ℹ️ Performance: It’s complicated.
✅ Net win for maintainability, robust use of RAII, and open the door to future compiler
optimization of async code.
27

Travis Downs
travis.downs@redpanda.co
m
@trav_downs
travisdowns
Thank you! Let’s connect.

Alternatives
Seastar is not the only option for writing fast async code:
■ C++/asio
■ Rust/tokio
■ Various GC language options (goroutines, Java lightweight threads)
Main difference: these do not adopt Seastar’s strict share-nothing model, do not avoid
atomics, and tend to only softly bind tasks to a core (e.g. tokio does work stealing).
Possibility of hybrid approaches (e.g. use Biased Reference Counting to avoid atomics
while avoiding pinning all memory to cores).
Seastar also has “alien threads” for mixing non-async code (Redpanda uses this for
Kerberos libs)
30

Trade offs
Using C++20 & Seastar is clear net benefit for Redpanda.
It might be right for you too if one or more of the following are true:
■ Starting a new project where high throughput and low latency are important
■ Does your work decompose into shard-affine units?
■ Do you need to scale to more than a few cores?
■ Is C++ your language of choice?
31

P99 Conf Template
#1A1047 #00E5FF
#753bf0 #FF2CDF
#2B53F9
Font usage Share Tech or Roboto
Color Palette
Table
Column 1 Column 2 Column 3 Column 5
Data 1 Data 2 Data 3 Data 4
Data 5 Data 6 Data 7 Data 8
#667EEA

33
What makes good high-throughput software?
Keep the disk/network fed with I/Os
Conform to the system’s topology
Not just high throughput: reliably low latency
Primary success metric: P99.9 latency

© 2023 REDPANDA DATA
Why Redpanda?
Fast
● 10x lower tail latency
vs Apache Kafka
● 6x faster transactions
● Written in C++ with
async, shared nothing
design
● No page cache, no
virtual memory
Easy
● Fully Kafka API-
compatible
● Single binary
● No JVM, No ZooKeeper
● Auto tuning & balancing
● Prometheus metrics
Efficient
● Thread-per-core
architecture
● Saturates your
infrastructure
● Extreme throughput
● Scales both vertically
and horizontally
Cost-Effective
● Reduces Kafka infra
costs by 6X
● Lower admin overhead
● Limitless data ingestion
and retention without
local disk

Coroutines and lifetimes: example 1
Real example helper function for constructing and writing a message batch, from PR #9154
ss::future<std::error_code> metadata::mark_clean(model::offset
clean_offset) {
// Construct a batch builder
auto builder = batch_start();
// Add one message
builder.mark_clean(clean_offset);
// Replicate using raft, return future for replication complete
return builder.replicate();
}
35

ss::future<std::error_code> metadata::mark_clean(model::offset clean_offset) {
return builder.replicate();
// … builder falls out of scope here, the returned future still references it
}
// Imagine replicate() might generate a future that captures this
ss::future<> batch_builder::replicate() {
return something.then([this]{
// update some member variable here
});
}
36

ss::future<std::error_code> metadata::mark_clean(model::offset
clean_offset) {
co_return co_await builder.replicate();
}
co_await futures inline -> ensure future completes before referenced object falls
out of scope.
Thank you coroutines! 🎉
37

// Print a string after a delay
seastar::future<> delayed_print(const std::string& msg) {
co_await seastar::sleep(100ms);
std::cout << "delayed_print: " << msg << std::endl;
}
// Print hello world after a delay
seastar::future<> delayed_hello_world(){
return delayed_print(std::string(“hello world!”)));
}
38

seastar::future<> delayed_print(const std::string& msg) {
}
}
39

seastar::future<> delayed_print(std::string msg) {
}
}
Pass by value is not expensive in this case: temporaries are rvalues, will be moved, not copied.
Always pass by value if you can, to avoid this class of issue.
40

P99 Conf Template
#1A1047
r26 g16 b71
c100 m100 y34 b45
Pantone 275c
#00E5FF
r0 g229 b255
c60 m0 y9 b0
#753bf0
r117 g59 b240
c79 m77 y0 b0
#FF2CDF
r255 g44 b223
C31 m78 y0 b0
#2B53F9
r38 g24 b250
c91 m75 y0 b0
Pantone 2727c
Color Palette Details
#667EEA
r102 g126 b234
c60 m35 y0 b5
Pantone 659c

P99 Conf Template
<here is some code>
<styling>
<use consolas for font when displaying code>
<don’t go below 12pt font size>

Slide title with white background
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu
faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula,
interdum blandit sem tortor eget dolor.
■ Bullet 1
● Bullet 2
■ Bullet 3

Hardware evolution
Not just CPUs:
■ Disk (SSD -> NVMe)
■ Network (100Gbps, 400Gbps)
Usually partitioned for virtualized
workloads
What if we want to run one high
throughput application on the whole
machine?

Hardware evolution
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu
faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula,
interdum blandit sem tortor eget dolor.
■ Bullet 1
● Bullet 2
■ Bullet 3

Clear slide for diagram with caption

First Last
email@address.com
@twitter
website/blog url
Thank you! Let’s connect.

Adventures in Thread-per-Core Async with Redpanda and Seastar

More Related Content

Similar to Adventures in Thread-per-Core Async with Redpanda and Seastar

More from ScyllaDB

Recently uploaded

Adventures in Thread-per-Core Async with Redpanda and Seastar