"Intro to Stateful Services or How to get 1 million RPS from a single node", Anton Moldovan

AGENDA
- intro
- why stateless is slow and less reliable
- tools for building stateful services

PART I
intro to sportsbook domain
and
how we come to stateful

Dynamo Kyiv vs Chelsea
2 : 1
Red card
Score
changed
Odds
changed
PUSH
PULL

2 : 1
Red card
Score
changed
Odds
changed
PUSH
PULL
- quite big payloads: 30 KB compressed data (1.5 MB uncompressed)
- update rate: 2K RPS (per tenant)
- user query rate: 3-4K RPS (per tenant)
- live data is very dynamic: no much sense to cache it
- data should be queryable: simple KV is not enough
- we need secondary indexes

20 KB payload for concurrent read and write
Redis, single node: 4vcpu - 8gb
redis_write: 4K RPS, p75 = 543, p95 = 688, p99 = 842
redis read: 7K RPS, p75 = 970, p95 = 1278, p99 = 1597

API
Cache
DB
but Cache is not queryable

How to handle a case when
your data is larger than RAM?
10 GB 30 GB

Solution 1: use memory DB that supports data larger than RAM
10 GB
20 GB

UA PL FR
Solution 2: use partition by tenant

Solution 3: use range-based sharding
users
(1-500)
users
(501-1000)
shard A shard B

API + DB
Stateful Service
API
Cache
DB
network latency
network latency

Latency Numbers
Latency
2010 2020
Compress 1KB with Zippy 2μs 2μs
Read 1 MB sequentially from RAM 30μs 3μs
Read 1 MB sequentially from SSD 494μs 49μs
Read 1 MB sequentially from disk 3ms 825μs
Round trip within same datacenter 500μs 500μs
Send packet CA -> Netherlands -> CA 150ms 150ms
https://colin-scott.github.io/personal_website/research/interactive_latency.html

API
Cache
DB
CPU: for serialize/deserialize
CPU: serialize/deserialize
API + DB
Stateful Service
CPU for serialize (we don’t need to deserialize)

API
Cache
DB
CPU for ASYNC request handling
CPU: ASYNC request handling
API + DB
Stateful Service

API
Cache
DB
CPU: managing sockets
API + DB
Stateful Service
CPU for managing sockets (only clients sockets )

API
Cache
DB
API + DB
Stateful Service
CPU for handling query (very cheap compared to
serialization)

API
Cache
DB
Overreads
API + DB
Stateful Service
CPU for handling query (very cheap compared to
serialization)

Object hit rate / Transactional hit rate

A B
C
API
In order to fulﬁll our transactional ﬂow we need to
fetch records: A, B, C
Record A and B will not impact our latency
Overall Latency = Latency of record C

Most existing cache eviction algorithms focus on maximizing
object hit rate, or the fraction of single object requests served
from cache. However, this approach fails to capture the
inter-object dependencies within transactions.

async / await
Imagine that we run Redis on localhost. Even with such setup we
usually use async request handling.

public void SimpleMethod()
{
var k = 0;
for (int i = 0; i < Iterations; i++)
{
k = Add(i, i);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private int Add(int a, int b) => a + b;

public async Task SimpleMethodAsync()
{
var k = 0;
{
k = await AddAsync(i, i);
}
}
private Task<int> AddAsync(int a, int b)
{
return Task.FromResult(a + b);
}

public async Task SimpleMethodAsyncYield()
{
var k = 0;
{
}
}
private async Task<int> AddAsync(int a, int b)
{
await Task.Yield();
return a + b;
}

public async Task SimpleMethodAsyncYield()
{
var k = 0;
{
}
}
private async Task<int> AddAsync(int a, int b)
{
await Task.Yield();
return await Task.Run(() => a + b);
}

PART III
why stateless is less reliable

API
Cache
DB
API + DB
Stateful Service
We have a higher probability of failure

API
Cache
DB
circuit breaker
retry
fallback
timeout
bulkhead isolation
circuit breaker
retry
fallback
timeout
bulkhead isolation
API + DB
Stateful Service

API
Cache
DB
What about cache invalidation
and data consistency?
API + DB
Stateful Service

API
Cache
DB
What about the predictable scale-out?
Will your RPS increase if you add an
additional API or Cache node?
API + DB
Stateful Service

- Metastable failures occur in open systems with an uncontrolled source of
load where a trigger causes the system to enter a bad state that persists
even when the trigger is removed.
- Paradoxically, the root cause of these failures is often features that
improve the eﬃciency or reliability of the system.
- The characteristic of a metastable failure is that the sustaining eﬀect keeps
the system in the metastable failure state even after the trigger is
removed.

At least 4 out of 15 major outages in the
last decade at Amazon Web Services
were caused by metastable failures.

PART IV
tools for building stateful services

distributed log with sync replication

2 : 1
Red card
Score
changed
Odds
changed
PUSH
PULL
- quite big payloads: 30 KB compressed data (1.5 MB uncompressed)
- update rate: 2K RPS (per tenant)
- user query rate: 3-4K RPS (per tenant)
- live data is very dynamic: no much sense to cache it
- data should be queryable: simple KV is not enough
- we need secondary indexes
At pick to handle big load for 1 tenant we have:
5-10 nodes, 0.5-2 CPU, 6GB RAM

THANKS
always benchmark
https://twitter.com/antyadev

"Intro to Stateful Services or How to get 1 million RPS from a single node", Anton Moldovan

Recommended

Recommended

More Related Content

Similar to "Intro to Stateful Services or How to get 1 million RPS from a single node", Anton Moldovan

Similar to "Intro to Stateful Services or How to get 1 million RPS from a single node", Anton Moldovan (20)

More from Fwdays

More from Fwdays (20)

Recently uploaded

Recently uploaded (20)

"Intro to Stateful Services or How to get 1 million RPS from a single node", Anton Moldovan