cachegrand: A Take on High Performance Caching

Brought to you by
cachegrand: A Take on
High Performance Caching
Daniele Salvatore Albano
Senior SWE II at Microsoft

Senior SWE II at Microsoft
■ Personal project to provide a blazing fast caching solution
■ Love performance, there are no good reasons to let the
hardware go underutilized!
■ Outside work, I spend a lot of time playing with embedded
hardware (e.g., ESP32, RPi, etc.), my last project was a
security camera with an ESP32 using both a normal camera
and a thermal one to have better motion detection

What is cachegrand?
■ cachegrande is a modern, blazing fast OSS caching platform, designed for
performance 🚀.
■ cachegrand is built for speed, written in C, scales vertically, almost linearly
● It’s a modern general-purpose solution: it’s not super fast for speciﬁc cases only
● Working on the network stack bypass and the on-disk database!
■ Aims to be protocol & command compatible with the most known caching
solutions

Why cachegrand?
■ Modern hardware requires modern software to express all its power
■ The architecture of the most popular alternative (Redis) is outdated and it
doesn’t scale vertically: more cores won't result in better performance
■ Up to 5.1 Mops GET/ 4.5 Mops SET, 40x faster than Redis with 64x more load
■ Up to 60 Mops GET / 26 Mops SET with batching
Benchmarked on a 1 x AMD EPYC 7502, 32 core, 64 threads, 256GB RAM @ 3200mhz, Ubuntu 22.04, memtier with 64 bytes payloads

This is How it Looks like Under the Hood

Requests per Second, no Batching
Used 3 x AMD EPYC 7502 with 2 x 25Gbit network links, one for cachegrand and two for memtier_benchmark

Latencies

Requests per Second with Batching

How can it be so Fast?
Benchmarks done on a 1 x AMD EPYC 7502, 32 core, 64 threads, 256GB RAM @ 3200mhz
Using Ubuntu 22.04

Custom Memory Allocator
■ cachegrand has its own memory allocator
■ Mix between the kernel slab allocator and tcmalloc
■ It uses Huge Pages to alloc and free mem in O(1)
● address % 2MB will give a pointer to the start of the Huge
Page used for the metadata
● Statistics and double-free catching at no cost
● Optimized for long-lived threads (although cross-thread
free only require 1 CAS) and 2^x memory allocations sizes
● Needs improvements, e.g., it does unnecessary branching
■ Some memory is wasted but perfs are amazing

Fibers
■ Context switching threads is slow, especially non pinned ones
■ cachegrand uses fibers, up to 60x faster
● A bit more costly than a couple of function calls
● 10k ctx switch take less than 0.25ms
● Pinned threads would need 14ms!
● Thread Pools help but have to carry around user data
context and require help for I/O
■ Cooperative switching is a win-win
● Thread Pools in general do preemptive switching, a ctx
switch might happen in the middle of a critical section
● A critical section will never be paused to run another
fiber, the app decides
● 10k fibers with 64 cores and 1 thread per core means
64 ops max, no risk to have to deal with half started
operations interrupted by a context switch

Data Structures - Optimize for the CPU (1/2)
■ Data are often logically organized to make the code more readable but it
doesn’t help performances
■ The CPU process the data in a very different way, Data Oriented Development
helps to provide speciﬁc optimizations organizing data for the CPU to better
leverage cachelines
■ An hash search in an hashtable using a linear search is a typical example
■ cachegrand’s hashtable, uses 2 separated arrays, one with only the hashes, to
have all sequential values, and another with keys and values
● Normally hashtables use only 1 array with hashes, keys and values

Data Structures - Optimize for the CPU (2/2)
cachegrand’s hashtable uses
a linear open addressing and
stores values up to 448
buckets far away from the
initial position to improve the
load.
This optimization provides up
to times 6x better
performances in the worst
case scenario

Data Structures - SIMD (1/2)
■ Another very common pattern to improve the performances when processing
data is the usage of Single Instruction Multiple Data (aka. SIMD)
● SIMD allows to perform the same operation on multiple data (e.g., AVX2 up to 256 bits)
● Cover a wide range of cases, from complex math calculations to string comparisons
■ SIMD is able to handle very complex scenarios but it also works great for
simple use cases, e.g., linear searches
■ The amount of parallel SIMD operations is often limited by the memory
bandwidth and the processing units in the CPU
● With AVX-512 the temperature also becomes a factor that needs to be taken into account

Data Structures - SIMD (2/2)
cachegrand’s hashtable
leverages SIMD to get better
performances, up to 2.5x
better performances for the
worst case scenario.

Data Structures - Localized Spinlocks (1/2)
■ Often a single lock is used to sync the access to data, easy but terrible perfs
■ Localizing locks inside the data can lead to a massive waste of memory
● E.g., an array where each element has a lock
● In cachegrand’s hashtable is even more complex as it’s required to lock a sequence of buckets
■ It’s possible to split the data in sections and provide localized locks
■ Useful in an hashtable where the risk of contention is reduced
■ With less contention, spinlocks helps to reduce the latency, less ctx switching
● Spinlocks in user space are “fake”, they can’t prevent the kernel to preempt the thread

Data Structures - Localized Spinlocks (2/2)
cachegrand’s hashtable, thanks to the various performance patterns implemented,
can perform up to ~85Mops inserts - which are more expensive than updates.
Doubling up the amount of cores provides almost twice the perfs, with 64 threads
the real perfs improvement is 170%, only 30% less than the ideal target of 200%.

io_uring - networking
Io_uring is a “new” async api which
provides the ability to batch various I/O
ops using rings shared between an app
and the kernel.
io_uring reduces the time spent switching
from/to the kernel space dramatically, the
extra cpu time can be spent by the OS to
actually doing the I/O or by the
application.
These benchmarks have been carried on
Linux Kernel 5.8, the 6.0 introduces
several improvements which will provide
even better performances

To Try it Out…
> curl
https://raw.githubusercontent.com/danielealbano/cachegrand/main/etc/cachegran
d.yaml.skel -o /path/to/cachegrand.yaml
> nano / vim / … /path/to/cachegrand.yaml # edit the config file if needed
> docker run
-v /path/to/cachegrand.yaml:/etc/cachegrand/cachegrand.yaml
--ulimit memlock=-1:-1
--ulimit nofile=262144:262144
-p 6379:6379
cachegrand/cachegrand-server:latest

Brought to you by
d.albano@gmail.com
@daniele_dll

cachegrand: A Take on High Performance Caching

Recommended

Recommended

More Related Content

Similar to cachegrand: A Take on High Performance Caching

Similar to cachegrand: A Take on High Performance Caching (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

cachegrand: A Take on High Performance Caching