SlideShare a Scribd company logo
Brought to you by
cachegrand: A Take on
High Performance Caching
Daniele Salvatore Albano
Senior SWE II at Microsoft
Daniele Salvatore Albano
Senior SWE II at Microsoft
■ Personal project to provide a blazing fast caching solution
■ Love performance, there are no good reasons to let the
hardware go underutilized!
■ Outside work, I spend a lot of time playing with embedded
hardware (e.g., ESP32, RPi, etc.), my last project was a
security camera with an ESP32 using both a normal camera
and a thermal one to have better motion detection
What is cachegrand?
What is cachegrand?
■ cachegrande is a modern, blazing fast OSS caching platform, designed for
performance 🚀.
■ cachegrand is built for speed, written in C, scales vertically, almost linearly
● It’s a modern general-purpose solution: it’s not super fast for specific cases only
● Working on the network stack bypass and the on-disk database!
■ Aims to be protocol & command compatible with the most known caching
solutions
Why cachegrand?
■ Modern hardware requires modern software to express all its power
■ The architecture of the most popular alternative (Redis) is outdated and it
doesn’t scale vertically: more cores won't result in better performance
■ Up to 5.1 Mops GET/ 4.5 Mops SET, 40x faster than Redis with 64x more load
■ Up to 60 Mops GET / 26 Mops SET with batching
Benchmarked on a 1 x AMD EPYC 7502, 32 core, 64 threads, 256GB RAM @ 3200mhz, Ubuntu 22.04, memtier with 64 bytes payloads
This is How it Looks like Under the Hood
Some Numbers First
Requests per Second, no Batching
Used 3 x AMD EPYC 7502 with 2 x 25Gbit network links, one for cachegrand and two for memtier_benchmark
Latencies
Used 3 x AMD EPYC 7502 with 2 x 25Gbit network links, one for cachegrand and two for memtier_benchmark
Requests per Second with Batching
Used 3 x AMD EPYC 7502 with 2 x 25Gbit network links, one for cachegrand and two for memtier_benchmark
How can it be so Fast?
Benchmarks done on a 1 x AMD EPYC 7502, 32 core, 64 threads, 256GB RAM @ 3200mhz
Using Ubuntu 22.04
Custom Memory Allocator
■ cachegrand has its own memory allocator
■ Mix between the kernel slab allocator and tcmalloc
■ It uses Huge Pages to alloc and free mem in O(1)
● address % 2MB will give a pointer to the start of the Huge
Page used for the metadata
● Statistics and double-free catching at no cost
● Optimized for long-lived threads (although cross-thread
free only require 1 CAS) and 2^x memory allocations sizes
● Needs improvements, e.g., it does unnecessary branching
■ Some memory is wasted but perfs are amazing
Fibers
■ Context switching threads is slow, especially non pinned ones
■ cachegrand uses fibers, up to 60x faster
● A bit more costly than a couple of function calls
● 10k ctx switch take less than 0.25ms
● Pinned threads would need 14ms!
● Thread Pools help but have to carry around user data
context and require help for I/O
■ Cooperative switching is a win-win
● Thread Pools in general do preemptive switching, a ctx
switch might happen in the middle of a critical section
● A critical section will never be paused to run another
fiber, the app decides
● 10k fibers with 64 cores and 1 thread per core means
64 ops max, no risk to have to deal with half started
operations interrupted by a context switch
Data Structures - Optimize for the CPU (1/2)
■ Data are often logically organized to make the code more readable but it
doesn’t help performances
■ The CPU process the data in a very different way, Data Oriented Development
helps to provide specific optimizations organizing data for the CPU to better
leverage cachelines
■ An hash search in an hashtable using a linear search is a typical example
■ cachegrand’s hashtable, uses 2 separated arrays, one with only the hashes, to
have all sequential values, and another with keys and values
● Normally hashtables use only 1 array with hashes, keys and values
Data Structures - Optimize for the CPU (2/2)
cachegrand’s hashtable uses
a linear open addressing and
stores values up to 448
buckets far away from the
initial position to improve the
load.
This optimization provides up
to times 6x better
performances in the worst
case scenario
Data Structures - SIMD (1/2)
■ Another very common pattern to improve the performances when processing
data is the usage of Single Instruction Multiple Data (aka. SIMD)
● SIMD allows to perform the same operation on multiple data (e.g., AVX2 up to 256 bits)
● Cover a wide range of cases, from complex math calculations to string comparisons
■ SIMD is able to handle very complex scenarios but it also works great for
simple use cases, e.g., linear searches
■ The amount of parallel SIMD operations is often limited by the memory
bandwidth and the processing units in the CPU
● With AVX-512 the temperature also becomes a factor that needs to be taken into account
Data Structures - SIMD (2/2)
cachegrand’s hashtable
leverages SIMD to get better
performances, up to 2.5x
better performances for the
worst case scenario.
Data Structures - Localized Spinlocks (1/2)
■ Often a single lock is used to sync the access to data, easy but terrible perfs
■ Localizing locks inside the data can lead to a massive waste of memory
● E.g., an array where each element has a lock
● In cachegrand’s hashtable is even more complex as it’s required to lock a sequence of buckets
■ It’s possible to split the data in sections and provide localized locks
■ Useful in an hashtable where the risk of contention is reduced
■ With less contention, spinlocks helps to reduce the latency, less ctx switching
● Spinlocks in user space are “fake”, they can’t prevent the kernel to preempt the thread
Data Structures - Localized Spinlocks (2/2)
cachegrand’s hashtable, thanks to the various performance patterns implemented,
can perform up to ~85Mops inserts - which are more expensive than updates.
Doubling up the amount of cores provides almost twice the perfs, with 64 threads
the real perfs improvement is 170%, only 30% less than the ideal target of 200%.
io_uring - networking
Io_uring is a “new” async api which
provides the ability to batch various I/O
ops using rings shared between an app
and the kernel.
io_uring reduces the time spent switching
from/to the kernel space dramatically, the
extra cpu time can be spent by the OS to
actually doing the I/O or by the
application.
These benchmarks have been carried on
Linux Kernel 5.8, the 6.0 introduces
several improvements which will provide
even better performances
To Try it Out…
> curl
https://raw.githubusercontent.com/danielealbano/cachegrand/main/etc/cachegran
d.yaml.skel -o /path/to/cachegrand.yaml
> nano / vim / … /path/to/cachegrand.yaml # edit the config file if needed
> docker run 
-v /path/to/cachegrand.yaml:/etc/cachegrand/cachegrand.yaml 
--ulimit memlock=-1:-1 
--ulimit nofile=262144:262144 
-p 6379:6379 
cachegrand/cachegrand-server:latest
Brought to you by
Daniele Salvatore Albano
d.albano@gmail.com
@daniele_dll

More Related Content

Similar to cachegrand: A Take on High Performance Caching

Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 
processor struct
processor structprocessor struct
processor struct
waqasjadoon11
 
5_Embedded Systems مختصر.pdf
5_Embedded Systems  مختصر.pdf5_Embedded Systems  مختصر.pdf
5_Embedded Systems مختصر.pdf
aliamjd
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
jaxconf
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
Jim St. Leger
 
Streaming multiprocessors and HPC
Streaming multiprocessors and HPCStreaming multiprocessors and HPC
Streaming multiprocessors and HPC
OmkarKachare1
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
OpenStack Korea Community
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
CESGA Centro de Supercomputación de Galicia
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
Ferdinand Jamitzky
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
Mike Pittaro
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
odsc
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4UniFabric
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
Data Con LA
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Anne Nicolas
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
Saeid Zebardast
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
ScyllaDB
 

Similar to cachegrand: A Take on High Performance Caching (20)

Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
Coa presentation3
Coa presentation3Coa presentation3
Coa presentation3
 
5_Embedded Systems مختصر.pdf
5_Embedded Systems  مختصر.pdf5_Embedded Systems  مختصر.pdf
5_Embedded Systems مختصر.pdf
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
Streaming multiprocessors and HPC
Streaming multiprocessors and HPCStreaming multiprocessors and HPC
Streaming multiprocessors and HPC
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
1083 wang
1083 wang1083 wang
1083 wang
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
 

More from ScyllaDB

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
ScyllaDB
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
ScyllaDB
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
ScyllaDB
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
ScyllaDB
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
ScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
ScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
ScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
ScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
ScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
ScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
ScyllaDB
 

More from ScyllaDB (20)

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

cachegrand: A Take on High Performance Caching

  • 1. Brought to you by cachegrand: A Take on High Performance Caching Daniele Salvatore Albano Senior SWE II at Microsoft
  • 2. Daniele Salvatore Albano Senior SWE II at Microsoft ■ Personal project to provide a blazing fast caching solution ■ Love performance, there are no good reasons to let the hardware go underutilized! ■ Outside work, I spend a lot of time playing with embedded hardware (e.g., ESP32, RPi, etc.), my last project was a security camera with an ESP32 using both a normal camera and a thermal one to have better motion detection
  • 4. What is cachegrand? ■ cachegrande is a modern, blazing fast OSS caching platform, designed for performance 🚀. ■ cachegrand is built for speed, written in C, scales vertically, almost linearly ● It’s a modern general-purpose solution: it’s not super fast for specific cases only ● Working on the network stack bypass and the on-disk database! ■ Aims to be protocol & command compatible with the most known caching solutions
  • 5. Why cachegrand? ■ Modern hardware requires modern software to express all its power ■ The architecture of the most popular alternative (Redis) is outdated and it doesn’t scale vertically: more cores won't result in better performance ■ Up to 5.1 Mops GET/ 4.5 Mops SET, 40x faster than Redis with 64x more load ■ Up to 60 Mops GET / 26 Mops SET with batching Benchmarked on a 1 x AMD EPYC 7502, 32 core, 64 threads, 256GB RAM @ 3200mhz, Ubuntu 22.04, memtier with 64 bytes payloads
  • 6. This is How it Looks like Under the Hood
  • 8. Requests per Second, no Batching Used 3 x AMD EPYC 7502 with 2 x 25Gbit network links, one for cachegrand and two for memtier_benchmark
  • 9. Latencies Used 3 x AMD EPYC 7502 with 2 x 25Gbit network links, one for cachegrand and two for memtier_benchmark
  • 10. Requests per Second with Batching Used 3 x AMD EPYC 7502 with 2 x 25Gbit network links, one for cachegrand and two for memtier_benchmark
  • 11. How can it be so Fast? Benchmarks done on a 1 x AMD EPYC 7502, 32 core, 64 threads, 256GB RAM @ 3200mhz Using Ubuntu 22.04
  • 12. Custom Memory Allocator ■ cachegrand has its own memory allocator ■ Mix between the kernel slab allocator and tcmalloc ■ It uses Huge Pages to alloc and free mem in O(1) ● address % 2MB will give a pointer to the start of the Huge Page used for the metadata ● Statistics and double-free catching at no cost ● Optimized for long-lived threads (although cross-thread free only require 1 CAS) and 2^x memory allocations sizes ● Needs improvements, e.g., it does unnecessary branching ■ Some memory is wasted but perfs are amazing
  • 13. Fibers ■ Context switching threads is slow, especially non pinned ones ■ cachegrand uses fibers, up to 60x faster ● A bit more costly than a couple of function calls ● 10k ctx switch take less than 0.25ms ● Pinned threads would need 14ms! ● Thread Pools help but have to carry around user data context and require help for I/O ■ Cooperative switching is a win-win ● Thread Pools in general do preemptive switching, a ctx switch might happen in the middle of a critical section ● A critical section will never be paused to run another fiber, the app decides ● 10k fibers with 64 cores and 1 thread per core means 64 ops max, no risk to have to deal with half started operations interrupted by a context switch
  • 14. Data Structures - Optimize for the CPU (1/2) ■ Data are often logically organized to make the code more readable but it doesn’t help performances ■ The CPU process the data in a very different way, Data Oriented Development helps to provide specific optimizations organizing data for the CPU to better leverage cachelines ■ An hash search in an hashtable using a linear search is a typical example ■ cachegrand’s hashtable, uses 2 separated arrays, one with only the hashes, to have all sequential values, and another with keys and values ● Normally hashtables use only 1 array with hashes, keys and values
  • 15. Data Structures - Optimize for the CPU (2/2) cachegrand’s hashtable uses a linear open addressing and stores values up to 448 buckets far away from the initial position to improve the load. This optimization provides up to times 6x better performances in the worst case scenario
  • 16. Data Structures - SIMD (1/2) ■ Another very common pattern to improve the performances when processing data is the usage of Single Instruction Multiple Data (aka. SIMD) ● SIMD allows to perform the same operation on multiple data (e.g., AVX2 up to 256 bits) ● Cover a wide range of cases, from complex math calculations to string comparisons ■ SIMD is able to handle very complex scenarios but it also works great for simple use cases, e.g., linear searches ■ The amount of parallel SIMD operations is often limited by the memory bandwidth and the processing units in the CPU ● With AVX-512 the temperature also becomes a factor that needs to be taken into account
  • 17. Data Structures - SIMD (2/2) cachegrand’s hashtable leverages SIMD to get better performances, up to 2.5x better performances for the worst case scenario.
  • 18. Data Structures - Localized Spinlocks (1/2) ■ Often a single lock is used to sync the access to data, easy but terrible perfs ■ Localizing locks inside the data can lead to a massive waste of memory ● E.g., an array where each element has a lock ● In cachegrand’s hashtable is even more complex as it’s required to lock a sequence of buckets ■ It’s possible to split the data in sections and provide localized locks ■ Useful in an hashtable where the risk of contention is reduced ■ With less contention, spinlocks helps to reduce the latency, less ctx switching ● Spinlocks in user space are “fake”, they can’t prevent the kernel to preempt the thread
  • 19. Data Structures - Localized Spinlocks (2/2) cachegrand’s hashtable, thanks to the various performance patterns implemented, can perform up to ~85Mops inserts - which are more expensive than updates. Doubling up the amount of cores provides almost twice the perfs, with 64 threads the real perfs improvement is 170%, only 30% less than the ideal target of 200%.
  • 20. io_uring - networking Io_uring is a “new” async api which provides the ability to batch various I/O ops using rings shared between an app and the kernel. io_uring reduces the time spent switching from/to the kernel space dramatically, the extra cpu time can be spent by the OS to actually doing the I/O or by the application. These benchmarks have been carried on Linux Kernel 5.8, the 6.0 introduces several improvements which will provide even better performances
  • 21. To Try it Out… > curl https://raw.githubusercontent.com/danielealbano/cachegrand/main/etc/cachegran d.yaml.skel -o /path/to/cachegrand.yaml > nano / vim / … /path/to/cachegrand.yaml # edit the config file if needed > docker run -v /path/to/cachegrand.yaml:/etc/cachegrand/cachegrand.yaml --ulimit memlock=-1:-1 --ulimit nofile=262144:262144 -p 6379:6379 cachegrand/cachegrand-server:latest
  • 22. Brought to you by Daniele Salvatore Albano d.albano@gmail.com @daniele_dll