This document proposes a unified read-only cache for Ceph using a standalone SSD caching library that can be reused for librbd and RGW. It describes the general architecture including a common libcachefile, policy, and hooks. It then provides more details on shared read-only caching implementations for librbd and RGW, including initial results showing a 4x performance improvement for librbd. Issues discussed include different block vs object caching semantics and status of the RGW caching PR.
2. Design goals & current status
• A standalone SSD caching library that can be re-used between librbd
RGW
• Current status:
• Librbd read-only cache: caching block contents on SSD
• Librbd parent/clone images, caching parent rbd contents on SSD, all cloned image can read
from parent image cache before COW happen
• PR will be send out soon
• RGW immutable caching: caching rados objects on SSD
• A small CDN farm behind RGW cluster
• PR against Jewel ready(#13144) but need to clean up
3. General architecture
• Libcachefile: common lib that does
read/write on SSD
• Sparse-file based cache
• Policy: controlling on the cache
promotion/demotion, sizing of the
cache
• Simple LRU based
• librbd/librgw hooks: call API from
libcachefile
FileImageCache
RBD_0
SSD
libCacheStore
RGW_DataCache
librbd librgw
RGW_civetweb
RBD_1
RBD_2
RGW_civetweb
RGW_civetweb
RADOS
librbd librados
hooks hookspolicy policy
5. Shared Read-only cache for RBD –rbd clone flow
RBD_0 RBD_0@snap1 RBD_1
RBD_2
RBD_N
…
Template image Protected snapshot
Cloned image
Cloned image
Cloned image
This is the shared image content
6. Shared Read-only cache for RBD -- overview
• There will be a shared
cache(from parent image) on
each compute node
• Cloned image will read from
the shared cache unless COW
happened Local Cache
Write I/O
Read I/O
SSD backend
Write I/O
Read I/O
…
…
Compute node
Local CacheShared Cache
Shared
Cache
…
…
Compute node
RADOS
OSD OSD OSD OSD OSD OSD OSD
SSD backend
7. Shared Read-only cache for RBD -- Cache
metadata
Each cloned image will have its COW cache mapping:
- For each read hit, either in shared cache, or in its own
cache
- Cache mapping bits for COWed data
- Updated when COW happen
2 bits :
not_in_cache,
In_shared_cache,
In_cache
62 bits:
block_id
8. Cache fileCache file
RBD_2(cloned)
librbd
FileImageCache
COW
data
librbd
FileImageCache
Shared Read-only cache for RBD – IO flow
RBD_0(parent)
image_store
Shared Cache file
(fully promoted on first cloned image open)
RADOS
1
RBD_1(cloned)
librbd
FileImageCache
Cache lookup
COW
data
2
in shared cache:
- Read from shared cache
2’
in cow cache:
- Read from cow cache
Compute node
read
SSD
COW Cache mapping
rbd_id lba length
Rbd_1 8192 4096
Rbd_1 1048576 4096
COW Cache mapping
rbd_id lba length
Rbd_2 8192 4096
Rbd_2 1048576 4096
9. librbd
FileImageCache
Cache fileCache file
RBD_2(cloned)
librbd
FileImageCache
COW
data
librbd
FileImageCache
RBD_0(parent)
image_store
Shared Cache file
(fully promoted on first cloned image open)
RADOS
1
COW Cache mapping
RBD_1(cloned)
Cache lookup
COW
data
2
in shared cache:
- Create entry in COW mapping table
- Write to RADOS
2’
in cow cache:
- Invalidate the chunk in the cache file
- Write to RADOS
Compute node
rbd_id lba length
rbd_1 8192 4096
rbd_1 1048576 4096
write
SSD
Shared Read-only cache for RBD – IO flow
COW Cache mapping
rbd_id lba length
rbd_2 81920 4096
rbd_2 1048576 4096
12. Shared Read-only cache for RGW
chunk_id RGW instance id Cache_chunk_id
7e21a6b2-89b9-4de6-869e-
1ddc0198a82b.5228.1__shadow_.Tzk
bVV_syqJ2vumnFe8uAaiL9j6ghtC_34
Rgw_1 7e21a6b2-89b9-4de6-869e-
1ddc0198a82b.5228.1__shadow_.Tzk
bVV_syqJ2vumnFe8uAaiL9j6ghtC_34
• A CDN cluster behind the RGW clusters
• L1 cache: allow to read from SSD cache of local RGW instance
• L2 cache(configurable): allow to read from SSD cache on other remote RGW instances
• Each object/chunk has an unique ID
• Need a centralized distributed K/V to store the mapping as the chunks maybe spreaded
on different RGW instances
13. Shared Read-only cache for RGW
rgw_1 rgw_2
RADOS
Local
cache
Local
cache
librados
Immutable Cache
S3 API Swift API
rgw_frontend
rgw_rados
rgw_cache
datacachepolicy
Immutable Cache
L1 L2
14. Issues
• different caching semantics for block and object?
• Promoting at block level(default 8k) for librbd
• Promoting at object level for RGW
• #13144 is not compiling
• https://github.com/maniaabdi/engage1.git
• Jewel based, need to rebase against master
• Currently the logic is inside rgw_rados, need to be decupled to cope with our
design(libcachefile + policy)
15. RGW datacache (PR #13144)
rgw_1 rgw_2
RADOS
Local
cache
Local
cache
librados
Immutable Cache
S3 API Swift API
rgw_frontend
rgw_rados
rgw_cache
datacache
policy
Immutable Cache
L1 L2
Editor's Notes
How to maintain the librbd parent/clone image table?
When to promote the shared cache file?
-> when opening the first cloned image, the cache will be promoted to local, this could be optimized
What data should we promote? parent_image@snapshot
Librbd caching will be promoting at block size(4k default) level
What is the cache file format?
-> sparse file based
Only do promote when read
Writes to osd directly and invalidates the cache if cache_hit