Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unified readonly cache for ceph updates sep cdm

Updates on RBD shared readonly SSD cache

  • Be the first to comment

  • Be the first to like this

Unified readonly cache for ceph updates sep cdm

  1. 1. Unified read-only cache proposal
  2. 2. Design goals • A standalone SSD caching library that can be re-used between librbd RGW • Use cases: • Librbd read-only cache: caching block contents on SSD • Librbd parent/clone images, caching parent rbd contents on SSD, all cloned image can read from parent image cache before COW happen • RGW immutable caching: caching rados objects on SSD • A small CDN farm behind RGW cluster
  3. 3. Cache daemon General architecture • Libcachestore: common lib that does read/write on SSD • Sparse-file based cache • Cache Daemon: controlling on the cache promotion/demotion, sizing of the cache • Simple LRU based • librbd/librgw hooks: call API from libcachefile FileImageCache RBD_0 SSD libCacheStore RGW_DataCache librbd librgw RGW_civetweb RBD_1 RBD_2 RGW_civetweb RGW_civetweb RADOS librbd librados hooks hooks policy
  4. 4. Shared read-only RBD SSD cache
  5. 5. PR #16788 • A generic file-based persistent cache store • Sparse-file-based cache • Sync interfaces provided • A generic read-only caching framework • Cache promotion on reads • Cache invalidate on writes, write requests will go to RADOS directly • A simple shared read-only cache implementation(“happy” data path) • Shared cache will be fully promoted on the opening of 1st child • The missing: • A standalone cache daemon controls the cache state • A configurable policy to control promotion/demotion on shared cache
  6. 6. Initial results 4k Rand Read Op_Size Op_Type QD Runtime(sec) IOPS BW(MB/s) Latency(ms) 99.99% Latency(ms) 1 osd 1 replica baseline 4k randread qd32 600 12927 50.5MB/s 2.437ms 8.889ms Read-only cache 4k randread qd32 600 51436 200MB/s 0.563ms 4.832ms Shared read-only cache(2 volumes) 4k randread qd32 600 69370 270MB/s 0.868ms 5.024ms Cache 1G, volume 10G 4k randread qd32 600 12219 47.7MB/s 2.571ms 8.256ms Cache 2G, volume 10G 4k randread qd32 600 14203 55.5MB/s 2.207ms 10.56ms Cache 4G, volume 10G 4k randread qd32 600 19099 74.6MB/s 1.630ms 6.944ms Cache 8G, volume 10G 4k randread qd32 600 46633 182MB/s 0.641ms 5.088ms 1 osd 1 replica baseline 4k randwrite qd32 600 8920 34.8MB/s 3.49ms 125ms 1 osd 1 replica with cache 4k randwrite qd32 600 8895 34.7MB/s 3.51ms 195.584ms
  7. 7. Shared read-only cache for RBD –rbd clone flow RBD_0 RBD_0@snap1 RBD_1 RBD_2 RBD_N … Parent image Protected snapshot Cloned image Cloned image Cloned image This is the shared image content
  8. 8. Shared read-only cache for RBD – Cache Daemon • Read-only blocks from parent image(s) are cached in a shared area on compute node(s) • Reads are served from the shared cache until the first COW request • A Cache Daemon • On each compute node to control the shared cache state • Policy thread - owns a policy to control promotion/demotion of the shared cache • RBD instances do IPC with the daemon do read/write lock on a shared cache block • Upon recovery from crash or reboot, the daemon tries to rebuild shared cache state from persistent metadata • Rebuild process is simple – read persistent metafile, check existence of image and corresponding cachefile • If rebuild fails (for example, on a meta/cachefile read error), reset to empty cache RBD_instance Write I/O Read I/O SSD Compute node Shared RO Cache RADOS OSD OSD OSD OSD OSD OSD OSD policy Promote/demote Cache Daemon IPC IPC Read I/O (post-COW) RBD_instance … Meta File
  9. 9. RBD_2 (cloned) librbd FileImageCache librbd FileImageCache Shared Read-only cache for RBD – Promote flow Cache_demon Shared Cache file RADOS RBD_1 (cloned) librbd FileImageCache Cache lookup2 Compute node SSD COW Cache mapping policy In shared cache but missing now: - WriterLock() - Promote from RADOS - Notify cloned image ready to read COW Cache mapping 1 3 4 Shared Cache file Meta 3
  10. 10. RBD_2 (cloned) librbd FileImageCache librbd FileImageCache Shared Read-only cache for RBD – Demote flow Cache_demon Shared Cache file RADOS RBD_1 (cloned) librbd FileImageCache Compute node SSD COW Cache mapping policy COW Cache mappingIn shared cache but cold - Demote the block Shared Cache file Meta
  11. 11. RBD_2 (cloned) librbd FileImageCache librbd FileImageCache Shared Read-only cache for RBD – IO flow(read) Cache_demon Shared Cache file RADOS RBD_1 (cloned) librbd FileImageCache Cache lookup Compute node SSD COW Cache mapping policy In shared cache now: - ReaderLock() - Check Meta - Read from the shared cache COW Cache mapping 1 2 On COW : - Read from RADOS 2’ Shared Cache file Meta
  12. 12. RBD_2 (cloned) librbd FileImageCache librbd FileImageCache Shared Read-only cache for RBD – IO flow(write) Cache_demon Shared Cache file RADOS RBD_1 (cloned) librbd FileImageCache Cache lookup 2 Compute node SSD COW Cache mapping policy COW Cache mapping 1 Write to RADOS Shared Cache file Meta
  13. 13. Issues/Corner cases • How to do VM migration? VM Crash? • We could rebuild the cache state on RBD re-open • RBD removed on other nodes? • Policy thread in cache daemon will periodically check the local cache, and remove those old cache eventually • Cache daemon crash? • The shared cache state will be persistent to local metafile • The daemon is stateless, we only need to restart the daemon process and rebuild the cache state • Cache file inconsistent? • We’re relying the filesystem to do the check, if some read error happen, we simply re-issue a read from the RADOS
  14. 14. Shared RGW read-only SSD cache
  15. 15. Shared Read-only cache for RGW chunk_id RGW instance id Cache_chunk_id 7e21a6b2-89b9-4de6-869e- 1ddc0198a82b.5228.1__shadow_.Tzk bVV_syqJ2vumnFe8uAaiL9j6ghtC_34 Rgw_1 7e21a6b2-89b9-4de6-869e- 1ddc0198a82b.5228.1__shadow_.Tzk bVV_syqJ2vumnFe8uAaiL9j6ghtC_34 • A CDN cluster behind the RGW clusters • L1 cache: allow to read from SSD cache of local RGW instance • L2 cache(configurable): allow to read from SSD cache on other remote RGW instances • Each object/chunk has an unique ID • Need a centralized distributed K/V to store the mapping as the chunks maybe spreaded on different RGW instances
  16. 16. Shared Read-only cache for RGW rgw_1 rgw_2 RADOS Local cache Local cache librados Immutable Cache S3 API Swift API rgw_frontend rgw_rados rgw_cache datacachepolicy Immutable Cache L1 L2
  17. 17. Issues • different caching semantics for block and object? • Promoting at block level(default 8k) for librbd • Promoting at object level for RGW • #13144 is not compiling • https://github.com/maniaabdi/engage1.git • Jewel based, need to rebase against master • Currently the logic is inside rgw_rados, need to be decupled to cope with our design(libcachefile + policy)
  18. 18. RGW datacache (PR #13144) rgw_1 rgw_2 RADOS Local cache Local cache librados Immutable Cache S3 API Swift API rgw_frontend rgw_rados rgw_cache datacache policy Immutable Cache L1 L2
  19. 19. backup
  20. 20. Shared read-only cache for RBD -- overview • Read-only blocks from parent image(s) are cached in a shared area on compute node(s) • Cloned image will read from the shared cache unless COW happen Local Cache Write I/O Read I/O SSD backend Write I/O Read I/O … … Compute node Local CacheShared Cache Shared Cache … … Compute node RADOS OSD OSD OSD OSD OSD OSD OSD SSD backend Cache daemon
  21. 21. Shared read-only cache for RBD – fast cache warmup • The state of the shared cache will be persistent to a local metadata file along with the cache file • The state of the local cache will be persistent to RBD metadata • On restart the cache controller will load the cache metadata file and reuse the shared cache file • On RBD instance restart the cache state will be loaded as an in- mem map to tell the COWed parts Each cloned image will have its COW cache mapping: - For each read hit, either in shared cache, or in its own cache - Cache mapping bits for COWed data - Updated when COW happen
  22. 22. Cache fileCache file RBD_2(cloned) librbd FileImageCache COW data librbd FileImageCache Shared Read-only cache for RBD – IO flow(private cache) Cache_demon Shared Cache file RADOS 1 RBD_1(cloned) librbd FileImageCache Cache lookup COW data 2 in shared cache: - Read from shared cache 2’ in cow cache: - Read from cow cache Compute node read SSD COW Cache mapping rbd_id lba length Rbd_1 8192 4096 Rbd_1 1048576 4096 COW Cache mapping rbd_id lba length Rbd_2 8192 4096 Rbd_2 1048576 4096 image_store policy GC
  23. 23. librbd FileImageCache Cache fileCache file RBD_2(cloned) librbd FileImageCache COW data librbd FileImageCache Cache_demo Shared Cache file RADOS 1 COW Cache mapping RBD_1(cloned) Cache lookup COW data 2 2’ in cow cache: - Invalidate the chunk in the cache file - Write to RADOS Compute node rbd_id lba length rbd_1 8192 4096 rbd_1 1048576 4096 write SSD Shared Read-only cache for RBD – IO flow(private cache) COW Cache mapping rbd_id lba length rbd_2 81920 4096 rbd_2 1048576 4096 image_store policy GC in shared cache: - Create entry in COW mapping table - Write to RADOS

×