Successfully reported this slideshow.
Your SlideShare is downloading. ×

Seastore: Next Generation Backing Store for Ceph

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 26 Ad

Seastore: Next Generation Backing Store for Ceph

Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.

Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Seastore: Next Generation Backing Store for Ceph (20)

Advertisement

More from ScyllaDB (20)

Recently uploaded (20)

Advertisement

Seastore: Next Generation Backing Store for Ceph

  1. 1. Brought to you by Seastore: Next Generation Backing Store for Ceph Samuel Just Senior Principal Software Engineer at
  2. 2. Sam Just Senior Principal Software Engineer ■ Currently Senior Principal Software Engineer at Red Hat ■ Crimson Lead ■ Veteran Ceph Contributor
  3. 3. Crimson! What ■ Rewrite IO path in Seastar ● Preallocate cores ● One thread per core ● Explicitly shard all data structures and work over cores ● No locks and no blocking ● Message passing between cores ● Polling for all IO ■ DPDK, SPDK ● Kernel bypass for network and storage IO ■ Multi-year effort Why ■ Not just about how many IOPS we do… ■ More about IOPS per CPU core ■ Current Ceph is based on traditional multi-threaded programming model ■ Context switching is too expensive when storage is almost as fast as memory
  4. 4. Classic-OSD Threading Architecture Messenger Worker Thread Worker Thread Worker Thread PG Worker Thread Worker Thread Worker Thread ObjectStore Worker Thread Worker Thread Worker Thread osd op queue kv queue Messenger Worker Thread Worker Thread Worker Thread out_queue
  5. 5. Crimson-OSD Threading Architecture Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait
  6. 6. Classic vs Crimson Programming Model for (;;) { auto req = connection.read_request(); auto result = handle_request(req); connection.send_response(result); } repeat([conn, this] { return conn.read_request().then([this](auto req) { return handle_request(req); }).then([conn] (auto result) { return conn.send_response(result); }); });
  7. 7. Seastore
  8. 8. ObjectStore ■ Transactional ■ Composed of flat object namespace ■ Object names may be large (>1k) ■ Each object contains a key->value mapping (string->bytes) as well as a data payload. ■ Supports COW object clones ■ Supports ordered listing of both the omap and the object namespace
  9. 9. SeaStore ■ New ObjectStore implementation designed natively for crimson’s threading/callback model ■ Avoids cpu-heavy metadata designs like RocksDB ■ Intended to exploit emerging technologies like ZNS, fast nvme, and persistent memory.
  10. 10. Seastore - ZNS ■ New NVMe Specification ■ Intended to address challenges with conventional FTL designs ● High writes amplification, bad for qlc flash ● Background garbage collection tends to impact tail latencies ■ Different interface ● Drive divided into zones ● Zones can only be opened, written sequentially, closed, and released. ■ As it happens, this kind of write pattern tends to be good for conventional ssds as well.
  11. 11. Seastore - Persistent Memory ■ Characteristics: ● Almost DRAM like read latencies ● Write latency drastically lower than flash ● Very high write endurance ● Seems like a good fit for persistently caching data and metadata! ● Reads from persistent memory can simply return a ready future without waiting at all. ■ SeaStore approach currently being discussed: ● Keep caching layer in persistent memory. ● Update by copy-on-write with a paddr->extent mapping maintained via a write-ahead journal.
  12. 12. Design
  13. 13. Seastore - Logical Structure Root onode_by_hobject lba_tree onode block_info omap_tree extents (contiguous block of LBAs) lba_t -> block_info_t hobject_t -> onode_t
  14. 14. Seastore - Why use an LBA indirection? A B C C’ When GCing node C, we need to do 2 things: 1. Find all incoming references (B in this case) 2. Write out a transaction updating (and dirtying) those references as well as writing out the new block C’ Using direct references means we still need to maintain some means of finding the parent references, and we pay the cost of updating the relatively low fanout onode and omap trees. By contrast, using an LBA indirection requires extra reads in the lookup path, but potentially decreases read and write amplification during gc. It also makes refcounting and subtree sharing for COW clones easier, and we can use it to do sparse mapping for object data extents.
  15. 15. Seastore - Layout Journal Segments header delta D’ ... delta E’ (logical) block B ... ... record (physical) block A E D A E’ D’ B LBA Tree LBA Tree
  16. 16. Seastore - ZNS Journal Segments header delta D’ ... delta E’ (logical) block B ... ... record (physical) block A E D A E’ D’ B LBA Tree LBA Tree
  17. 17. Architecture Cache Journal LBAManager SegmentCleaner TransactionManager OnodeManager OmapManager ObjectDataHandler ... SeaStore RootBlock ■ Broadly, SeaStore is divided into components above and below TransactionManager: ● TransactionManager supplies a transactional interface in terms of logically addressed blocks used by data extents and metadata structures like the ghobject->onode index and omap trees. ● Components under TransactionManager (mainly the logical address tree) deal in terms of physically addressed extents.
  18. 18. OnodeManager -- FLTree ■ crimson/os/seastore/onode_manager/staged-fltree/ ■ Btree based structure for mapping ghobject_t -> onode ■ Splits ghobject_t keys into a fixed prefix (shard, pool, key), variable middle (object name and namespace), and fixed suffix (snapshot, generation) ■ Internal nodes drop components that are uniform over the whole node to reduce space overhead. ■ Internal nodes can also drop components that differ between adjacent keys. ● Allows nodes close to root to avoid recording name and namespace improving density. ■ Contributed mainly by Yingxin Cheng <yingxin.cheng@intel.com>
  19. 19. OmapManager -- BtreeOmapManager ■ crimson/os/seastore/omap_manager/btree/ ■ Implements omap storage for each object. ■ Fairly straightforward btree mapping string keys to string values. ■ Contributed mainly by chunmei-liu <chunmei.liu@intel.com>
  20. 20. ObjectDataHandler ■ crimson/os/seastore/object_data_handler.h/cc ■ Each onode maps a contiguous, sparse range of logical addresses to that object’s offsets ■ Leverages the LBAManager to avoid a secondary extent map. ■ Clone support requires work to enable relocating logical extent addresses (TODO)
  21. 21. TransactionManager ■ crimson/os/seastore/TransactionManager.h/cc ■ Implements a uniform transactional interface for allocating, reading, and mutating logically addressed extents. ■ Mutations to extents can be expressed as a compact type dependent “delta” which will be included transparently in the commit journal record. ● For example, BtreeOmapManager represents the insertion of a key into a block by encoding the key/value pair rather than needing to rewrite the full block. ■ Components using TransactionManager can ignore extent relocation entirely.
  22. 22. LBAManager -- BtreeLBAManager ■ crimson/os/seastore/lba_manager/btree/ ■ Btree mapping logical addresses to physical offsets ■ Includes reference counts to be used with clone. ■ Will include extent checksums (TODO)
  23. 23. Cache ■ crimson/os/seastore/lba_manager/cache.h/cc ■ Manages in-memory extents, keeps dirty extents pinned in memory ■ Detects conflicts in transactions to be committed
  24. 24. SegmentCleaner ■ crimson/os/seastore/lba_manager/segment_cleaner ■ Tracks usage status of segments. ■ Runs a background process (within the same reactor) to choose a segment, relocate any live extents, and release it. ● Logical extents are simply remapped within the LBAManager ● Physical extents (mainly BtreeLBAManager extents) need any references updated as well (mainly btree parents) ■ Also responsible for throttling foreground work based on pending gc work -- goal is to avoid abrupt gc pauses and smooth out latency.
  25. 25. Current Status and Future Work ■ Can boot an osd via vstart without snapshots and complete upwards of several minutes of IO before crashing! ■ Performance/stability: stabilize, add to ceph upstream testing infrastructure ■ Add ability to remap location of logical extents ■ In place mutation support for fast nvme devices ■ Persistent memory support in Cache ■ Tiering
  26. 26. Brought to you by Samuel Just Senior Principal Software Engineer at Red Hat Email: sjust@redhat.com

×