Brought to you by
Seastore: Next Generation
Backing Store for Ceph
Samuel Just
Senior Principal Software Engineer at
Sam Just
Senior Principal Software Engineer
■ Currently Senior Principal Software Engineer at Red Hat
■ Crimson Lead
■ Veteran Ceph Contributor
Crimson!
What
■ Rewrite IO path in Seastar
● Preallocate cores
● One thread per core
● Explicitly shard all data structures
and work over cores
● No locks and no blocking
● Message passing between cores
● Polling for all IO
■ DPDK, SPDK
● Kernel bypass for network and
storage IO
■ Multi-year effort
Why
■ Not just about how many IOPS we do…
■ More about IOPS per CPU core
■ Current Ceph is based on traditional
multi-threaded programming model
■ Context switching is too expensive
when storage is almost as fast as
memory
Classic-OSD Threading Architecture
Messenger
Worker Thread
Worker Thread
Worker Thread
PG
Worker Thread
Worker Thread
Worker Thread
ObjectStore
Worker Thread
Worker Thread
Worker Thread
osd op queue kv queue
Messenger
Worker Thread
Worker Thread
Worker Thread
out_queue
Crimson-OSD Threading Architecture
Seastar Reactor Thread
Messenger PG ObjectStore
async wait async wait
Messenger
async wait
Seastar Reactor Thread
Messenger PG ObjectStore
async wait async wait
Messenger
async wait
Seastar Reactor Thread
Messenger PG ObjectStore
async wait async wait
Messenger
async wait
Seastar Reactor Thread
Messenger PG ObjectStore
async wait async wait
Messenger
async wait
Seastar Reactor Thread
Messenger PG ObjectStore
async wait async wait
Messenger
async wait
Classic vs Crimson Programming Model
for (;;) {
auto req = connection.read_request();
auto result = handle_request(req);
connection.send_response(result);
}
repeat([conn, this] {
return conn.read_request().then([this](auto req) {
return handle_request(req);
}).then([conn] (auto result) {
return conn.send_response(result);
});
});
Seastore
ObjectStore
■ Transactional
■ Composed of flat object namespace
■ Object names may be large (>1k)
■ Each object contains a key->value mapping (string->bytes) as well as a data payload.
■ Supports COW object clones
■ Supports ordered listing of both the omap and the object namespace
SeaStore
■ New ObjectStore implementation designed natively for crimson’s threading/callback model
■ Avoids cpu-heavy metadata designs like RocksDB
■ Intended to exploit emerging technologies like ZNS, fast nvme, and persistent memory.
Seastore - ZNS
■ New NVMe Specification
■ Intended to address challenges with conventional FTL designs
● High writes amplification, bad for qlc flash
● Background garbage collection tends to impact tail latencies
■ Different interface
● Drive divided into zones
● Zones can only be opened, written sequentially, closed, and released.
■ As it happens, this kind of write pattern tends to be good for conventional ssds as well.
Seastore - Persistent Memory
■ Characteristics:
● Almost DRAM like read latencies
● Write latency drastically lower than flash
● Very high write endurance
● Seems like a good fit for persistently caching data and metadata!
● Reads from persistent memory can simply return a ready future without waiting at all.
■ SeaStore approach currently being discussed:
● Keep caching layer in persistent memory.
● Update by copy-on-write with a paddr->extent mapping maintained via a write-ahead journal.
Design
Seastore - Logical Structure
Root
onode_by_hobject lba_tree
onode
block_info
omap_tree extents (contiguous
block of LBAs)
lba_t -> block_info_t
hobject_t -> onode_t
Seastore - Why use an LBA indirection?
A
B
C C’
When GCing node C, we need to do 2 things:
1. Find all incoming references (B in this case)
2. Write out a transaction updating (and dirtying) those
references as well as writing out the new block C’
Using direct references means we still need to maintain some
means of finding the parent references, and we pay the cost of
updating the relatively low fanout onode and omap trees.
By contrast, using an LBA indirection requires extra reads in the
lookup path, but potentially decreases read and write
amplification during gc.
It also makes refcounting and subtree sharing for COW clones
easier, and we can use it to do sparse mapping for object data
extents.
Seastore - Layout
Journal Segments
header delta D’ ...
delta E’ (logical) block B ... ...
record
(physical) block A
E
D
A
E’
D’
B
LBA Tree LBA Tree
Seastore - ZNS
Journal Segments
header delta D’ ...
delta E’ (logical) block B ... ...
record
(physical) block A
E
D
A
E’
D’
B
LBA Tree LBA Tree
Architecture
Cache Journal LBAManager SegmentCleaner
TransactionManager
OnodeManager OmapManager ObjectDataHandler ...
SeaStore
RootBlock
■ Broadly, SeaStore is divided into components above and below TransactionManager:
● TransactionManager supplies a transactional interface in terms of logically addressed blocks used by
data extents and metadata structures like the ghobject->onode index and omap trees.
● Components under TransactionManager (mainly the logical address tree) deal in terms of physically
addressed extents.
OnodeManager -- FLTree
■ crimson/os/seastore/onode_manager/staged-fltree/
■ Btree based structure for mapping ghobject_t -> onode
■ Splits ghobject_t keys into a fixed prefix (shard, pool, key), variable middle (object name and
namespace), and fixed suffix (snapshot, generation)
■ Internal nodes drop components that are uniform over the whole node to reduce space overhead.
■ Internal nodes can also drop components that differ between adjacent keys.
● Allows nodes close to root to avoid recording name and namespace improving density.
■ Contributed mainly by Yingxin Cheng <yingxin.cheng@intel.com>
OmapManager -- BtreeOmapManager
■ crimson/os/seastore/omap_manager/btree/
■ Implements omap storage for each object.
■ Fairly straightforward btree mapping string keys to string values.
■ Contributed mainly by chunmei-liu <chunmei.liu@intel.com>
ObjectDataHandler
■ crimson/os/seastore/object_data_handler.h/cc
■ Each onode maps a contiguous, sparse range of logical addresses to that object’s offsets
■ Leverages the LBAManager to avoid a secondary extent map.
■ Clone support requires work to enable relocating logical extent addresses (TODO)
TransactionManager
■ crimson/os/seastore/TransactionManager.h/cc
■ Implements a uniform transactional interface for allocating, reading, and mutating logically
addressed extents.
■ Mutations to extents can be expressed as a compact type dependent “delta” which will be included
transparently in the commit journal record.
● For example, BtreeOmapManager represents the insertion of a key into a block by encoding
the key/value pair rather than needing to rewrite the full block.
■ Components using TransactionManager can ignore extent relocation entirely.
LBAManager -- BtreeLBAManager
■ crimson/os/seastore/lba_manager/btree/
■ Btree mapping logical addresses to physical offsets
■ Includes reference counts to be used with clone.
■ Will include extent checksums (TODO)
Cache
■ crimson/os/seastore/lba_manager/cache.h/cc
■ Manages in-memory extents, keeps dirty extents pinned in memory
■ Detects conflicts in transactions to be committed
SegmentCleaner
■ crimson/os/seastore/lba_manager/segment_cleaner
■ Tracks usage status of segments.
■ Runs a background process (within the same reactor) to choose a segment, relocate any live
extents, and release it.
● Logical extents are simply remapped within the LBAManager
● Physical extents (mainly BtreeLBAManager extents) need any references updated as well
(mainly btree parents)
■ Also responsible for throttling foreground work based on pending gc work -- goal is to avoid abrupt
gc pauses and smooth out latency.
Current Status and Future Work
■ Can boot an osd via vstart without snapshots and complete upwards of several minutes of IO before
crashing!
■ Performance/stability: stabilize, add to ceph upstream testing infrastructure
■ Add ability to remap location of logical extents
■ In place mutation support for fast nvme devices
■ Persistent memory support in Cache
■ Tiering
Brought to you by
Samuel Just
Senior Principal Software Engineer at Red Hat
Email: sjust@redhat.com

Seastore: Next Generation Backing Store for Ceph

  • 1.
    Brought to youby Seastore: Next Generation Backing Store for Ceph Samuel Just Senior Principal Software Engineer at
  • 2.
    Sam Just Senior PrincipalSoftware Engineer ■ Currently Senior Principal Software Engineer at Red Hat ■ Crimson Lead ■ Veteran Ceph Contributor
  • 3.
    Crimson! What ■ Rewrite IOpath in Seastar ● Preallocate cores ● One thread per core ● Explicitly shard all data structures and work over cores ● No locks and no blocking ● Message passing between cores ● Polling for all IO ■ DPDK, SPDK ● Kernel bypass for network and storage IO ■ Multi-year effort Why ■ Not just about how many IOPS we do… ■ More about IOPS per CPU core ■ Current Ceph is based on traditional multi-threaded programming model ■ Context switching is too expensive when storage is almost as fast as memory
  • 4.
    Classic-OSD Threading Architecture Messenger WorkerThread Worker Thread Worker Thread PG Worker Thread Worker Thread Worker Thread ObjectStore Worker Thread Worker Thread Worker Thread osd op queue kv queue Messenger Worker Thread Worker Thread Worker Thread out_queue
  • 5.
    Crimson-OSD Threading Architecture SeastarReactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait
  • 6.
    Classic vs CrimsonProgramming Model for (;;) { auto req = connection.read_request(); auto result = handle_request(req); connection.send_response(result); } repeat([conn, this] { return conn.read_request().then([this](auto req) { return handle_request(req); }).then([conn] (auto result) { return conn.send_response(result); }); });
  • 7.
  • 8.
    ObjectStore ■ Transactional ■ Composedof flat object namespace ■ Object names may be large (>1k) ■ Each object contains a key->value mapping (string->bytes) as well as a data payload. ■ Supports COW object clones ■ Supports ordered listing of both the omap and the object namespace
  • 9.
    SeaStore ■ New ObjectStoreimplementation designed natively for crimson’s threading/callback model ■ Avoids cpu-heavy metadata designs like RocksDB ■ Intended to exploit emerging technologies like ZNS, fast nvme, and persistent memory.
  • 10.
    Seastore - ZNS ■New NVMe Specification ■ Intended to address challenges with conventional FTL designs ● High writes amplification, bad for qlc flash ● Background garbage collection tends to impact tail latencies ■ Different interface ● Drive divided into zones ● Zones can only be opened, written sequentially, closed, and released. ■ As it happens, this kind of write pattern tends to be good for conventional ssds as well.
  • 11.
    Seastore - PersistentMemory ■ Characteristics: ● Almost DRAM like read latencies ● Write latency drastically lower than flash ● Very high write endurance ● Seems like a good fit for persistently caching data and metadata! ● Reads from persistent memory can simply return a ready future without waiting at all. ■ SeaStore approach currently being discussed: ● Keep caching layer in persistent memory. ● Update by copy-on-write with a paddr->extent mapping maintained via a write-ahead journal.
  • 12.
  • 13.
    Seastore - LogicalStructure Root onode_by_hobject lba_tree onode block_info omap_tree extents (contiguous block of LBAs) lba_t -> block_info_t hobject_t -> onode_t
  • 14.
    Seastore - Whyuse an LBA indirection? A B C C’ When GCing node C, we need to do 2 things: 1. Find all incoming references (B in this case) 2. Write out a transaction updating (and dirtying) those references as well as writing out the new block C’ Using direct references means we still need to maintain some means of finding the parent references, and we pay the cost of updating the relatively low fanout onode and omap trees. By contrast, using an LBA indirection requires extra reads in the lookup path, but potentially decreases read and write amplification during gc. It also makes refcounting and subtree sharing for COW clones easier, and we can use it to do sparse mapping for object data extents.
  • 15.
    Seastore - Layout JournalSegments header delta D’ ... delta E’ (logical) block B ... ... record (physical) block A E D A E’ D’ B LBA Tree LBA Tree
  • 16.
    Seastore - ZNS JournalSegments header delta D’ ... delta E’ (logical) block B ... ... record (physical) block A E D A E’ D’ B LBA Tree LBA Tree
  • 17.
    Architecture Cache Journal LBAManagerSegmentCleaner TransactionManager OnodeManager OmapManager ObjectDataHandler ... SeaStore RootBlock ■ Broadly, SeaStore is divided into components above and below TransactionManager: ● TransactionManager supplies a transactional interface in terms of logically addressed blocks used by data extents and metadata structures like the ghobject->onode index and omap trees. ● Components under TransactionManager (mainly the logical address tree) deal in terms of physically addressed extents.
  • 18.
    OnodeManager -- FLTree ■crimson/os/seastore/onode_manager/staged-fltree/ ■ Btree based structure for mapping ghobject_t -> onode ■ Splits ghobject_t keys into a fixed prefix (shard, pool, key), variable middle (object name and namespace), and fixed suffix (snapshot, generation) ■ Internal nodes drop components that are uniform over the whole node to reduce space overhead. ■ Internal nodes can also drop components that differ between adjacent keys. ● Allows nodes close to root to avoid recording name and namespace improving density. ■ Contributed mainly by Yingxin Cheng <yingxin.cheng@intel.com>
  • 19.
    OmapManager -- BtreeOmapManager ■crimson/os/seastore/omap_manager/btree/ ■ Implements omap storage for each object. ■ Fairly straightforward btree mapping string keys to string values. ■ Contributed mainly by chunmei-liu <chunmei.liu@intel.com>
  • 20.
    ObjectDataHandler ■ crimson/os/seastore/object_data_handler.h/cc ■ Eachonode maps a contiguous, sparse range of logical addresses to that object’s offsets ■ Leverages the LBAManager to avoid a secondary extent map. ■ Clone support requires work to enable relocating logical extent addresses (TODO)
  • 21.
    TransactionManager ■ crimson/os/seastore/TransactionManager.h/cc ■ Implementsa uniform transactional interface for allocating, reading, and mutating logically addressed extents. ■ Mutations to extents can be expressed as a compact type dependent “delta” which will be included transparently in the commit journal record. ● For example, BtreeOmapManager represents the insertion of a key into a block by encoding the key/value pair rather than needing to rewrite the full block. ■ Components using TransactionManager can ignore extent relocation entirely.
  • 22.
    LBAManager -- BtreeLBAManager ■crimson/os/seastore/lba_manager/btree/ ■ Btree mapping logical addresses to physical offsets ■ Includes reference counts to be used with clone. ■ Will include extent checksums (TODO)
  • 23.
    Cache ■ crimson/os/seastore/lba_manager/cache.h/cc ■ Managesin-memory extents, keeps dirty extents pinned in memory ■ Detects conflicts in transactions to be committed
  • 24.
    SegmentCleaner ■ crimson/os/seastore/lba_manager/segment_cleaner ■ Tracksusage status of segments. ■ Runs a background process (within the same reactor) to choose a segment, relocate any live extents, and release it. ● Logical extents are simply remapped within the LBAManager ● Physical extents (mainly BtreeLBAManager extents) need any references updated as well (mainly btree parents) ■ Also responsible for throttling foreground work based on pending gc work -- goal is to avoid abrupt gc pauses and smooth out latency.
  • 25.
    Current Status andFuture Work ■ Can boot an osd via vstart without snapshots and complete upwards of several minutes of IO before crashing! ■ Performance/stability: stabilize, add to ceph upstream testing infrastructure ■ Add ability to remap location of logical extents ■ In place mutation support for fast nvme devices ■ Persistent memory support in Cache ■ Tiering
  • 26.
    Brought to youby Samuel Just Senior Principal Software Engineer at Red Hat Email: sjust@redhat.com