Experiences building a distributed
shared-log on RADOS
Noah Watkins
UC Santa Cruz
@noahdesu
2
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● Graduate student
● UC Santa Cruz
3
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● Graduate student
● UC Santa Cruz
● Data management
● Storage systems
● High-performance computing
● Quality of service
4
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● Graduate student
● UC Santa Cruz
● Data management
● Storage systems
● High-performance computing
● Quality of service
● Ceph shop for prototyping
Ceph storage interfaces, a familiar sight...
5
RADOS
LIBRADOS RADOSGW RBD CEPHFS
Our research: programmability in storage
6
RADOS
LIBRADOS RADOSGW RBD CEPHFSMatrix Log
Our research: programmability in storage
7
RADOS
LIBRADOS RADOSGW RBD CEPHFSMatrix Log
The agenda
● Why should I care about logs?
● How to build a really, really fast log
● Mapping techniques for fast logs onto Ceph
● A few usage examples
8
Why you should care about logs
9
A log is an ordered list of data elements
● General abstract form of a log
● Ordered list of elements
10
0 1 2 3 4 5 6 7 8 9 10 11 12
A log is an ordered list of data elements
● General abstract form of a log
● Ordered list of elements
● New entries are appended to the log
11
0 1 2 3 4 5 6 7 8 9 10 11 12
clients
append
A log is an ordered list of data elements
● General abstract form of a log
● Ordered list of elements
● New entries are appended to the log
● Log entries can be read directly
12
0 1 2 3 4 5 6 7 8 9 10 11 12
clients
append
Everyone is going bananas for logs
13
● Very high throughput logs
● No global ordering
● Strongly consistent, global ordering
● Write batching
● Single writer
Apache
BookKeeper
Strongly consistent shared-log
14
0 1 2 3 4 5 6 7 8 9 10 11 12
Database management systems Replicated state machines Cloud-based metadata services
Strongly consistent, high-performance shared-log
15
0 1 2 3 4 5 6 7 8 9 10 11 12
Database management systems Replicated state machines Cloud-based metadata services
Shared-logs are challenging to scale
Balakrishnan et al., “Tango: Distributed Data Structures over a Shared Log”, SOSP ‘13
16
Shared-logs are challenging to scale
Balakrishnan et al., “Tango: Distributed Data Structures over a Shared Log”, SOSP ‘13
17
Shared-logs are challenging to scale
clients
0 1
2
Paxos
append
18
0 1 2 3 4 5 6 7 8 9 10 11 12
Shared-logs are challenging to scale
clients
0 1
2
Paxos
append
Writes are
funneled through
a total ordering
engine
19
0 1 2 3 4 5 6 7 8 9 10 11 12
Shared-logs are challenging to scale
clients
0 1
2
Paxos
append
Writes are
funneled through
a total ordering
engine
20
0 1 2 3 4 5 6 7 8 9 10 11 12
Parallel reads
supported when
log is distributed
Random read
workloads perform
poorly on HDDs
CORFU: A Shared Log Design for Flash Clusters
21
Balakrishnan, et. al, NSDI 2011
CORFU decouples I/O from ordering
22
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
reads
CORFU decouples I/O from ordering
23
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
reads
CORFU decouples I/O from ordering
24
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
CORFU stripes log across a cluster of flash devices
25
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
You might be thinking...
clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
read append Decouple append I/O from
assignment of ordering● It can’t possibly be this simple....
26
100K-1M pos/sec
You might be thinking...
clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
read append Decouple append I/O from
assignment of ordering● It can’t possibly be this simple.... you’d be correct
27
100K-1M pos/sec
You might be thinking...
clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
read append Decouple append I/O from
assignment of ordering● It can’t possibly be this simple.... you’d be correct
● Isn’t this supposed to be a Ceph talk?
28
100K-1M pos/sec
The CORFU I/O path
LogInterface(e.g.Append)
29
Cluster of flash devices
Append( ) ??
CORFU uses a cluster map and striping function
LogInterface(e.g.Append)
30
Cluster of flash devices
Append( )
Striping and
Addressing
CORFU replicates log entries across the cluster
LogInterface(e.g.Append)
31
Cluster of flash devices
Append( )
Striping and
Addressing
Fault tolerance
Replication
The CORFU protocol relies on custom I/O interface
LogInterface(e.g.Append)
32
Cluster of flash devices
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
Replication
A dedicated CORFU cluster is expensive...
● Duplicated services
○ Fault-tolerance
○ Metadata management
● Dedicated / over-provisioned hardware
○ You need a full cluster of flash devices!
● Custom storage interfaces
○ Hardware or software
● Already have a Ceph cluster?
○ Too bad
● Can we use software-defined storage?
33
Mapping the components of CORFU onto Ceph
LogInterface(e.g.Append)
34
Cluster of flash devices
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
Replication
A cluster of OSDs rather than raw storage devices
LogInterface(e.g.Append)
35
Cluster of OSDs
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
Ceph handles fault-tolerance transparently
LogInterface(e.g.Append)
36
Cluster of OSDs
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
Replication
Log distribution becomes object naming issue
LogInterface(e.g.Append)
37
Append( )
Striping and
Addressing
Object naming
Custom
hardware
interface
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
Replication
Cluster of OSDs
Storage interface built using RADOS object classes
LogInterface(e.g.Append)
38
Append( )
Striping and
Addressing
Object naming
RADOS
object
class
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
Replication
Cluster of OSDs
ZLog is an implementation of CORFU on Ceph
osd osd osd
39
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
BlueStore RocksDBLMDB
ZLog is an implementation of CORFU on Ceph
osd osd osd
40
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
BlueStore RocksDBLMDB
Transparent
changes to
hardware,
software, tuning
ZLog is an implementation of CORFU on Ceph
osd osd osd
41
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
BlueStore RocksDBLMDB
Transparent
changes to
hardware,
software, tuning
Integration with
features like
tiering, erasure
coding, and I/O
hinting
The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
42
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
43
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
44
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
45
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
Building applications on RADOS using
object classes
46
Design decisions for the ZLog I/O path
47
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
Design decisions for the ZLog I/O path
48
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
Design decisions for the ZLog I/O path
49
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
Design decisions for the ZLog I/O path
50
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
Design decisions for the ZLog I/O path
51
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
Design decisions for the ZLog I/O path
52
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
● Coordinated metadata update
Design decisions for the ZLog I/O path
53
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
● Coordinated metadata update
● Treat an object like a
consensus API
Design decisions for the ZLog I/O path
54
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
● Coordinated metadata update
● Treat an object like a
consensus API
The CORFU storage interface
● write(obj, position, data)
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
55
The CORFU storage interface
● write(obj, position, data)
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
56
The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
57
The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
58
The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
59
The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
○ Optionally update maximum position observed
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
60
The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
○ Optionally update maximum position observed
○ Atomic update
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
61
librados won’t [currently] cut it for this job
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
○ Optionally update maximum position observed
○ Atomic update
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
62
CORFU write interface as an object class
63
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
CORFU write interface as an object class
64
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
connection to client
CORFU write interface as an object class
65
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
connection to client
CORFU write interface as an object class
66
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
connection to client
Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
67
Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
68
Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
3. Old way to manage plugins
a. Manage your own Ceph fork
69
Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
3. Old way to manage plugins
a. Manage your own Ceph fork
4. New way to manage plugins with SDK (credit: Neha Ojha @ b7215b0)
a. dnf install rados-objclass-devel
b. #include <rados/objclass.h>
c. compile cls_hello.cc as object library
70
Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
3. Old way to manage plugins
a. Manage your own Ceph fork
4. New way to manage plugins with SDK (credit: Neha Ojha @ b7215b0)
a. dnf install rados-objclass-devel
b. #include <rados/objclass.h>
c. compile cls_hello.cc as object library
5. Distribute your plugin to OSDs at <libdir>/rados-classes/cls_hello.so
6. Starting making requests ioctx::exec(“hello”, …)
71
The CORFU seal(obj, epoch) interface is tougher
● Semantics are do the following atomically
○ Verify and update the stored epoch
○ Return the largest position written
● RADOS object classes don’t support this mix of ops
○ Currently
● Solution
○ Split the operation
○ Verify
● Luckily this isn’t a fast path
72
How we have been using ZLog
73
PostgreSQL logical replication with ZLog
74
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
INSERT UPDATE DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Queries
ZLog
https://github.com/cruzdb/pg_zlog
CruzDB is an immutable key-value store on ZLog
75
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
https://github.com/cruzdb/cruzdb
CruzDB is an immutable key-value store on ZLog
76
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
Database is a
copy-on-write
red-black tree
https://github.com/cruzdb/cruzdb
CruzDB is an immutable key-value store on ZLog
77
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
Database is a
copy-on-write
red-black tree
Efficient read-only
snapshots
https://github.com/cruzdb/cruzdb
CruzDB is an immutable key-value store on ZLog
78
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
Database is a
copy-on-write
red-black tree
Efficient read-only
snapshotsTime travel
capabilities
https://github.com/cruzdb/cruzdb
Wrapping up
● We use Ceph as a prototyping system
○ Application-specific interfaces
● There is a broad range of interests in log-oriented interfaces
● We’ve built a Ceph-based implementation of a high-performance log
○ ZLog @ https://github.com/cruzdb/zlog
● Using it for several real-world use cases
○ PostgreSQL logical replication (https://github.com/cruzdb/pg_zlog)
○ CruzDB immutable database (https://github.com/cruzdb/cruzdb)
79
Thank you and questions
● Noah Watkins (@noahdesu)
● Learning resources
○ Object class SDK documentation
■ http://docs.ceph.com/docs/master/rados/api/objclass-sdk/
○ ZLog is the only user of the SDK I’m aware of
■ https://github.com/cruzdb/zlog/blob/master/src/storage/ceph/cls_zlog.cc
○ The Ceph tree contains a ton of object classes for reference
■ https://github.com/ceph/ceph/tree/master/src/cls
○ Writing object classes using Lua
■ https://ceph.com/geen-categorie/dynamic-object-interfaces-with-lua/
■ https://nwat.xyz/blog/2018/03/13/video-of-my-talk-at-lua-workshop-2017/
80

Experiences building a distributed shared log on RADOS - Noah Watkins

  • 1.
    Experiences building adistributed shared-log on RADOS Noah Watkins UC Santa Cruz @noahdesu
  • 2.
    2 ● Graduate student ●UC Santa Cruz ● Data management, file systems, HPC, QoS ● Ceph as a prototyping platform About me ● Graduate student ● UC Santa Cruz
  • 3.
    3 ● Graduate student ●UC Santa Cruz ● Data management, file systems, HPC, QoS ● Ceph as a prototyping platform About me ● Graduate student ● UC Santa Cruz ● Data management ● Storage systems ● High-performance computing ● Quality of service
  • 4.
    4 ● Graduate student ●UC Santa Cruz ● Data management, file systems, HPC, QoS ● Ceph as a prototyping platform About me ● Graduate student ● UC Santa Cruz ● Data management ● Storage systems ● High-performance computing ● Quality of service ● Ceph shop for prototyping
  • 5.
    Ceph storage interfaces,a familiar sight... 5 RADOS LIBRADOS RADOSGW RBD CEPHFS
  • 6.
    Our research: programmabilityin storage 6 RADOS LIBRADOS RADOSGW RBD CEPHFSMatrix Log
  • 7.
    Our research: programmabilityin storage 7 RADOS LIBRADOS RADOSGW RBD CEPHFSMatrix Log
  • 8.
    The agenda ● Whyshould I care about logs? ● How to build a really, really fast log ● Mapping techniques for fast logs onto Ceph ● A few usage examples 8
  • 9.
    Why you shouldcare about logs 9
  • 10.
    A log isan ordered list of data elements ● General abstract form of a log ● Ordered list of elements 10 0 1 2 3 4 5 6 7 8 9 10 11 12
  • 11.
    A log isan ordered list of data elements ● General abstract form of a log ● Ordered list of elements ● New entries are appended to the log 11 0 1 2 3 4 5 6 7 8 9 10 11 12 clients append
  • 12.
    A log isan ordered list of data elements ● General abstract form of a log ● Ordered list of elements ● New entries are appended to the log ● Log entries can be read directly 12 0 1 2 3 4 5 6 7 8 9 10 11 12 clients append
  • 13.
    Everyone is goingbananas for logs 13 ● Very high throughput logs ● No global ordering ● Strongly consistent, global ordering ● Write batching ● Single writer Apache BookKeeper
  • 14.
    Strongly consistent shared-log 14 01 2 3 4 5 6 7 8 9 10 11 12 Database management systems Replicated state machines Cloud-based metadata services
  • 15.
    Strongly consistent, high-performanceshared-log 15 0 1 2 3 4 5 6 7 8 9 10 11 12 Database management systems Replicated state machines Cloud-based metadata services
  • 16.
    Shared-logs are challengingto scale Balakrishnan et al., “Tango: Distributed Data Structures over a Shared Log”, SOSP ‘13 16
  • 17.
    Shared-logs are challengingto scale Balakrishnan et al., “Tango: Distributed Data Structures over a Shared Log”, SOSP ‘13 17
  • 18.
    Shared-logs are challengingto scale clients 0 1 2 Paxos append 18 0 1 2 3 4 5 6 7 8 9 10 11 12
  • 19.
    Shared-logs are challengingto scale clients 0 1 2 Paxos append Writes are funneled through a total ordering engine 19 0 1 2 3 4 5 6 7 8 9 10 11 12
  • 20.
    Shared-logs are challengingto scale clients 0 1 2 Paxos append Writes are funneled through a total ordering engine 20 0 1 2 3 4 5 6 7 8 9 10 11 12 Parallel reads supported when log is distributed Random read workloads perform poorly on HDDs
  • 21.
    CORFU: A SharedLog Design for Flash Clusters 21 Balakrishnan, et. al, NSDI 2011
  • 22.
    CORFU decouples I/Ofrom ordering 22 clients 1, 2, 3, 4... 0 1 2 3 4 5 6 7 8 9 10 11 12 RPC: Tail, Next std::atomic<u64> seq; tail = seq; next = seq++; Sequencer reads
  • 23.
    CORFU decouples I/Ofrom ordering 23 clients 1, 2, 3, 4... 0 1 2 3 4 5 6 7 8 9 10 11 12 RPC: Tail, Next std::atomic<u64> seq; tail = seq; next = seq++; Sequencer 100K-1M pos/sec reads
  • 24.
    CORFU decouples I/Ofrom ordering 24 clients 1, 2, 3, 4... 0 1 2 3 4 5 6 7 8 9 10 11 12 RPC: Tail, Next std::atomic<u64> seq; tail = seq; next = seq++; Sequencer 100K-1M pos/sec parallel appendreads
  • 25.
    CORFU stripes logacross a cluster of flash devices 25 clients 1, 2, 3, 4... 0 1 2 3 4 5 6 7 8 9 10 11 12 RPC: Tail, Next std::atomic<u64> seq; tail = seq; next = seq++; Sequencer 100K-1M pos/sec parallel appendreads
  • 26.
    You might bethinking... clients 1, 2, 3, 4, 5, …. RPC: Tail, Next std::atomic<u64> seq; tail = seq; next = seq++; Sequencer read append Decouple append I/O from assignment of ordering● It can’t possibly be this simple.... 26 100K-1M pos/sec
  • 27.
    You might bethinking... clients 1, 2, 3, 4, 5, …. RPC: Tail, Next std::atomic<u64> seq; tail = seq; next = seq++; Sequencer read append Decouple append I/O from assignment of ordering● It can’t possibly be this simple.... you’d be correct 27 100K-1M pos/sec
  • 28.
    You might bethinking... clients 1, 2, 3, 4, 5, …. RPC: Tail, Next std::atomic<u64> seq; tail = seq; next = seq++; Sequencer read append Decouple append I/O from assignment of ordering● It can’t possibly be this simple.... you’d be correct ● Isn’t this supposed to be a Ceph talk? 28 100K-1M pos/sec
  • 29.
    The CORFU I/Opath LogInterface(e.g.Append) 29 Cluster of flash devices Append( ) ??
  • 30.
    CORFU uses acluster map and striping function LogInterface(e.g.Append) 30 Cluster of flash devices Append( ) Striping and Addressing
  • 31.
    CORFU replicates logentries across the cluster LogInterface(e.g.Append) 31 Cluster of flash devices Append( ) Striping and Addressing Fault tolerance Replication
  • 32.
    The CORFU protocolrelies on custom I/O interface LogInterface(e.g.Append) 32 Cluster of flash devices Append( ) Striping and Addressing Fault tolerance Custom hardware interface Replication
  • 33.
    A dedicated CORFUcluster is expensive... ● Duplicated services ○ Fault-tolerance ○ Metadata management ● Dedicated / over-provisioned hardware ○ You need a full cluster of flash devices! ● Custom storage interfaces ○ Hardware or software ● Already have a Ceph cluster? ○ Too bad ● Can we use software-defined storage? 33
  • 34.
    Mapping the componentsof CORFU onto Ceph LogInterface(e.g.Append) 34 Cluster of flash devices Append( ) Striping and Addressing Fault tolerance Custom hardware interface Replication
  • 35.
    A cluster ofOSDs rather than raw storage devices LogInterface(e.g.Append) 35 Cluster of OSDs Append( ) Striping and Addressing Fault tolerance Custom hardware interface OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD
  • 36.
    Ceph handles fault-tolerancetransparently LogInterface(e.g.Append) 36 Cluster of OSDs Append( ) Striping and Addressing Fault tolerance Custom hardware interface OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD Replication
  • 37.
    Log distribution becomesobject naming issue LogInterface(e.g.Append) 37 Append( ) Striping and Addressing Object naming Custom hardware interface OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD Replication Cluster of OSDs
  • 38.
    Storage interface builtusing RADOS object classes LogInterface(e.g.Append) 38 Append( ) Striping and Addressing Object naming RADOS object class OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD Replication Cluster of OSDs
  • 39.
    ZLog is animplementation of CORFU on Ceph osd osd osd 39 clients 1, 2, 3, 4... 0 1 2 3 4 5 6 7 8 9 10 11 12 RPC: Tail, Next std::atomic<u64> seq; tail = seq; next = seq++; Sequencer 100K-1M pos/sec parallel appendreads BlueStore RocksDBLMDB
  • 40.
    ZLog is animplementation of CORFU on Ceph osd osd osd 40 clients 1, 2, 3, 4... 0 1 2 3 4 5 6 7 8 9 10 11 12 RPC: Tail, Next std::atomic<u64> seq; tail = seq; next = seq++; Sequencer 100K-1M pos/sec parallel appendreads BlueStore RocksDBLMDB Transparent changes to hardware, software, tuning
  • 41.
    ZLog is animplementation of CORFU on Ceph osd osd osd 41 clients 1, 2, 3, 4... 0 1 2 3 4 5 6 7 8 9 10 11 12 RPC: Tail, Next std::atomic<u64> seq; tail = seq; next = seq++; Sequencer 100K-1M pos/sec parallel appendreads BlueStore RocksDBLMDB Transparent changes to hardware, software, tuning Integration with features like tiering, erasure coding, and I/O hinting
  • 42.
    The ZLog projectis open-source on Github ● https://github.com/cruzdb/zlog ● Development backend (LMDB) ● Tools to build Ceph plugin ● High and low-level APIs 42 // build a backend instance auto backend = std::unique_ptr<zlog::Backend>( new zlog::storage::ceph::CephBackend(ioctx)); // open the log zlog::Log log; int ret = zlog::Log::CreateWithBackend(std::move(backend), "mylog", &log); // append to the log uint64_t pos; log.Append(Slice(“foo”), &pos);
  • 43.
    The ZLog projectis open-source on Github ● https://github.com/cruzdb/zlog ● Development backend (LMDB) ● Tools to build Ceph plugin ● High and low-level APIs 43 // build a backend instance auto backend = std::unique_ptr<zlog::Backend>( new zlog::storage::ceph::CephBackend(ioctx)); // open the log zlog::Log log; int ret = zlog::Log::CreateWithBackend(std::move(backend), "mylog", &log); // append to the log uint64_t pos; log.Append(Slice(“foo”), &pos);
  • 44.
    The ZLog projectis open-source on Github ● https://github.com/cruzdb/zlog ● Development backend (LMDB) ● Tools to build Ceph plugin ● High and low-level APIs 44 // build a backend instance auto backend = std::unique_ptr<zlog::Backend>( new zlog::storage::ceph::CephBackend(ioctx)); // open the log zlog::Log log; int ret = zlog::Log::CreateWithBackend(std::move(backend), "mylog", &log); // append to the log uint64_t pos; log.Append(Slice(“foo”), &pos);
  • 45.
    The ZLog projectis open-source on Github ● https://github.com/cruzdb/zlog ● Development backend (LMDB) ● Tools to build Ceph plugin ● High and low-level APIs 45 // build a backend instance auto backend = std::unique_ptr<zlog::Backend>( new zlog::storage::ceph::CephBackend(ioctx)); // open the log zlog::Log log; int ret = zlog::Log::CreateWithBackend(std::move(backend), "mylog", &log); // append to the log uint64_t pos; log.Append(Slice(“foo”), &pos);
  • 46.
    Building applications onRADOS using object classes 46
  • 47.
    Design decisions forthe ZLog I/O path 47 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy
  • 48.
    Design decisions forthe ZLog I/O path 48 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy ● Log entry → Object
  • 49.
    Design decisions forthe ZLog I/O path 49 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy ● Log entry → Object ● Stripe log across objects
  • 50.
    Design decisions forthe ZLog I/O path 50 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy ● Log entry → Object ● Stripe log across objects ● … but not too many objects
  • 51.
    Design decisions forthe ZLog I/O path 51 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy ● Log entry → Object ● Stripe log across objects ● … but not too many objects ● Changes to striping policy
  • 52.
    Design decisions forthe ZLog I/O path 52 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy ● Log entry → Object ● Stripe log across objects ● … but not too many objects ● Changes to striping policy ● Coordinated metadata update
  • 53.
    Design decisions forthe ZLog I/O path 53 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy ● Log entry → Object ● Stripe log across objects ● … but not too many objects ● Changes to striping policy ● Coordinated metadata update ● Treat an object like a consensus API
  • 54.
    Design decisions forthe ZLog I/O path 54 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy ● Log entry → Object ● Stripe log across objects ● … but not too many objects ● Changes to striping policy ● Coordinated metadata update ● Treat an object like a consensus API
  • 55.
    The CORFU storageinterface ● write(obj, position, data) ● read(obj, position) ● invalidate(obj, position) ● trim(obj, position) ○ Mark for GC ● seal(obj, epoch) 55
  • 56.
    The CORFU storageinterface ● write(obj, position, data) ● read(obj, position) ● invalidate(obj, position) ● trim(obj, position) ○ Mark for GC ● seal(obj, epoch) 56
  • 57.
    The CORFU storageinterface ● write(obj, position, data) ○ Write-once: position must be open ● read(obj, position) ● invalidate(obj, position) ● trim(obj, position) ○ Mark for GC ● seal(obj, epoch) 57
  • 58.
    The CORFU storageinterface ● write(obj, position, data) ○ Write-once: position must be open ○ Request must be tagged with up-to-date epoch ● read(obj, position) ● invalidate(obj, position) ● trim(obj, position) ○ Mark for GC ● seal(obj, epoch) 58
  • 59.
    The CORFU storageinterface ● write(obj, position, data) ○ Write-once: position must be open ○ Request must be tagged with up-to-date epoch ○ Position must not fall within a GC’d range ● read(obj, position) ● invalidate(obj, position) ● trim(obj, position) ○ Mark for GC ● seal(obj, epoch) 59
  • 60.
    The CORFU storageinterface ● write(obj, position, data) ○ Write-once: position must be open ○ Request must be tagged with up-to-date epoch ○ Position must not fall within a GC’d range ○ Optionally update maximum position observed ● read(obj, position) ● invalidate(obj, position) ● trim(obj, position) ○ Mark for GC ● seal(obj, epoch) 60
  • 61.
    The CORFU storageinterface ● write(obj, position, data) ○ Write-once: position must be open ○ Request must be tagged with up-to-date epoch ○ Position must not fall within a GC’d range ○ Optionally update maximum position observed ○ Atomic update ● read(obj, position) ● invalidate(obj, position) ● trim(obj, position) ○ Mark for GC ● seal(obj, epoch) 61
  • 62.
    librados won’t [currently]cut it for this job ● write(obj, position, data) ○ Write-once: position must be open ○ Request must be tagged with up-to-date epoch ○ Position must not fall within a GC’d range ○ Optionally update maximum position observed ○ Atomic update ● read(obj, position) ● invalidate(obj, position) ● trim(obj, position) ○ Mark for GC ● seal(obj, epoch) 62
  • 63.
    CORFU write interfaceas an object class 63 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy int write(context_t hctx, bufferlist *in, bufferlist *out) { // ensure up-to-date client if (epoch_guard(header, op.epoch(), false)) { return -ESPIPE; } // read entry based on request position ceph::bufferlist bl; int ret = cls_cxx_map_get_val(hctx, key, &bl); if (ret < 0) return ret; if (entry && !decode(bl, entry)) return -EIO; return 0; // write entry if position is not taken if (ret == -ENOENT) { entry_write_entry(hctx, key, entry); if (!header.has_max_pos() || (op.pos() > header.max_pos())) header.set_max_pos(op.pos()); ret = entry_write_header(hctx, header); } else { // handle errors associated with write-once semantics } }
  • 64.
    CORFU write interfaceas an object class 64 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy int write(context_t hctx, bufferlist *in, bufferlist *out) { // ensure up-to-date client if (epoch_guard(header, op.epoch(), false)) { return -ESPIPE; } // read entry based on request position ceph::bufferlist bl; int ret = cls_cxx_map_get_val(hctx, key, &bl); if (ret < 0) return ret; if (entry && !decode(bl, entry)) return -EIO; return 0; // write entry if position is not taken if (ret == -ENOENT) { entry_write_entry(hctx, key, entry); if (!header.has_max_pos() || (op.pos() > header.max_pos())) header.set_max_pos(op.pos()); ret = entry_write_header(hctx, header); } else { // handle errors associated with write-once semantics } } connection to client
  • 65.
    CORFU write interfaceas an object class 65 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy int write(context_t hctx, bufferlist *in, bufferlist *out) { // ensure up-to-date client if (epoch_guard(header, op.epoch(), false)) { return -ESPIPE; } // read entry based on request position ceph::bufferlist bl; int ret = cls_cxx_map_get_val(hctx, key, &bl); if (ret < 0) return ret; if (entry && !decode(bl, entry)) return -EIO; return 0; // write entry if position is not taken if (ret == -ENOENT) { entry_write_entry(hctx, key, entry); if (!header.has_max_pos() || (op.pos() > header.max_pos())) header.set_max_pos(op.pos()); ret = entry_write_header(hctx, header); } else { // handle errors associated with write-once semantics } } connection to client
  • 66.
    CORFU write interfaceas an object class 66 RADOS object classes CORFU storage interface OSD librados ZLog client library Striping policy int write(context_t hctx, bufferlist *in, bufferlist *out) { // ensure up-to-date client if (epoch_guard(header, op.epoch(), false)) { return -ESPIPE; } // read entry based on request position ceph::bufferlist bl; int ret = cls_cxx_map_get_val(hctx, key, &bl); if (ret < 0) return ret; if (entry && !decode(bl, entry)) return -EIO; return 0; // write entry if position is not taken if (ret == -ENOENT) { entry_write_entry(hctx, key, entry); if (!header.has_max_pos() || (op.pos() > header.max_pos())) header.set_max_pos(op.pos()); ret = entry_write_header(hctx, header); } else { // handle errors associated with write-once semantics } } connection to client
  • 67.
    Getting started withobject classes 1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc) a. Super well documented, easy examples 67
  • 68.
    Getting started withobject classes 1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc) a. Super well documented, easy examples 2. Build and load the plugin into Ceph 68
  • 69.
    Getting started withobject classes 1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc) a. Super well documented, easy examples 2. Build and load the plugin into Ceph 3. Old way to manage plugins a. Manage your own Ceph fork 69
  • 70.
    Getting started withobject classes 1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc) a. Super well documented, easy examples 2. Build and load the plugin into Ceph 3. Old way to manage plugins a. Manage your own Ceph fork 4. New way to manage plugins with SDK (credit: Neha Ojha @ b7215b0) a. dnf install rados-objclass-devel b. #include <rados/objclass.h> c. compile cls_hello.cc as object library 70
  • 71.
    Getting started withobject classes 1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc) a. Super well documented, easy examples 2. Build and load the plugin into Ceph 3. Old way to manage plugins a. Manage your own Ceph fork 4. New way to manage plugins with SDK (credit: Neha Ojha @ b7215b0) a. dnf install rados-objclass-devel b. #include <rados/objclass.h> c. compile cls_hello.cc as object library 5. Distribute your plugin to OSDs at <libdir>/rados-classes/cls_hello.so 6. Starting making requests ioctx::exec(“hello”, …) 71
  • 72.
    The CORFU seal(obj,epoch) interface is tougher ● Semantics are do the following atomically ○ Verify and update the stored epoch ○ Return the largest position written ● RADOS object classes don’t support this mix of ops ○ Currently ● Solution ○ Split the operation ○ Verify ● Luckily this isn’t a fast path 72
  • 73.
    How we havebeen using ZLog 73
  • 74.
    PostgreSQL logical replicationwith ZLog 74 0 1 2 3 4 5 6 7 8 9 10 11 12 osd osd osd INSERT UPDATE DELETEReplay Replay Replay Replay Replay C C C C Queries ZLog https://github.com/cruzdb/pg_zlog
  • 75.
    CruzDB is animmutable key-value store on ZLog 75 0 1 2 3 4 5 6 7 8 9 10 11 12 osd osd osd GET PUT DELETEReplay Replay Replay Replay Replay C C C C Key/Value Transactions ZLog https://github.com/cruzdb/cruzdb
  • 76.
    CruzDB is animmutable key-value store on ZLog 76 0 1 2 3 4 5 6 7 8 9 10 11 12 osd osd osd GET PUT DELETEReplay Replay Replay Replay Replay C C C C Key/Value Transactions ZLog Database is a copy-on-write red-black tree https://github.com/cruzdb/cruzdb
  • 77.
    CruzDB is animmutable key-value store on ZLog 77 0 1 2 3 4 5 6 7 8 9 10 11 12 osd osd osd GET PUT DELETEReplay Replay Replay Replay Replay C C C C Key/Value Transactions ZLog Database is a copy-on-write red-black tree Efficient read-only snapshots https://github.com/cruzdb/cruzdb
  • 78.
    CruzDB is animmutable key-value store on ZLog 78 0 1 2 3 4 5 6 7 8 9 10 11 12 osd osd osd GET PUT DELETEReplay Replay Replay Replay Replay C C C C Key/Value Transactions ZLog Database is a copy-on-write red-black tree Efficient read-only snapshotsTime travel capabilities https://github.com/cruzdb/cruzdb
  • 79.
    Wrapping up ● Weuse Ceph as a prototyping system ○ Application-specific interfaces ● There is a broad range of interests in log-oriented interfaces ● We’ve built a Ceph-based implementation of a high-performance log ○ ZLog @ https://github.com/cruzdb/zlog ● Using it for several real-world use cases ○ PostgreSQL logical replication (https://github.com/cruzdb/pg_zlog) ○ CruzDB immutable database (https://github.com/cruzdb/cruzdb) 79
  • 80.
    Thank you andquestions ● Noah Watkins (@noahdesu) ● Learning resources ○ Object class SDK documentation ■ http://docs.ceph.com/docs/master/rados/api/objclass-sdk/ ○ ZLog is the only user of the SDK I’m aware of ■ https://github.com/cruzdb/zlog/blob/master/src/storage/ceph/cls_zlog.cc ○ The Ceph tree contains a ton of object classes for reference ■ https://github.com/ceph/ceph/tree/master/src/cls ○ Writing object classes using Lua ■ https://ceph.com/geen-categorie/dynamic-object-interfaces-with-lua/ ■ https://nwat.xyz/blog/2018/03/13/video-of-my-talk-at-lua-workshop-2017/ 80