Experiences building a distributed shared log on RADOS - Noah Watkins

Experiences building a distributed
shared-log on RADOS
Noah Watkins
UC Santa Cruz
@noahdesu

2
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● UC Santa Cruz

3
● UC Santa Cruz
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● UC Santa Cruz
● Data management
● Storage systems
● High-performance computing
● Quality of service

4
● UC Santa Cruz
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● UC Santa Cruz
● Data management
● Storage systems
● High-performance computing
● Quality of service
● Ceph shop for prototyping

Ceph storage interfaces, a familiar sight...
5
RADOS
LIBRADOS RADOSGW RBD CEPHFS

Our research: programmability in storage
6
RADOS
LIBRADOS RADOSGW RBD CEPHFSMatrix Log

Our research: programmability in storage
7
RADOS
LIBRADOS RADOSGW RBD CEPHFSMatrix Log

The agenda
● Why should I care about logs?
● How to build a really, really fast log
● Mapping techniques for fast logs onto Ceph
● A few usage examples
8

Why you should care about logs
9

A log is an ordered list of data elements
● General abstract form of a log
● Ordered list of elements
10
0 1 2 3 4 5 6 7 8 9 10 11 12

● New entries are appended to the log
11
0 1 2 3 4 5 6 7 8 9 10 11 12
clients
append

● New entries are appended to the log
● Log entries can be read directly
12
0 1 2 3 4 5 6 7 8 9 10 11 12
clients
append

Everyone is going bananas for logs
13
● Very high throughput logs
● No global ordering
● Strongly consistent, global ordering
● Write batching
● Single writer
Apache
BookKeeper

Strongly consistent shared-log
14
0 1 2 3 4 5 6 7 8 9 10 11 12
Database management systems Replicated state machines Cloud-based metadata services

Strongly consistent, high-performance shared-log
15
0 1 2 3 4 5 6 7 8 9 10 11 12
Database management systems Replicated state machines Cloud-based metadata services

Shared-logs are challenging to scale
Balakrishnan et al., “Tango: Distributed Data Structures over a Shared Log”, SOSP ‘13
16

Balakrishnan et al., “Tango: Distributed Data Structures over a Shared Log”, SOSP ‘13
17

clients
0 1
2
Paxos
append
18
0 1 2 3 4 5 6 7 8 9 10 11 12

clients
0 1
2
Paxos
append
Writes are
funneled through
a total ordering
engine
19
0 1 2 3 4 5 6 7 8 9 10 11 12

clients
0 1
2
Paxos
append
Writes are
funneled through
a total ordering
engine
20
0 1 2 3 4 5 6 7 8 9 10 11 12
Parallel reads
supported when
log is distributed
Random read
workloads perform
poorly on HDDs

CORFU: A Shared Log Design for Flash Clusters
21
Balakrishnan, et. al, NSDI 2011

CORFU decouples I/O from ordering
22
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
reads

23
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
reads

24
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads

CORFU stripes log across a cluster of flash devices
25
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec

You might be thinking...
clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
tail = seq;
next = seq++;
Sequencer
read append Decouple append I/O from
assignment of ordering● It can’t possibly be this simple....
26
100K-1M pos/sec

clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
tail = seq;
next = seq++;
Sequencer
assignment of ordering● It can’t possibly be this simple.... you’d be correct
27
100K-1M pos/sec

clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
tail = seq;
next = seq++;
Sequencer
assignment of ordering● It can’t possibly be this simple.... you’d be correct
● Isn’t this supposed to be a Ceph talk?
28
100K-1M pos/sec

The CORFU I/O path
LogInterface(e.g.Append)
29
Cluster of flash devices
Append( ) ??

CORFU uses a cluster map and striping function
30
Append( )
Striping and
Addressing

CORFU replicates log entries across the cluster
31
Append( )
Striping and
Addressing
Fault tolerance
Replication

The CORFU protocol relies on custom I/O interface
32
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
Replication

A dedicated CORFU cluster is expensive...
● Duplicated services
○ Fault-tolerance
○ Metadata management
● Dedicated / over-provisioned hardware
○ You need a full cluster of flash devices!
● Custom storage interfaces
○ Hardware or software
● Already have a Ceph cluster?
○ Too bad
● Can we use software-defined storage?
33

Mapping the components of CORFU onto Ceph
34
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
Replication

A cluster of OSDs rather than raw storage devices
35
Cluster of OSDs
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD

Ceph handles fault-tolerance transparently
36
Cluster of OSDs
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
Replication

Log distribution becomes object naming issue
37
Append( )
Striping and
Addressing
Object naming
Custom
hardware
interface
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
Replication
Cluster of OSDs

Storage interface built using RADOS object classes
38
Append( )
Striping and
Addressing
Object naming
RADOS
object
class
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
Replication
Cluster of OSDs

ZLog is an implementation of CORFU on Ceph
osd osd osd
39
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
BlueStore RocksDBLMDB

osd osd osd
40
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
Transparent
changes to
hardware,
software, tuning

osd osd osd
41
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
Transparent
changes to
hardware,
software, tuning
Integration with
features like
tiering, erasure
coding, and I/O
hinting

The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
42
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);

43
// open the log
zlog::Log log;
"mylog", &log);
uint64_t pos;

44
// open the log
zlog::Log log;
"mylog", &log);
uint64_t pos;

45
// open the log
zlog::Log log;
"mylog", &log);
uint64_t pos;

Building applications on RADOS using
object classes
46

Design decisions for the ZLog I/O path
47
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy

48
OSD
librados
ZLog client library
Striping policy
● Log entry → Object

49
OSD
librados
ZLog client library
Striping policy
● Stripe log across objects

50
OSD
librados
ZLog client library
Striping policy
● … but not too many objects

51
OSD
librados
ZLog client library
Striping policy
● Changes to striping policy

52
OSD
librados
ZLog client library
Striping policy
● Coordinated metadata update

53
OSD
librados
ZLog client library
Striping policy
● Treat an object like a
consensus API

54
OSD
librados
ZLog client library
Striping policy
● Treat an object like a
consensus API

The CORFU storage interface
● write(obj, position, data)
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
55

○ Mark for GC
56

○ Write-once: position must be open
○ Mark for GC
57

○ Request must be tagged with up-to-date epoch
○ Mark for GC
58

○ Position must not fall within a GC’d range
○ Mark for GC
59

○ Optionally update maximum position observed
○ Mark for GC
60

○ Atomic update
○ Mark for GC
61

librados won’t [currently] cut it for this job
○ Atomic update
○ Mark for GC
62

CORFU write interface as an object class
63
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}

64
OSD
librados
ZLog client library
Striping policy
return -ESPIPE;
}
if (ret < 0)
return ret;
return -EIO;
return 0;
} else {
}
}
connection to client

65
OSD
librados
ZLog client library
Striping policy
return -ESPIPE;
}
if (ret < 0)
return ret;
return -EIO;
return 0;
} else {
}
}

66
OSD
librados
ZLog client library
Striping policy
return -ESPIPE;
}
if (ret < 0)
return ret;
return -EIO;
return 0;
} else {
}
}

Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
67

2. Build and load the plugin into Ceph
68

3. Old way to manage plugins
a. Manage your own Ceph fork
69

4. New way to manage plugins with SDK (credit: Neha Ojha @ b7215b0)
a. dnf install rados-objclass-devel
b. #include <rados/objclass.h>
c. compile cls_hello.cc as object library
70

4. New way to manage plugins with SDK (credit: Neha Ojha @ b7215b0)
a. dnf install rados-objclass-devel
b. #include <rados/objclass.h>
c. compile cls_hello.cc as object library
5. Distribute your plugin to OSDs at <libdir>/rados-classes/cls_hello.so
6. Starting making requests ioctx::exec(“hello”, …)
71

The CORFU seal(obj, epoch) interface is tougher
● Semantics are do the following atomically
○ Verify and update the stored epoch
○ Return the largest position written
● RADOS object classes don’t support this mix of ops
○ Currently
● Solution
○ Split the operation
○ Verify
● Luckily this isn’t a fast path
72

How we have been using ZLog
73

PostgreSQL logical replication with ZLog
74
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
INSERT UPDATE DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Queries
ZLog
https://github.com/cruzdb/pg_zlog

CruzDB is an immutable key-value store on ZLog
75
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
https://github.com/cruzdb/cruzdb

76
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
Replay
Replay
Replay
Replay
C C C C
ZLog
Database is a
copy-on-write
red-black tree

77
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
Replay
Replay
Replay
Replay
C C C C
ZLog
Database is a
copy-on-write
red-black tree
Efficient read-only
snapshots

78
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
Replay
Replay
Replay
Replay
C C C C
ZLog
Database is a
copy-on-write
red-black tree
Efficient read-only
snapshotsTime travel
capabilities

Wrapping up
● We use Ceph as a prototyping system
○ Application-specific interfaces
● There is a broad range of interests in log-oriented interfaces
● We’ve built a Ceph-based implementation of a high-performance log
○ ZLog @ https://github.com/cruzdb/zlog
● Using it for several real-world use cases
○ PostgreSQL logical replication (https://github.com/cruzdb/pg_zlog)
○ CruzDB immutable database (https://github.com/cruzdb/cruzdb)
79

Thank you and questions
● Noah Watkins (@noahdesu)
● Learning resources
○ Object class SDK documentation
■ http://docs.ceph.com/docs/master/rados/api/objclass-sdk/
○ ZLog is the only user of the SDK I’m aware of
■ https://github.com/cruzdb/zlog/blob/master/src/storage/ceph/cls_zlog.cc
○ The Ceph tree contains a ton of object classes for reference
■ https://github.com/ceph/ceph/tree/master/src/cls
○ Writing object classes using Lua
■ https://ceph.com/geen-categorie/dynamic-object-interfaces-with-lua/
■ https://nwat.xyz/blog/2018/03/13/video-of-my-talk-at-lua-workshop-2017/
80

Experiences building a distributed shared log on RADOS - Noah Watkins

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Experiences building a distributed shared log on RADOS - Noah Watkins

Similar to Experiences building a distributed shared log on RADOS - Noah Watkins (20)

Recently uploaded

Recently uploaded (20)

Experiences building a distributed shared log on RADOS - Noah Watkins