Cassandra on Castle

Cassandra on Castle
Tim Moreton
@timmoreton

Saturday, 24 September 2011

Outline

• Why Castle?
• A [quick] tour of Castle
• Cassandra on Castle
• An aside into Memcache
• Cross-cluster snapshots and clones


Before the Flood
1990

Small databases
BTree indexes
BTree File systems
RAID
Old hardware


Two Revolutions
2010
Distributed, shared-nothing databases
Write-optimised indexes Write-optimised indexes

BTree ﬁle systems BTree ﬁle systems
RAID ... RAID
New hardware New hardware


Bridging the Gap
2011

Distributed, shared-nothing databases

Castle Castle
...
New hardware New hardware


Shared memory interface
keys
Userspace
Acunu Kernel
values
In-kernel
async, shared
memory ring workloads

interface
shared buffers

userspace
Streaming interface
range key buffered key buffered
queries insert value insert get value get

interface
kernelspace
Doubling Arrays
insert Bloom ﬁlters
queues key
get x
arrays
range arrays
queries management

mapping layer
key

doubling array
insert merges

Arrays
key Version tree
insert btree
key
get
btree
range

modlist btree
mapping layer
queries value arrays

Cache
"Extent" layer
extent block
extent cache
freespace
allocator

prefetcher
manager
& mapper

cacheing layer
ﬂusher

block mapping &
page cache

Linux Kernel
Block layer Memory manager

MM layers
linux's block &


Castle
keys
Userspace
Acunu Kernel
userspace
interface

values
In-kernel
async, shared
shared buffers
kernelspace

• Like ZFS+BDB for Big Data
Streaming interface
interface


• Opensource (GPLv2, MIT
Doubling Arrays
doubling array
mapping layer

queues key
get
arrays x

for user libraries)
range arrays
queries management
key
insert merges

Arrays
• http://bitbucket.org/acunu
mapping layer
modlist btree

key Version tree
insert btree

• Loadable Kernel Module,
key
get
btree
range

Cache
targeting CentOS’s 2.6.18
block mapping &

• http://www.acunu.com/
cacheing layer

"Extent" layer
prefetcher

extent block
extent cache
freespace
allocator
manager
ﬂusher

& mapper

page cache
blogs/andy-twigg/why-
acunu-kernel/
linux's block &

Linux Kernel
MM layers



The Interface
keys
Userspace
Acunu Kernel
userspace
interface

values
In-kernel
async, shared
shared buffers
kernelspace

Streaming interface
interface


Doubling Arrays
doubling array
mapping layer

queues key
get
arrays x
range
queries
castle_{back,objects}.c
arrays
management
Saturday, 24 September 2011 key

The Interface
Tree of versions
Attachment
• Create, snapshot, clone

• Attach/detach
• Keys: any dimensional
• Values: any size
v0
• Simple get, put, delete
v1 v3
• Iterator, slice interfaces

v12 v13 v15
• Streaming interface

v16 v24


interface
userspac
values
In-kernel
async, shared
shared buffers
kernelspace
interface
Doubling Array Streaming interface

Doubling Arrays
doubling array
mapping layer

queues key
get
arrays x
range arrays
queries management
key
insert merges

Arrays
mapping layer
modlist btree

key Version tree
insert btree
key
get
btree
range

castle_{da,bloom}.c

Doubling Array
Inserts

2 2 9

9

Buffer arrays in memory
until we have > B of them

Doubling Array
Inserts

11 2 9 2 8 9 11

8 8 11
etc...


8KB @ 100MB/s, w/ 8ms seek 100 / 5
= 100 IOs/s = 20 updates/s
~ log (2^30)/log 100
= 5 IOs/update
Range Query
Update
(Size Z)
O(logB N) O(Z/B)
B-Tree random IOs random IOs

O((log N)/B) O(Z/B)
Doubling Array sequential IOs sequential IOs

~ log (2^30)/100 8KB @ 100MB/s 13k / 0.2
= 0.2 IOs/update = 13k IOs/s = 65k updates/s

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries


Doubling Arrays

doubling array
mapping layer
“Mod-list” B-Tree
queues key
get
arrays x
range arrays
queries management
key
insert merges

Arrays
mapping layer
modlist btree

key Version tree
insert btree
key
get
btree
range

Cache
block mapping &
cacheing layer

"Extent" layer

prefetcher
extent block
extent cache

So how to do snapshots and clones?
freespace
manager
allocator

ﬂusher
& mapper

page cache

castle_{btree,versions}.c
k&

Linux Kernel
s


Copy-on-Write BTree
Idea:
• Apply path-copying [DSST] to
the B-tree
Problems:
• Space blowup: Each update may
rewrite an entire path
• Slow updates: as above
A log ﬁle system makes updates sequential, but relies on
random access and garbage collection (achilles heel!)


Range
Update Space
Query
CoW B- O(logB Nv) O(Z/B)
O(N B logB Nv)
Tree random IOs random IOs

“BigTable” O((log N)/B) O(Z/B)
O(VN)
LevelDB
style DA sequential IOs sequential IOs

“Mod-list” O((log N)/B) O(Z/B)
Castle
in a DA sequential IOs sequential IOs
O(N)

Nv = #keys live (accessible) at version v


Stratified B-Trees
• Retires Copy-On-Write B-Trees, the bedrock of
modern storage (Sun ZFS, NetApp WAFL, ...)
• Patent-pending, next-generation data structure
• Theoretically optimal, yet highly practical

Copy-on-write B-tree finally beaten.

Andy Twigg∗ , Andrew Byde∗ , Grzegorz Miło´∗ , Tim Moreton∗ , John Wilkes†∗ and Tom Wilkie∗
∗
s
Acunu, † Google http://goo.gl/INTb1
firstname@acunu.com

Abstract This paper presents some recent results on new con-
structions for B-trees that go beyond copy-on-write, that
A classic versioned data structure in storage and com- we call ‘stratified B-trees’. They solve two open prob-
puter science is the copy-on-write (CoW) B-tree – it un- lems: Firstly. they offer a fully-versioned B-tree with
derlies many of today’s file systems and databases, in- optimal space and the same lookup time as the CoW B-
cluding WAFL, ZFS, Btrfs and more. Unfortunately, it tree. Secondly, they are the first to offer other points on
doesn’t inherit the B-tree’s optimality properties; it has the Pareto optimal query/update tradeoff curve, and in
poor space utilization, cannot offer fast updates, and re- particular, our structures offer fully-versioned updates in

http://goo.gl/gzihe
lies on random IO to scale. Yet, nothing better has o(1) IOs, while using linear space. Experimental results
been developed since. We describe the ‘stratified B-tree’, indicate 100,000s updates/s on a large SATA disk, two
which beats the CoW B-tree in every way. In particu- orders of magnitude faster than a CoW B-tree.
lar, it is the first versioned dictionary to achieve optimal Since stratified B-trees subsume CoW B-trees (and in-
tradeoffs between space, query and update performance. deed all other known versioned external-memory dictio-
Therefore, we believe there is no longer a good reason to naries), we believe there is no longer a good reason to
use CoW B-trees for versioned data stores. use them for versioned data stores. Acunu is develop-
ing a commercial in-kernel implementation of stratified
B-tress, which we hope to release soon.
1 Introduction
The B-tree was presented in 1972 [1], and it survives

Doubling Arrays

doubling array
mapping layer
“Mod-list” B-Tree
queues key
get
arrays x
range arrays
queries management
key
insert merges

Arrays
mapping layer
modlist btree

key Version tree
insert btree
key
get
btree
range

Cache
block mapping &
cacheing layer

"Extent" layer

prefetcher
extent block
extent cache
freespace
allocator
manager

ﬂusher
& mapper

page cache

castle_{btree,versions}.c
k&

Linux Kernel
s


Arrays

mapping layer
modlist btree
key Version tree
insert btree

Disk Layout: RDA
key
get
btree
range

Cache
block mapping &
cacheing layer

"Extent" layer

prefetcher
extent block
extent cache
freespace
allocator
manager

ﬂusher
& mapper

page cache
linux's block &

Linux Kernel
MM layers


castle_{cache,extent,freespace,rebuild}.c

Disk Layout: RDA
random duplicate allocation

4 2 1 4 5 2 5 3 1 3

7 10 7 6 8 9 9 10 6 8

15 12 14 11 13 14 11 12 13 15

16 16


SSD tiering [taster]

• Why? Key to >cache random reads
• v1: SSD for metadata structures
• Redundancy provided by disk
• SSD for selected collection data (CFs)
• 10x write rate on SSDs than regular FSs


Cassandra on Castle
• Eliminate all ‘storage heavy lifting’
• Extend ColumnFamilyStore
• Efﬁcient JNI bindings to libcastle C library
• row, col, value, t: (row, col) -> (t,value)
• row, a|b|c|d, value, t:
(row, a, b, c, d, col) -> (t,value)


Small random inserts
Inserting 3 billion rows

Acunu powered Cassandra -
‘standard’ Cassandra -


Insert latency
While inserting 3 billion rows

Acunu powered Cassandra x
‘standard’ Cassandra +


Small random range queries
Performed immediately after inserts

Acunu powered Cassandra -
‘standard’ Cassandra -


Memcache + Cassandra
get/insert Cass client get/put memcached
Same data! 100k random
Replication logic inserts/sec! Replication logic

Text
Cassandra memcache Cassandra memcache

Castle Castle
...
H/W H/W


v2: Cross-cluster versions
• Eventually consistent
• Spans data centers
• Tolerates node failure,
network partition
• High performance,
no space overhead
• Dev/Test/Staging on Prod
clusters


So...
• Castle = ZFS + BDB for Big Data
• Cassandra on Castle runs apps unmodiﬁed
• Up to 100x throughput under load
• No GC pauses: very predictable latencies
• v2: Cross-cluster snapshot and clone
• SSD optimisation

Questions?

Tim Moreton // @timmoreton

http://goo.gl/INTb1 http://goo.gl/gzihe

Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and
elephant logos are trademarks of the Apache Software Foundation.


Cassandra on Castle

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (8)

Similar to Cassandra on Castle

Similar to Cassandra on Castle (8)

More from Acunu

More from Acunu (20)

Recently uploaded

Recently uploaded (20)

Cassandra on Castle