Cache Tiering and Erasure Coding

RED HAT CONFIDENTIAL | NDA ONLY
CACHE TIERING AND ERASURE CODING
#ceph-devel
shinobu

■ CEPH MOTIVATING PRINCIPLES
■ CEPH COMPONENTS
■ ARCHITECTURE COMPONENT
■ RADOS
■ LIBRADOS
■ RADOS COMPONENTS
■ DATA PLACEMENT
■ CACHE TIERING
■ ERASURE CODING
AGENDA
1

■ All components must scale horizontally
■ There can be no single point of failure
■ The solution must be hardware agnostic
■ Should use commodity hardware
■ Self-manage whenever possible
■ Open source (LGPL)
■ Move beyond legacy approaches
■ Client / cluster instead of client / server
■ Ad hoc HA
CEPH MOTIVATING PRINCIPLES
2

RADOS
A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes and lightweight monitors
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CephFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
CEPH COMPONENTS
3

RADOS
LIBRADOS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
RBD
A reliable, fully-
distributed block
device with cloud
CephFS
A distributed file
system with POSIX
out metadata
management
APP HOST/VM CLIENT
ARCHITECTURE COMPONENTS
4

THE RADOS GATEWAY
APPLICATION
RADOSGW
LIBRADOS
APPLICATION
RADOSGW
LIBRADOS
RADOS CLUSTER
M
M
M
5

RADOS
LIBRADOS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
RBD
A reliable, fully-
distributed block
device with cloud
CephFS
A distributed file
system with POSIX
out metadata
management
APP HOST/VM CLIENT
6

RADOS CLUSTER
M
M
STORING VIRTUAL DISK: LIBRBD
VM
HYPERVISOR
LIBRBD
7

RADOS CLUSTER
M
M
KERNEL MODULE: KRBD
LINUX HOST
KRBD
8

RBD FEATURES
■ Stripe images across entire cluster (pool)
■ Read-only snapshots
■ Copy-on-Write clones
■ Broad integration
■ Qemu
■ Linux kernel
■ iSCSI (STGT, LIO)
■ OpenStack, CloudStack, Nebula, Geneti, Proxmox
■ Incremental backup (relative to snapshot)
9

RBD FEATURES
■ image mirroring
■ Asynchronous replication to another cluster
■ Replica(s) crash consistent
■ Replication is per-image
■ Each image has a data journal
■ RBD mirror daemon does the work
CLUSTER A
HYPERVISOR
LIBRBD
Journal
CLUSTER B
HYPERVISOR
LIBRBD
rbd-mirror
10

RADOS
LIBRADOS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
RBD
A reliable, fully-
distributed block
device with cloud
CephFS
A distributed file
system with POSIX
out metadata
management
APP HOST/VM CLIENT
11

SEPARATE METADATA SERVER
LINUX HOST
KERNEL MODULE
RADOS CLUSTER
M
M
M
01
10metadata data
12

SCALABLE METADATA SERVERS
MDS
■ Manages metadata for a POSIX-compliant shared filesystem
■ Directory hierarchy
■ File metadata (owner, timestamps, mode, etc)
■ Snapshots on any directory
■ Clients stripe file data in RADOS
■ MDS not in data path
■ MDS stores metadata in RADOS
■ Dynamic MDS cluster scales to 10s or 100s
■ Only required for shared file system
13

LIBRADOS
RADOS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
RBD
A reliable, fully-
distributed block
device with cloud
CephFS
A distributed file
system with POSIX
out metadata
management
APP HOST/VM CLIENT
LIBRADOS
14

LIBRADOS API
#include <rados/librados.hpp>
librados::IoCtx io_ctx;
librados::Rados rados;
rados.init("admin");
rados.connect();
rados.pool_create("swimming_pool");
rados.ioctx_create("swimming_pool", io_ctx);
librados::bufferlist bl;
bl.append("water");
io_ctx.write_full("octopus", bl)
librados::bufferlist rbl;
librados::AioCompletion *read_completion1 = librados::Rados::aio_create_completion();
io_ctx.aio_read("octopus", read_completion1, &rbl, 4193404, 0);
read_completion1->wait_for_safe();
read_completion1->get_return_value()
librados::ObjectWriteOperation write_op;
librados::bufferlist xbl;
xbl.append('2');
write_op.setxattr("version", xbl);
15

RADOS
LIBRADOS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
RBD
A reliable, fully-
distributed block
device with cloud
CephFS
A distributed file
system with POSIX
out metadata
management
APP HOST/VM CLIENT
RADOS
16

RADOS COMPONENTS
OSD:
■ 10s to 1000s in a cluster
■ One per disk (or one per SSD, RAID group…)
■ Server stored objects to clients
■ Intelligently peer for replication & recovery
17

RADOS
M
M
M
OSD
DISK
FS
OSD
DISK
FS
OSD
DISK
FS
OSD
DISK
FS
OBJECT STORAGE DAEMON
18

M
RADOS COMPONENTS
MON:
■ Maintain cluster membership and state
■ Provide consensus of distributed decision making
■ Small, odd number (e.g., 5)
■ Not part of data path
19

CRUSH
CRUSH:
■ Pseudo-random placement algorithm
■ Fast calculation, no lookup
■ Repeatable, deterministic
■ Statically uniform distribution
■ Stable mapping
■ Limited data migration on change
■ Rule-based configuration
■ Infrastructure topology aware
■ Adjustable replication
■ Weighting
20

DATA PLACEMENT
21

DATA PLACEMENT
RADOS
10
01
01
11
10
01
01
11
11
11
11
10
10
01
10
01
0110
10
10
1101
01
01
22

DATA PLACEMENT
RADOS
10
01
01
11
10
01
01
11
11
11
11
10
10
01
10
01
0110
10
10
1101
01
01
23

DATA PLACEMENT
RADOS
10
01
01
11
10
01
01
11
11
11
11
10
10
01
10
01
0110
10
10
11
01
01
10
01
01
11
10
01
01
11
01
01
24

25
CACHE TIERING

26
TWO WAYS TO CACHE

■ Within each OSD
■ Combine SSD and HDD under each OSD
■ Make localized promote / demote decisions
■ Leverage existing tools
■ dm-cache, bcache, flashcache
■ Variety of caching controllers
■ We can help with hints
TWO WAYS TO CACHE
OSD
DISK
BLOCKDEV
DISK
FS
27

TWO WAYS TO CACHE
BLOCKDEV
Data Cache
Metadata
FS
OSD
dm-cache
28

■ Cache on separate devices / nodes
■ Different hardware for devices / nodes
■ Slow nodes for cold data
■ High performance nodes for hot data
■ Add, remove, scale each tier independently
■ Unlikely to choose right ratios at procurement time
TWO WAYS TO CACHE
OSD
DISK
BLOCKDEV
FS
29

APPLICATION
RADOS
CACHE POOL (Replicated)
BACKING POOL (ERASURE CODED)
TIERED STORAGE
30

RADOS TIERING PRINCIPLES
■ Each tier is a RADOS pool
■ Replicated or erasure coded
■ Tiers are durable
■ replicate across OSDs in multiple hosts
■ Each tier has its own CRUSH policy
■ map to SSDs devices / hosts only
■ librados clients adapt to tiering topology
■ Transparently direct requests accordingly
■ No changes to RBD, RGW, CephFS, etc
RADOS
CACHE TIER
Promotion
logic
Tiering
agent
BASE TIER
Client
Objecter
31

32
I/O PATTERN
CACHE TIERING

33
WRITE HIT
CACHE TIERING

APPLICATION
RADOS
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
WRITE INTO CACHE POOL
WRITE ACK
34

35
WRITE MISS
CACHE TIERING

APPLICATION
RADOS
BACKING POOL (HDD)
WRITE MISS
WRITE
PROMOTE
ACK
36

37
PROXY WRITE
CACHE TIERING

APPLICATION
RADOS
BACKING POOL (HDD)
PROXY WRITE
WRITE
PROXY WRITE
ACK
38

39
READ: CACHE HIT
CACHE TIERING

APPLICATION
RADOS
BACKING POOL (HDD)
READ: CACHE HIT
READ READ REPLY
40

41
READ: CACHE MISS
CACHE TIERING

APPLICATION
RADOS
BACKING POOL (HDD)
READ: CACHE MISS
READ READ REPLY
PROMOTE
42

43
READFORWARD
CACHE TIERING

APPLICATION
RADOS
CACHE POOL (SSD)
BACKING POOL (HDD)
READFORWARD
READ REDIRECT READ READ REPLY
44

45
FLUSH AND EVICT
CACHE TIERING

APPLICATION
RADOS
BACKING POOL (HDD)
FLUSH AND/OR EVICT COLD DATA
EVICTACKFLUSH
46

47
ERASURE CODING

OBJECT
ERASURE CODING
RADOS
REPLICATED POOL
COPYCOPYCOPY
RADOS
ERASURE CODED POOL
1 2 3 5 64
OBJECT
■ Full copy of stored objects
■ Very high durability
■ 3x (200% overhead)
■ Quick recovery
■ One copy plus parity
■ Cost-effective durability
■ 1.5x (50% overhead)
■ Expensive recovery
48

RADOS
ERASURE CODED POOL
ERASURE CODING
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
49

RADOS
ERASURE CODED POOL
ERASURE CODING
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
50
DATA CHUNKS

RADOS
ERASURE CODED POOL
ERASURE CODING
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
51
CODING CHUNKS

OBJECT
RADOS
ERASURE CODED POOL
ERASURE CODING
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
52

53
I/O PATTERN
ERASURE CODING

54
EC READ
ERASURE CODING

CLIENT
RADOS
ERASURE CODED POOL
EC READ
READ
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
55

CLIENT
RADOS
ERASURE CODED POOL
EC READ
READ
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
READS
56

CLIENT
RADOS
ERASURE CODED POOL
EC READ
READ REPLY
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
57

58
EC WRITE
ERASURE CODING

CLIENT
RADOS
ERASURE CODED POOL
EC WRITE
WRITE
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
59

CLIENT
RADOS
ERASURE CODED POOL
EC WRITE
WRITE
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
WRITES
60

CLIENT
RADOS
ERASURE CODED POOL
EC WRITE
WRITE ACK
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
61

62
EC WRITE: DEGRADED
ERASURE CODING

CLIENT
RADOS
ERASURE CODED POOL
EC WRITE: DEGRADED
WRITE
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
WRITES
63

64
EC WRITE: PARTIAL FAILURE
ERASURE CODING

CLIENT
RADOS
ERASURE CODED POOL
WRITE
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
WRITES
65

CLIENT
RADOS
ERASURE CODED POOL
OSD
1
OSD
2
OSD
3
OSD
5
OSD
6
OSD
4
WRITES
66
B B BA A A

CONFIGURATION EXAMPLE
/// Create pools
sudo ceph osd erasure-code-profile set myecprofile ruleset-failure-domain=osd k=3 m=1
sudo ceph osd pool create myecpool 12 12 erasure myecprofile
sudo ceph osd pool create mycache 64 64
sudo ceph osd pool set mycache crush_ruleset 3
/// Set up a read/write cache pool mycache for pool myecpool
sudo ceph osd tier add myecpool mycache
sudo ceph osd tier cache-mode mycache writeback
sudo ceph osd tier set-overlay myecpool mycache
/// Set the target size and enable the tiering agent
sudo ceph osd pool set mycache hit_set_type bloom
sudo ceph osd pool set mycache hit_set_count 1
sudo ceph osd pool set mycache hit_set_period 3600
sudo ceph osd pool set mycache target_max_objects 250
sudo ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
sudo ceph osd pool set foo-hot min_read_recency_for_promote 1
sudo ceph osd pool set foo-hot min_write_recency_for_promote 1
67
/// CRUSH Rule
root ssd {
id -6
# weight 8.000
alg straw
hash 0 # rjenkins1
item octopus01-ssd weight 1.000
}
rule cacher {
ruleset 3
type replicated
min_size 3
max_size 10
step take ssd
step choose firstn 0 type host
step emit
}

CONFIGURATION EXAMPLE
68
CONTRIBUTION
http://docs.ceph.com/docs/master/dev/
IRC AND MAILING LIST
http://ceph.com/resources/mailing-list-irc/
BUG REPORT
http://tracker.ceph.com/projects/ceph/issues/
BENCHMARKING
Cache Tiering
http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2015/20150813_S303E_Zhang.pdf
Erasure Coding
http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2015/20150813_S303E_Roy.pdf

Red Hat
shinobu@redhat.com
Shinobu Kinjo
THANK YOU!

Cache Tiering and Erasure Coding

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Cache Tiering and Erasure Coding

Similar to Cache Tiering and Erasure Coding (20)

Recently uploaded

Recently uploaded (20)

Cache Tiering and Erasure Coding