Email	Storage	with	Ceph
@	SUSE	Enterprise	Storage
Danny	Al-Gaaf
Senior	Cloud	Technologist
Deutsche	Telekom	AG
danny.al-gaaf@telekom.de
Agenda
•
•
•
•
•
•
•
•
TelekomMail	platform
Motivation
Ceph
librmb	and	dovecot
Hardware
Placement
PoC	and	Next	Steps
Conclusion
1 . 2
TelekomMail	platform
2 . 1
TelekomMail
•
•
•
•
•
•
DT's	mail	platform	for	customers
dovecot
Network-Attached	Storage	(NAS)
NFS	(sharded)
~1.3	petabyte	net	storage
~39	million	accounts
2 . 2
NFS	Operations
2 . 3
NFS	Traffic
2 . 4
NFS	Statistics
~42%	usable	raw	space
NFS	IOPS
•
•
max:	~835,000
avg:	~390,000
relevant	IO:
•
•
WRITE:	107,700	/	50,000
READ:	65,700	/	30,900
2 . 5
Email	Statistics
6.7	billion	emails
•
•
1.2	petabyte	net
compression
1.2	billion	index/cache/metadata	files
•
•
avg:	24	kiB
max:	~600	MiB
2 . 6
Email	Distribution
2 . 7
How	are	emails	stored?
Emails	are	written	once,	read	many	(WORM)
Usage	depends	on:
•
•
protocol	(IMAP	vs	POP3)
user	frontend	(mailer	vs	webmailer)
usually	separated	metadata,	caches	and	indexes
• loss	of	metadata/indexes	is	critical
without	attachments	easy	to	compress
2 . 8
Where	to	store	emails?
Filesystem
•
•
maildir
mailbox
Database
• SQL
Objectstore
•
•
S3
Swift
2 . 9
Ceph
3 . 1
Motivation
•
•
•
•
•
•
scale-out	vs	scale-up
fast	self	healing
commodity	hardware
prevent	vendor	lock-in
open	source	where	feasible
reduce	Total	Cost	of	Ownership	(TCO)
3 . 2
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self managing, intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS, with
support for C, C++,
Java, Python, Ruby
and PHP
RADOSGW
A bucket-based REST
Gateway, compatible with
S3 and Swift
RBD
A reliable and fully-
distributed block device,
with a Linux kernel client
and a QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel client
and support for FUSE
CLIENTHOST/VMAPPAPP
3 . 3
Ceph	Options
Filesystem
•
•
•
CephFS
NFS	Gateway	via	RGW
any	filesystem	on	RBD
Objectstore
•
•
S3/Swift	via	RGW
RADOS
3 . 4
Where	to	store	in	Ceph?
CephFS
•
•
•
•
same	issues	as	NFS
mail	storage	on	POSIX	layer	adds	complexity
no	option	for	emails
usable	for	metadata/caches/indexes	
Security
•
•
requires	direct	access	to	storage	network
only	for	dedicated	platform
M
M
RADOS	CLUSTER
LINUX	HOST
KERNEL	MODULE
01
10 metadatadata
3 . 5
Where	to	store	in	Ceph?
RBD
•
•
•
•
•
requires	sharding	and	large	RBDs
requires	RBD/fs	extend	scenarios
regular	account	migration
no	sharing	between	clients
impracticable	
Security
•
•
no	direct	access	to	storage	network	required
secure	through	hypervisor	abstraction	(libvirt)
M
M
RADOS	CLUSTER
HYPERVISOR
LIBRBD
VM
3 . 6
Where	to	store	in	Ceph?
RadosGW
•
•
•
•
can	store	emails	as	objects
extra	network	hops
potential	bottleneck
very	likely	not	fast	enough	
Security
•
•
no	direct	access	to	Ceph	storage	network	required
connection	to	RadosGW	can	be	secured	(WAF)
M
M
RADOS	CLUSTER
APPLICATION
RADOSGW
LIBRADOS
REST
socket
3 . 7
Where	to	store	in	Ceph?
Librados
•
•
•
•
direct	access	to	RADOS
parallel	I/O
not	optimized	for	emails
how	to	handle	metadata/caches/indexes?	
Security
•
•
requires	direct	access	to	storage	network
only	for	dedicated	platform
M
M
RADOS	CLUSTER
APP
LIBRADOS
APP
LIBRADOS
3 . 8
Dovecot	and	Ceph
4 . 1
Dovecot
Open	source	project	(LGPL	2.1,	MIT)
72%	market	share	(openemailsurvey.org,	02/2017)
Objectstore	plugin	available	(obox)
•
•
•
•
supports	only	REST	APIs	like	S3/Swift
not	open	source
requires	Dovecot	Pro
large	impact	on	TCO
4 . 2
Dovecot	Pro	obox	Plugin
IMAP4	/	POP3	/	LMTP	process
Storage	API
dovecot	obox	backend
metacacheRFC	5322	mails
fs	API
fscache	backend
object	store	backend
RFC	5322	objects
object	store
index	&	cache	bundles
sync	local
index	&	cache	with
object	store
write	index	&	cache
to	local	store
local	storage
mail	cache
local	storage
4 . 3
DT's	approach
•
•
•
•
•
–
–
–
no	open	source	solution	on	the	market
closed	source	is	no	option
develop	/	sponsor	a	solution
open	source	it
partner	with:
Wido	den	Hollander	(42on.com)	for	consulting
Tallence	AG	for	development
SUSE	for	Ceph
4 . 4
Ceph	plugin	for	Dovecot
First	Step:	hybrid	approach
Emails
• store	in	RADOS	Cluster
Metadata	and	indexes
• store	in	CephFS
Be	as	generic	as	possible
•
•
split	out	code	into	libraries
integrate	into	corresponding	upstream	projects
Mail User Agent
RADOS	CLUSTER
Ceph Client
Dovecot
rbox storage plugin
librmb CephFS
librados Linux Kernel
IMAP/POP
4 . 5
Librados	mailbox	(librmb)
Generic	email	abstraction	on	top	of	librados
Out	of	scope:
•
–
user	data	and	credential	storage
target	are	huge	installations	where	usually	are	already	solutions	in	place
•
–
full	text	indexes
there	are	solutions	already	available	and	working	outside	email	storage
4 . 6
Librados	mailbox	(librmb)
M
M
RADOS	CLUSTER
M
M
RADOS	CLUSTER
RFC5322
CACHE
LOCAL	STORAGE
IMAP4	/	POP3	/	LMTP	process
Storage	API
dovecot	rbox	backend
librmb
librados
dovecot	lib-index
CephFS
Linux	kernel
RFC	5322	mails
RFC	5322	objects index	&	metadata
4 . 7
librmb	-	Mail	Object	Format
Mails	are	immutable	regarding	the	RFC-5322	content
RFC-5322	content	stored	in	RADOS	directly
Immutable	attributes	used	by	Dovecot	stored	in	RADOS	xattr
•
•
•
•
•
•
•
rbox	format	version
GUID
Received	and	save	date
POP3	UIDL	and	POP3	order
Mailbox	GUID
Physical	and	virtual	size
Mail	UID
Writable	attributes	are	stored	in	Dovecot	index	files
4 . 8
Dump	email	details	from	RADOS
$>	rmb	-p	mail_storage	-N	t1	ls	M=ad54230e65b49a59381100009c60b9f7
mailbox_count:	1
MAILBOX:	M(mailbox_guid)=ad54230e65b49a59381100009c60b9f7
									mail_total=2,	mails_displayed=2
									mailbox_size=5539	bytes
									MAIL:			U(uid)=4
																	oid	=	a2d69f2868b49a596a1d00009c60b9f7
																	R(receive_time)=Tue	Jan	14	00:18:11	2003
																	S(save_time)=Mon	Aug	21	12:22:32	2017
																	Z(phy_size)=2919	V(v_size)	=	2919	stat_size=2919
																	M(mailbox_guid)=ad54230e65b49a59381100009c60b9f7
																	G(mail_guid)=a3d69f2868b49a596a1d00009c60b9f7
																	I(rbox_version):	0.1
[..]
4 . 9
RADOS	Dictionary	Plugin
make	use	of	Ceph	omap	key/value	store
RADOS	namespaces
•
•
shared/<key>
priv/<key>
used	by	Dovecot	to	store	metadata,	quota,	...
4 . 10
It's	open	source!
License:	LGPLv2.1
Language:	C++
Location:	
Supported	Dovecot	versions:
•
•
2.2	>=	2.2.21
2.3
github.com/ceph-dovecot/
4 . 11
Ceph	Requirements
Performance
•
•
write	performance	for	emails	is	critical
metadata/index	read/write	performance
Cost
•
•
Erasure	Coding	(EC)	for	emails
replication	for	CephFS
Reliability
• MUST	survice	failure	of	disk,	server,	rack	and	even	fire	compartments
4 . 12
Which	Ceph	Release?
Required	Features:
•
–
•
–
–
•
Bluestore
should	be	at	least	2x	faster	than	filestore
CephFS
Stable	release
Multi-MDS
Erasure	coding
4 . 13
SUSE	Products	to	use
SLES	12-SP3	and	SES	5
4 . 14
Hardware
5 . 1
Hardware
Commodity	x86_64	server
•
•
–
•
•
•
HPE	ProLiant	DL380	Gen9
dual	socket
Intel®	Xeon®	E5	V4
2x	Intel®	X710-DA2	Dual-port	10G
2x	boot	SSDs,	SATA,	HBA
HBA,	no	separate	RAID	controller
CephFS,	Rados,	MDS	and	MON	nodes
5 . 2
Storage	Nodes
CephFS	SSD	Nodes
•
•
•
CPU:	E5-2643v4	@	3.4	GHz,	6	Cores,	turbo	3.7GHz
RAM:	256	GByte,	DDR4,	ECC
SSD:	8x	1.6	TB	SSD,	3	DWPD,	SAS,	RR/RW	(4kQ16)	125k/93k	iops
Rados	HDD	Nodes
•
•
•
–
•
CPU:	E5-2640v4	@	2.4	GHz,	10	Cores,	turbo	3.4GHz
RAM:	128	GByte,	DDR4,	ECC
SSD:	2x	400	GByte,	3	DWPD,	SAS,	RR/RW	(4kQ16)	108k/49k	iops
for	BlueStore	database	etc.
HDD:	10x	4	TByte,	7.2K,	128	MB	cache,	SAS,	PMR
5 . 3
Compute	Nodes
MDS
•
•
CPU:	E5-2643v4	@	3.4	GHz,	6	Cores,	turbo	3.7GHz
RAM:	256	GByte,	DDR4,	ECC
MON	/	SUSE	admin
•
•
CPU:	E5-2640v4	@	2.4	GHz,	10	Cores,	turbo	3.4GHz
RAM:	64	GByte,	DDR4,	ECC
5 . 4
Why	this	specific	HW?
Community	recommendations?
•
•
OSD:	1x	64-bit	AMD-64,	1GB	RAM/1TB	of	storage,	2x	1GBit	NICs
MDS:	1x	64-bit	AMD-64	quad-core,	1	GB	RAM	minimum	per	MDS,	2x	1GBit	NICs
NUMA,	high	clocked	CPUs	and	large	RAM	overkill?
•
•
–
•
vendor	did	not	offer	single	socket	nodes	for	number	of	drives
MDS	performance	is	mostly	CPU	clock	bound	and	partly	single	threaded
high	clocked	CPUs	for	fast	single	threaded	performance
large	RAM:	better	caching!
5 . 5
Placement
6 . 1
Issues
Datacenter
•
•
usually	two	independent	fire	compartments	(FCs)
may	additional	virtual	FCs
Requirements
•
•
•
•
loss	of	customer	data	MUST	be	prevented
any	server,	switch	or	rack	can	fail
one	FC	can	fail
data	replication	at	least	3	times	(or	equivalent)
6 . 2
Issues
Questions
•
•
•
•
How	to	place	3	copies	in	two	FCs?
How	independent	and	reliable	are	the	virtual	FCs?
Network	architecture?
Network	bandwidth?
6 . 3
Fire	Compartment	A
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Switches
MON
MDS
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
SSDnode
SSDnode
SSDnode
Fire	Compartment	B
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Switches
MON
MDS
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
SSDnode
SSDnode
SSDnode
SUSE
Admin
Fire	Compartment	C	
(third	room)
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Switches
MON
MDS
SSDnode
SSDnode
SSDnode
6 . 4
Network
1G	OAM	network
10G	network
• 2	NICs	/	4	ports	per	node	/	SFP+	DAC
Multi-chassis	Link	Aggregation	(MC-LAG/M-LAG)
• for	aggregation	and	fail-over
Spine-Leaf	architecture
•
•
•
•
interconnect	must	not	reflect	theoretical	rack/FC	bandwidth
L2:	terminated	in	rack
L3:	TOR	<->	spine	/	spine	<->	spine
Border	Gateway	Protocol	(BGP)
6 . 5
2x 40G QSFP
L3 crosslink
40G QSFP
LAG, L3, BGP
N*10G SFP+
MCLAG
rack , L2 terminated
Spine switch
DC-R
FC1 FC2 vFC
6 . 6
Status	and	Next	Steps
7 . 1
Status
Dovecot	Ceph	Plugin
•
–
•
–
•
•
–
open	sourced
initial	SLES12-SP3,	openSUSE	Leap	42.3,	and	Tumbleweed	RPMs
still	under	development
still	includes	librmb
planned:	move	to	Ceph	project
https://github.com/ceph-dovecot/
https://build.opensuse.org/package/show/home:dalgaaf:dovecot-ceph-plugin
7 . 2
Testing
Functional	testing
•
–
–
•
setup	small	5-node	cluster
SLES12-SP3	GMC
SES5	Beta
run	Dovecot	functional	tests	against	Ceph
7 . 3
Proof-of-Concept
Hardware
•
•
•
9	SSD	nodes	for	CephFS
12	HDD	nodes
3	MDS	/	3	MON
2	FCs	+	1	vFC
Testing
•
•
•
•
run	load	tests
run	failure	scenarios	against	Ceph
improve	and	tune	Ceph	setup
verify	and	optimize	hardware
7 . 4
Further	Development
Goal:	Pure	RADOS	backend,	store	metadata/index	in	Ceph	omap
M
M
RADOS	CLUSTER
RFC5322
CACHE
LOCAL	STORAGE
IMAP4	/	POP3	/	LMTP	process
Storage	API
dovecot	rbox	backend
librmb
librados
RFC	5322	mails
RFC	5322	objects	and	index
7 . 5
Next	Steps
Production
•
•
•
•
–
–
verify	if	all	requirements	are	fulfilled
integrate	in	production
migrate	users	step-by-step
extend	to	final	size
128	HDD	nodes,	1200	OSDs,	4,7	PiB
15	SSD	nodes,	120	OSDs,	175	TiB
7 . 6
Conclusion
8 . 1
Summary	and	conclusions
Ceph	can	replace	NFS
•
•
•
mails	in	RADOS
metadata/indexes	in	CephFS
BlueStore,	EC
librmb	and	dovecot	rbox
•
•
•
Open	Source,	LGPLv2.1,	no	license	costs
librmb	can	be	used	in	non-dovecot	systems
still	under	development
PoC	with	dovecot	in	progress
Performance	optimization
8 . 2
Be	invited	to:	Participate!	
Try	it,	test	it,	feedback	and	report	bugs!
Contribute!	
github.com/ceph-dovecot/
Thank	you.
9 . 1
Questions?
9 . 2
9 . 3
General	Disclaimer
This	document	is	not	to	be	construed	as	a	promise	by	any	participating	company	to	develop,	deliver,	or	market	a	product.	It	is	not	a	commitment	to	deliver	any
material,	code,	or	functionality,	and	should	be	relied	upon	in	making	purchasing	decisions.	SUSE	makes	no	representations	or	warranties	with	respect	to	the
contents	of	this	document,	and	specifically	disclaims	any	express	or	implied	warranties	of	merchantability	or	fitness	for	any	particular	purpose.	The
development,	release,	and	timing	of	features	or	functionality	described	for	SUSE	products	remains	at	the	sole	discretion	of	SUSE.	Further,	SUSE	reserves	the
right	to	revise	this	document	and	to	make	changes	to	its	content,	at	any	time,	without	obligation	to	notify	any	person	or	entity	of	such	revisions	or	changes.	All
SUSE	marks	referenced	in	this	presentation	are	trademarks	or	registered	trademarks	of	Novell,	Inc.	in	the	United	States	and	other	countries.	All	third-party
trademarks	are	the	property	of	their	respective	owners.
9 . 4
https://dalgaaf.github.io/SUSECon-2017
10

Email Storage with Ceph - SUSECON2017