Offload PG worker from	
executing	queue_transaction
Alibaba	Group
Motivation	
• Currently	pg worker	is	doing	heavy	work
• do_op()	is	a	long	heavy	function
• PG_LOCK	is	held	during	the	entire	path
• Can	we	offload	some	functions	within	do_op()	to	other	thread	pools	
and	make	PG	worker	pipeline	with	those	threads?
• Start	from	looking	at	objectstore->queue_transaction()
Offload	some	work	from	PG	worker
PG
worker
PG
worker
PG
worker
……
Messenger
OBJECT	STORE
Prepare op and queue transactions
PG
worker
PG
worker
PG
worker
……
Messenger
OBJECT	STORE
Prepare	op	and	really	just	“queued”
Asynchronously queue_transaction()
Objectstore layer	allocate	thread	
pool	to	execute	logic	within	
current	queue_transaction()
Offload	queue_transaction()	to	threads	pool	at objectstore layer,return	pg worker and	
release	pg lock	sooner
OBJECT	STORE	(BlueStore)
PG	WORKER
Create	bluestore transaction,	
reserve	disk	space,	submit	aio
RocksDB Ksyn worker	
Batch	sync	Rocksdb metadata	and	
Bluestore small	data	writes
Finisher
PG	WORKER
transaction
worker
……transaction
worker
transaction
worker
Create	bluestore transaction,	reserve	disk	
space,	submit	aio and	sync	RocksDB
metadata	and	small	data	writes	individually
Finisher
Deploy	transaction workers	to	handle	transaction	requests	enqueued by	PG	worker,and	
submit	individual	transaction	within	transaction	worker	context	(both	data	and	metadata)
Evaluations	(1)
• Systems	(roughly):
• 4	servers,	1	running	mon	and	fio processes,	3	running	osd processes.
• Running	12	osd processes	on	osd servers,	each	manage	one	Intel	NVME	drive.
• 25Gb	NIC
• Fio workload:
• Num_jobs=32	or	64
• bs=4KB
• Seq write	and	rand	write
Evaluations	(2)
• Bandwidth	(MB/s)
Note:	difference	between	”orange”	and	“grey”	bar	is:	orange	bar	still	use	ksync thread	to	commit	
rocksdb transactions,	while	grey	bar	commit	rocksdb transaction	within	transaction	worker	context
Analysis
• For	seq-write,	more	io goes	to	same	PG	within	small	time	window,	
therefore	offload	PG	worker	help	reduce	PG_LOCK	contention
• Some	work	can	be	done	in	pipeline	among	PG	worker	and	Transaction	worker
• Commit	Rocksdb transaction	individually	seems	help	little	compared	
with	doing	in	“batch”
• RocksDB internally	serialize	journal	write	events	?
• Any	other	way	to	reduce	RocksDB sync	latency?
Summary
• We	are	trying	to	reduce	individual	io latency	as	well	as	IOPS
• IOPS	may	be	improved	by	reducing	lock	or	other	resource	contentions
• Latency	can	be	improved	by	simplify	existing	CEPH’s	IO	path
• For	example,	if	we	don’t	need	snapshot	support,	can	we	do	better?
• We	are	trying	to	hear	comments	and	feedbacks	from	CEPH	
community

Async queue-transaction