Async queue-transaction

Offload PG worker from
executing queue_transaction
Alibaba Group

Motivation
• Currently pg worker is doing heavy work
• do_op() is a long heavy function
• PG_LOCK is held during the entire path
• Can we offload some functions within do_op() to other thread pools
and make PG worker pipeline with those threads?
• Start from looking at objectstore->queue_transaction()

Offload some work from PG worker
PG
worker
PG
worker
PG
worker
……
Messenger
OBJECT STORE
Prepare op and queue transactions
PG
worker
PG
worker
PG
worker
……
Messenger
OBJECT STORE
Prepare op and really just “queued”
Asynchronously queue_transaction()
Objectstore layer allocate thread
pool to execute logic within
current queue_transaction()
Offload queue_transaction() to threads pool at objectstore layer，return pg worker and
release pg lock sooner

OBJECT STORE (BlueStore)
PG WORKER
Create bluestore transaction,
reserve disk space, submit aio
RocksDB Ksyn worker
Batch sync Rocksdb metadata and
Bluestore small data writes
Finisher
PG WORKER
transaction
worker
……transaction
worker
transaction
worker
Create bluestore transaction, reserve disk
space, submit aio and sync RocksDB
metadata and small data writes individually
Finisher
Deploy transaction workers to handle transaction requests enqueued by PG worker，and
submit individual transaction within transaction worker context (both data and metadata)

Evaluations (1)
• Systems (roughly):
• 4 servers, 1 running mon and fio processes, 3 running osd processes.
• Running 12 osd processes on osd servers, each manage one Intel NVME drive.
• 25Gb NIC
• Fio workload:
• Num_jobs=32 or 64
• bs=4KB
• Seq write and rand write

Evaluations (2)
• Bandwidth (MB/s)
Note: difference between ”orange” and “grey” bar is: orange bar still use ksync thread to commit
rocksdb transactions, while grey bar commit rocksdb transaction within transaction worker context

Analysis
• For seq-write, more io goes to same PG within small time window,
therefore offload PG worker help reduce PG_LOCK contention
• Some work can be done in pipeline among PG worker and Transaction worker
• Commit Rocksdb transaction individually seems help little compared
with doing in “batch”
• RocksDB internally serialize journal write events ?
• Any other way to reduce RocksDB sync latency?

Summary
• We are trying to reduce individual io latency as well as IOPS
• IOPS may be improved by reducing lock or other resource contentions
• Latency can be improved by simplify existing CEPH’s IO path
• For example, if we don’t need snapshot support, can we do better?
• We are trying to hear comments and feedbacks from CEPH
community

Async queue-transaction

More Related Content

What's hot

Recently uploaded

Async queue-transaction