Which Hypervisor is Best?
MySQL on Ceph
3:30pm – 4:20pm
Room 203
WHOIS
Kyle Bader
Storage Solution Architectures
Red Hat
Yves Trudeau
Principal Architect
Percona
AGENDA
• Ceph Architecture Elevator Pitch
• Tuning Ceph Block (RBD)
• Tuning QEMU Block Virtualization
• Benchmarks
Ceph Architecture
ARCHITECTURAL COMPONENTS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
APP HOST/VM CLIENT
ARCHITECTURAL COMPONENTS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
APP HOST/VM CLIENT
Linux Containers
vs
Virtual Machines
KVM/QEMU RBD BACKEND
RADOS CLUSTER
PERCONA ON KRBD
RADOS CLUSTER
TUNING CEPH BLOCK
TUNING CEPH BLOCK
• Format
• Order
• Fancy Striping
• TCP_NO_DELAY
RBD FORMAT
• Format 1
• Deprecated
• Supported by all versions of Ceph
• No reason to use it in greenfield environment
• Format 2
• New, default, format
• Support snapshot and clone
RBD ODER
• The chunk / striping boundary for block device
• Default is 4MB -> 22
• 4MB = 222
• Used default during our testing
RBD: Fancy Striping
• Only available to QEMU / librbd
• Finer striping for parallelization of small writes across order
• Helps with some HDD workloads
• Used default during our testing
TCP_NO_DELAY
• Disables Nagel congestion control algorithm
• Important for latency sensitive workloads
• Good for maximizing IOPS -> MySQL
• Default in QEMU
• Default in KRBD
• Added in mainline 4.2
• Backported to RHEL 7.2 3.10-236+
TUNING QEMU
BLOCK VIRTUALIZATION
TUNING QEMU BLOCK
• Paravirtual Devices
• AIO Mode
• Caching
• x-data-plane
• num_queues
QEMU: PARAVIRTUAL DEVICES
• Virtio-blk
• Virtio-scsi
QEMU: AIO MODE
• Threads
• Software implementation of aio using thread pool
• Native
• User Kernel AIO
• Way to go in the future
QEMU: CACHING
Writeback None Writethrough Directsync
Uses Host
Page Cache Yes No Yes No
Guest Disk
WCE Enabled Enabled Disabled Disabled
rbd_cache True False True False
rbd_max_dirty 25165824 0 0 0
QEMU: Timers
• Block storage benchmark too – fio
• Very frequent access to CPU timing registers
• Accesses need to be emulated
• Can block main QEMU event loop with concurrent high IO load
QEMU: Timers
• Block storage benchmark too – fio
• Very frequent access to CPU timing registers
• Accesses need to be emulated
• Can block main QEMU event loop with concurrent high IO load
BENCHMARKS
BENCHMARKS
• Sysbench OLTP, 32 tables of each 28M rows, ~200GB
• MySQL config: 50GB buffer pool, 8MB log file size, ACID
• Filesystem: XFS with noatime, nodiratime, nobarrier
• Data reloaded before each test
• 100% reads: --oltp-point-select=100
• 100% writes: --oltp-index-updates=100
• 70%/30% reads/writes: --oltp-index-updates=28 --oltp-point-select=70
--rand-type=uniform
• 20 minute run time per test, iterations averaged
• 64 threads, 8 cores
BASIC QEMU PERFORMANCE
0
5000
10000
15000
20000
25000
30000
35000
qemu tcg qemu-kvm-default io=threads cache=none io=native cache=none
IOPS
Reads Writes R/W 70/30
THREAD CACHING MODES
0
5000
10000
15000
20000
25000
30000
io=threads cache=none io=threads cache=writethrough io=threads cache=writeback
IOPS
Reads Writes R/W 70/30
DEDICATED DISPATCH THREADS
0
5000
10000
15000
20000
25000
30000
35000
io=native cache=none io=native cache=directsync io=native cache=directsync
iothread=1
io=native cache=directsync
iothread=2
IOPS
Reads Writes R/W 70/30
DATA PLANE AND VIRTIO-SCSI QUEUES
0
5000
10000
15000
20000
25000
30000
35000
40000
x-data-plane virtio-scsi, num-queues=4 virtio-scsi, num-queues=2, vectors=3 virtio-scsi, num-queues=4, vectors=5
IOPS
Reads Writes R/W 70/30
CONTAINERS AND METAL
0
10000
20000
30000
40000
50000
60000
Metal (taskset -c 10-17) lxc (cgroup cpu 10-17) io=threads cache=none io=native cache=none virtio-scsi, num-queues=2,
vectors=3
IOPS
Reads Writes R/W 70/30
THANK YOU

Which Hypervisor is Best?

  • 1.
    Which Hypervisor isBest? MySQL on Ceph 3:30pm – 4:20pm Room 203
  • 2.
    WHOIS Kyle Bader Storage SolutionArchitectures Red Hat Yves Trudeau Principal Architect Percona
  • 3.
    AGENDA • Ceph ArchitectureElevator Pitch • Tuning Ceph Block (RBD) • Tuning QEMU Block Virtualization • Benchmarks
  • 4.
  • 5.
    ARCHITECTURAL COMPONENTS RGW A webservices gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata APP HOST/VM CLIENT
  • 6.
    ARCHITECTURAL COMPONENTS RGW A webservices gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata APP HOST/VM CLIENT
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    TUNING CEPH BLOCK •Format • Order • Fancy Striping • TCP_NO_DELAY
  • 12.
    RBD FORMAT • Format1 • Deprecated • Supported by all versions of Ceph • No reason to use it in greenfield environment • Format 2 • New, default, format • Support snapshot and clone
  • 13.
    RBD ODER • Thechunk / striping boundary for block device • Default is 4MB -> 22 • 4MB = 222 • Used default during our testing
  • 14.
    RBD: Fancy Striping •Only available to QEMU / librbd • Finer striping for parallelization of small writes across order • Helps with some HDD workloads • Used default during our testing
  • 15.
    TCP_NO_DELAY • Disables Nagelcongestion control algorithm • Important for latency sensitive workloads • Good for maximizing IOPS -> MySQL • Default in QEMU • Default in KRBD • Added in mainline 4.2 • Backported to RHEL 7.2 3.10-236+
  • 16.
  • 17.
    TUNING QEMU BLOCK •Paravirtual Devices • AIO Mode • Caching • x-data-plane • num_queues
  • 18.
    QEMU: PARAVIRTUAL DEVICES •Virtio-blk • Virtio-scsi
  • 19.
    QEMU: AIO MODE •Threads • Software implementation of aio using thread pool • Native • User Kernel AIO • Way to go in the future
  • 20.
    QEMU: CACHING Writeback NoneWritethrough Directsync Uses Host Page Cache Yes No Yes No Guest Disk WCE Enabled Enabled Disabled Disabled rbd_cache True False True False rbd_max_dirty 25165824 0 0 0
  • 21.
    QEMU: Timers • Blockstorage benchmark too – fio • Very frequent access to CPU timing registers • Accesses need to be emulated • Can block main QEMU event loop with concurrent high IO load
  • 22.
    QEMU: Timers • Blockstorage benchmark too – fio • Very frequent access to CPU timing registers • Accesses need to be emulated • Can block main QEMU event loop with concurrent high IO load
  • 23.
  • 24.
    BENCHMARKS • Sysbench OLTP,32 tables of each 28M rows, ~200GB • MySQL config: 50GB buffer pool, 8MB log file size, ACID • Filesystem: XFS with noatime, nodiratime, nobarrier • Data reloaded before each test • 100% reads: --oltp-point-select=100 • 100% writes: --oltp-index-updates=100 • 70%/30% reads/writes: --oltp-index-updates=28 --oltp-point-select=70 --rand-type=uniform • 20 minute run time per test, iterations averaged • 64 threads, 8 cores
  • 25.
    BASIC QEMU PERFORMANCE 0 5000 10000 15000 20000 25000 30000 35000 qemutcg qemu-kvm-default io=threads cache=none io=native cache=none IOPS Reads Writes R/W 70/30
  • 26.
    THREAD CACHING MODES 0 5000 10000 15000 20000 25000 30000 io=threadscache=none io=threads cache=writethrough io=threads cache=writeback IOPS Reads Writes R/W 70/30
  • 27.
    DEDICATED DISPATCH THREADS 0 5000 10000 15000 20000 25000 30000 35000 io=nativecache=none io=native cache=directsync io=native cache=directsync iothread=1 io=native cache=directsync iothread=2 IOPS Reads Writes R/W 70/30
  • 28.
    DATA PLANE ANDVIRTIO-SCSI QUEUES 0 5000 10000 15000 20000 25000 30000 35000 40000 x-data-plane virtio-scsi, num-queues=4 virtio-scsi, num-queues=2, vectors=3 virtio-scsi, num-queues=4, vectors=5 IOPS Reads Writes R/W 70/30
  • 29.
    CONTAINERS AND METAL 0 10000 20000 30000 40000 50000 60000 Metal(taskset -c 10-17) lxc (cgroup cpu 10-17) io=threads cache=none io=native cache=none virtio-scsi, num-queues=2, vectors=3 IOPS Reads Writes R/W 70/30
  • 30.