XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
Upcoming SlideShare
Loading in...5
×
 

XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix

on

  • 163 views

Storage systems continue to deliver better performance year after year. High performance solutions are now available off-the-shelf, allowing users to boost their servers with drives capable of ...

Storage systems continue to deliver better performance year after year. High performance solutions are now available off-the-shelf, allowing users to boost their servers with drives capable of achieving several GB/s worth of throughput per host. To fully utilise such devices, workloads with large queue depths are often necessary. In virtual environments, this translates into aggregate workloads coming from multiple virtual machines.
Having previously addressed the impact of low latency devices in virtualised platforms, we are now aiming at optimising aggregate workloads. We will discuss the existing memory grant technologies available in Xen and compare trade-offs and performance implications of each: grant mapping, persistent grants and grant copy. For the first time, we will present grant copy as an alternative and show measurements over 7 GB/s, maxing out a set of local SSDs.

Statistics

Views

Total Views
163
Views on SlideShare
161
Embed Views
2

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 2

http://xenproject.org 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix Presentation Transcript

  • Scaling Xen’s Aggregate Storage Performance Going double digits on a single host Felipe Franciosi XenServer Engineering Performance Team e-mail: felipe.franciosi@citrix.com freenode: felipef #xen-api twitter: @franciozzy
  • Agenda • The dimensions of storage performance ๏ What exactly are we trying to measure? ! • State of the art ๏ blkfront, blkback, blktap2+tapdisk, tapdisk3, qemu-qdisk ๏ trade-offs between traditional grant mapping, persistent grants, grant copy ! • Aggregate measurements ๏ Pushing the boundaries with very, very fast local storage ! •Where to go next? © 2014 Citrix 2
  • The Dimensions of Storage Performance What exactly are we trying to measure?
  • The Dimensions of Storage Performance • You have probably seen this: ! ! ! ! ! • The average user will usually: ๏ Run a synthetic benchmark on a bare metal environment ๏ Repeat the test on a virtual machine ๏ Draw conclusions without seeing the full picture © 2014 Citrix 4 # hdparm -t /dev/sda ! /dev/sda: Timing buffered disk reads: 1116 MB in 3.00 seconds = 371.70 MB/sec # dd if=/dev/sda of=/dev/null bs=1M count=100 iflag=direct 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.269689 s, 389 MB/s
  • The Dimensions of Storage Performance © 2014 Citrix 5 throughput log(block size)
  • The Dimensions of Storage Performance © 2014 Citrix 6 throughput log(block size) sequentiality # of threads LBA io depth C/P states config temperature io engine noise read ahead direction (r/w)
  • The Dimensions of Storage Performance © 2014 Citrix • The simpler of all cases: ๏ single thread ๏ iodepth=1 ๏ direct IO ๏ sequential ! • Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433 • Kernel 3.10 + Xen 4.4 7
  • The Dimensions of Storage Performance © 2014 Citrix 8 • Pushing boundaries a bit: ๏ multiple threads ๏ iodepth=1 ๏ direct IO ๏ sequential “kind of” ! • Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433 • Kernel 3.10 + Xen 4.4
  • The Dimensions of Storage Performance © 2014 Citrix • Comparing dom0 vs. domU: ๏ single thread vs. single VM ๏ iodepth=1 ๏ direct IO ๏ sequential ! • Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433 • Kernel 3.10 + Xen 4.4 9
  • The Dimensions of Storage Performance © 2014 Citrix 10 • Comparing dom0 vs. domU: ๏ many threads vs. many VMs ๏ iodepth=1 ๏ direct IO ๏ sequential kind of ! • Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433 • Kernel 3.10 + Xen 4.4
  • State of the Art And the trade-offs between the technologies
  • State of the Art: traditional grant mapping user space kernel space © 2014 Citrix 12 dom0 domU user space kernel space libc / libaio vfs / others VDI blkfront block blkback layer xen’s blkif protocol: shared memory & event channels requests are associated with pages in the guest’s memory space requests are associated with foreign pages block layer page grants user apps VDI device driver • Pros: ๏ no copies involved ๏ low-latency alternative (when done in kernel) ! • Cons: ๏ not “network-safe” ๏ hard on grant tables
  • State of the Art: persistent grants user space kernel space © 2014 Citrix 13 dom0 domU user space kernel space libc / libaio vfs / others VDI blkfront block blkback layer xen’s blkif protocol: shared memory & event channels requests are associated with pages in the guest’s memory space requests are associated with foreign pages block layer persistent page grants user apps VDI device driver • Pros: ๏ easy on grant tables ๏ copies on the front end ! • Cons: ๏ not “network-safe” ๏ copies involved blkfront memcpy() data from/to a set of persistently granted pages on demand
  • State of the Art: tapdisk2+blktap2+blkback user space kernel space device driver © 2014 Citrix • Pros: ๏ “network-safe” ๏ neat features (VHD) ! • Cons: ๏ copies involved ๏ use lots of memory ๏ hard on grant tables 14 dom0 domU tapdisk2 libc / libaio user space kernel space vfs / others VDI blkfront block blktap2 TAP blkback layer xen’s blkif protocol: shared memory & event channels requests are associated with pages in the guest’s memory space requests are associated with local pages block layer page grants user apps blktap2 copies data to local pages libaio vfs / others block VDI layer
  • © 2014 Citrix • Pros: ๏ “network-safe” ๏ easy on grant tables ๏ neat features (VHD) ! • Cons: ๏ copies involved (back end) ๏ use lots of memory State of the Art: grant copy 15 dom0 domU tapdisk3 libc / libaio user space kernel space vfs / others VDI blkfront block layer xen’s blkif protocol: shared memory & event channels requests are associated with pages in the guest’s memory space requests are associated with local pages libaio vfs / others user apps VDI device driver block layer user space kernel space gntdev evtchn tapdisk3 issues grant copy commands via the “gntdev” Xen copies data across domains
  • State of the Art: technologies comparison © 2014 Citrix 16 extra! copies network-safe low-latency potential “neat” features easy on grant tables grant mapping N N Y (if done in blkback) depends (not in blkback) N persistent grants Y (front end) N N Y (qcow in qemu-qdisk) Y grant copy Y (back end) Y N Y (vhd in tapdisk3) Y blktap2 Y (back end) Y N Y N
  • Aggregate Measurements Going double digits on a single host
  • Aggregate Measurements • Test environment: ๏ Intel PowerEdge R720 • Intel E5-2643v2 @ 3.50 GHz (2 Sockets, 6 Cores/socket, HT Enabled) - Unless stated otherwise: 24 vCPUs to dom0, 2 vCPUs to each guest • 64 GB of RAM - Unless stated otherwise: 4 GB to dom0, 512 MB to each guest • BIOS Settings: - Power regulators set to “Performance per Watt (OS)” - C-States disabled, Xen Scaling Governor set to “Performance” ๏ Storage: • 4 x Micron P320h • 2 x Intel P3700 • 1 x Fusion-io ioDrive2 © 2014 Citrix 18
  • Aggregate Measurements © 2014 Citrix 19 SSD #1 (SR 1) lv01 lv02 … lv10 lv01 lv02 … lv10 lv01 lv02 … lv10 SSD #2 (SR 2) SSD #7 (SR 7) vd 1 (SR 1 - lv01) vd 7 (SR 7 - lv01) VM 01 vd 1 (SR 1 - lv02) vd 7 (SR 7 - lv02) VM 02 vd 1 (SR 1 - lv10) vd 7 (SR 7 - lv10) VM 10
  • Aggregate Measurements © 2014 Citrix 20 • Baseline ๏ Measurements from dom0 ๏ Each line corresponds to a group of 7 threads (one for each disk) ! ๏ Some of the drives respond faster for small block sizes and a single thread
  • Aggregate Measurements © 2014 Citrix 21 • qemu-qdisk ๏ Persistent grants disabled ๏ With O_DIRECT
  • Aggregate Measurements © 2014 Citrix 22 • qemu-qdisk ๏ Persistent grants enabled ๏ With O_DIRECT ! ๏ Apparent bottleneck was single process per VM
  • Aggregate Measurements © 2014 Citrix 23 • tapdisk2 + blktap2 ๏ With O_DIRECT ! ๏ Using blkback from 3.10 ๏ No persistent grants ๏ No indirect-IO
  • Aggregate Measurements © 2014 Citrix 24 • tapdisk2 + blktap2 ๏ With O_DIRECT ! ๏ Using blkback from 3.16 ๏ Persistent grants ๏ Indirect-IO ! ๏ Apparent bottleneck on some pvspinlock operations
  • Aggregate Measurements © 2014 Citrix 25 • blkback 3.16 ๏ 8 dom0 vCPUs ๏ 6 domU vCPUs ! ๏ Persistent grants ๏ Indirect-IO ! ๏ Apparent bottleneck on some pvspinlock operations
  • Aggregate Measurements © 2014 Citrix 26 • tapdisk3 ๏ Using grant copy ๏ With O_DIRECT ๏ Using libaio ! ๏ Apparent bottleneck is vCPU utilisation
  • Where To Go Next? Areas for improvement
  • Where To Go Next? disk © 2014 Citrix throughput log(block size) • Single-VBD performance remains problematic ๏ [1/3] Latency is too high 28 vdi VM virtualisation subsystem ~ ms ~ us ~ us ~ ns
  • Where To Go Next? • Single-VBD performance remains problematic ๏ [2/3] IO Depth is limited to 32 © 2014 Citrix 29 blkback disk VM blkfront 32 reqs • Are these workloads realistic? •We can use multi-page rings! • But… qdisk tapdisk vdi
  • Where To Go Next? • Single-VBD performance remains problematic ๏ [3/3] Backend is single threaded blkback disk VM © 2014 Citrix 30 blkfront 32 reqs qdisk tapdisk vdi io_submit() fio{ numjobs = 1 iodepth = ___ blksz = 4k rw = read io_getevents() {1, 1, 4k} = 15k IOPS, 15 % {1, 8, 4k} = 70k IOPS, 35 % {1, 16, 4k} = 110k IOPS, 55 % {1, 24, 4k} = 165k IOPS, 85 % {1, 32, 4k} = 190k IOPS, 100 % {1, 64, 4k} = 195k IOPS, 100 % ~400k IOPS at 4k (sequential reads) {5, 32, 4k} = 415k IOPS, 55 % (each)
  • • Single-VBD performance remains problematic ๏ [3/3] Backend is single threaded disk VM © 2014 Citrix blkfront 32 reqs qdisk tapdisk vdi Where To Go Next? 31 fio{ numjobs = 1 iodepth = ___ blksz = 4k rw = read io_submit() io_getevents() {1, 1, 4k} = 10k IOPS, 30 % (30 % in dom0) ~400k IOPS at 4k (sequential reads) {1, 8, 4k} = 50k IOPS, 75 % (75 % in dom0) {1, 16, 4k} = 70k IOPS, 100 % (100 % in dom0) {1, 32, 4k} = 110k IOPS, 120 % (100 % in dom0) {4, 32, 4k} = 115k IOPS, 400 % (125 % in dom0) blkback
  • Where To Go Next? • Many-VBD performance could be much better: ! ๏ Both persistent grants and grant copy are interesting alternatives: • tapdisk3 with grant copy is network-“friendly” and has one process per VBD • qdisk with persistent grants does the copy on the front end ! ๏ But both add extra copies to the data path: • We should be avoiding copies… :-/ - Grant operations need to scale better - The network retransmission issues need to be addressed © 2014 Citrix 32
  • e-mail: felipe.franciosi@citrix.com freenode: felipef #xen-api twitter: @franciozzy
  • Support Slides • Usage of O_DIRECT with QDISK vs. 1 x Micron P320h • Number of dom0 vCPUs on Creedence #87433 + blkback from 3.16 • Temperature effects on storage performance © 2014 Citrix 34
  • Usage of O_DIRECT in qemu-qdisk © 2014 Citrix 35 • qemu-qdisk ๏ Without O_DIRECT (default) ! ๏ Faster for small block sizes ๏ Faster for single-VM ! ๏ Scalability issue (investigation pending)
  • Usage of O_DIRECT in qemu-qdisk © 2014 Citrix 36 • qemu-qdisk ๏ With O_DIRECT (directiosafe=1) ! ๏ Slower for small block sizes ๏ Slower for single-VM ! ๏ Scales much better
  • Impact of dom0 vCPU count © 2014 Citrix 37 • XenServer Creedence #87433 ๏ Kernel 3.10 + internal PQ • blkback backported from 3.16 • LVs plugged directly to guests ! • Throughput sinks with ๏ larger blocks ๏ increased number of guests • oprofile suggests pvspinlock
  • Impact of dom0 vCPU count © 2014 Citrix 38 • XenServer Creedence #87433 ๏ Kernel 3.10 + internal PQ • blkback backported from 3.16 • LVs plugged directly to guests ! • Giving less vCPUs to dom0 • Aggregate throughput improves
  • Temperature Effects on Storage Performance © 2014 Citrix 39 • Workload keeps pCPUs busy with large block sizes ! • iDRAC Settings > Thermal > Thermal Base Algorithm • “Maximum Performance”
  • Temperature Effects on Storage Performance © 2014 Citrix 40 • Workload keeps pCPUs busy with large block sizes ! • iDRAC Settings > Thermal > Thermal Base Algorithm • “Auto” ! • Effects very noticeable with 3 or more guests