QEMU and Xen:
Reducing the Attack
Surface
Paul Durrant,
Senior Principal Software Engineer,
Citrix Systems R&D
1
Acknowledgements:
● Ian Jackson
● George Dunlap
● Anthony Perard
2
Background:
How do we use QEMU
with Xen?
3
Paravirtual Backend
Service Domain Guest Domain
Xen
4
I/O Drivers
Frontend Frontend
QEMU
BackendBackend
Kernel
User
XenStore
Paravirtual Backend
Service Domain Guest Domain
Xen
5
Frontend Frontend
QEMU
BackendBackend
Kernel
User
Ring and data shared
directly using grants
XenStore
I/O Drivers
Paravirtual Backend
Service Domain Guest Domain
Xen
6
Frontend Frontend
QEMU
BackendBackend
Kernel
User
XenStore
Service domain trusted by guest
but doesn’t need mapping privilege
I/O Drivers
Paravirtual Backend
Service Domain Guest Domain
Xen
7
Frontend Frontend
QEMU
BackendBackend
Kernel
User
XenStore
I/O Drivers
Backends do use
hypercalls
I/O Emulation
Emulator Domain Guest Domain
Xen
8
Driver Driver
QEMU
Device
Model
Device
Model
Kernel
User
IO / MMIO
I/O Drivers
IOREQ
Server
I/O Emulation
Emulator Domain Guest Domain
Xen
9
Driver Driver
QEMU
Device
Model
Device
Model
Kernel
User
IO / MMIO / PCI
I/O trapped
by Xen and
forwarded
to QEMU
I/O Drivers
IOREQ
Server
I/O Emulation
Emulator Domain Guest Domain
Xen
10
Driver Driver
QEMU
Device
Model
Device
Model
Kernel
User
Emulator domain has mapping
privilege to access data
I/O Drivers
IOREQ
Server
I/O Emulation
Emulator Domain Guest Domain
Xen
11
Driver Driver
QEMU
Device
Model
Device
Model
Kernel
User
IOREQ
Server
I/O Drivers
Emulator uses hypercalls and
has shared memory interface
with Xen
Attack Vector:
IOREQ Pages
12
IOREQ Pages
Emulator Domain Guest Domain
Xen
13
QEMU
Kernel
User
IOREQ
Server
Memory
SYNC
BUFFERED
E820
Reserved
Control pages
allocated from
guest memory
IOREQ Server
creation
IOREQ Pages
Emulator Domain Guest Domain
Xen
14
QEMU
Kernel
User
IOREQ
Server
Memory
SYNC
BUFFERED
E820
Reserved
Then they are
mapped by
QEMU and Xen
Map foreign
memory hypercall
IOREQ Pages
Emulator Domain Guest Domain
Xen
15
QEMU
Kernel
User
IOREQ
Server
Memory
SYNC
BUFFERED
E820
Reserved
Requests and responses are
passed between Xen and QEMU
using the shared pages
IOREQ Pages
Emulator Domain Guest Domain
Xen
16
QEMU
Kernel
User
IOREQ
Server
Memory
E820
Reserved
But the pages could still be
manipulated directly by the guest
SYNC
BUFFERED
Mitigation:
IOREQ Server Enable
17
IOREQ Server Enable
Emulator Domain Guest Domain
Xen
18
QEMU
Kernel
User
IOREQ
Server
Memory
E820
Reserved
The protocol
is not
immediately
operational
SYNC
BUFFERED
IOREQ Server Enable
Emulator Domain Guest Domain
Xen
19
QEMU
Kernel
User
IOREQ
Server
Memory
E820
Reserved
Then they are
mapped by
QEMU
Enable IOREQ
Server hypercall
MFNs are removed from
the guest P2M...
IOREQ Server Enable
Emulator Domain Guest Domain
Xen
20
QEMU
Kernel
User
IOREQ
Server
Memory
E820
Reserved
Then they are
mapped by
QEMU
…better if they were never
there at all
Better Mitigation:
XENMEM_acquire_resource
21
IOREQ Page Mapping
Emulator Domain Guest Domain
Xen
22
QEMU
Kernel
User
IOREQ
Server
Control pages
allocated from
‘Xen’ memory
Create IOREQ
Server hypercall
SYNC
BUFFERED
IOREQ Page Mapping
Emulator Domain Guest Domain
Xen
23
QEMU
Kernel
User
IOREQ
Server
XENMEM_acquire_resource
New hypercall
SYNC
BUFFERED
/*
* Get the pages for a particular guest resource, so that they can be
* mapped directly by a tools domain.
*/
#define XENMEM_acquire_resource 28
struct xen_mem_acquire_resource {
domid_t domid;
uint16_t type;
uint32_t id;
uint32_t nr_frames;
uint32_t flags;
uint64_aligned_t frame;
XEN_GUEST_HANDLE(xen_pfn_t) frame_list;
};
24
XENMEM_acquire_resource
/*
* Get the pages for a particular guest resource, so that they can be
* mapped directly by a tools domain.
*/
#define XENMEM_acquire_resource 28
struct xen_mem_acquire_resource {
domid_t domid;
uint16_t type;
uint32_t id;
uint32_t nr_frames;
uint32_t flags;
uint64_aligned_t frame;
XEN_GUEST_HANDLE(xen_pfn_t) frame_list;
};
25
XENMEM_acquire_resource
XENMEM_resource_ioreq_server
Resource type for all
IOREQ Server control
pages
/*
* Get the pages for a particular guest resource, so that they can be
* mapped directly by a tools domain.
*/
#define XENMEM_acquire_resource 28
struct xen_mem_acquire_resource {
domid_t domid;
uint16_t type;
uint32_t id;
uint32_t nr_frames;
uint32_t flags;
uint64_aligned_t frame;
XEN_GUEST_HANDLE(xen_pfn_t) frame_list;
};
26
XENMEM_acquire_resource
Frame identifiers
distinguish between
SYNC and BUFFERED
pages
XENMEM_resource_ioreq_server_frame_bufioreq
XENMEM_resource_ioreq_server_frame_ioreq(n)
/*
* Get the pages for a particular guest resource, so that they can be
* mapped directly by a tools domain.
*/
#define XENMEM_acquire_resource 28
struct xen_mem_acquire_resource {
domid_t domid;
uint16_t type;
uint32_t id;
uint32_t nr_frames;
uint32_t flags;
uint64_aligned_t frame;
XEN_GUEST_HANDLE(xen_pfn_t) frame_list;
};
27
XENMEM_acquire_resource
If the tools domain is PV then, upon return, frame_list
will be populated with the MFNs of the resource.
If the tools domain is HVM then it is expected that, on
entry, frame_list will be populated with a list of GFNs
that will be mapped to the MFNs of the resource.
Emulator could be
running in either PV or
HVM domain
struct privcmd_mmap_resource {
domid_t dom;
__u32 type;
__u32 id;
__u32 idx;
__u64 num;
__u64 addr;
} privcmd_mmap_resource_t;
28
IOCTL_PRIVCMD_MMAP_RESOURCE
commit 9a80bfbdd23242168a508b950ffdc80f675ce695
Author: Paul Durrant <paul.durrant@citrix.com>
Date: Fri Jul 28 11:22:49 2017 +0100
xen/privcmd: add IOCTL_PRIVCMD_MMAP_RESOURCE
Currently queued for
upstream
Attack Vector:
Hypercall Memory
Handles
29
Hypercall Memory Handles
Emulator Domain Guest Domain
Xen
30
QEMU
Kernel
User
privcmd
Hypercall Memory Handles
Emulator Domain Guest Domain
Xen
31
Kernel
User
privcmd
Guest attacks and
compromises
QEMU
QEMU
Hypercall Memory Handles
Emulator Domain Guest Domain
Xen
32
QEMU
Kernel
User
privcmd
IOCTL_PRIVCMD_HYPERCALL
QEMU
Hypercall Memory Handles
Emulator Domain Guest Domain
Xen
33
Kernel
User
privcmd HYPERVISOR_???
34
HVMOP_track_dirty_vram
/* Track dirty VRAM. */
#define HVMOP_track_dirty_vram 6
struct xen_hvm_track_dirty_vram {
/* Domain to be tracked. */
domid_t domid;
/* Number of pages to track. */
uint32_t nr;
/* First pfn to track. */
uint64_aligned_t first_pfn;
/* OUT variable. */
/* Dirty bitmap buffer. */
XEN_GUEST_HANDLE_64(uint8) dirty_bitmap;
};
Emulator Domain Memory
QEMU controlled
value
Hypercall Memory Handles
Emulator Domain Guest Domain
Xen
35
Kernel
User
privcmd
Xen writes to
emulator domain
kernel memory
HVMOP_track_dirty_vram
QEMU
Mitigation:
HYPERCALL_dm_op
36
privcmd
Hypercall Memory
Emulator Domain Guest Domain
Xen
37
QEMU
Kernel
User
libxendevicemodel
privcmd
Hypercall Memory
Emulator Domain Guest Domain
Xen
38
QEMU
Kernel
User
libxendevicemodel
IOCTL_PRIVCMD_DM_OP
HYPERVISOR_dm_op
privcmd
Hypercall Memory
Emulator Domain Guest Domain
Xen
39
QEMU
Kernel
User
libxendevicemodel
IOCTL_PRIVCMD_DM_OP
HYPERVISOR_dm_op
This can be
audited
40
IOCTL_PRIVCMD_DM_OP
struct privcmd_dm_op {
domid_t dom;
__u16 num;
const privcmd_dm_op_buf_t __user
*ubufs;
};
struct privcmd_dm_op_buf
{
void __user *uptr;
size_t size;
};
access_ok()?
Emulator Domain Memory
41
HYPERVISOR_dm_op
HYPERVISOR_dm_op(domid_t domid,
unsigned int nr_bufs,
xen_dm_op_buf_t bufs[]);
struct xen_dm_op_buf {
XEN_GUEST_HANDLE(void) h;
xen_ulong_t size;
};
Emulator Domain Memory
Operation information
Attack Vector:
Hypercall Target
Domain
42
Hypercall Target Domain
A
Emulator Domain Guest Domain
Xen
43
QEMU
Kernel
User
privcmd
libxenforeignmemory B
Guest Domain
IOCTL_PRIVCMD_MMAP
domid = B
Hypercall Target Domain
A
Emulator Domain Guest Domain
Xen
44
QEMU
Kernel
User
privcmd
libxenforeignmemory B
Guest Domain
Target
domain
unaudited
Hypercall Target Domain
A
Emulator Domain Guest Domain
Xen
45
QEMU
Kernel
User
privcmd
libxenforeignmemory B
Guest Domain
Memory mapped
from ‘wrong’
guest
Mitigation:
IOCTL_PRIVCMD_RESTRICT
46
Hypercall Target Domain
A
Emulator Domain Guest Domain
Xen
47
QEMU
Kernel
User
privcmd
libxenforeignmemory B
Guest Domain
IOCTL_PRIVCMD_RESTRICT
domid = A
Handle now restricted to operations on domain A
Hypercall Target Domain
A
Emulator Domain Guest Domain
Xen
48
QEMU
Kernel
User
privcmd
libxenforeignmemory B
Guest Domain
domid = A
IOCTL_PRIVCMD_MMAP
domid = B
Hypercall not
issued
Multiple Handles
Emulator Domain
Xen
49
QEMU
Kernel
User
privcmd
libxenforeignmemory libxendevicemodel libxenstore libxenevtchn libxengnttab
privcmd xenbus gntdevevtchn
QEMU has lots of handles
open to many different
drivers
50
libxenforeignmemory libxendevicemodel libxenstore libxenevtchn libxengnttab
One Library to Rule Them All
QEMU
libxentoolcore
New library
51
libxenforeignmemory libxendevicemodel libxenstore libxenevtchn libxengnttab
Handle Registration
QEMU
libxentoolcore
xentoolcore__register_active_handle()
Other libraries register a restriction
callback for each open handle
52
libxenforeignmemory libxendevicemodel libxenstore libxenevtchn libxengnttab
Handle Restriction
QEMU
libxentoolcore
xentoolcore_restrict_all()
active_handle->restrict_callback()
IOCTL_PRIVCMD_RESTRICT
Restriction ‘aware’
implementation
QEMU makes single call to restrict
all handles
53
libxenforeignmemory libxendevicemodel libxenstore libxenevtchn libxengnttab
Handle Restriction
QEMU
libxentoolcore
xentoolcore_restrict_all()
xentoolcore__restrict_by_dup2_null()
Restriction ‘unaware’
implementation
QEMU still runs as
root
54
Let’s not do that...
55
56
[pauldu@brixham:~]/usr/local/lib/xen/bin/qemu-system-i386 --help
.
.
.
-runas user change to user id user just before starting the VM
.
.
.
QEMU Command Line
57
[pauldu@brixham:~]/usr/local/lib/xen/bin/qemu-system-i386 --help
.
.
.
-runas user change to user id user just before starting the VM
.
.
.
Not actually a UID
but a user name
QEMU Command Line
58
Shared UID Problem
QEMU
Shared UID
QEMU QEMU
59
Shared UID Problem
QEMU
Shared UID
QEMU QEMU
Compromised
process
60
Shared UID Problem
QEMU
Shared UID
QEMU QEMU
ptrace(2)
Compromised
process
61
UID per VM
domid space
uid space
< 16 bits 32 bits
Space is much larger so
reserve a region
System
reserved
region
UID base
62
[pauldu@brixham:~]/usr/local/lib/xen/bin/qemu-system-i386 --help
.
.
.
-runas user change to user id user just before starting the VM
.
.
.
This is going to be
awkward
QEMU Command Line
63
[pauldu@brixham:~]/usr/local/lib/xen/bin/qemu-system-i386 --help
.
.
.
-runas user change to user id user just before starting the VM
.
.
.
commit 2c42f1e80103cb926c0703d4c1ac1fb9c3e2c600
Author: Ian Jackson <ian.jackson@eu.citrix.com>
Date: Fri Sep 15 18:10:44 2017 +0100
os-posix: Provide new -runas <uid>:<gid> facility
This allows the caller to specify a uid and gid to use, even if there
is no corresponding password entry. This will be useful in certain
Xen configurations.
But this makes it
better
QEMU Command Line
QEMU
64
Cleanup
QEMUQEMU QEMUQEMU
uid = base +
1
uid = base +
DOMID_FIRST_RESERVED - 1
uid = base +
2
uid = base +
3
UIDs cycle as domains
come and go...
QEMU
65
Cleanup
QEMUQEMU QEMUQEMU
uid = base +
1
uid = base +
DOMID_FIRST_RESERVED - 1
uid = base +
2
uid = base +
3
... but something is
lurking here
QEMU
66
Cleanup
QEMU
uid = base +
2
uid = base +
2
Compromised process
with the same UID that
was not killed...
QEMU
67
Cleanup
QEMU
uid = base +
2
uid = base +
2
Compromised process
with the same UID that
was not killed...
... But why was it not
killed?
QEMU
68
Killing Processes Is Tricky
while(1) {
if(!fork())
_exit(0);
}
kill(qemu_pid);
Toolstack
Not going to work since
PID is continuously
changing
uid = base +
2
69
Reliable Mechanism
while(1) {
if(!fork())
_exit(0);
}
setresuid(..., base + 2, ...);
kill(-1, SIGKILL)
QEMU
Toolstack
uid = base +
2
QEMU
70
Reliable Mechanism
while(1) {
if(!fork())
_exit(0);
}
Toolstack
uid = base +
2
Carefully crafted so
QEMU can’t kill
Toolstack
setresuid(..., base + 2, ...);
kill(-1, SIGKILL)
QEMU
71
Reliable Mechanism
while(1) {
if(!fork())
_exit(0);
}
Toolstack
uid = base +
2
Example code here
setresuid(..., base + 2, ...);
kill(-1, SIGKILL)
https://github.com/gwd/runner-reaper
● Direct resource mapping makes guest attack
on QEMU more difficult
72
Summary
● Direct resource mapping makes guest attack
on QEMU more difficult
● Hypercall auditing and restriction reduces
ability of compromised QEMU to attack host
or other guests
73
Summary
● Direct resource mapping makes guest attack
on QEMU more difficult
● Hypercall auditing and restriction reduces
ability of compromised QEMU to attack host
or other guests
● De-privileging QEMU stops it bypassing
those restrictions
74
Summary
Problems
75
● Migration
● PCI Pass-Through
76
Problems
● Migration
● PCI Pass-Through
77
Problems
Problem: Signaling uses xenstore
● Migration
● PCI Pass-Through
78
Problems
Problem: Signaling uses xenstore
Solution: Use QMP instead
79
Problems
Problem: Signaling uses xenstore
Solution: Use QMP instead
● Migration
● PCI Pass-Through
Audit Handles
80
QEMU
Kernel
User
privcmd
libxenforeignmemory libxendevicemodel libxenstore libxenevtchn libxengnttab
privcmd xenbus gntdevevtchn
Audit Handles
81
QEMU
Kernel
User
privcmd
libxenforeignmemory libxendevicemodel libxenstore libxenevtchn libxengnttab
privcmd xenbus gntdevevtchn
● Migration
● PCI Pass-Through
82
Problems
Problem: Signaling uses xenstore
Solution: Use QMP instead
Problem: pcilib(7)
● Migration
● PCI Pass-Through
83
Problems
Problem: Signaling uses xenstore
Solution: Use QMP instead
Continue to run QEMU as root
Problem: pcilib(7)
Further Restrictions
84
● chroot(2)
● setrlimit(2)
● Linux namespaces
85
Further Restrictions
● chroot(2)
● setrlimit(2)
● Linux namespaces
86
Further Restrictions
qemu-system-i386 -chroot <dir>
● chroot(2)
● setrlimit(2)
● Linux namespaces
87
Further Restrictions
qemu-system-i386 -chroot <dir>
Virtual CD-ROM?
● chroot(2)
● setrlimit(2)
● Linux namespaces
88
Further Restrictions
qemu-system-i386 -chroot <dir>
Use QMP add-fd
● chroot(2)
● setrlimit(2)
● Linux namespaces
89
Further Restrictions
qemu-system-i386 -chroot <dir>
Use QMP add-fd
RLIMIT_CORE
RLIMIT_FSIZE
RLIMIT_LOCKS
RLIMIT_NOFILE
.
.
.
● chroot(2)
● setrlimit(2)
● Linux namespaces
90
Further Restrictions
qemu-system-i386 -chroot <dir>
Use QMP add-fd
RLIMIT_CORE
RLIMIT_FSIZE
RLIMIT_LOCKS
RLIMIT_NOFILE
.
.
.
unshare(CLONE_NEWNS | CLONE_NEWIPC);
Q&A
91

XPDDS18: Qemu and Xen: Reducing the attack surface - Paul Durrant, Citrix

Editor's Notes

  • #2 Hi. My name is Paul Durrant. I’m a Senior Principal Software Engineer in the XenServer group at Citrix Systems R&D, based in Cambridge UK. I’ve been a contributor to the Xen Project for many years now. I’m community lead for the Windows PV Drivers but I also contribute to Linux, QEMU and the hypervisor itself. Today I’m going to talk about some of the recent work that has been going on to reduce the attack surface, both of a guest against QEMU and a compromised QEMU against the rest of the system.
  • #3 However, before I start, I’d like to acknowledge the help of Ian Jackson and George Dunlap in creating this presentation. A lot of what you’ll see here is the result of many discussions and work done by all three of us, as well as Anthony and other members of the Xen community. Also a big thanks to Anthony for reviewing this presentation.
  • #4 So, first a bit of background… How do we typically use QEMU with an installation of Xen? Actually it plays two distinct roles...
  • #5 The first is as a set of paravirtual backends… For things like storage, network, keyboard/mouse and framebuffer
  • #6 In this role it behaves in the same way as a kernel backend would… In this picture we have QEMU running in what I’ve called a ‘service’ domain, but typically this would be dom0. It maps the shared rings which are set up by the guest and advertised through xenstore, and services the guest I/O requests as they appear. Mapping is done using grants and signalling is done using event channels. Xen provides shared libraries for all these interactions and QEMU is linked against those.
  • #7 So in this case QEMU running in the service domain has a trust relationship with the frontends in the guest (the extent of that trust relationship having been somewhat of a contentious issue on xen-devel lately) but the service domain has limited privilege… E.g. it does not need mapping privilege over the guest (since the protocols use grants)...
  • #8 However, because of the use of grants, event channels, etc. the service domain will need to make hypercalls. So in this role there is some scope for QEMU to be attacked (e.g. via bugs in the PV protocol or interactions with xenstore) and then some scope for a compromised QEMU to attack the system (e.g. via the hypercall privileges the service domain has). But for the rest of this talk I’m going to focus on QEMU’s other role...
  • #9 … which is as a set of hardware device emulation models to support an HVM guest. In this picture I’ve shown QEMU running in an ‘emulator’ domain. This could be a stub domain (i.e. an emulator domain dedicated to a single guest) but in many deployments - including XenServer - the emulator domain is dom0. The reason that stubdomains are generally not used in enterprise environments is scalability. Without the ability to share pages between VMs (which is there in Xen but not well tested or really trusted) then you potentially end up with hundreds of identical stub OS images, which is massive resource and customers are just not willing to live with an overhead of this magnitude.
  • #10 So, Xen traps I/O (port, memory mapped or PCI config) accesses made by the hardware drivers running in the guest and uses the IOREQ Server subsystem to forward those requests to an emulator… which in this case, of course, is QEMU.
  • #11 QEMU then handles those I/O requests and, if the data is not immediate in the request, will need to map guest memory to do so. Of course the device models may also be emulating DMA capable hardware and so may also directly read or write guest memory for that reason. Hence the emulator domain must have mapping privilege over the guest.
  • #12 And, like the service domain, the emulator domain also needs to make hypercalls… but in addition it shares memory directly with Xen to participate in the IOREQ protocol (i.e. the protocol used for issuing the I/O emulation requests to QEMU, and getting the results back).
  • #13 So, the first attack I’ll focus on is via that shared memory interface...
  • #14 When QEMU starts up it needs to get Xen to set up the shared pages that are going to be used to drive the IOREQ protocol. I’ll just mention at this point that there are actually a pair of pages: The ‘SYNC’ page contains an array of IOREQ structures, one per guest vCPU. When a vCPU traps for I/O emulation the details are written into its slot in the array and the vCPU is paused whilst QEMU services the emulation request The ‘BUFFERED’ page contains a more traditional PV ring structure. It’s there because some emulations, mainly memory mapped writes through the VGA aperture - 0xA0000 thru 0xBFFFF - don’t actually need to be dealt with synchronously. So Xen forwards these on using the ring in the buffered page. Now, shared memory set-up can happen in a couple of ways… For older versions of QEMU (that are unaware of the IOREQ Server API… in one of Xen’s shared libraries) it happens when QEMU reads back the GFNs of the IOREQ pages (which are actually allocated in guest space, in an E820 reserved region) via a couple of magic HVM parameters. Newer versions of QEMU, that are aware of the IOREQ Server API, do the set up using specific hypercalls. The pages are still located in the same E820 reserved region in guest space though.
  • #15 So, having allocated the GFNs, they are then mapped by Xen and (using a normal map-foreign-memory operation) by QEMU.
  • #16 And now the protocol can start operation.
  • #17 But those pages are visible in guest memory space… And, whilst they are in an E820 reserved region, they actually have well known frame numbers and could easily be mapped by a malicious guest… So now the guest can play with the IOREQ protocol directly and not only attack QEMU but Xen… Doesn’t sound good.
  • #18 Fortunately, when the IOREQ Server API was introduced, it contained a mitigation against such attacks in the form of the ‘enable’ operation...
  • #19 When the IOREQ Server API is used, the pages are mapped by Xen and QEMU, as I said before. But actually protocol does not start operating immediately. Xen will not touch its mapping of the pages until...
  • #20 QEMU issues the IOREQ Server ‘enable’ hypercall. This hypercall: Firstly removes the MFNs of the SYNC and BUFFERED pages from the guest P2M Zeroes them (blowing away whatever the guest may have put in them) And only then sets a flag to allow Xen to start using them for the IOREQ protocol So, this stops the guest from messing with them whilst the protocol is operational but it does introduce a little complexity when it comes to migration… There is an IOREQ Server ‘disable’ hypercall and this needs to be performed before the final guest memory migration to put the pages back into the guest P2M… Otherwise the IOREQ Server pages are leaked and, when trying to start QEMU on the receiving host it goes bang because there are no pages to map.
  • #21 So, it would really be better - and it would make migration simpler - if the IOREQ Server pages were never in the guest P2M at all.
  • #22 So, to this end, there is now a new memory op in Xen 4.11 called XENMEM_acquire_resource… The idea behind this is to allow an emulator domain to directly map memory-based guest resources without them ever needing to be visible to the guest itself.
  • #23 So now, when the hypercall is made to create an IOREQ Server, the SYNC and BUFFERED pages are allocated directly by Xen. They do not come from guest memory space. (Actually they are allocated using alloc_domheap_pages() and assigned to the emulator domain).
  • #24 Now, the new ‘acquire resources’ hypercall is issued and the pages are directly mapped by QEMU. For consistency and backwards compatibility, the IOREQ Server enable hypercall is still there but it is no longer so crucial for protection. Let’s take a closer look at the details of the hypercall...
  • #25 This is the snippet from the Xen public memory header… There are various fields, the first being the domid whose resource is to be mapped...
  • #26 The next is the type of resource… At the moment there is only one resource type defined and that is ‘ioreq server’. The intention is that other types of guest resource can be mapped this way in future and, in fact, I have already posted patches to add support for grant table mapping… but they didn’t make 4.11.
  • #27 Then there is the id field… For the ‘ioreq server’ resource type there are ids for the BUFFERED page and for SYNC pages… Notice I said pages (plural) here… You may recall that, in the SYNC page, there is an array of IOREQ structures... one per guest vCPU… and because of the size of that structure there is actually only space for 128 of them in a page. So, if we ever want more than 128 vCPUs in an HVM guest we’re going to need to handle more than one SYNC page and using this mechanism we now can.
  • #28 Then there is the number of frames we want to map starting from the base id, some flags, and finally the frame list… Now this is a little bit subtle because the emulator domain (i.e. tools domain in this context) could be PV or HVM. If the emulator domain is PV then, when the hypercall returns, the frame_list will be populated with the MFNs of the resource and the domain can then ask Xen to set up PTEs for those MFNs. However, if the emulator domain is HVM then it needs to allocate some GFN space (usually done by grabbing a piece of the balloon) and then pass those GFNs into the hypercall so that Xen can add the resource MFNs into its P2M.
  • #29 Happily I’ve hidden this subtlety behind a new IOCTL in privcmd - the driver used to issue most hypercalls - so the new xenforeignmemory_map_resource() API call doesn’t need to worry about it! You can see the date on that commit is pretty old but because it took such a long time to get the hypercall into Xen (due to some tangential security issues spotted during review) the Linux patch has now been queued up for 4.18.
  • #30 Now I’ll move on to look at another couple of attacks… For these, and for the rest of the presentation, we’ll assume that QEMU has been compromised (probably via a bug in one of the device models) and look at what it might attempt to do once compromised. The first attack we’ll look at are the memory handles that may be passed in a hypercall.
  • #31 So, we’ll focus on the way that a significant number of the hypercalls that QEMU issues, via the Xen shared libraries, are issued… As I mentioned briefly before these go via the privcmd driver in kernel.
  • #32 First, as I said, let’s assume that the guest has attacked and successfully compromised QEMU...
  • #33 Now it causes QEMU to issue a hypercall. It does this by sending an ioctl that encapsulates that hypercall to privcmd.
  • #34 privcmd then just blindly marshalls the arguments from that ioctl and makes whatever hypercall it was told to… No auditing is done. So the guest has carte blanche to issue arbitrary hypercalls.
  • #35 Now let’s look at a particular hypercall… ‘track dirty VRAM’... It has some slightly odd semantics but it’s basically used to get Xen to update a bitmap with all the pages of guest VRAM which have been written to since the hypercall was previously invoked. It’s used for keeping the guest VNC console up to date without having to scrape the entire VRAM each time a framebuffer update is requested. The dirty_bitmap memory handle is entirely under QEMU’s control though and, as we’ve seen, is passed completely unaudited to Xen. So let’s say the malicious guest writes a pattern into its VRAM, then uses the compromised QEMU to issue a ‘track dirty VRAM’ hypercall with dirty_bitmap set to a kernel virtual address. There’s no auditing to stop it doing this.
  • #36 Xen will then blindly write the bitmap (and hence a guest controlled bit pattern) to the address it was given… So, in this way the guest effectively has arbitrary write access, via the compromised QEMU, to the emulator domain’s kernel memory.
  • #37 So what can we do about this? Well, it’s reasonably obvious… Don’t allow unaudited hypercalls to be issued by QEMU. Now this could be done by introspecting on the hypercalls in privcmd, but that means we have to teach privcmd about specific hypercalls and keep Xen and privcmd completely in sync. But there is a better way we can tackle this… We introduce a new hypercall...
  • #38 We’ll start with broadly the same picture as before, but this time notice that I’ve included a specific shared library: libxendevicemodel. This was new in Xen 4.10.
  • #39 We start again assuming that QEMU has been compromised and is issuing a hypercall. But this time let’s say this time it is actually going to use libxendevicemodel to issue the hypercall. This library sends IOCTL_PRIVCMD_DM_OP, rather than IOCTL_PRIVCMD_HYPERCALL, which results in privcmd issuing (specifically) a new hypercall called dm_op. Now you may ask how we stop QEMU avoiding the library and just sending IOCTL_PRIVCMD_HYPERCALL as it did before… but we’ll come to that in a few slides’ time. The important thing about the dm op ioctl though is...
  • #40 It can be audited. Let’s have a closer look at how it is made up.
  • #41 The ioctl structure itself just contains the domid to which the hypercall relates and an array of pointers to buf structures. Each buf structure then specifies an area of emulator domain memory and so it is quite straightforward for privcmd to make an access_ok() check on each of those areas to verify that they are actually pointing into process user space.
  • #42 And then the translation into the dm_op hypercall is entirely mechanical because the hypercall structures are analogous to the ioctl structures. By convention a dm_op expects to find its operation information in buf[0] so no introspection of the detail of the hypercall is necessary in privcmd.
  • #43 So that’s how we can audit emulator memory handles but what about target domains?
  • #44 To illustrate, we’ll start with another variant of the picture… The QEMU process relating to domain A has been compromised and, this time, it’s issuing an IOCTL_PRIVCMD_MMAP (to map domain memory)… But the operation is targeting domain B’s memory. Now, as I said, we’re not considering stub domains here so the emulator domain here does have mapping privilege over domain B
  • #45 Now, privcmd is just a kernel module in the emulator domain… It doesn’t know which domain’s memory it should or should not be mapping on behalf of an particular QEMU instance so...
  • #46 Domain B’s memory is mapped and now the domain is owned by the compromised QEMU.
  • #47 So I said privcmd doesn’t know which domain’s memory it should or should not be mapping on behalf of an particular QEMU instance… Well how about we tell it!
  • #48 Before QEMU starts handling any emulation requests from the guest, and hence before there’s any possibility of it being compromised by the guest (because the guest has not yet been unpaused), QEMU can send an ioctl to privcmd to restrict the operations it can do and limit them to a particular target domain. The restriction is applied by privcmd storing the domid (A in this case) in a structure related to the process’s file handle...
  • #49 So when later, after QEMU has been compromised, privcmd can check the domid in the mmap ioctl against the one referenced by the file handle, spot the mismatch and deny the operation. Note that to get this to work it means that privcmd needs to be able to ‘see’ the domid of the operation… This is true for mmap operations but not of arbitrary hypercalls issued via IOCTL_PRIVCMD_HYPERCALL. But recall IOCTL_PRIVCMD_DM_OP… The domid is present in the ioctl structure so we can also audit the domain id of all dm_ops, as well as the memory handles. So, when IOCTL_PRIVCMD_RESTRICT is issued, privcmd will also refuse to handle any future IOCTL_PRIVCMD_HYPERCALL that may be issued… and that’s how we stop QEMU going round the side of libxendevicemodel. But there is a problem with restricting file handles...
  • #50 And that is that there are a lot of them… Because of the way that the Xen shared libraries work, it turns out that there isn’t just one handle open from QEMU to privcmd… and also QEMU doesn’t just open handles to privcmd; there are other drivers such as evtchn (for event handling) and gntdev (for grant operations). Having QEMU issue individual restrictions for each file handle is going to get messy… Also, not all APIs are currently restrictable, so what do we do about those?
  • #51 Well, we introduce a new library to control restriction… libxentoolcore
  • #52 The idea here is that each of the libraries that open handles to drivers registers those handles, along with a callback, with libxentoolcore
  • #53 QEMU now makes a single call into libxentoolcore to restrict all its handles… Then, for each of the registered handles from the other libraries, libxentoolcore invokes the registered callback… And these callbacks then do the restriction. So, for example, for libxenforeignmemory (or libxendevicemodel) the callback issues an IOCTL_PRIVCMD_RESTRICT to privcmd. libxenevtchn issues a similar ioctl to the evtchn driver, but not all the APIs (or their underlying implementations) are (yet) restriction ‘aware’ so there is a fall-back...
  • #54 For handles to those libraries libxentoolcore dup2()s them to a handle open to /dev/null so they are essentially neutralized and anything sent down them will go nowhere. So now we have a QEMU which can only issue auditable hypercalls, can only map memory from the ‘correct’ domain, but there is still a problem...
  • #55 QEMU still runs as root, so it can just open a new handle to privcmd and completely bypass all the restrictions we have set up.
  • #56 So let’s not run it as root…
  • #57 Happily QEMU already has an option run as an alternative user after all the initial set up (i.e. after we have opened handles and restricted) but...
  • #58 Whilst it says user id in the help text that argument is actually a user name Anyway, let’s at up an account for a QEMU user and use that. There is a problem with this...
  • #59 Here we have a few QEMU processes all running using the id of the new user we set up...
  • #60 But now let’s say one of them is compromised...
  • #61 It could now invoke ptrace on any other QEMU process using the shared UID, so whilst it cannot gain direct access to another VM’s memory (via a now-restricted mmap operation) it can still observe the interactions between another domain and its instance of QEMU. So, to avoid this possibility, what we really want is a separate UID per VM...
  • #62 So, how are we going to do it? Well domid space is actually less than 16 bits wide (goes from 0 to 32751 - 0x7FEF) whilst uid space is generally (close to) 32 bits wide (on modern OS, although POSIX only guarantees 16 bits). So, we should be able to reserve enough consecutive UIDs to have one per domain, such that UID = base + domid. (This is good because we only need to tell the toolstack about the base uid).
  • #63 So, now we’ve reserved our UID space but that QEMU command line argument is a little awkward… It means we need to actually add an account (i.e. a passwd entry) for each of our reserved UIDs!
  • #64 Well actually we don’t… Ian recently upstreamed a patch to QEMU so we don’t need to do that. Now we can run QEMU with an individual UID for each VM but what happens when the VM is shut down? Well there is a little problem with this...
  • #65 Let’s say we’re starting and stopping guests and the QEMU UIDs are tracking the domids as we want… Eventually the domid is going to hit that limit of 32751, or DOMID_FIRST_RESERVED - 1, and is going to cycle back round (obviously not to 0)... But when the domid and hence the UID cycles back round there could be an issue...
  • #66 Notice how there is something lurking in the background there...
  • #67 It’s a compromised QEMU process that was not killed and now has the same UID as our brand new QEMU process, which makes our new QEMU instance vulnerable (for the reasons we went into earlier).
  • #68 But why was it not killed? The toolstack should have done that when the associated VM was shut down, right?
  • #69 Well it turns out killing processes is tricky. If our compromised QEMU starts executing a loop like the one illustrated here then its PID is not going the be the one that the toolstack originally forked and it’s going to be changing so rapidly that it’s likely to win the race even if the toolstack uses killall. We need to a more reliable kill mechanism.
  • #70 Happily there turns out to be one… If kill is issued with a PID of -1 then this means that all other processes belonging to the same UID of the process that invokes kill will be terminated. So all the toolstack needs to do is set its UID to that of the QEMU instance it needs to kill, and then issue a kill with UID -1.
  • #71 But we have to be a little bit careful… There are various UIDs that a process can have - real, effective and saved - and if the toolstack chooses to set the wrong one it could leave itself open to being killed by the compromised QEMU (using UID -1). Happily there is an asymmetry between who can kill and who will be killed. Thus, by carefully crafting a call to setresuid, the effective UID of the toolstack process can be changed safely and thus QEMU can be reliably terminated.
  • #72 For those who are interested, George Dunlap has posted sample code to illustrate this issue which you can find at this URL
  • #73 So to summarize what we’ve covered so far… We’ve looked at direct resource mapping, which closes off one way in which a guest could attack QEMU
  • #74 We’ve looked at hypercall auditing and restriction to reduce the amount of damage that a compromised QEMU can do...
  • #75 And we’ve looked at de-privileging QEMU to stop it bypassing those restrictions...
  • #76 But there are a couple of problems with restriction and de-privileging that have yet to be solved...
  • #77 These are: Migration… this also includes save and restore; which use a lot of the same mechanisms internally And Pass-through… taking a discrete piece of PCI hardware and giving the guest access to it
  • #78 The problem with migration is that the signaling done between the toolstack and QEMU: to instruct QEMU to save its state on save/migration out, and when QEMU is done parsing the state on restore/migration in both use xenstore, and during restriction QEMU’s xenstore handle gets dup2-ed onto /dev/null… so the sigalling no longer works
  • #79 The proposed solution to this one is to do the signalling via QMP, which is arguably how it should be done anyway. (And Anthony tells me that he has candidate patches for this and has now successfully run a migration).
  • #80 There is still a question of auditing though...
  • #81 We want to be able to be sure that all the various QEMU handles are restricted before QEMU is exposed to any untrusted input. Ian Jackson has written a tool to do this and, when starting a new VM it’s fairly straightforward to run… You just start the guest paused, run the tool and...
  • #82 It will (hopefully) give QEMU a clean bill of health and you can unpause the VM. The problem is that on a restore or migration in, we need to regard the inbound state as malicious (as it may have come from a compromised QEMU) so we also need to figure out a way to run the audit tool between starting up QEMU as part of a domain restore and when QEMU starts to process the inbound state… The current proposal is to interpose on the inbound state by means of a pipe so we can block QEMU in a read operation until the audit tool has been run.
  • #83 The other main problem with restriction is PCI pass-through and this comes down to QEMU needing to use APIs which require it to be running as root e.g. things underpinned by sysfs, such as pcilib.
  • #84 And really there isn’t an obvious solution to this problem. Any solution is going to require significant re-work of PCI pass-through… and actually there are moves to bring pass-through into the hypervisor itself, so for the moment - if you want pass-through - you’re going to have to continue to run QEMU as root… and indeed this is what we do in XenServer.
  • #85 So, that brings me to the end of the talk but before I finish I’d just like to cover a few more QEMU restrictions the haven’t been implemented yet but we’ll likely add in the near future...
  • #86 These are: Running QEMU in a chroot... Applying process limits... And using Linux namespaces
  • #87 Using a chroot is pretty easy… QEMU already has an option to run itself in a chroot, but there is an of implication of doing this…
  • #88 What about virtual CD-ROMs (or any removable media device really)… Currently presenting an ISO, or a QCOW, or whatever as a virtual CD-ROM involves having QEMU open a file when instructed to do so by the toolstack. But if we’ve chrooted it then it probably can’t see that file to open it… and its new UID may not let it open the file anyway.
  • #89 Happily QEMU already has a mechanism that can be used to avoid this problem… There is a QMP command to hand QEMU the handle of an already-open file or socket. So the toolstack can open the ISO or QCOW itself and then hand the handle into the chrooted and restricted QEMU.
  • #90 The next thing to do is to use setrlimit() to apply some process limits… E.g. We should set RLIMIT_FSIZE or there’s a chance that a rogue QEMU could fill up the emulator domain’s (and hence probably dom0’s) file system.
  • #91 And lastly there are Linux namespaces… It would be a good idea to put QEMU in its own mount and IPC name space so that it can’t even name system mount points or non-file-based IPC descriptors, so there’s no way it can attack them.
  • #92 I’ll end there. Thanks for listening. If anyone has any questions then please fire away...