SlideShare a Scribd company logo
XenTT: Deterministic Systems Analysis
in Xen
Anton Burtsev, David Johnson, Chung Hwan Kim, Mike Hibler,
Eric Eide, John Regehr
University of Utah
• Record execution history of a guest VM
• Recreate it in an instruction-accurate way




                                               2
• Execution replay is a right way to analyze systems
• We need a practical tool!

                                                       3
Deterministic replay




                       4
Determinism
CPU is
  deterministic


                                      Execution history
           Interrupt,
           device I/O




                                                          5
Recording




• Determinism of the execution environment
• Instruction-accurate position of events



                                   Event log

                                               6
Instruction-accurate position of events

label: ...
       mov                  • Number of instructions
       shr                    since boot
       mov                     • Intel has a hardware
   rep movsl                     counter
       test                    • It’s not accurate
       jne label
       ...
• Hardware instruction counter
 Position = { EIP, branch #, ECX}
   • Preempt execution of a system at the same
     instruction
   • Hardware instruction counter + single-stepping
                                                        7
Determinism in Xen




                     8
Nondeterministic events


             Simple Model
 System = memory pages + registers


      Events = memory updates
        (time, device I/O, system calls)
Control flow updates = registers + stack
             (interrupts, events)



                                           9
Some examples
• Instruction emulation (e.g. cpuid, rdtsc, in/out)
   • Return the values of the original run
• Hypercalls
   • Re-execute to ensure determinism of the hypervisor
• Time
   • Shared info page + rdtsc
• Exceptions
   • Deterministic, re-execute
• Interrupts
   • Force re-execution of the interrupt frame (bounce frame) code
     in entry.S
• Shared info updates
   • Replay original values
• Memory
   • Shadow page tables

                                                                 10
11
Xen devices




              12
Device interposition




• Devd ensures determinism of updates to the guest’s
  shared ring buffers


                                                   13
Replay touches many parts of Xen
• Device discovery
   • Devd implements a concept of a device bus
   • Discovers new devices in Xenstore
   • Binds new devices with drivers

• Xenstore transactions
   • During replay, transactions from replayed guest can’t fail
   • They will not be re-executed

• Out-of-order device responses
   • Disk and network responses can arrive out-of-order

• Disk logging
   • Disk payload is deterministic
   • LVM snapshots

• Network logging
   • In-kernel logging of the network payload

                                                                  14
Low-overhead logging




                       15
Are we sure that executions identical?
• Intel branch store trace facility
• Record all taken branches in a memory buffer
   • Compare original and replay runs




                                                 16
Analysis Engine and
Virtual Machine Introspection




                                17
Analysis framework




                     18
Virtual Machine Introspection (VMI)




                                      19
Performance model

                             • Account for effects of
                               replay

                             • Translate performance
                               between original and
                               replay runs

• Re-execution approach to performance

      Virt cntr = Virt cntrstart + ∆ (Real cntr)

                                                        20
Analysis Examples




                    21
NFS request
       processing path
• How much time requests
  spend in each subsystem?

• Request tracking
   • Address of the kernel
     data structure is a
     unique identifier
   • Join identifiers when
     requests move
     between subsystems

                             22
Execution context tracking
• Execution context
  • Context switches
     • schedule.switch_tasks
  • User/kernel
     • System call transitions
     • system_call
  • Interrupts and exceptions
     • do_IRQ
     • do_pagefault
     • do_* (divide_error, debug, nmi, int3...)


• Make analysis context aware
  • Filter probes by context, e.g.
  • All pagefaults from the process “foo”
                                                  23
Control Flow Integrity (CFI)




       • Simple CFI model:
          • Returns should match calls
          • Dynamic calls are “sane”, e.g.
              within kernel address space
       • Detect ROP attacks, stack smashing,
         etc.
                                               24
Execution trace
sys_sendfile
  do_sendfile
    fget_light                    • CFI records a trace of
    rw_verify_area
    fget_light                      function calls
    rw_verify_area
    shmem_file_sendfile              • sys_sendfile is
      do_shmem_file_read
        shmem_getpage                  the last system call
          find_lock_page
             radix_tree_lookup         before control
          shmem_recalc_inode
          shmem_swp_alloc              flow jumps to 0x0
             shmem_swp_entry
               kmap_atomic
                 __kmap_atomic
                   page_address
          kunmap_atomic
          find_get_page
             radix_tree_lookup
        file_send_actor
          sock_sendpage
             UNKNOWN FUNCTION
             (addr:0x00000000)

                                                         25
Intrusion backtracking




• Track accesses to “/etc/passwd”
• Probe sys_open
   • Filter by file name
   • Find process ID, branch counter     26
Intrusion backtracking: pass 1




• Process or it’s parent escalated privileges
• Watch write accesses to &task->uid
   • Filter by parents of the offending process
                                                  27
Intrusion backtracking: pass 1




• Find the syscall inside which privileges are escalated
• Probe sys_* - all system call entry and exit points
   • Filter by the offending process ID
                                                           28
Intrusion backtracking: pass 1




• Privilege escalation is a CFI violation
• Start CFI analysis from the last system call
• Find %EIP at which CFI is violated, and location of the
  shell code (0x0)                                        29
Intrusion backtracking: pass 1




• Find at which point address 0x0000000 gets mapped
• Probe do_page_fault

                                                      30
More mechanisms
• Execution traces
   • BTS trace of all taken branches
   • Instruction traces

• Memory (variable) access traces
   • Intersect with the execution trace
   • See where variables get accessed




                                          31
How much overhead?




                     32
• 32-bit x86 PV-guests
• xen-unstable near v3.0.4
   • We rely on a working shadow page tables
• 1-CPU time-traveling guests
   • No SMP replay
   • Dom0 and Xen are SMP of course

• Test machine
   • 4 cores
   • 1Gbps network
   • 130 MB/s disks



                                               33
CPU
        140
        120
        100
        80
Score




        60
        40          Xen
                    XenTT
        20
         0




                          34
Network throughput
       140

       120

       100

       80
MB/s




                                        Xen
       60                               XenTT
                                        XenTT (32MB buffers)
       40

       20

        0
             TCP Send     TCP Receive

                                                         35
Network delay
     0.16

     0.14

     0.12

      0.1
ms




     0.08
                                                          Xen
     0.06                                                 XenTT
     0.04

     0.02

       0
            domU to dom0          domU to external host
                           Ping

                                                                36
Disk I/O
       140

       120

       100

       80
MB/s




                                                     Xen
       60                                            XenTT

       40

       20

        0
             Serial write              Serial read

                                                           37
Raw           Compressed (gzip)
Linux boot
Event log                4.3 GB        0.8 GB
Idle overnight (14 hours)
Event log                6.2 GB        1.6 GB
Growth rate              440 MB/hour   114 MB/hour (2.7GB/day)
TCP receive (1.63 GB data stream)
Event log                1.76 GB       342 MB
Payload log              1.79 GB       Payload dependent
Disk write (4 GB file)
Event log                1.8 GB        350 MB
Disk read (4 GB file)
Event log                0.59 GB
                                                             38
Lessons learnt




                 39
Scaling development
• Extending Xen with BTS support
   • Debug crashes in Xen, and dom0

• Execution comparison tools
   • BTS traces to understand what went wrong
   • Support for resolving symbols

• Run-time comparison tools
   • Compare guest’s state between original and replay runs

• Trace from all parts of your system
   • Xen, dom0, domU

• Support performance tracing
  • Xentrace messages

                                                              40
What we didn’t predict
• I/O delay goes up
   • Not sure if Linux has adequate low-latency user-level
     processing support
   • Maybe need an in-kernel interposition component

• Branch counters are fragile
   • Our code works on several server CPUs
   • Fails on a laptop with the CPU from the same model/family
     line




                                                             41
Conclusions




              42
Practical replay analysis is feasible
• Performance overheads are reasonable
   • Realistic systems
   • Realistic workloads
   • Minor setup costs (just install Xen)

• Analysis engine is an amazing tool
   • And we’re growing it

• We need your help to port it upstream
   • Porting effort
   • Shadow memory for PV guests
   • Support for HVM guests

                                                 43
Thank you.

All code is GPLv2 and will be available soon
        (available on-request now).



               Questions: aburtsev@flux.utah.edu

                                                   44
Backup slides




                45
Execution environment


                     Deterministic
• CPU                Machine
• Virtual hardware
   • Memory                                  System
   • Disk
• Software




                                External Events

                                     • External I/O
                                                      46
How do we find nondeterministic events?




                                          47
Device interface




                   48
Reading a local variable
bsym_skb = target_lookup_sym(probe->target,
                             "netif_poll.skb",
                             ".", NULL, flags);

lval_skb = bsymbol_load(bsym_skb, flags);

skb_addr = *(unsigned long*) lval_skb->buf;




                                                  49
Controlled re-execution or replay
• Types of non-deterministic events
   • Synchronous
      • Hypercalls
   • Asynchronous
      • Interrupts
   • Best effort
      • Time updates
• Branch counters
   • The biggest problem of this solution




                                             50

More Related Content

What's hot

OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
ScyllaDB
 

What's hot (20)

OpenZFS at LinuxCon
OpenZFS at LinuxConOpenZFS at LinuxCon
OpenZFS at LinuxCon
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NECXPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
 
OpenZFS code repository
OpenZFS code repositoryOpenZFS code repository
OpenZFS code repository
 
RHEVM - Live Storage Migration
RHEVM - Live Storage MigrationRHEVM - Live Storage Migration
RHEVM - Live Storage Migration
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to Bottom
 
Xen 4.3 Roadmap
Xen 4.3 RoadmapXen 4.3 Roadmap
Xen 4.3 Roadmap
 
Linux firmware for iRMC controller on Fujitsu Primergy servers
Linux firmware for iRMC controller on Fujitsu Primergy serversLinux firmware for iRMC controller on Fujitsu Primergy servers
Linux firmware for iRMC controller on Fujitsu Primergy servers
 
Live migrating a container: pros, cons and gotchas
Live migrating a container: pros, cons and gotchasLive migrating a container: pros, cons and gotchas
Live migrating a container: pros, cons and gotchas
 
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
 
XPDS14: Removing the Xen Linux Upstream Delta of Various Linux Distros - Luis...
XPDS14: Removing the Xen Linux Upstream Delta of Various Linux Distros - Luis...XPDS14: Removing the Xen Linux Upstream Delta of Various Linux Distros - Luis...
XPDS14: Removing the Xen Linux Upstream Delta of Various Linux Distros - Luis...
 
Look Into Libvirt Osier Yang
Look Into Libvirt Osier YangLook Into Libvirt Osier Yang
Look Into Libvirt Osier Yang
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
Virtunoid: Breaking out of KVM
Virtunoid: Breaking out of KVMVirtunoid: Breaking out of KVM
Virtunoid: Breaking out of KVM
 
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME  JP BLAKE, ASSURED INFORMATION SE...XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME  JP BLAKE, ASSURED INFORMATION SE...
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
 
Linux based Stubdomains
Linux based StubdomainsLinux based Stubdomains
Linux based Stubdomains
 
Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stack
 
Xen Debugging
Xen DebuggingXen Debugging
Xen Debugging
 
Libvirt/KVM Driver Update (Kilo)
Libvirt/KVM Driver Update (Kilo)Libvirt/KVM Driver Update (Kilo)
Libvirt/KVM Driver Update (Kilo)
 

Similar to XenTT: Deterministic Systems Analysis in Xen

Efficient Bytecode Analysis: Linespeed Shellcode Detection
Efficient Bytecode Analysis: Linespeed Shellcode DetectionEfficient Bytecode Analysis: Linespeed Shellcode Detection
Efficient Bytecode Analysis: Linespeed Shellcode Detection
Georg Wicherski
 
Chapter 02 modified
Chapter 02 modifiedChapter 02 modified
Chapter 02 modified
Jugal Doshi
 
Operating system enhancements to prevent misuse of systems
Operating system enhancements to prevent misuse of systemsOperating system enhancements to prevent misuse of systems
Operating system enhancements to prevent misuse of systems
Dayal Dilli
 
Xen and the Art of Virtualization
Xen and the Art of VirtualizationXen and the Art of Virtualization
Xen and the Art of Virtualization
Susheel Thakur
 
Android Mind Reading: Android Live Memory Analysis with LiME and Volatility
Android Mind Reading: Android Live Memory Analysis with LiME and VolatilityAndroid Mind Reading: Android Live Memory Analysis with LiME and Volatility
Android Mind Reading: Android Live Memory Analysis with LiME and Volatility
Joe Sylve
 

Similar to XenTT: Deterministic Systems Analysis in Xen (20)

Xen in Safety-Critical Systems - Critical Summit 2022
Xen in Safety-Critical Systems - Critical Summit 2022Xen in Safety-Critical Systems - Critical Summit 2022
Xen in Safety-Critical Systems - Critical Summit 2022
 
Process and Threads in Linux - PPT
Process and Threads in Linux - PPTProcess and Threads in Linux - PPT
Process and Threads in Linux - PPT
 
Efficient Bytecode Analysis: Linespeed Shellcode Detection
Efficient Bytecode Analysis: Linespeed Shellcode DetectionEfficient Bytecode Analysis: Linespeed Shellcode Detection
Efficient Bytecode Analysis: Linespeed Shellcode Detection
 
Larson Macaulay apt_malware_past_present_future_out_of_band_techniques
Larson Macaulay apt_malware_past_present_future_out_of_band_techniquesLarson Macaulay apt_malware_past_present_future_out_of_band_techniques
Larson Macaulay apt_malware_past_present_future_out_of_band_techniques
 
Chapter 02 modified
Chapter 02 modifiedChapter 02 modified
Chapter 02 modified
 
You suck at Memory Analysis
You suck at Memory AnalysisYou suck at Memory Analysis
You suck at Memory Analysis
 
Operating system enhancements to prevent misuse of systems
Operating system enhancements to prevent misuse of systemsOperating system enhancements to prevent misuse of systems
Operating system enhancements to prevent misuse of systems
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Lect02
Lect02Lect02
Lect02
 
Xen and the Art of Virtualization
Xen and the Art of VirtualizationXen and the Art of Virtualization
Xen and the Art of Virtualization
 
Os lectures
Os lecturesOs lectures
Os lectures
 
Practical Windows Kernel Exploitation
Practical Windows Kernel ExploitationPractical Windows Kernel Exploitation
Practical Windows Kernel Exploitation
 
Bridging the Semantic Gap in Virtualized Environment
Bridging the Semantic Gap in Virtualized EnvironmentBridging the Semantic Gap in Virtualized Environment
Bridging the Semantic Gap in Virtualized Environment
 
Android Mind Reading: Android Live Memory Analysis with LiME and Volatility
Android Mind Reading: Android Live Memory Analysis with LiME and VolatilityAndroid Mind Reading: Android Live Memory Analysis with LiME and Volatility
Android Mind Reading: Android Live Memory Analysis with LiME and Volatility
 
Ch3-2
Ch3-2Ch3-2
Ch3-2
 
Linux introduction
Linux introductionLinux introduction
Linux introduction
 
PerfUG 3 - perfs système
PerfUG 3 - perfs systèmePerfUG 3 - perfs système
PerfUG 3 - perfs système
 
Operating system 22 threading issues
Operating system 22 threading issuesOperating system 22 threading issues
Operating system 22 threading issues
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTrace
 
Multithreaded Programming Part- III.pdf
Multithreaded Programming Part- III.pdfMultithreaded Programming Part- III.pdf
Multithreaded Programming Part- III.pdf
 

More from The Linux Foundation

More from The Linux Foundation (20)

ELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made Simple
 
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
 
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
 
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
 
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote:  Unikraft Weather ReportXPDDS19 Keynote:  Unikraft Weather Report
XPDDS19 Keynote: Unikraft Weather Report
 
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
 
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, XilinxXPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
 
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
 
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
XPDDS19: Memories of a VM Funk - Mihai Donțu, BitdefenderXPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
 
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19:  The Road to Safety Certification: Overcoming Community Challeng...OSSJP/ALS19:  The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
 
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
 OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making... OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
 
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, CitrixXPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
 
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltdXPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
 
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
 
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&DXPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
 
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsXPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
 
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
 
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
 
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
 
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSEXPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

XenTT: Deterministic Systems Analysis in Xen

  • 1. XenTT: Deterministic Systems Analysis in Xen Anton Burtsev, David Johnson, Chung Hwan Kim, Mike Hibler, Eric Eide, John Regehr University of Utah
  • 2. • Record execution history of a guest VM • Recreate it in an instruction-accurate way 2
  • 3. • Execution replay is a right way to analyze systems • We need a practical tool! 3
  • 5. Determinism CPU is deterministic Execution history Interrupt, device I/O 5
  • 6. Recording • Determinism of the execution environment • Instruction-accurate position of events Event log 6
  • 7. Instruction-accurate position of events label: ... mov • Number of instructions shr since boot mov • Intel has a hardware rep movsl counter test • It’s not accurate jne label ... • Hardware instruction counter Position = { EIP, branch #, ECX} • Preempt execution of a system at the same instruction • Hardware instruction counter + single-stepping 7
  • 9. Nondeterministic events Simple Model System = memory pages + registers Events = memory updates (time, device I/O, system calls) Control flow updates = registers + stack (interrupts, events) 9
  • 10. Some examples • Instruction emulation (e.g. cpuid, rdtsc, in/out) • Return the values of the original run • Hypercalls • Re-execute to ensure determinism of the hypervisor • Time • Shared info page + rdtsc • Exceptions • Deterministic, re-execute • Interrupts • Force re-execution of the interrupt frame (bounce frame) code in entry.S • Shared info updates • Replay original values • Memory • Shadow page tables 10
  • 11. 11
  • 13. Device interposition • Devd ensures determinism of updates to the guest’s shared ring buffers 13
  • 14. Replay touches many parts of Xen • Device discovery • Devd implements a concept of a device bus • Discovers new devices in Xenstore • Binds new devices with drivers • Xenstore transactions • During replay, transactions from replayed guest can’t fail • They will not be re-executed • Out-of-order device responses • Disk and network responses can arrive out-of-order • Disk logging • Disk payload is deterministic • LVM snapshots • Network logging • In-kernel logging of the network payload 14
  • 16. Are we sure that executions identical? • Intel branch store trace facility • Record all taken branches in a memory buffer • Compare original and replay runs 16
  • 17. Analysis Engine and Virtual Machine Introspection 17
  • 20. Performance model • Account for effects of replay • Translate performance between original and replay runs • Re-execution approach to performance Virt cntr = Virt cntrstart + ∆ (Real cntr) 20
  • 22. NFS request processing path • How much time requests spend in each subsystem? • Request tracking • Address of the kernel data structure is a unique identifier • Join identifiers when requests move between subsystems 22
  • 23. Execution context tracking • Execution context • Context switches • schedule.switch_tasks • User/kernel • System call transitions • system_call • Interrupts and exceptions • do_IRQ • do_pagefault • do_* (divide_error, debug, nmi, int3...) • Make analysis context aware • Filter probes by context, e.g. • All pagefaults from the process “foo” 23
  • 24. Control Flow Integrity (CFI) • Simple CFI model: • Returns should match calls • Dynamic calls are “sane”, e.g. within kernel address space • Detect ROP attacks, stack smashing, etc. 24
  • 25. Execution trace sys_sendfile do_sendfile fget_light • CFI records a trace of rw_verify_area fget_light function calls rw_verify_area shmem_file_sendfile • sys_sendfile is do_shmem_file_read shmem_getpage the last system call find_lock_page radix_tree_lookup before control shmem_recalc_inode shmem_swp_alloc flow jumps to 0x0 shmem_swp_entry kmap_atomic __kmap_atomic page_address kunmap_atomic find_get_page radix_tree_lookup file_send_actor sock_sendpage UNKNOWN FUNCTION (addr:0x00000000) 25
  • 26. Intrusion backtracking • Track accesses to “/etc/passwd” • Probe sys_open • Filter by file name • Find process ID, branch counter 26
  • 27. Intrusion backtracking: pass 1 • Process or it’s parent escalated privileges • Watch write accesses to &task->uid • Filter by parents of the offending process 27
  • 28. Intrusion backtracking: pass 1 • Find the syscall inside which privileges are escalated • Probe sys_* - all system call entry and exit points • Filter by the offending process ID 28
  • 29. Intrusion backtracking: pass 1 • Privilege escalation is a CFI violation • Start CFI analysis from the last system call • Find %EIP at which CFI is violated, and location of the shell code (0x0) 29
  • 30. Intrusion backtracking: pass 1 • Find at which point address 0x0000000 gets mapped • Probe do_page_fault 30
  • 31. More mechanisms • Execution traces • BTS trace of all taken branches • Instruction traces • Memory (variable) access traces • Intersect with the execution trace • See where variables get accessed 31
  • 33. • 32-bit x86 PV-guests • xen-unstable near v3.0.4 • We rely on a working shadow page tables • 1-CPU time-traveling guests • No SMP replay • Dom0 and Xen are SMP of course • Test machine • 4 cores • 1Gbps network • 130 MB/s disks 33
  • 34. CPU 140 120 100 80 Score 60 40 Xen XenTT 20 0 34
  • 35. Network throughput 140 120 100 80 MB/s Xen 60 XenTT XenTT (32MB buffers) 40 20 0 TCP Send TCP Receive 35
  • 36. Network delay 0.16 0.14 0.12 0.1 ms 0.08 Xen 0.06 XenTT 0.04 0.02 0 domU to dom0 domU to external host Ping 36
  • 37. Disk I/O 140 120 100 80 MB/s Xen 60 XenTT 40 20 0 Serial write Serial read 37
  • 38. Raw Compressed (gzip) Linux boot Event log 4.3 GB 0.8 GB Idle overnight (14 hours) Event log 6.2 GB 1.6 GB Growth rate 440 MB/hour 114 MB/hour (2.7GB/day) TCP receive (1.63 GB data stream) Event log 1.76 GB 342 MB Payload log 1.79 GB Payload dependent Disk write (4 GB file) Event log 1.8 GB 350 MB Disk read (4 GB file) Event log 0.59 GB 38
  • 40. Scaling development • Extending Xen with BTS support • Debug crashes in Xen, and dom0 • Execution comparison tools • BTS traces to understand what went wrong • Support for resolving symbols • Run-time comparison tools • Compare guest’s state between original and replay runs • Trace from all parts of your system • Xen, dom0, domU • Support performance tracing • Xentrace messages 40
  • 41. What we didn’t predict • I/O delay goes up • Not sure if Linux has adequate low-latency user-level processing support • Maybe need an in-kernel interposition component • Branch counters are fragile • Our code works on several server CPUs • Fails on a laptop with the CPU from the same model/family line 41
  • 43. Practical replay analysis is feasible • Performance overheads are reasonable • Realistic systems • Realistic workloads • Minor setup costs (just install Xen) • Analysis engine is an amazing tool • And we’re growing it • We need your help to port it upstream • Porting effort • Shadow memory for PV guests • Support for HVM guests 43
  • 44. Thank you. All code is GPLv2 and will be available soon (available on-request now). Questions: aburtsev@flux.utah.edu 44
  • 46. Execution environment Deterministic • CPU Machine • Virtual hardware • Memory System • Disk • Software External Events • External I/O 46
  • 47. How do we find nondeterministic events? 47
  • 49. Reading a local variable bsym_skb = target_lookup_sym(probe->target, "netif_poll.skb", ".", NULL, flags); lval_skb = bsymbol_load(bsym_skb, flags); skb_addr = *(unsigned long*) lval_skb->buf; 49
  • 50. Controlled re-execution or replay • Types of non-deterministic events • Synchronous • Hypercalls • Asynchronous • Interrupts • Best effort • Time updates • Branch counters • The biggest problem of this solution 50