Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Xen in Safety-Critical Systems - Critical Summit 2022

  1. Xen in Safety-Critical Systems Stefano Stabellini & Bertrand Marquis Critical Summit NA 2022
  2. What is safety? • “Safety critical embedded software applications are developed for systems whose failures contribute to hazards in the system for safety of life” • Safety certifications (ISO 26262) • Strict coding guidelines (MISRA C) • Strict testing and documentation requirements
  3. Why Xen matters for safety • It is common to have a mix of critical and non-critical components (mixed-criticality) • Xen has been enabling mixed-criticality workloads for years • Componentization • Highly secure environments • Isolate critical apps from non-critical apps • Separate work environment from Personal environment on a laptop e.g. Qubes OS and OpenXT • Xen as static partitioning tool for embedded • from industrial to medical and automotive • the real-time (critical) domain must be isolated from the non-critical • the real-time domain cannot miss any deadlines • Safety-critical systems are mixed-criticality systems Xen Critical Non- Critical Non- Critical
  4. Why Xen is a good fit for safety • Small codebase (less than 50K LOC) • Micro-kernel architecture • only the hypervisor requires privileges • dom0 is only optional now • Xen supports disaggregation and driver domains: large amounts of code run unprivileged • No need for a “dom0” privileged environment • Linux is not required • Supports real-time and cache coloring • Thorough review process • Thorough security process Xen Zephyr Linux
  5. Example Industrial • Xen static partitioning configuration with dom0less • 2 domains with hardware directly assigned • no dom0 • 1 Linux VM, networking and cloud • 1 Zephyr VM, motor controller application with real-time requirements Xen Linux networking, cloud APIs Zephyr real-time controller
  6. Example Automotive • Xen static partitioning configuration with dom0less and also dom0 for monitoring • 4 domains created statically at boot time • 1 minimal dom0 VM (Zephyr) for system monitoring • 1 Linux VM, infotainment • 1 Zephyr VM, real-time sensor processing • 1 Instrument Cluster VM Xen Zephyr mini-Dom0 Linux infotainment Instrument Cluster Zephyr real-time sensors
  7. Real-time and Xen • What is Real-time ? • Real time is not fast • I will answer on average in 5ms is not real-time • Real time is a guarantee on the maximum time to respond • I will give a response to an event in no more than 100ms • Why does it matter in a safety context ? • Safety usually equals time constraints • If I detect a wall, stop before the wall • How long to action the breaks • If longer …. 0 10 20 30 40 50 60 70 80 < 1 m s 1 - 5 m s 5 - 9 5 m s 9 5 - 1 0 0 m s > 1 0 0 m s My system
  8. Real-time and Xen • What is Real-time ? • Real time is not fast • I will answer on average in 5ms is not real-time • Real time is a guarantee on the maximum time to respond • I will give a response to an event in no more than 100ms • Why does it matter in a safety context ? • Safety usually equals time constraints • If I detect a wall, stop before the wall • How long to action the breaks • If longer …. • In safety software: Worst Case Execution Time (WCET) • By demonstration, not by test • Usually, the WCET is a case impossible to trigger by test • For Xen several subjects are being investigated
  9. Real-time - Interrupts • Interrupt latency • Maximum time until a guest receives an irq • Depend on time required by guest • Time depend on hardware • Context of the analysis • Arm64 • Guest alone on his core • Zephyr as guest • Timer interrupt • Procedure • Code analysis/inspection • Confirm with tracing on a real target Zephyr Xen Timer irq handler Forward irq Timer irq
  10. Real-time - Interrupts • Overall result: 1090 instructions • Save guest context (cpu and irq controller): 356 • Xen irq handler: 144 • Xen virtual timer: 360 • Xen exit irq handler: 97 • Restore guest: 133 • Assumption and limitations • No hypercall from real time guest • No interaction with guests on other cores • After Xen init phase • Fix configuration (guest, communication, …etc) Zephyr Xen Timer irq handler Enter irq handler Timer irq Save guest ctxt Restore guest ctxt Exit irq handler Virtual timer
  11. Real-time - Interrupts • Issues and future work • IPI interrupts and Xen RCU tasks • Limited to guest isolated on its own core • Wfi/wfe handling disabled (power consumption) • No PV driver • No hypercalls in real-time guest • Status: Full analysis is public, link
  12. Real-time – MPU support • MMU hard to use for real-time • TLB miss -> page table walk • Influence of other cores (TLB sync) • Influence of other guests (TLB miss) MMU L1 L2 L3 TLB VA PA
  13. MPU REGs VA PA Real-time – MPU support • MMU hard to use for real-time • TLB miss -> page table walk • Influence of other cores (TLB sync) • Influence of other guests (TLB miss) • MPU • No translation (1 to 1 mapping) • Register based (no page tables) • No cache effect
  14. Real-time and Xen – MPU support • Arm Cortex R (Cortex R82) • Both MMU and MPU • EL2 (Xen): MPU • For Xen • Control Guest allowed memory • EL1 (RTOS): MPU • Real time • EL1 (Linux): MMU • Not real-time • Cohabitation of MPU and MMU guests • Xen and RTOS real-time • Linux or other non-real-time OS running on same system • Status: Proof of concept available • Upstream in Xen ongoing Linux Zephyr Xen
  15. Real-time – Cache coloring L2 Core 1 Core 2 Core 3 Core 4 DDR L1 L1 L1 L1
  16. Real-time – Cache coloring L2 Core 1 Core 2 Core 3 Core 4 DDR L1 L1 L1 L1
  17. Real-time – Cache coloring • CPUs clusters often share L2 cache • Interference via L2 cache affects performance • App0 running on CPU0 can cause cache entries evictions, which affect App1 running on CPU1 • App1 running on CPU1 could miss a deadline due to App0’s behavior • It can happen between Apps running on the same OS & between VMs on the same hypervisor • Hypervisor Solution: Cache Partitioning, AKA Cache Coloring • Each VM gets its own allocation of cache entries • No shared cache entries between VMs • Allows real-time apps to run with deterministic IRQ latency • 3us IRQ latency Core 1 Core 2 Core 3 Core 4 L2 D D R
  18. Static configuration with Xen • What is it ? • Defining the whole system (guests and communication) statically • Why does it maters in a safety context ? • No random behaviour • Same after reboot (target, task, guest, …etc) • Example: application on same core at same address in memory • Reduce amount of testing • Limit possibilities • Example: only used functions, compile out the rest • No dynamic behaviour • Limit complexity • Example: allocation on boot or static, no free • Conclusion: reduce certification costs
  19. Static configuration – dom0less • Define the system during design phase • How much guests • What characteristics • memory, device access, cpus • Create them directly on boot • Defined in configuration (device-tree) • Advantages for safety • No need for a complex dom0 • No dependency to Linux (not certifiable) • Faster boot • Guests start directly on boot • Reduce system complexity • No dynamic guest creation • Status: available Linux Zephyr Xen Device tree: domU1 { #address-cells = <1>; #size-cells = <1>; compatible = "xen,domain"; memory = <0 0x20000>; cpus = <1>; vpl011; module@2000000 { compatible = "multiboot,kernel", "multiboot,module"; reg = <0x2000000 0xffffff>; bootargs = "console=ttyAMA0"; }; };
  20. Static configuration – memory • Define the address and size of memories • For guests memory • For Xen heap • Internal Xen allocation • For Xen guest heap • Xen allocation related to a guest • Defined in configuration (device-tree) • Advantages for safety • System or guest identical upon reboot • Reduce possible interferences • A guest cannot impact another • Adding a guest in the future • Current guests unchanged • Incremental certification • Status: available in next Xen release Linux Zephyr Xen Memory Zephyr Linux Device tree: domU1 { compatible = "xen,domain"; #address-cells = <0x2>; #size-cells = <0x2>; cpus = <2>; memory = <0x0 0x80000>; #xen,static-mem-address-cells = <0x1>; #xen,static-mem-size-cells = <0x1>; xen,static-mem = <0x30000000 0x20000000>; ... };
  21. Static configuration – communication Linux Zephyr Xen Memory SHM Event Device tree: shared-mem@10000000 { compatible = "xen,domain-shared-memory-v1"; role = "owner"; xen,shm-id = <0x0>; xen,shared-mem = <0x10000000 0x10000000 0x10000000>; }; domU1 { …. domU1-shared-mem@10000000 { compatible = "xen,domain-shared-memory-v1"; role = "borrower"; xen,shm-id = <0x0>; xen,shared-mem = <0x10000000 0x10000000 0x50000000>; }; …. }; • Xenbus is too complex for safety • Need Dom0 or Linux • Drivers are complex • One guest access another guest memory • Several components for safety (not only) • Static shared memory • Area of memory accessible by several guests • Defined in configuration (device-tree) • Static event channels • Solution to ping between 2 guests • Defined in configuration (device-tree) • Any protocol possible to build on top • Status: Available in next Xen release • Linux support • Zephyr support
  22. Static configuration - cpupools • Define which core(s) are useable by who • Xen cpupool: a pool with cores • One or several cores • Scheduler to use • A guest can be assigned to a cpupool • Several guests can be in one cpupool • Scheduler independent between cpupools • A core can only be assigned to one cpupool • Defined in configuration (device-tree) • Advantages for safety • Static core assignment • Isolation between guests • Scheduler per cpupool • Status: available in next Xen release Linux1 Zephyr Xen Linux2 Pool-1 Pool-0 Device tree: cp0:cpupool0 { compatible = "xen,cpupool"; cpupool-cpus = <&a72_0 &a72_1>; cpupool-sched = "credit2"; }; domU1 { #address-cells = <1>; #size-cells = <1>; compatible = "xen,domain"; memory = <0 0x20000>; cpus = <1>; domain-cpupool = <&cp0>; … };
  23. Safety Certifications Activities • Xen can be already safety-certified, but at what cost? • It has been done already • Safety experts have analyzed the code and deemed it safety-certifiable • Require significant downstream work • GOAL: make it easier for users to deploy Xen in safety environments • "safety-certifiable", not safety-certified • users can fill the gaps • we can be flexible: it is OK to decide not to follow certain rules • let's focus on what we do best: robustness of the code • Clarity: What does Xen support? What's missing? • Users should be able to estimate precisely the work required
  24. Code First • Robustness and Safety of the code • Code is Xen Project's primary output • The most important item for safety-certifications is robustness/safety of the codebase • Documentation, requirements, and tests can be more easily outsourced • Main code safety aspects: • Coding style and MISRA C rules • Determinism: deterministic IRQ handling and memory allocations • Enhanced Kconfig for a smaller codebase (less to certify) • Why MISRA C? • A de facto standard in all industry sectors • Maintained and backed up by an authoritative organization (MISRA consortium) • A pragmatic approach and a perfect match for Xen: MISRA documents clearly state that code quality should never be sacrificed for compliance (deviation process)
  25. MISRA C: status • MISRA C Tailoring completed: ~100 rules considered relevant for Xen • MISRA C Rules adoption in progress ~15/100 rules • Xen is actually already following many MISRA C rules, just not officially • Add Xen Rules we already follow to CODING_STYLE • Discuss the others • Decide we follow a rule, add it to CODING_STYLE, check for it using MISRA C scanners • Decide we follow a rule with deviations • deviations are intentional and documented exceptions to the rule • document deviations with in-code or out-of-code comments so that MISRA C scanners will “ignore” them • check for the rule automatically using MISRA C scanners • Not follow the rule and not scan for it • cannot be scanned automatically by static analyzers
  26. MISRA C: status • Benefits: • Static code analyzers available to check for the rules, from ECLAIR to cppcheck • Check individual patches in advance before review even start • Ease code reviews & reduce maintainers work • Improve code quality • Reduce defects • Working with Roberto Bagnara and Bugseng ECLAIR • Improve existing coding style and coding conventions in Xen • Improve safety of the code • Improve code security – defensive programming • Widen compilers compatibility • Ensure we do not violate the C99 standard • Ensure we do not unknowingly use language extensions that may not be available in other compilers
  27. Tooling • CPPCheck • Available to any developer without license • Good for pre-submission checks • Open Source • Good coverage but not 100% • ECLAIR • 100% coverage of MISRA C:2012 with very high accuracy • Automatically adapts to the toolchain to capture all implementation-defined aspects of the language • Very detailed and actionable reports • Made publicly available by BUGSENG at http://eclairit.com
  28. ECLAIR • Enter the system by clicking “See ECLAIR in action”
  29. Future Work • Deterministic interrupt handling code path • Deterministic memory allocations at system boot • Further reducing code size via Kconfig • Complete MISRA C rules adoption and fixing violations • Documentation • Testing • Xen Testing Framework “XTF” • Gitlab-CI
  30. Questions?
Advertisement