The art of virtualizing cache
Julien Grall <julien.grall@arm.com>
Xen Developer Summit 2018
© 2018 Arm Limited
Cache coherency on Arm
Cache coherent architecture
Scales from single CPU to massive SMP systems
Implementer chooses to offer caches that are
visible to so ware
invisible to so ware
... or any point between these two op ons
Enough abstrac on to cope with these differences
Allows different PPA (Performance, Power, Area) points:
Running a VM on your smart watch? Easy.
The same VM on your $15K server? Sure.
The architecture is designed for maximum flexibility.
2 © 2018 Arm Limited
Cache architecture
(Modified) Harvard architecture
Mul ple levels of caching (with snooping)
Separate I-cache and D-cache (no snooping between I and D)
Either PIPT or non-aliasing VIPT for D-cache
Mee ng at the Point of Unifica on (PoU)
Controlled by a ributes in the page tables
Memory type (normal, device)
Cacheability, Shareability
Two Enable bits (I and C)
Actually not really an Enable switch
More like a global ”a ribute override”
Generally invisible to normal so ware
With a few key excep ons
An example is Executable code loading / genera on
3 © 2018 Arm Limited
Interac ng with caches
The Arm architecture offers the usual (mostly) privileged opera ons to interact
with caches:
Invalidate (I & D-cache)
Clean (D-cache)
Clean + Invalidate (D-cache)
Cache maintenance by Virtual Address
Cache maintenance by Set/Way
4 © 2018 Arm Limited
Interac ng with caches
The Arm architecture offers the usual (mostly) privileged opera ons to interact
with caches:
Invalidate (I & D-cache)
Clean (D-cache)
Clean + Invalidate (D-cache)
Cache maintenance by Virtual Address
Cache maintenance by Set/Way
Set/Way opera ons are local to a CPU
Will break if more than one CPU is ac ve
No ALL opera on on the D side
Itera on over Sets/Ways
Only for bring-up/shutdown of a CPU
Not all the levels have to implement Set/Way
System caches only know about VA
Set/Way opera ons are impossible to virtualize
VA opera ons are the only way to perform cache maintenance outside of CPU bring-up/teardown
4 © 2018 Arm Limited
Introducing Stage-2 transla on
Virtual machines add their share of complexity:
Second stage of page tables (equivalent to EPT on x86)
Second set of memory a ributes
Xen always configures RAM cacheable at Stage-2
These memory a ributes get combined with those controlled by the guest:
The strongest memory type wins
Device vs Normal memory
The least cacheable memory a ribute wins
Non-cacheable is always enforced
And the hypervisor doesn’t much have control over it
Some global controls, but nothing fine grained
5 © 2018 Arm Limited
Linux 32-bit boot example
Boo ng a 32-bit guest on a 64-bit host (with an L3 system cache).
The (compressed) kernel is in RAM
The embedded decompressor:
enables the caches
decompress the image
turns the cache off,
flushes it by Set/Way,
and jumps to the payload...
What could possibly go wrong?
6 © 2018 Arm Limited
Linux 32-bit boot example
Boo ng a 32-bit guest on a 64-bit host (with an L3 system cache).
The (compressed) kernel is in RAM
The embedded decompressor:
enables the caches
decompress the image
turns the cache off,
flushes it by Set/Way,
and jumps to the payload...
What could possibly go wrong?
System caches do not implement Set/Way ops
So our guest code sits in L3, while fetching from RAM
6 © 2018 Arm Limited
Set/Way in virtualized environment
The guest cannot directly use set/way because of:
The presence of system caches on Arm64
The vCPU can be migrated to another pCPU at any me
The new pCPU cache may not be cleaned
How can we solve this?
7 © 2018 Arm Limited
Set/Way in virtualized environment
The guest cannot directly use set/way because of:
The presence of system caches on Arm64
The vCPU can be migrated to another pCPU at any me
The new pCPU cache may not be cleaned
How can we solve this?
We need to trap these ops and convert them into VA ops
Which means itera ng over all the mapped pages
Good thing we’re only doing that at boot me!
7 © 2018 Arm Limited
Implementa on of Set/Way in Xen
8 © 2018 Arm Limited
Xen and Set/Way today
Set/Way instruc ons are not trapped
The guest is directly ac ng on the cache
Poten al cause of a heisenbug in Osstest
https://lists.xenproject.org/archives/html/xen-devel/2017-09/msg03191.html
All guests using Set/Way are unsafe on Xen
Linux 32-bit
UEFI
...
9 © 2018 Arm Limited
Cleaning guest memory
We need to iterate on each mapped page and clean them.
Any problems?
10 © 2018 Arm Limited
Cleaning guest memory
We need to iterate on each mapped page and clean them.
Any problems?
Guest memory is always mapped
Lots of pages to clean
32-bit Linux is using Set/Way during CPU bring-up
Bring-up is bound by a meout
Pages are cleaned when first assigned to the guest
10 © 2018 Arm Limited
Cleaning guest memory
We need to iterate on each mapped page and clean them.
Any problems?
Guest memory is always mapped
Lots of pages to clean
32-bit Linux is using Set/Way during CPU bring-up
Bring-up is bound by a meout
Pages are cleaned when first assigned to the guest
We need to clean only pages used since the last flush.
10 © 2018 Arm Limited
Trapping Set/Way instruc ons
Set/Way instruc ons usually happen:
In batch of instruc ons
Before turning on/off caches
A poten al approach to trap would:
On first Set/Way instruc on
Enable trapping of VM instruc ons (e.g HCR EL2.TVM)
Do a full clean of the guest memory
Subsquent Set/Way instruc ons will be ignored un l the cache is toggled
On cache toggling
Do a full clean of the guest memory
Turn off trapping of VM instruc ons
11 © 2018 Arm Limited
Current status
Some approach was discussed on Xen-devel in December 2017
https://lists.xen.org/archives/html/xen-devel/2017-12/msg00328.html
A PoC based on the feedback was wri en
Sharing page-table is not possible with the approach
More details will be posted on xen-devel
12 © 2018 Arm Limited
Conclusion
Caches are not just a ”make it faster” block slapped on the side of the CPU
They are essen al part of the coherency protocol
Using uncached memory explicitely bypasses it
It looks logical to cope with the consequence
No magic involved!
Following the architecture rules ensures correctness on all implementa ons
RTFAA (Read The Fabulous ARM ARM, almost 7000 pages - and coun ng)
13 © 2018 Arm Limited
Ques ons?
14 © 2018 Arm Limited
The Arm trademarks featured in this presenta on are registered trademarks or
trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights
reserved. All other marks featured may be trademarks of their respec ve owners.
www.arm.com/company/policies/trademarks
© 2018 Arm Limited

XPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, Arm

  • 1.
    The art ofvirtualizing cache Julien Grall <julien.grall@arm.com> Xen Developer Summit 2018 © 2018 Arm Limited
  • 2.
    Cache coherency onArm Cache coherent architecture Scales from single CPU to massive SMP systems Implementer chooses to offer caches that are visible to so ware invisible to so ware ... or any point between these two op ons Enough abstrac on to cope with these differences Allows different PPA (Performance, Power, Area) points: Running a VM on your smart watch? Easy. The same VM on your $15K server? Sure. The architecture is designed for maximum flexibility. 2 © 2018 Arm Limited
  • 3.
    Cache architecture (Modified) Harvardarchitecture Mul ple levels of caching (with snooping) Separate I-cache and D-cache (no snooping between I and D) Either PIPT or non-aliasing VIPT for D-cache Mee ng at the Point of Unifica on (PoU) Controlled by a ributes in the page tables Memory type (normal, device) Cacheability, Shareability Two Enable bits (I and C) Actually not really an Enable switch More like a global ”a ribute override” Generally invisible to normal so ware With a few key excep ons An example is Executable code loading / genera on 3 © 2018 Arm Limited
  • 4.
    Interac ng withcaches The Arm architecture offers the usual (mostly) privileged opera ons to interact with caches: Invalidate (I & D-cache) Clean (D-cache) Clean + Invalidate (D-cache) Cache maintenance by Virtual Address Cache maintenance by Set/Way 4 © 2018 Arm Limited
  • 5.
    Interac ng withcaches The Arm architecture offers the usual (mostly) privileged opera ons to interact with caches: Invalidate (I & D-cache) Clean (D-cache) Clean + Invalidate (D-cache) Cache maintenance by Virtual Address Cache maintenance by Set/Way Set/Way opera ons are local to a CPU Will break if more than one CPU is ac ve No ALL opera on on the D side Itera on over Sets/Ways Only for bring-up/shutdown of a CPU Not all the levels have to implement Set/Way System caches only know about VA Set/Way opera ons are impossible to virtualize VA opera ons are the only way to perform cache maintenance outside of CPU bring-up/teardown 4 © 2018 Arm Limited
  • 6.
    Introducing Stage-2 translaon Virtual machines add their share of complexity: Second stage of page tables (equivalent to EPT on x86) Second set of memory a ributes Xen always configures RAM cacheable at Stage-2 These memory a ributes get combined with those controlled by the guest: The strongest memory type wins Device vs Normal memory The least cacheable memory a ribute wins Non-cacheable is always enforced And the hypervisor doesn’t much have control over it Some global controls, but nothing fine grained 5 © 2018 Arm Limited
  • 7.
    Linux 32-bit bootexample Boo ng a 32-bit guest on a 64-bit host (with an L3 system cache). The (compressed) kernel is in RAM The embedded decompressor: enables the caches decompress the image turns the cache off, flushes it by Set/Way, and jumps to the payload... What could possibly go wrong? 6 © 2018 Arm Limited
  • 8.
    Linux 32-bit bootexample Boo ng a 32-bit guest on a 64-bit host (with an L3 system cache). The (compressed) kernel is in RAM The embedded decompressor: enables the caches decompress the image turns the cache off, flushes it by Set/Way, and jumps to the payload... What could possibly go wrong? System caches do not implement Set/Way ops So our guest code sits in L3, while fetching from RAM 6 © 2018 Arm Limited
  • 9.
    Set/Way in virtualizedenvironment The guest cannot directly use set/way because of: The presence of system caches on Arm64 The vCPU can be migrated to another pCPU at any me The new pCPU cache may not be cleaned How can we solve this? 7 © 2018 Arm Limited
  • 10.
    Set/Way in virtualizedenvironment The guest cannot directly use set/way because of: The presence of system caches on Arm64 The vCPU can be migrated to another pCPU at any me The new pCPU cache may not be cleaned How can we solve this? We need to trap these ops and convert them into VA ops Which means itera ng over all the mapped pages Good thing we’re only doing that at boot me! 7 © 2018 Arm Limited
  • 11.
    Implementa on ofSet/Way in Xen 8 © 2018 Arm Limited
  • 12.
    Xen and Set/Waytoday Set/Way instruc ons are not trapped The guest is directly ac ng on the cache Poten al cause of a heisenbug in Osstest https://lists.xenproject.org/archives/html/xen-devel/2017-09/msg03191.html All guests using Set/Way are unsafe on Xen Linux 32-bit UEFI ... 9 © 2018 Arm Limited
  • 13.
    Cleaning guest memory Weneed to iterate on each mapped page and clean them. Any problems? 10 © 2018 Arm Limited
  • 14.
    Cleaning guest memory Weneed to iterate on each mapped page and clean them. Any problems? Guest memory is always mapped Lots of pages to clean 32-bit Linux is using Set/Way during CPU bring-up Bring-up is bound by a meout Pages are cleaned when first assigned to the guest 10 © 2018 Arm Limited
  • 15.
    Cleaning guest memory Weneed to iterate on each mapped page and clean them. Any problems? Guest memory is always mapped Lots of pages to clean 32-bit Linux is using Set/Way during CPU bring-up Bring-up is bound by a meout Pages are cleaned when first assigned to the guest We need to clean only pages used since the last flush. 10 © 2018 Arm Limited
  • 16.
    Trapping Set/Way instrucons Set/Way instruc ons usually happen: In batch of instruc ons Before turning on/off caches A poten al approach to trap would: On first Set/Way instruc on Enable trapping of VM instruc ons (e.g HCR EL2.TVM) Do a full clean of the guest memory Subsquent Set/Way instruc ons will be ignored un l the cache is toggled On cache toggling Do a full clean of the guest memory Turn off trapping of VM instruc ons 11 © 2018 Arm Limited
  • 17.
    Current status Some approachwas discussed on Xen-devel in December 2017 https://lists.xen.org/archives/html/xen-devel/2017-12/msg00328.html A PoC based on the feedback was wri en Sharing page-table is not possible with the approach More details will be posted on xen-devel 12 © 2018 Arm Limited
  • 18.
    Conclusion Caches are notjust a ”make it faster” block slapped on the side of the CPU They are essen al part of the coherency protocol Using uncached memory explicitely bypasses it It looks logical to cope with the consequence No magic involved! Following the architecture rules ensures correctness on all implementa ons RTFAA (Read The Fabulous ARM ARM, almost 7000 pages - and coun ng) 13 © 2018 Arm Limited
  • 19.
    Ques ons? 14 ©2018 Arm Limited
  • 20.
    The Arm trademarksfeatured in this presenta on are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respec ve owners. www.arm.com/company/policies/trademarks © 2018 Arm Limited