SlideShare a Scribd company logo
Building A KVM-based Hypervisor for A
Heterogeneous System Architecture
Compliant System
National Chiao Tung University & National Tsing Hua University & National Taiwan University
Yu-Ju Huang, Hsuan-Heng Wu,
Yeh-Ching Chung, Wei-Chung Hsu
Agenda
• Motivation
• Background
• HSA features
• AMD’s implementation on Kaveri, the HSA-
compliant platform
• Design and Implementation
• Evaluation
• Conclusion
2
Motivation
• Problem of heterogeneous computing
• Data communication between CPU & GPU
• Inefficiency
• Programmability inconvenience
• Heterogeneous System Architecture (HSA)
• Developed by HSA Foundation
• Goal
• Improving computation efficiency for heterogeneous computing
• Reducing programmability barrier
• Make virtual machines also get benefit of HSA !
3
HSA
Hypervisor
Guest
OS
Guest
OS
A
p
p
A
p
p
A
p
p
A
p
p HSA!!!
HSA Features
• Shared virtual memory
• I/O page faulting
• User-level queueing
• Memory based signaling
4
CPU Memory
GPUCPU
GPU
Memory
Data copy
Before HSA
Physical Memory
HSA GPUCPU
Virtual Memory
HSA
Application
Queues
Operating System
GPU Driver
GPU
Before HSA
HSA GPU
Application
Queues
HSA
• Shared virtual memory
• I/O page faulting
• User-level queueing
• Memory based signaling
Shared Virtual Memory - IOMMU
• Set process page table to IOMMU to carry out virtual to
physical address translation
• CPU and GPU share same process page table
5
System Memory
GPU CPU
IOMMU MMUProcess Page Table
I/O Page Faulting - PPR
• PPR(peripheral page service request) issued by IOMMU as
interrupt
• PPR logs contains fault process ID and fault address
• get_user_pages API can be used to fix page fault
6
IOMMU CPU
Call PPR handler
Get PPR logs
Fix fault fault
COMPLETE command
PPR Interrupt
1
2
3
4
5
User Level Queueing -
Kernel Fusion Driver (KFD)
• Help applications set address of user level queues to GPU
7
Kernel Space
GPU
Userspace
KFD
Addr of user
level queue
User Level Queues
Computation
Design - How to Virtualize
• User-level queueing
• VirtIO-KFD
• Shared virtual memory
• Shadow page table
• Why not hardware-assisted nested paging ?
• I/O Page faulting
• Shadow PPR
• VirtIO-IOMMU
8
Virtualize User Level Queueing
VirtIO-KFD
9
Guest OS
Host OS
KFD
Qemu
Guest
App
VirtIO-KFD
(Back-end)
VirtIO-KFD
(Front-end)
Guest
App
Guest
App
GPU
Share virtqueue
HSA Runtime Library
1
2
3
4
KVM
Virtualize Shared Virtual Memory
Shadow Page Table
10
Guest OS
Host OS
KFD
Qemu
Guest
App
VirtIO-KFD
(Back-end)
VirtIO-KFD
(Front-end)
Guest
App
Guest
App
Share virtqueue
HSA Runtime Library
1
2
3
4IOMMU
Driver
KVM
IOMMU
Addr of
shadow
page table
5
6
GPU
IOMMU
Memory
ID System Page table
1 Host, process 1 Addr of PT
2 Guest 1,
process 1
Addr of SPT
Page
Table
ID=1
HVA
MPA
Native ScenarioGuest Scenario
 More guest processes in different guest OSes are also allowed.
11
IOMMU Snapshot During GPU Execution
GVA
MPA
ID=2
Virtualize I/O Page Faulting
VirtIO-IOMMU, Shadow PPR
12
Guest OS
Host OS
Shadow
PPR
Qemu
Guest
App
VirtIO-
IOMMU
Guest
App
Guest
App
IOMMU
HSA Runtime Library
IOMMU
Driver
KVM
Interrupt1
3
5
4
2
PPR: Peripheral Page Request
System Architecture
13
Guest OS
Host OS
KVM
Shadow
PPR
KFD
Qemu
(Host Process)
HSA Runtime Library
Guest
App
VirtIO-
IOMMU
VirtIO-
IOMMU
VirtIO-KFD
VirtIO-KFD
Guest
App
Guest
App
IOMMU GPU
User level
queuing
IOMMU
Driver
 KFD: Kernel Fusion Driver
 PPR: Peripheral Page Request
Shared
virtual
memory
I/O page
faulting
Evaluation
• Queue initialization time
• Measuring overheads of VirtIO-KFD
• GPU execution time
• Measuring overheads of shadow page table and shadow PPR
14
Configurations Native Guest
Hardware platform Kaveri
Memory 8G 4G
Number of CPUs 4 4
OS Ubuntu 13.10
Queue Initialization Time
15
Average 30% performance drop.
GPU Execution Time
16
Achieve average 95% of native performance in most cases.
GPU time
(sec)
BinarySea
rch
FastWalsh
Transform
BitonocSort FloydWars
hall
MatrixMulti
plication
MatrixTrans
pose
MoteCarlo
Asian
Native 0.0108 0.0018 0.014 16.094 8.012 0.502 17.458
Guest 0.0113 0.0019 0.016 16.603 8.286 0.538 18.342
Small benchmark
Enqueue Task
Kick GPU
Wait Signal
World Switch to Host
Switch Back
Guest Application
World Switch to Host
Signal
delay
Enqueue many times
Conclusion
• Successfully implementing a hypervisor virtualizing HSA
features.
• Guest system can get benefit of HSA and carry out
heterogeneous computing.
• GPU in Kaveri is shareable between multiple guest OSes and
host OS.
17
Thanks!
Q&A
gic4107@gmail.com
18

More Related Content

What's hot

BKK16-315 Graphics Stack Update
BKK16-315 Graphics Stack UpdateBKK16-315 Graphics Stack Update
BKK16-315 Graphics Stack Update
Linaro
 
Computer hardware presentation
Computer hardware presentationComputer hardware presentation
Computer hardware presentation
Jisu Dasgupta
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory ManagementNi Zo-Ma
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
ScyllaDB
 
PCI Passthrough and ITS Support in Xen / ARM :Xen Dev Summit 2015 Presentation
PCI Passthrough and ITS Support in Xen / ARM :Xen Dev Summit 2015 Presentation PCI Passthrough and ITS Support in Xen / ARM :Xen Dev Summit 2015 Presentation
PCI Passthrough and ITS Support in Xen / ARM :Xen Dev Summit 2015 Presentation
Manish Jaggi
 
Securing your cloud with Xen's advanced security features
Securing your cloud with Xen's advanced security featuresSecuring your cloud with Xen's advanced security features
Securing your cloud with Xen's advanced security features
The Linux Foundation
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
Ceph Community
 
Linux kernel architecture
Linux kernel architectureLinux kernel architecture
Linux kernel architecture
SHAJANA BASHEER
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
Adrian Huang
 
ELC21: VM-to-VM Communication Mechanisms for Embedded
ELC21: VM-to-VM Communication Mechanisms for EmbeddedELC21: VM-to-VM Communication Mechanisms for Embedded
ELC21: VM-to-VM Communication Mechanisms for Embedded
Stefano Stabellini
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototyping
Yan Vugenfirer
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
Adrian Huang
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
Linaro
 
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, XilinxXPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
The Linux Foundation
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKB
shimosawa
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
Kamal Maiti
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Persistent memory
Persistent memoryPersistent memory
Persistent memory
Benoit Hudzia
 
Rootlinux17: Hypervisors on ARM - Overview and Design Choices by Julien Grall...
Rootlinux17: Hypervisors on ARM - Overview and Design Choices by Julien Grall...Rootlinux17: Hypervisors on ARM - Overview and Design Choices by Julien Grall...
Rootlinux17: Hypervisors on ARM - Overview and Design Choices by Julien Grall...
The Linux Foundation
 

What's hot (20)

BKK16-315 Graphics Stack Update
BKK16-315 Graphics Stack UpdateBKK16-315 Graphics Stack Update
BKK16-315 Graphics Stack Update
 
Computer hardware presentation
Computer hardware presentationComputer hardware presentation
Computer hardware presentation
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
 
PCI Passthrough and ITS Support in Xen / ARM :Xen Dev Summit 2015 Presentation
PCI Passthrough and ITS Support in Xen / ARM :Xen Dev Summit 2015 Presentation PCI Passthrough and ITS Support in Xen / ARM :Xen Dev Summit 2015 Presentation
PCI Passthrough and ITS Support in Xen / ARM :Xen Dev Summit 2015 Presentation
 
Securing your cloud with Xen's advanced security features
Securing your cloud with Xen's advanced security featuresSecuring your cloud with Xen's advanced security features
Securing your cloud with Xen's advanced security features
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
Linux kernel architecture
Linux kernel architectureLinux kernel architecture
Linux kernel architecture
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
 
ELC21: VM-to-VM Communication Mechanisms for Embedded
ELC21: VM-to-VM Communication Mechanisms for EmbeddedELC21: VM-to-VM Communication Mechanisms for Embedded
ELC21: VM-to-VM Communication Mechanisms for Embedded
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototyping
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, XilinxXPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKB
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Persistent memory
Persistent memoryPersistent memory
Persistent memory
 
Rootlinux17: Hypervisors on ARM - Overview and Design Choices by Julien Grall...
Rootlinux17: Hypervisors on ARM - Overview and Design Choices by Julien Grall...Rootlinux17: Hypervisors on ARM - Overview and Design Choices by Julien Grall...
Rootlinux17: Hypervisors on ARM - Overview and Design Choices by Julien Grall...
 

Similar to Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compliant System

PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)
PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)
PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)
iXsystems
 
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Suresh Kumar
 
Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02Suresh Kumar
 
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp012virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01Vietnam Open Infrastructure User Group
 
VDI Design Guide
VDI Design GuideVDI Design Guide
VDI Design Guide
Dan Brinkmann
 
V mware view™ poc jumpstart service
V mware view™ poc jumpstart serviceV mware view™ poc jumpstart service
V mware view™ poc jumpstart service
solarisyougood
 
5. IO virtualization
5. IO virtualization5. IO virtualization
5. IO virtualization
Hwanju Kim
 
Cloud-computing.ppt
Cloud-computing.pptCloud-computing.ppt
Cloud-computing.ppt
Ajit Mali
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
Sandeep Joshi
 
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissStupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
macslide
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructuresolarisyourep
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
xKinAnx
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
The Linux Foundation
 
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologiess6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologiesChris Huybregts
 
Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...Louis Göhl
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon
 
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI ServersGet Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
Unidesk Corporation
 
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
inside-BigData.com
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)Vinayak Hegde
 

Similar to Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compliant System (20)

PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)
PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)
PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)
 
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
 
Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02
 
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp012virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
 
VDI Design Guide
VDI Design GuideVDI Design Guide
VDI Design Guide
 
V mware view™ poc jumpstart service
V mware view™ poc jumpstart serviceV mware view™ poc jumpstart service
V mware view™ poc jumpstart service
 
5. IO virtualization
5. IO virtualization5. IO virtualization
5. IO virtualization
 
Cloud-computing.ppt
Cloud-computing.pptCloud-computing.ppt
Cloud-computing.ppt
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
 
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissStupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
 
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologiess6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
 
Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI ServersGet Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
 
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)
 

Recently uploaded

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compliant System

  • 1. Building A KVM-based Hypervisor for A Heterogeneous System Architecture Compliant System National Chiao Tung University & National Tsing Hua University & National Taiwan University Yu-Ju Huang, Hsuan-Heng Wu, Yeh-Ching Chung, Wei-Chung Hsu
  • 2. Agenda • Motivation • Background • HSA features • AMD’s implementation on Kaveri, the HSA- compliant platform • Design and Implementation • Evaluation • Conclusion 2
  • 3. Motivation • Problem of heterogeneous computing • Data communication between CPU & GPU • Inefficiency • Programmability inconvenience • Heterogeneous System Architecture (HSA) • Developed by HSA Foundation • Goal • Improving computation efficiency for heterogeneous computing • Reducing programmability barrier • Make virtual machines also get benefit of HSA ! 3 HSA Hypervisor Guest OS Guest OS A p p A p p A p p A p p HSA!!!
  • 4. HSA Features • Shared virtual memory • I/O page faulting • User-level queueing • Memory based signaling 4 CPU Memory GPUCPU GPU Memory Data copy Before HSA Physical Memory HSA GPUCPU Virtual Memory HSA Application Queues Operating System GPU Driver GPU Before HSA HSA GPU Application Queues HSA • Shared virtual memory • I/O page faulting • User-level queueing • Memory based signaling
  • 5. Shared Virtual Memory - IOMMU • Set process page table to IOMMU to carry out virtual to physical address translation • CPU and GPU share same process page table 5 System Memory GPU CPU IOMMU MMUProcess Page Table
  • 6. I/O Page Faulting - PPR • PPR(peripheral page service request) issued by IOMMU as interrupt • PPR logs contains fault process ID and fault address • get_user_pages API can be used to fix page fault 6 IOMMU CPU Call PPR handler Get PPR logs Fix fault fault COMPLETE command PPR Interrupt 1 2 3 4 5
  • 7. User Level Queueing - Kernel Fusion Driver (KFD) • Help applications set address of user level queues to GPU 7 Kernel Space GPU Userspace KFD Addr of user level queue User Level Queues Computation
  • 8. Design - How to Virtualize • User-level queueing • VirtIO-KFD • Shared virtual memory • Shadow page table • Why not hardware-assisted nested paging ? • I/O Page faulting • Shadow PPR • VirtIO-IOMMU 8
  • 9. Virtualize User Level Queueing VirtIO-KFD 9 Guest OS Host OS KFD Qemu Guest App VirtIO-KFD (Back-end) VirtIO-KFD (Front-end) Guest App Guest App GPU Share virtqueue HSA Runtime Library 1 2 3 4 KVM
  • 10. Virtualize Shared Virtual Memory Shadow Page Table 10 Guest OS Host OS KFD Qemu Guest App VirtIO-KFD (Back-end) VirtIO-KFD (Front-end) Guest App Guest App Share virtqueue HSA Runtime Library 1 2 3 4IOMMU Driver KVM IOMMU Addr of shadow page table 5 6
  • 11. GPU IOMMU Memory ID System Page table 1 Host, process 1 Addr of PT 2 Guest 1, process 1 Addr of SPT Page Table ID=1 HVA MPA Native ScenarioGuest Scenario  More guest processes in different guest OSes are also allowed. 11 IOMMU Snapshot During GPU Execution GVA MPA ID=2
  • 12. Virtualize I/O Page Faulting VirtIO-IOMMU, Shadow PPR 12 Guest OS Host OS Shadow PPR Qemu Guest App VirtIO- IOMMU Guest App Guest App IOMMU HSA Runtime Library IOMMU Driver KVM Interrupt1 3 5 4 2 PPR: Peripheral Page Request
  • 13. System Architecture 13 Guest OS Host OS KVM Shadow PPR KFD Qemu (Host Process) HSA Runtime Library Guest App VirtIO- IOMMU VirtIO- IOMMU VirtIO-KFD VirtIO-KFD Guest App Guest App IOMMU GPU User level queuing IOMMU Driver  KFD: Kernel Fusion Driver  PPR: Peripheral Page Request Shared virtual memory I/O page faulting
  • 14. Evaluation • Queue initialization time • Measuring overheads of VirtIO-KFD • GPU execution time • Measuring overheads of shadow page table and shadow PPR 14 Configurations Native Guest Hardware platform Kaveri Memory 8G 4G Number of CPUs 4 4 OS Ubuntu 13.10
  • 15. Queue Initialization Time 15 Average 30% performance drop.
  • 16. GPU Execution Time 16 Achieve average 95% of native performance in most cases. GPU time (sec) BinarySea rch FastWalsh Transform BitonocSort FloydWars hall MatrixMulti plication MatrixTrans pose MoteCarlo Asian Native 0.0108 0.0018 0.014 16.094 8.012 0.502 17.458 Guest 0.0113 0.0019 0.016 16.603 8.286 0.538 18.342 Small benchmark Enqueue Task Kick GPU Wait Signal World Switch to Host Switch Back Guest Application World Switch to Host Signal delay Enqueue many times
  • 17. Conclusion • Successfully implementing a hypervisor virtualizing HSA features. • Guest system can get benefit of HSA and carry out heterogeneous computing. • GPU in Kaveri is shareable between multiple guest OSes and host OS. 17

Editor's Notes

  1. Hello everyone. My name is Yu-Ju Huang. Here is the author list, this is me, my partner, and two professors. We all from Taiwan, a country in the east Asia. <NEED funny intro> This is my topic today. It’s a little long, right :D? So now, I’m gonna give you a brief introduction and image about this work. Hope you can enjoy it ! In this work, our target is a special HW architecture called Heterogeneous System Architecture, or HSA in short. HSA is mainly focus on helping heterogeneous computing system more powerful and more efficient. Given the HSA-compliant HW platform, we implement a hypervisor running on top of it. And the hypervisor tries to virtualize the features provided by HSA such that the virtual machines can also get the benefits of HSA.
  2. In the beginning, I’ll introduce the motivation of this work. And then a brief background about HSA including the HSA features and the AMD’s implementation on Kaveri which is the first HSA-compliant platform, and also is our target platform. After that, we can talk about our design and implementation. And then the evaluation and conclusion.
  3. About the motivation, we start from the heterogeneous computing. The heterogeneous computing programming model requires data communication between devices. This communication cause inefficiency and programmability inconvenience. So HSA foundation propose the HSA architecture to resolve this problems. For the motivation of our work, the motivation is that if we believe the heterogeneous computing will be more and more popular in the future, then there must be a hypervisor to support virtual machines to get benefits of HSA. Here, though our discussion is based on HSA and the implementation is based on AMD’s platform. Our design philosophy can also be applied to other platform, or even other architecture that tries to improve heterogeneous computing systems.
  4. OK, let’s start to introduce HSA. As previous description, HSA tries to solve the communication inefficiency and inconvenience. Here is the solution of HSA. It proposes many features. And here the list is the features focusing on how a program is able to execute. These features are also what we need to virtualize. The first, shared virtual memory. Before HSA, CPU and GPU use different memory and address space, so data copy is required. For HSA, all the computing resource, like CPU and GPU or other HSA-aware devices, see the same virtual address space so they can access the system memory with virtual address. This way can eliminate the data copy. For the I/O page faulting feature, this is a requirement for shared virtual memory because we allow I/O device to access system memory directly, then the page fault service must also support it And the user-level queuing. Before HSA, tasks can only be dispatched to GPU by OS, or GPU driver. As for HSA, GPU is able to see all the user level queues. So the jobs dispatching don’t need trap into GPU driver any more. This design reduce the latency of dispatching jobs. Final, the memory based signaling is also designed for reduce OS intervention latency. Previous to HSA, once GPU finishes its task, it issue an interrupt to CPU and let CPU to notify user-space program. This path incurs OS intervention overhead. So HSA makes GPU able to access a particular memory address for job finishing notification. The particular memory address is assigned by application when it dispatch jobs. For these fours features, the memory based signaling can be achieved once GPU is able to access process address space. So actually, we have only take care to virtualize the first three features.
  5. Well, in the following page, I will introduce the AMD’s implementation of the HSA features. The shared virtual memory. AMD implement IOMMU for GPU or other HSA-aware devices to translate virtual address physical address. And since the CPU and GPU see the same process address space, the page table of IOMMU should be same as what CPU MMU uses. So with setting the page table properly, the shared virtual memory feature can be achieved.
  6. About the I/O page faulting, AMD designs a mechanism call PPR, peripheral page service request. This request is issued by IOMMU as an interrupt to CPU once a failure occurs in address translation, such as page doesn’t exist or insufficient permission to access the page. The IOMMU will also write log containing fault process ID and fault address. With these information, Linux API get_user_pages can be used to fix the I/O page fault. Here is the brief flow of the I/O page fault handling.
  7. As for the user-level queuing feature. The key idea is how to make GPU know where is the address of user-level queues. AMD designs a driver call kernel fusion driver, or KFD, to complete this function. During user-program initialization, the CREATE_QUEUE API will send the address of user-level queue to the KFD, and the KFD set this address to GPU. After this setting, driver’s intervention can be moved out. The driver is only used during initialization, the computation time is co-worked between GPU and user-program.
  8. Good? In previous slides, I describe what we need to virtualize. And from now on, I will introduce you about how we virtualize these HSA features. You can see on this page, I will elaborate more in the following page. For one thing I need to mention is that, we use the shadow page table to virtualize the shared virtual memory. I know you may feel strange why SPT is adopted rather than the nested paging. This is due to the constrain of the AMD’s IOMMU, and it’s a little complicated so I will not describe it in this talk. But you can still find the explanation in proceeding and the paper.
  9. As I previously describe, the key to support user level queuing is to let the GPU know where is the address of user level queue. So we implemented VirtIO-KFD, as you can see in the slide. The VirtIO-KFD help guest application to bypass the address of its queue to the real KFD. And the KFD will set it to GPU. With this way, the GPU can know where is the address of guest application queue.
  10. And then the shared virtual memory. As we know, the shadow page table guides the MMU to translate guest virtual address to machine physical address. So in our work, we just need to find the address of shadow page table and set it to IOMMU when guest application tries to use GPU.
  11. This is a snapshot of the GPU executing state. IOMMU maintains a table to map process address space ID to the corresponding page table address. In this scenario, there are two process use GPU. For native execution, like GPU run a program dispatched by a host application. Then it will know where to find the host application’s page table. For guest execution, GPU run a program dispatched by a guest application. And this program is encoded in the guest virtual address space. So IOMMU will find the corresponding SPT to translation the GVA to MPA. As you can expect, this table can be extended. So in our design, multiple processes from difference guest OSes or even host OS can share the GPU. So we kind of achieving the GPU sharing in our work.
  12. Final one, I/O page faulting. One challenge to virtualize this feature is that the PPR log region, where is used to store the page fault information, is inside a special IO region. Usually, guest system is not allowed to access this region. So we implemented a module called shadow-PPR. This module is used to store the information about guest GPU program’s page faults. Once a PPR occurs, the PPR handler will decide whether it is caused by guest program. If so, then store the information into shadow PPR. Then shadow PPR kick up the KVM and send a virtual interrupt into guest OS. Inside guest OS, we implemented a VirtIO-IOMMU to handle the I/O page fault. It will get page fault information from shadow PPR and fix the page fault. So this how we virtual the I/O page faulting.
  13. Whole system architecture. VirtIO-KFD for user level queuing. SPT for SVM. VirtIO-IOMMU for I/O page fault.
  14. About the experiment. We use AMD SDK as our benchmark. Data is shown in initialization time and execution time to evaluate our design.
  15. The data is normalized against native scenario. It’s about 30% performance drop. This drop is mainly caused by the propagation from VirtIO-KFD to real KFD. Since there are world switch overhead in this path. But usually, an application only do this initialization process once. So this performance drop is not a great concern.
  16. For GPU execution time. The major cause of performance drop in GPU execution time is the I/O page fault handling. But as you can see, our design does get a good result, around 95% of native performance in most cases. As for the two poor case, FWT, BS. These two benchmark does have a little poor performance. The reason is that, let’s see this figure. This is about the flow of an application dispatching jobs, waiting for signal, and getting notification when GPU finishes the job. There is possible that during guest application waits for signal, the CPU may switch to other process. So if in a particular time, GPU finish the job and send a notification. But in this particular time, the CPU is owned by other process rather than the guest system. So the application will get the signal lately. These red arrows shows the this delay. And why only the two benchmarks suffer from it. We can see the raw data here. Because they are small benchmark, about only 10 ms GPU execution time. For the long benchmark, this signal delay can be amortized. Another reason is that these two benchmark enqueue many time. So they keep inside this loop. And the overhead becomes large. For BinarySearch, though it is a small benchmark, it only enqueue once, so the overhead is invisible.
  17. Conclusion of our work. We implement a hypervisor that makes guest system can also get the benefit of HSA. And furthermore, we also achieve GPU sharing.