Vinetalk is a software abstraction layer that allows cluster managers like Mesos and Kubernetes to offer fractions of GPU resources, enabling more efficient sharing of accelerators. Existing cluster managers cannot share accelerators because device drivers do not support it. Vinetalk implements an abstraction layer that decouples executors from vendor-specific drivers, representing accelerators as virtual access queues. This allows multiple tasks to concurrently use the same physical accelerator. Vinetalk has been shown to reduce queuing times for tasks sharing a GPU compared to Mesos alone. It also easier for developers to use, hiding proprietary device APIs, and has low overhead of 1-5% due to memory transfers.
3. To make things harder: Add heterogeneity
• Accelerators
(GPU/FPGA)
• $$$ / device
• Better value if:
• Good utilization
• Ease of use
4. Existing accelerator support
• Mesos/Kubernetes
• They know how to manage GPU resources
• They do not know how to offer fractions of GPUs.
• Underutilization => Expensive practice
• More workloads => more hardware
5. Production workload characteristics
• Hypothesis:
• “User facing” tasks take a few msec
• Non production tasks run in the background for days
• No accelerator-cluster data available to test
• Google datacenter data from the past
6. User facing task durations
• P50: 300sec
• Long tail: millions of sec
• Hypothesis does not hold
• Production tasks last longer
than “a few msec”
CDF of Task Execution time for priorities 9-11
Total tasks: 80K
Source: Google cluster data
7. Are tasks doing “work” all of the time?
• Not really
• Noon is busier than night
• Assuming GPU tasks
follow the same pattern
• Expensive to keep GPUs
idle/underutilized
Source: Reiss et.al, Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis, SOCC’12
50%
Time (hour)
Portionof
clusterCPU
Long running production tasks
8. To sum up so far
• Cluster managers do not enable accelerator sharing
• Expensive practice
• 10’s of $Ks / device
• More users => more devices
• Long reservations for too many production tasks
• P50=300sec, P80=1hr
• The tasks are not always busy
• 50% workload volatility
9. Our approach
• Enable fractions of accelerator offers
• Enable multiple containers work on the same accelerator
concurrently
10. Why accelerator sharing is hard in cluster
managers
• Device drivers do not enable sharing
• Propagation to all cluster manager modules
• Case in point: Apache Mesos
11. How Apache Mesos works
GPU GPU GPU
Apache Mesos
Offers do not
allow sharing
Bound to
vendor drivers
12. Approach: Decouple executors from vendor
locking.
GPU GPU GPU
Abstraction Layer
+ gpu abstract units
+ gpu abstract units + gpu abstract units
+ gpu abstract units
Questions:
• How to implement the
abstraction layer?
• What is the offer “currency”?
• Ease of use?
13. Accelerator abstraction layer: The hardware
interface
• Main idea:
• Have a server process for each
accelerator of a node.
• Implement all vendor bound
functionality
• Three functions:
• Monitor for incoming workloads.
• Load the proper kernel.
• Assume a lib of kernels
• Properly transfer the data.
Kernel call
+ data
Butler
process
Butler
process
Butler
process
Kernel call
+ dataKernel call
+ data
Nvidia API
Intel API
Xilinx API
14. Shared
Memory
VAQ VAQ
VAQ VAQ
Hardware
Interface
(a user space
process)
Mesos
Executor
Mesos
Task
Mesos
Task
Mesos
Task
subtask
+ data
subtask
+ data
data data
data data
subtask
+ data
subtasks
k + data
subtask
+ data
subtask
+ data
Accelerator abstraction layer: The software
interface
• Shared interface among Mesos tasks
• IPC mechanism to transfer functions + data
• Shared memory
• Abstract accelerators as Virtual Access Queues
16. A Showcase of the gain
• Two mesos tasks trying to share a GPU:
• One that launches 1000x a Monte Carlo GPU kernel
• One that launches 1000x a Black&Scholes GPU kernel
• Monte Carlo > Black&Scholes
• Compare queuing time of the first subtask
Black & Scholes (msec) Montecarlo(msec)
Mesos 126Κ 38
Mesos + Vinetalk 40 42
17. Is that easy to use?
• It is much easier (30% fewer lines of code)
• It hides all vendor specific APIs.
• Example: porting of ICCS FPGA financial application (FPL’17).
FPGA
Black-Scholes
(ICCS)
SDAccel
User application
(ICCS)
FPGA
Black-Scholes
(ICCS)
SDAccel
Vinetalk
User application
• We ported all SDAccel API
inside Vinetalk.
• User applications just need
to use the Vinetalk API
20. My gift to you:
• Vinetalk as of today becomes open source
• Apache v2.0
• Check it out:
•https://github.com/vineyard2020/vinetalkSuite
21. Conclusions
• Problem: Accelerators cannot be shared in a cluster
• Cause: The big forest of (proprietary) device drivers
• Solution: Abstract accelerators through Vinetalk
• Install Vinetalk to workers
• Offer VAQs as resources
• 1-5% overhead due to memory transfers
• Ease of integration with all Mesos frameworks (such as Spark etc).
• https://github.com/vineyard2020/vinetalkSuite