Vinetalk: The missing piece for cluster managers to enable accelerator sharing

Vinetalk: The missing piece for
cluster managers to enable
accelerator sharing.
Christos Kozanitis - FORTH

Today’s ecosystem
Productivity,
Performance, QoS
Software Stack
Utilization

To make things harder: Add heterogeneity
• Accelerators
(GPU/FPGA)
• $$$ / device
• Better value if:
• Good utilization
• Ease of use

Existing accelerator support
• Mesos/Kubernetes
• They know how to manage GPU resources
• They do not know how to offer fractions of GPUs.
• Underutilization => Expensive practice
• More workloads => more hardware

Production workload characteristics
• Hypothesis:
• “User facing” tasks take a few msec
• Non production tasks run in the background for days
• No accelerator-cluster data available to test
• Google datacenter data from the past

User facing task durations
• P50: 300sec
• Long tail: millions of sec
• Hypothesis does not hold
• Production tasks last longer
than “a few msec”
CDF of Task Execution time for priorities 9-11
Total tasks: 80K
Source: Google cluster data

Are tasks doing “work” all of the time?
• Not really
• Noon is busier than night
• Assuming GPU tasks
follow the same pattern
• Expensive to keep GPUs
idle/underutilized
Source: Reiss et.al, Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis, SOCC’12
50%
Time (hour)
Portionof
clusterCPU
Long running production tasks

To sum up so far
• Cluster managers do not enable accelerator sharing
• Expensive practice
• 10’s of $Ks / device
• More users => more devices
• Long reservations for too many production tasks
• P50=300sec, P80=1hr
• The tasks are not always busy
• 50% workload volatility

Our approach
• Enable fractions of accelerator offers
• Enable multiple containers work on the same accelerator
concurrently

Why accelerator sharing is hard in cluster
managers
• Device drivers do not enable sharing
• Propagation to all cluster manager modules
• Case in point: Apache Mesos

How Apache Mesos works
GPU GPU GPU
Apache Mesos
Offers do not
allow sharing
Bound to
vendor drivers

Approach: Decouple executors from vendor
locking.
GPU GPU GPU
Abstraction Layer
+ gpu abstract units
+ gpu abstract units + gpu abstract units
+ gpu abstract units
Questions:
• How to implement the
abstraction layer?
• What is the offer “currency”?
• Ease of use?

Accelerator abstraction layer: The hardware
interface
• Main idea:
• Have a server process for each
accelerator of a node.
• Implement all vendor bound
functionality
• Three functions:
• Monitor for incoming workloads.
• Load the proper kernel.
• Assume a lib of kernels
• Properly transfer the data.
Kernel call
+ data
Butler
process
Butler
process
Butler
process
Kernel call
+ dataKernel call
+ data
Nvidia API
Intel API
Xilinx API

Shared
Memory
VAQ VAQ
VAQ VAQ
Hardware
Interface
(a user space
process)
Mesos
Executor
Mesos
Task
Mesos
Task
Mesos
Task
subtask
+ data
subtask
+ data
data data
data data
subtask
+ data
subtasks
k + data
subtask
+ data
subtask
+ data
Accelerator abstraction layer: The software
interface
• Shared interface among Mesos tasks
• IPC mechanism to transfer functions + data
• Shared memory
• Abstract accelerators as Virtual Access Queues

Mesos
Executor
Vinetalk: Putting everything together.
15
VineTalk VineController
`
Virtual Accelerator Queues
Physical
Accelerator
thread
Data buffers
Task
Task
Task Physical
Accelerator
thread
Physical
Accelerator
thread
Task
GPU
Task
FPGA
Task
GPU
VAQs become the offer currency

A Showcase of the gain
• Two mesos tasks trying to share a GPU:
• One that launches 1000x a Monte Carlo GPU kernel
• One that launches 1000x a Black&Scholes GPU kernel
• Monte Carlo > Black&Scholes
• Compare queuing time of the first subtask
Black & Scholes (msec) Montecarlo(msec)
Mesos 126Κ 38
Mesos + Vinetalk 40 42

Is that easy to use?
• It is much easier (30% fewer lines of code)
• It hides all vendor specific APIs.
• Example: porting of ICCS FPGA financial application (FPL’17).
FPGA
Black-Scholes
(ICCS)
SDAccel
User application
(ICCS)
FPGA
Black-Scholes
(ICCS)
SDAccel
Vinetalk
User application
• We ported all SDAccel API
inside Vinetalk.
• User applications just need
to use the Vinetalk API

My gift to you:
• Vinetalk as of today becomes open source
• Apache v2.0
• Check it out:
•https://github.com/vineyard2020/vinetalkSuite

Conclusions
• Problem: Accelerators cannot be shared in a cluster
• Cause: The big forest of (proprietary) device drivers
• Solution: Abstract accelerators through Vinetalk
• Install Vinetalk to workers
• Offer VAQs as resources
• 1-5% overhead due to memory transfers
• Ease of integration with all Mesos frameworks (such as Spark etc).
• https://github.com/vineyard2020/vinetalkSuite

Vinetalk: The missing piece for cluster managers to enable accelerator sharing

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Vinetalk: The missing piece for cluster managers to enable accelerator sharing

Similar to Vinetalk: The missing piece for cluster managers to enable accelerator sharing (20)

Recently uploaded

Recently uploaded (20)

Vinetalk: The missing piece for cluster managers to enable accelerator sharing

Editor's Notes