Scalability and Efficiency in Accelerator Sharing on FPGA Devices

Scalability and Efficiency in
Accelerator Sharing on FPGA Devices
Eli Bozorgzadeh
Computer Science Department
University of California, Irvine, USA
Seminar at NECST lab, Politecnico di Milano- October 26,2018

Why Accelerators?
•  Era of big data and
computational intensive
applications
•  Failure of continuing Dennard
scaling
•  Getting close to the end of
multi-core scaling
2
Big data
applications
Computational
intensive
applications

Accelerators in Heterogeneous Architectures
•  Accelerators
•  GPUs
•  Power hungry
•  SIMD
•  FPGAs
•  Power efficiency
•  Multiple accelerator types
•  Application specific accelerators (ASIC)
3

Accelerators in Heterogeneous Architectures
•  Accelerators
•  GPUs
•  Power hungry
•  SIMD
•  FPGAs
•  Power efficiency
•  Multiple accelerator types
•  Application specific accelerators (ASIC)
4

FPGA based accelerator platform
•  Hardware custom circuits
•  Provide the environment for fine grain
parallelism and multiple accelerates
•  Challenges
•  Lack of a high-level programming language
•  HLS tools (Vivado HLS, LegUp, Intel HLS compiler)
•  Date transfer and Accelerators invocation
•  System software support
•  Data interface
5
This talk focuses on an efficient and scalable access to accelerators on FPGA

FPGA Multi-Accelerator Framework
•  Resource management of
multiple accelerators on
FPGA
•  Data transfer
6
Host

Memory

DMA
Processor

FPGA

PCIe
On board DDR
Accelerator 1
Accelerator 2
Accelerator 3
Accelerator 4
Core 1
Core 1
Core 1
Input Buffer
Output Buffer

Related Work: Multi-Accelerator Invocation
•  Open-source frameworks
•  RIFFA [TRETS 2015] by Jacobsen, et al.
•  JetStream [FPL 2016] by Vesper, et al.
•  ffLink [Heart 2015] by Chevallerie, et al.
•  Fully automated commercial tools
•  SDAccel from Xilinx
•  OPAE from Intel
7

Open-source framework: RIFFA
•  RIFFA
•  Sending request through
PIO using multiple shared
registers
•  Controlling data transfer
process through PIO
•  Blocking I/O requests
8
Host

Memory
DMA Processor

FPGA

PCIe
On board DDR

Core 1
Core 2
RIFFA

Accelerator Buffer
RIFFA
Driver
Data transfer
management
Data transfer

Commercial tools
•  SDAccel
•  Command-based
•  Data Transfer
•  Buffers on DDR
•  Not efficient for
streaming applications
[FCCM 2018] by Ruan
et al.
Host

Memory
DMA Processor

FPGA

PCIe
On board DDR

Core 1
Core 2
SDAccel

Accelerator 1
Buffer3
SDAccel
Driver
Accelerator 3
Accelerator 2
SDAccel
Buffer2Buffer1
queue
queue
queue
queue
Invoking
accelerator 2
9

Our solution for Invoking Multiple accelerators
•  Allocating accelerators to
the requests
•  Current platforms
•  User defines the
destination accelerator
•  Under utilization In the
case of more than one
accelerator of the same
type
•  User needs to know
more about the
hardware design
structure
Host

Memory
DMA Processor

FPGA

PCIe
On board DDR

Core 1
Core 2

Accelerator 1
Buffer3
Driver

Accelerator 3
Buffer2
Buffer1
queue
queue
Invoking
accelerator 2
10
Accelerator 2
Invoking
accelerator 2
queue
Scheduler
Our proposed framework will address this drawback by moving a scheduler
to the hardware, and Adding hardware queues per accelerator type
Accelerator 2

Multi-core accelerator invocation
•  Accelerator request from
multiple cores
•  Current platforms policies
•  Rejecting all the requests
while a request is getting
processed
•  Conflict between cores to
access shared queue on
the host side
•  Keeping the core locked in
the case of conflict
Host

Memory
DMA Processor

FPGA

PCIe
On board DDR

Core 1
Core 2

Buffer3

Driver
Accelerator B
Buffer2
Buffer1
queue
queue
Invoking
accelerator type A
11
Accelerator A
Invoking
accelerator type A
Accelerator A
queue
Scheduler

Multi-Queue Multi-Accelerator Interface (MQMAI)
Host

Memory
DMA
Processor

FPGA

PCIe
On board DDR

Core 1
Core 2

MQMAI Controller

Buffer3
MQMAI Driver

Accelerator B
Buffer2
Buffer1
accelerator type queue
12
Submission queue
Completion queue
Completion queue
Scheduler
accelerator type queue
Invoking
accelerator type A
Invoking
accelerator type A
Accelerator A
Accelerator A
Submission queue

Multi-Queue Multi-Accelerator Interface
(MQMAI)
Ø proposing multi-queues to multi-core processors to manage resource
contention in I/O intensive applications
Ø Adapting NVMe data transfer protocol for FPGA accelerator invocation
§  Each accelerator on the FPGA can be viewed as a memory block in
SSD for parallel access
Ø Similar to NVMe, MQMAI bypasses the block layer and instead deploys a
multi-queue mechanism
13
S. Rezaei, E. Bozorgzadeh, and K. Kim, “Multi - Queue Data Transfer Scheme for FPGA - based Multi – Accelerators”, in ICCD 2018.

MQMAI Architecture
•  Software Stack
•  C/C++ Library layer
•  Device Driver layer:
•  Submission/Completion command
•  Submission/Completion queue
•  Doorbell; Interrupt handler
•  Hardware Stack
•  Controller
•  Doorbell register Management, command management (Scheduler and accelerator
allocation), Data movement controller (Scatter Gather) and Completion management
•  PCIe Interface
•  Accelerator Interface
14

MQMAI Software Stack
•  Submission/completion Queues
•  Circular queues, head/tail registers
•  Submission command
•  command ID, core ID, I/O command type,
Accelerator type
•  Completion command
•  Command ID, validation bit
•  Doorbell:
•  Device driver sends doorbell register to
hardware to inform a new request (PIO mode)
•  All the requests are through DMA except
sending doorbell register
•  Free up CPU time
15
OS
FPGA
MqueueMAI controller
MqueueMAI driver
SQ
CQ
SQ
CQ
SQ
CQ
...
ACC 1 ACC 2 ACC M...
Core 1 Core 2 Core N

Hardware layer of MQMAI
16
.
.
.
Read
Requester
Write
Requester
RX MUX
SQ 1 Tail Address
SQ 2 Tail Address
SQ N Tail Address
Xilinx PCI Integrated IP
TX MUX
Scatter Gather maganger
Command Controller
Main Command Queue
Command Processor
Lookup Table
(channel selector)SG Commands Queue
1 .
Channel Controller
3 .
2 .
...
Data from DMA
MSI Interrupt Generator
Completion Module
Read/Write
completion
Write Request
Write to the
Completion queue
Rx Channel 1
Tx Channel 1
Rx Channel 2
Tx Channel 2
Rx Channel N
Tx Channel N
SQ 1 Head Address
SQ 2 Head Address
SQ N Head Address
Doorbell Controller
Command Fetch
requester
SG Queue SG Processor
SG or Read Request
Write Request
•  Doorbell Controller
•  Command Controller
•  Command queue
•  Scheduler and
accelerator allocation
•  Scatter Gather Manager
•  Completion Module

Timing Diagram of a single accelerator call from host to FPGA
17
CPU core
MqueuMAI
Controller
MqueueMAI
Driver
User
Application
PC
Memory
Rx
Channel
DMA
Call Library Function
Making and pushing
command Into SQ
Pushing Doorbell button
Request to get the commands in the SQ queue
Receiving commands
Processing a command
and assigning an
accelerator
Request to get the SG list
Receiving SG list
Processing SG list
Request to get Data
Receiving Data
Writing data to the Rx
channel dedicated to the
selected accelerator
Writing the completion
command into CQ
Send an Interrupt to the core that sent the command
FPGAHost
ACC
Acceleratorcomputation
Tx
Channel
Request to get the SG list
Receiving SG list
Write Result Data to
the memory
Processing SG list
.
.
.
Read
Requester
Write
Requester
RX MUX
SQ 1 Tail Address
SQ 2 Tail Address
SQ N Tail Address
Xilinx PCI Integrated IP
TX MUX
Scatter Gather maganger
Command Controller
Main Command Queue
Command Processor
Lookup Table
(channel selector)SG Commands Queue
1 .
Channel Controller
3 .
2 .
...
Data from DMA
MSI Interrupt Generator
Completion Module
Read/Write
completion
Write Request
Write to the
Completion queue
Rx Channel 1
Tx Channel 1
Rx Channel 2
Tx Channel 2
Rx Channel N
Tx Channel N
SQ 1 Head Address
SQ 2 Head Address
SQ N Head Address
Doorbell Controller
Command Fetch
requester
SG Queue SG Processor
SG or Read Request
Write Request

Experimental Setup
18
•  Host side
•  quad-core Intel processor i5-4590 running at 3.3GHz with Linux Operating system
•  8GB memory
•  PCI Express Gen3
•  ADM-PCIE-7V3 alpha data board with a Xilinx Virtex 7

Experimental Setup
19
Void *thread_main () {
buffer *A;
buffer *B;
…
While () {
// Start of measuring the delay
For (i = 0, i < K, i ++)
// A: input data B: output data
Ret = acc_fpga (fpga, acc_type, A, length, B, length);
wait_fpga (fpga, acc_id, K);
// End of measuring the delay
}
}

Experimental Results
•  Average end-to-end delay in Multi-Thread Single-Accelerator case
•  Impact of command-based multi-queue feature
20
18x faster

•  Average end-to-end delay in Multi-Thread Multi-Accelerator
•  Impact of accelerator allocation management
21
60x faster

•  Resource utilization on a Xilinx Virtex 7 FPGA for 4 I/O channels
22

Summary
•  We proposed
•  An efficient and scalable FPGA-based accelerator invocation framework that is scalable with
•  The number of accelerators on the FPGA
•  The number of parallel threads/applications invoking accelerators
•  Provides an efficient multiple accelerators management and allocation
•  Assigning request to the accelerators on the hardware using a scheduler
•  A command queue for each accelerator type on the hardware controller
•  Allowing multi-core access with minimizing the number of conflicts between requests
•  Defining a pair of submission and completion queue per core
•  Developing all the way from the software stack (libraries and device driver) to the hardware
controller
23

System Services for Reconfigurable Hardware
Acceleration in Mobile Devices
Mobile Devices with Reconfigurable Accelerators
•  Multi-core and heterogeneous compute platforms
(CPUs, GPUs, FPGA and other DSP devices.)
•  Acceleration Service: an approach to deliver improvement across power,
performance, programmability and portability.
FPGA-based Acceleration
•  Programmable according to user application
•  High performance due to hardware parallelism
•  Power and energy efficient
24
T. Ting, E. Bozorgzadhe, and A.Amirisani, “System Services for Reconfigurable Hardware Acceleration in Mobile Devices”, IEEE
Reconfig 2018.

Acceleration Service Prototype
Software/Hardware co-design
Integrate mobile Android
software and FPGA design flow
Acceleration scheduler: SW/
HW libraries
25
Linux Kernel
/dev/binder
/dev/uio
UserSpaceKernelSpace
App Code
Acceleration Client
Library
Acceleration
Scheduler
Software
Library
Hardware
Library
FPGA Driver
App Acceleration Service
IPC
Zynq Programmable SoC
Hardware
Acc.
#N
Android PlatformAndroid Applications
Acc.
#2
Acc.
#1
System Overview

26
Label:
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Targer device: Zynq SoC
Preliminary Results

27
Label:
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Preliminary Results

Conclusions
•  Sharing scheme in FPGA-based accelerators is an unavoidable scheme
•  FPGAs are being deployed in data centers as accelerators for various applications
•  There is a need for multi accelerator ecosystems that enable more efficient and
scalable accelerator sharing in multi accelerator-multicore systems
•  Our proposed MultiQueue MultiAccelerator Data Interface is a step toward efficient
and scalable “shared” access to multiple accelerators.
•  Our proposed Android based framework is a step toward application level solution to
manage accelerators along with software library usage. Application level accelerator
service can further enhance the shared access to multi-accelerator platforms.
28

Acknowledgement
•  PhD Students
•  Siavash Rezaei
•  Hsin-Yu Ting
•  Collaborators
•  Kanghee Kim, Soongsil Univ. , Korea
29

Scalability and Efficiency in Accelerator Sharing on FPGA Devices

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scalability and Efficiency in Accelerator Sharing on FPGA Devices

Similar to Scalability and Efficiency in Accelerator Sharing on FPGA Devices (20)

More from NECST Lab @ Politecnico di Milano

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded

Recently uploaded (20)

Scalability and Efficiency in Accelerator Sharing on FPGA Devices