In the era of big data applications and heterogeneous high-performance computing architectures, deployment of accelerators such as GPUs and FPGAs is an unavoidable scheme to meet performance constraints for compute-intensive kernels of applications. FPGAs provide energy-efficient hardware customized accelerations that can be simultaneously accessed by various applications. The I/O software stack for data transfer to/from FPGA is further challenged by increasing number of parallel threads requesting access to accelerators and exponential growth in the volume of the data to send/receive to/from FPGA-accelerators. As a result, deployment of FPGA-based accelerators in large software systems has been challenged by the lack of a scalable scheme from application software to the I/O interface to allow multi-core to access FPGA-based accelerators at a high data transfer rate.
We need an interface scheme that is scalable with increasing number of host cores, accelerators, and the volume of I/O data. Sharing accelerators as proposed in previous works brings resource contention issues in I/O software stack, i.e., a thread waiting for an accelerator when in use by another thread. To tackle this drawback, we present MQMAI, a multi-queue command-based access mechanism, to reduce the contention over the access to multiple accelerators. Our proposed technique enhances I/O parallelism for access to multiple accelerators. MQMAI is assembled of 1) a software stack on the host side, and 2) a hardware controller on the FPGA side. Experimental results presented in this talk support our claim that our proposed framework provides a scalable and efficient data transfer scheme for FPGA-based multi-accelerator systems.
Scalability and Efficiency in Accelerator Sharing on FPGA Devices
1. Scalability and Efficiency in
Accelerator Sharing on FPGA Devices
Eli Bozorgzadeh
Computer Science Department
University of California, Irvine, USA
Seminar at NECST lab, Politecnico di Milano- October 26,2018
2. Why Accelerators?
• Era of big data and
computational intensive
applications
• Failure of continuing Dennard
scaling
• Getting close to the end of
multi-core scaling
2
Big data
applications
Computational
intensive
applications
3. Accelerators in Heterogeneous Architectures
• Accelerators
• GPUs
• Power hungry
• SIMD
• FPGAs
• Power efficiency
• Multiple accelerator types
• Application specific accelerators (ASIC)
3
4. Accelerators in Heterogeneous Architectures
• Accelerators
• GPUs
• Power hungry
• SIMD
• FPGAs
• Power efficiency
• Multiple accelerator types
• Application specific accelerators (ASIC)
4
5. FPGA based accelerator platform
• Hardware custom circuits
• Provide the environment for fine grain
parallelism and multiple accelerates
• Challenges
• Lack of a high-level programming language
• HLS tools (Vivado HLS, LegUp, Intel HLS compiler)
• Date transfer and Accelerators invocation
• System software support
• Data interface
5
This talk focuses on an efficient and scalable access to accelerators on FPGA
6. FPGA Multi-Accelerator Framework
• Resource management of
multiple accelerators on
FPGA
• Data transfer
6
Host
Memory
DMA
Processor
FPGA
PCIe
On board DDR
Accelerator 1
Accelerator 2
Accelerator 3
Accelerator 4
Core 1
Core 1
Core 1
Input Buffer
Output Buffer
7. Related Work: Multi-Accelerator Invocation
• Open-source frameworks
• RIFFA [TRETS 2015] by Jacobsen, et al.
• JetStream [FPL 2016] by Vesper, et al.
• ffLink [Heart 2015] by Chevallerie, et al.
• Fully automated commercial tools
• SDAccel from Xilinx
• OPAE from Intel
7
8. Open-source framework: RIFFA
• RIFFA
• Sending request through
PIO using multiple shared
registers
• Controlling data transfer
process through PIO
• Blocking I/O requests
8
Host
Memory
DMA Processor
FPGA
PCIe
On board DDR
Core 1
Core 2
RIFFA
Accelerator Buffer
RIFFA
Driver
Data transfer
management
Data transfer
9. Commercial tools
• SDAccel
• Command-based
• Data Transfer
• Buffers on DDR
• Not efficient for
streaming applications
[FCCM 2018] by Ruan
et al.
Host
Memory
DMA Processor
FPGA
PCIe
On board DDR
Core 1
Core 2
SDAccel
Accelerator 1
Buffer3
SDAccel
Driver
Accelerator 3
Accelerator 2
SDAccel
Buffer2Buffer1
queue
queue
queue
queue
Invoking
accelerator 2
9
10. Our solution for Invoking Multiple accelerators
• Allocating accelerators to
the requests
• Current platforms
• User defines the
destination accelerator
• Under utilization In the
case of more than one
accelerator of the same
type
• User needs to know
more about the
hardware design
structure
Host
Memory
DMA Processor
FPGA
PCIe
On board DDR
Core 1
Core 2
Accelerator 1
Buffer3
Driver
Accelerator 3
Buffer2
Buffer1
queue
queue
Invoking
accelerator 2
10
Accelerator 2
Invoking
accelerator 2
queue
Scheduler
Our proposed framework will address this drawback by moving a scheduler
to the hardware, and Adding hardware queues per accelerator type
Accelerator 2
11. Multi-core accelerator invocation
• Accelerator request from
multiple cores
• Current platforms policies
• Rejecting all the requests
while a request is getting
processed
• Conflict between cores to
access shared queue on
the host side
• Keeping the core locked in
the case of conflict
Host
Memory
DMA Processor
FPGA
PCIe
On board DDR
Core 1
Core 2
Buffer3
Driver
Accelerator B
Buffer2
Buffer1
queue
queue
Invoking
accelerator type A
11
Accelerator A
Invoking
accelerator type A
Accelerator A
queue
Scheduler
13. Multi-Queue Multi-Accelerator Interface
(MQMAI)
Ø proposing multi-queues to multi-core processors to manage resource
contention in I/O intensive applications
Ø Adapting NVMe data transfer protocol for FPGA accelerator invocation
§ Each accelerator on the FPGA can be viewed as a memory block in
SSD for parallel access
Ø Similar to NVMe, MQMAI bypasses the block layer and instead deploys a
multi-queue mechanism
13
S. Rezaei, E. Bozorgzadeh, and K. Kim, “Multi - Queue Data Transfer Scheme for FPGA - based Multi – Accelerators”, in ICCD 2018.
15. MQMAI Software Stack
• Submission/completion Queues
• Circular queues, head/tail registers
• Submission command
• command ID, core ID, I/O command type,
Accelerator type
• Completion command
• Command ID, validation bit
• Doorbell:
• Device driver sends doorbell register to
hardware to inform a new request (PIO mode)
• All the requests are through DMA except
sending doorbell register
• Free up CPU time
15
OS
FPGA
MqueueMAI controller
MqueueMAI driver
SQ
CQ
SQ
CQ
SQ
CQ
...
ACC 1 ACC 2 ACC M...
Core 1 Core 2 Core N
16. Hardware layer of MQMAI
16
.
.
.
Read
Requester
Write
Requester
RX MUX
SQ 1 Tail Address
SQ 2 Tail Address
SQ N Tail Address
Xilinx PCI Integrated IP
TX MUX
Scatter Gather maganger
Command Controller
Main Command Queue
Command Processor
Lookup Table
(channel selector)SG Commands Queue
1 .
Channel Controller
3 .
2 .
...
Data from DMA
MSI Interrupt Generator
Completion Module
Read/Write
completion
Write Request
Write to the
Completion queue
Rx Channel 1
Tx Channel 1
Rx Channel 2
Tx Channel 2
Rx Channel N
Tx Channel N
SQ 1 Head Address
SQ 2 Head Address
SQ N Head Address
Doorbell Controller
Command Fetch
requester
SG Queue SG Processor
SG or Read Request
Write Request
• Doorbell Controller
• Command Controller
• Command queue
• Scheduler and
accelerator allocation
• Scatter Gather Manager
• Completion Module
17. Timing Diagram of a single accelerator call from host to FPGA
17
CPU core
MqueuMAI
Controller
MqueueMAI
Driver
User
Application
PC
Memory
Rx
Channel
DMA
Call Library Function
Making and pushing
command Into SQ
Pushing Doorbell button
Request to get the commands in the SQ queue
Receiving commands
Processing a command
and assigning an
accelerator
Request to get the SG list
Receiving SG list
Processing SG list
Request to get Data
Receiving Data
Writing data to the Rx
channel dedicated to the
selected accelerator
Writing the completion
command into CQ
Send an Interrupt to the core that sent the command
FPGAHost
ACC
Acceleratorcomputation
Tx
Channel
Request to get the SG list
Receiving SG list
Write Result Data to
the memory
Processing SG list
.
.
.
Read
Requester
Write
Requester
RX MUX
SQ 1 Tail Address
SQ 2 Tail Address
SQ N Tail Address
Xilinx PCI Integrated IP
TX MUX
Scatter Gather maganger
Command Controller
Main Command Queue
Command Processor
Lookup Table
(channel selector)SG Commands Queue
1 .
Channel Controller
3 .
2 .
...
Data from DMA
MSI Interrupt Generator
Completion Module
Read/Write
completion
Write Request
Write to the
Completion queue
Rx Channel 1
Tx Channel 1
Rx Channel 2
Tx Channel 2
Rx Channel N
Tx Channel N
SQ 1 Head Address
SQ 2 Head Address
SQ N Head Address
Doorbell Controller
Command Fetch
requester
SG Queue SG Processor
SG or Read Request
Write Request
18. Experimental Setup
18
• Host side
• quad-core Intel processor i5-4590 running at 3.3GHz with Linux Operating system
• 8GB memory
• PCI Express Gen3
• ADM-PCIE-7V3 alpha data board with a Xilinx Virtex 7
19. Experimental Setup
19
Void *thread_main () {
buffer *A;
buffer *B;
…
While () {
// Start of measuring the delay
For (i = 0, i < K, i ++)
// A: input data B: output data
Ret = acc_fpga (fpga, acc_type, A, length, B, length);
wait_fpga (fpga, acc_id, K);
// End of measuring the delay
}
}
23. Summary
• We proposed
• An efficient and scalable FPGA-based accelerator invocation framework that is scalable with
• The number of accelerators on the FPGA
• The number of parallel threads/applications invoking accelerators
• Provides an efficient multiple accelerators management and allocation
• Assigning request to the accelerators on the hardware using a scheduler
• A command queue for each accelerator type on the hardware controller
• Allowing multi-core access with minimizing the number of conflicts between requests
• Defining a pair of submission and completion queue per core
• Developing all the way from the software stack (libraries and device driver) to the hardware
controller
23
24. System Services for Reconfigurable Hardware
Acceleration in Mobile Devices
Mobile Devices with Reconfigurable Accelerators
• Multi-core and heterogeneous compute platforms
(CPUs, GPUs, FPGA and other DSP devices.)
• Acceleration Service: an approach to deliver improvement across power,
performance, programmability and portability.
FPGA-based Acceleration
• Programmable according to user application
• High performance due to hardware parallelism
• Power and energy efficient
24
T. Ting, E. Bozorgzadhe, and A.Amirisani, “System Services for Reconfigurable Hardware Acceleration in Mobile Devices”, IEEE
Reconfig 2018.
25. Acceleration Service Prototype
Software/Hardware co-design
Integrate mobile Android
software and FPGA design flow
Acceleration scheduler: SW/
HW libraries
25
Linux Kernel
/dev/binder
/dev/uio
UserSpaceKernelSpace
App Code
Acceleration Client
Library
Acceleration
Scheduler
Software
Library
Hardware
Library
FPGA Driver
App Acceleration Service
IPC
Zynq Programmable SoC
Hardware
Acc.
#N
Android PlatformAndroid Applications
Acc.
#2
Acc.
#1
System Overview
28. Conclusions
• Sharing scheme in FPGA-based accelerators is an unavoidable scheme
• FPGAs are being deployed in data centers as accelerators for various applications
• There is a need for multi accelerator ecosystems that enable more efficient and
scalable accelerator sharing in multi accelerator-multicore systems
• Our proposed MultiQueue MultiAccelerator Data Interface is a step toward efficient
and scalable “shared” access to multiple accelerators.
• Our proposed Android based framework is a step toward application level solution to
manage accelerators along with software library usage. Application level accelerator
service can further enhance the shared access to multi-accelerator platforms.
28