SlideShare a Scribd company logo
1 of 30
Download to read offline
Scalability and Efficiency in
Accelerator Sharing on FPGA Devices
Eli Bozorgzadeh
Computer Science Department
University of California, Irvine, USA
Seminar at NECST lab, Politecnico di Milano- October 26,2018
Why Accelerators?
•  Era of big data and
computational intensive
applications
•  Failure of continuing Dennard
scaling
•  Getting close to the end of
multi-core scaling
2
Big	data	
applications	
Computational	
intensive	
applications
Accelerators in Heterogeneous Architectures
•  Accelerators
•  GPUs
•  Power hungry
•  SIMD
•  FPGAs
•  Power efficiency
•  Multiple accelerator types
•  Application specific accelerators (ASIC)
3
Accelerators in Heterogeneous Architectures
•  Accelerators
•  GPUs
•  Power hungry
•  SIMD
•  FPGAs
•  Power efficiency
•  Multiple accelerator types
•  Application specific accelerators (ASIC)
4
FPGA based accelerator platform
•  Hardware custom circuits
•  Provide the environment for fine grain
parallelism and multiple accelerates
•  Challenges
•  Lack of a high-level programming language
•  HLS tools (Vivado HLS, LegUp, Intel HLS compiler)
•  Date transfer and Accelerators invocation
•  System software support
•  Data interface
5
This talk focuses on an efficient and scalable access to accelerators on FPGA
FPGA Multi-Accelerator Framework
•  Resource management of
multiple accelerators on
FPGA
•  Data transfer
6
Host	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
Memory	
	
	
DMA	
Processor	
	
	
	
	
	
	
	
	
FPGA	
	
	
	
	
	
	
	
	
	
PCIe	
On	board	DDR	
Accelerator	1	
Accelerator	2	
Accelerator	3	
Accelerator	4	
Core	1	
Core	1	
Core	1	
Input	Buffer	
Output	Buffer
Related Work: Multi-Accelerator Invocation
•  Open-source frameworks
•  RIFFA [TRETS 2015] by Jacobsen, et al.
•  JetStream [FPL 2016] by Vesper, et al.
•  ffLink [Heart 2015] by Chevallerie, et al.
•  Fully automated commercial tools
•  SDAccel from Xilinx
•  OPAE from Intel
7
Open-source framework: RIFFA
•  RIFFA
•  Sending request through
PIO using multiple shared
registers
•  Controlling data transfer
process through PIO
•  Blocking I/O requests
8
Host	
	
	
	
	
	
	
	
	
	
	
	
	
Memory	
DMA	Processor	
	
	
	
	
	
	
FPGA	
	
	
	
	
	
	
	
	
	
	
PCIe	
On	board	DDR	
	
	
Core	1	
Core	2	
RIFFA	
	
	
	
Accelerator	Buffer
RIFFA	
Driver	
Data	transfer	
management	
Data	transfer
Commercial	tools	
•  SDAccel
•  Command-based
•  Data Transfer
•  Buffers on DDR
•  Not efficient for
streaming applications
[FCCM 2018] by Ruan
et al.
Host	
	
	
	
	
	
	
	
	
	
	
	
	
Memory	
DMA	Processor	
	
	
	
	
	
	
FPGA	
	
	
	
	
	
	
	
	
	
PCIe	
On	board	DDR	
	
	
Core	1	
Core	2	
SDAccel	
	
	
	
	
Accelerator	1	
Buffer3
SDAccel	
Driver	
Accelerator	3	
Accelerator	2	
SDAccel	
Buffer2Buffer1
queue	
queue	
queue	
queue	
Invoking	
accelerator	2	
9
Our	solution	for	Invoking	Multiple	accelerators		
•  Allocating accelerators to
the requests
•  Current platforms
•  User defines the
destination accelerator
•  Under utilization In the
case of more than one
accelerator of the same
type
•  User needs to know
more about the
hardware design
structure
Host	
	
	
	
	
	
	
	
	
	
	
	
	
Memory	
DMA	Processor	
	
	
	
	
	
	
FPGA	
	
	
	
	
	
	
	
	
	
PCIe	
On	board	DDR	
	
	
Core	1	
Core	2	
	
	
	
	
	
Accelerator	1	
Buffer3
Driver	
	
	
	
	
Accelerator	3	
Buffer2
Buffer1
queue	
queue	
Invoking	
accelerator	2	
10
Accelerator	2	
Invoking	
accelerator	2	
queue	
Scheduler	
Our proposed framework will address this drawback by moving a scheduler
to the hardware, and Adding hardware queues per accelerator type
Accelerator	2
Multi-core	accelerator	invocation	
•  Accelerator request from
multiple cores
•  Current platforms policies
•  Rejecting all the requests
while a request is getting
processed
•  Conflict between cores to
access shared queue on
the host side
•  Keeping the core locked in
the case of conflict
Host	
	
	
	
	
	
	
	
	
	
	
	
	
Memory	
DMA	Processor	
	
	
	
	
	
	
FPGA	
	
	
	
	
	
	
	
	
	
PCIe	
On	board	DDR	
	
	
Core	1	
Core	2	
	
	
	
	
	
Buffer3
	
	
Driver	
Accelerator	B	
Buffer2
Buffer1
queue	
queue	
Invoking	
accelerator	type	A	
11
Accelerator	A	
Invoking	
accelerator	type	A	
Accelerator	A	
queue	
Scheduler
Multi-Queue	Multi-Accelerator	Interface	(MQMAI)	
Host	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
Memory	
DMA	
Processor	
	
	
	
	
	
	
	
FPGA	
	
	
	
	
	
	
	
	
	
	
PCIe	
On	board	DDR	
	
	
Core	1	
Core	2	
	
	
	
	
MQMAI	Controller	
	
	
	
	
	
	
	
	
	
	
	
	
Buffer3
MQMAI	Driver	
	
	
	
	
	
	
	
Accelerator	B	
Buffer2
Buffer1
accelerator	type	queue	
12
Submission	queue	
Completion	queue	
Completion	queue	
Scheduler	
accelerator	type	queue	
Invoking	
accelerator	type	A	
Invoking	
accelerator	type	A	
Accelerator	A	
Accelerator	A	
Submission	queue
Multi-Queue Multi-Accelerator Interface
(MQMAI)
Ø proposing multi-queues to multi-core processors to manage resource
contention in I/O intensive applications
Ø Adapting NVMe data transfer protocol for FPGA accelerator invocation
§  Each accelerator on the FPGA can be viewed as a memory block in
SSD for parallel access
Ø Similar to NVMe, MQMAI bypasses the block layer and instead deploys a
multi-queue mechanism
13
S. Rezaei, E. Bozorgzadeh, and K. Kim, “Multi - Queue Data Transfer Scheme for FPGA - based Multi – Accelerators”, in ICCD 2018.
MQMAI Architecture
•  Software Stack
•  C/C++ Library layer
•  Device Driver layer:
•  Submission/Completion command
•  Submission/Completion queue
•  Doorbell; Interrupt handler
•  Hardware Stack
•  Controller
•  Doorbell register Management, command management (Scheduler and accelerator
allocation), Data movement controller (Scatter Gather) and Completion management
•  PCIe Interface
•  Accelerator Interface
14
MQMAI Software Stack
•  Submission/completion Queues
•  Circular queues, head/tail registers
•  Submission command
•  command ID, core ID, I/O command type,
Accelerator type
•  Completion command
•  Command ID, validation bit
•  Doorbell:
•  Device driver sends doorbell register to
hardware to inform a new request (PIO mode)
•  All the requests are through DMA except
sending doorbell register
•  Free up CPU time
15
OS
FPGA
MqueueMAI controller
MqueueMAI driver
SQ
CQ
SQ
CQ
SQ
CQ
...
ACC 1 ACC 2 ACC M...
Core 1 Core 2 Core N
Hardware layer of MQMAI
16
.
.
.
Read	
Requester
Write	
Requester
RX	MUX
SQ	1	Tail	Address
SQ	2	Tail	Address
SQ	N	Tail	Address
Xilinx	PCI	Integrated	IP
TX	MUX
Scatter	Gather	maganger		
Command	Controller		
Main	Command	Queue
Command	Processor
Lookup	Table
(channel	selector)SG	Commands	Queue
	1	.
Channel	Controller
	3	.
	2	.
...
Data	from	DMA
MSI	Interrupt	Generator
Completion	Module
Read/Write
completion
Write	Request
Write	to	the
Completion	queue
Rx	Channel	1
Tx	Channel	1
Rx	Channel	2
Tx	Channel	2
Rx	Channel	N
Tx	Channel	N
SQ	1	Head	Address
SQ	2	Head	Address
SQ	N	Head	Address
Doorbell	Controller
Command	Fetch	
requester
SG	Queue SG	Processor
SG	or	Read	Request
Write	Request
•  Doorbell Controller
•  Command Controller
•  Command queue
•  Scheduler and
accelerator allocation
•  Scatter Gather Manager
•  Completion Module
Timing Diagram of a single accelerator call from host to FPGA
17
CPU core
MqueuMAI	
Controller
MqueueMAI
Driver
User	
Application
PC	
Memory
Rx
Channel
DMA
Call Library Function
Making and pushing
command Into SQ
Pushing Doorbell button
Request to get the commands in the SQ queue
Receiving commands
Processing a command
and assigning an
accelerator
Request to get the SG list
Receiving SG list
Processing SG list
Request to get Data
Receiving Data
Writing data to the Rx
channel dedicated to the
selected accelerator
Writing the completion
command into CQ
Send an Interrupt to the core that sent the command
FPGAHost
ACC
Acceleratorcomputation
Tx
Channel
Request to get the SG list
Receiving SG list
Write Result Data to
the memory
Processing SG list
.
.
.
Read	
Requester
Write	
Requester
RX	MUX
SQ	1	Tail	Address
SQ	2	Tail	Address
SQ	N	Tail	Address
Xilinx	PCI	Integrated	IP
TX	MUX
Scatter	Gather	maganger		
Command	Controller		
Main	Command	Queue
Command	Processor
Lookup	Table
(channel	selector)SG	Commands	Queue
	1	.
Channel	Controller
	3	.
	2	.
...
Data	from	DMA
MSI	Interrupt	Generator
Completion	Module
Read/Write
completion
Write	Request
Write	to	the
Completion	queue
Rx	Channel	1
Tx	Channel	1
Rx	Channel	2
Tx	Channel	2
Rx	Channel	N
Tx	Channel	N
SQ	1	Head	Address
SQ	2	Head	Address
SQ	N	Head	Address
Doorbell	Controller
Command	Fetch	
requester
SG	Queue SG	Processor
SG	or	Read	Request
Write	Request
Experimental	Setup	
18
•  Host side
•  quad-core Intel processor i5-4590 running at 3.3GHz with Linux Operating system
•  8GB memory
•  PCI Express Gen3
•  ADM-PCIE-7V3 alpha data board with a Xilinx Virtex 7
Experimental	Setup	
19
Void *thread_main () {
buffer *A;
buffer *B;
…
While () {
// Start of measuring the delay
For (i = 0, i < K, i ++)
// A: input data B: output data
Ret = acc_fpga (fpga, acc_type, A, length, B, length);
wait_fpga (fpga, acc_id, K);
// End of measuring the delay
}
}
Experimental	Results	
•  Average end-to-end delay in Multi-Thread Single-Accelerator case
•  Impact of command-based multi-queue feature
20
18x faster
Experimental	Results	
•  Average end-to-end delay in Multi-Thread Multi-Accelerator
•  Impact of accelerator allocation management
21
60x faster
Experimental	Results	
•  Resource utilization on a Xilinx Virtex 7 FPGA for 4 I/O channels
22
Summary
•  We proposed
•  An efficient and scalable FPGA-based accelerator invocation framework that is scalable with
•  The number of accelerators on the FPGA
•  The number of parallel threads/applications invoking accelerators
•  Provides an efficient multiple accelerators management and allocation
•  Assigning request to the accelerators on the hardware using a scheduler
•  A command queue for each accelerator type on the hardware controller
•  Allowing multi-core access with minimizing the number of conflicts between requests
•  Defining a pair of submission and completion queue per core
•  Developing all the way from the software stack (libraries and device driver) to the hardware
controller
23
System Services for Reconfigurable Hardware
Acceleration in Mobile Devices
Mobile Devices with Reconfigurable Accelerators
•  Multi-core and heterogeneous compute platforms
(CPUs, GPUs, FPGA and other DSP devices.)
•  Acceleration Service: an approach to deliver improvement across power,
performance, programmability and portability.
FPGA-based Acceleration
•  Programmable according to user application
•  High performance due to hardware parallelism
•  Power and energy efficient
24
T. Ting, E. Bozorgzadhe, and A.Amirisani, “System Services for Reconfigurable Hardware Acceleration in Mobile Devices”, IEEE
Reconfig 2018.
Acceleration Service Prototype
Software/Hardware co-design
Integrate mobile Android
software and FPGA design flow
Acceleration scheduler: SW/
HW libraries
25
Linux Kernel
/dev/binder
/dev/uio
UserSpaceKernelSpace
App Code
Acceleration Client
Library
Acceleration
Scheduler
Software
Library
Hardware
Library
FPGA Driver
App Acceleration Service
IPC
Zynq Programmable SoC
Hardware
Acc.
#N
Android PlatformAndroid Applications
Acc.
#2
Acc.
#1
System Overview
26
Label:
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Targer device: Zynq SoC
Preliminary Results
27
Label:
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Preliminary Results
Conclusions
•  Sharing scheme in FPGA-based accelerators is an unavoidable scheme
•  FPGAs are being deployed in data centers as accelerators for various applications
•  There is a need for multi accelerator ecosystems that enable more efficient and
scalable accelerator sharing in multi accelerator-multicore systems
•  Our proposed MultiQueue MultiAccelerator Data Interface is a step toward efficient
and scalable “shared” access to multiple accelerators.
•  Our proposed Android based framework is a step toward application level solution to
manage accelerators along with software library usage. Application level accelerator
service can further enhance the shared access to multi-accelerator platforms.
28
Acknowledgement
•  PhD Students
•  Siavash Rezaei
•  Hsin-Yu Ting
•  Collaborators
•  Kanghee Kim, Soongsil Univ. , Korea
29
Thank you!
eli@ics.uci.edu
30

More Related Content

What's hot

Session 8,9 PCI Express
Session 8,9 PCI ExpressSession 8,9 PCI Express
Session 8,9 PCI Express
Subhash Iyer
 
Generic and Automatic Specman Based Verification Environment
Generic and Automatic Specman Based Verification EnvironmentGeneric and Automatic Specman Based Verification Environment
Generic and Automatic Specman Based Verification Environment
DVClub
 
Xilinx fpga cores
Xilinx fpga coresXilinx fpga cores
Xilinx fpga cores
sanaz nouri
 
Sessions 6,7 Ethernet
Sessions 6,7 EthernetSessions 6,7 Ethernet
Sessions 6,7 Ethernet
Subhash Iyer
 
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSRun-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Heiko Joerg Schick
 

What's hot (20)

Implementation of Soft-core Processor on FPGA
Implementation of Soft-core Processor on FPGAImplementation of Soft-core Processor on FPGA
Implementation of Soft-core Processor on FPGA
 
BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!
 
The Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft Processor
 
Session 8,9 PCI Express
Session 8,9 PCI ExpressSession 8,9 PCI Express
Session 8,9 PCI Express
 
Thaker q3 2008
Thaker q3 2008Thaker q3 2008
Thaker q3 2008
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
Int 1010 Tcp Offload
Int 1010 Tcp OffloadInt 1010 Tcp Offload
Int 1010 Tcp Offload
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
Generic and Automatic Specman Based Verification Environment
Generic and Automatic Specman Based Verification EnvironmentGeneric and Automatic Specman Based Verification Environment
Generic and Automatic Specman Based Verification Environment
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
 
Rhino labs Prese4th ntation At FPGA Camp, Santa Clara, CA
Rhino labs Prese4th ntation At FPGA Camp, Santa Clara, CARhino labs Prese4th ntation At FPGA Camp, Santa Clara, CA
Rhino labs Prese4th ntation At FPGA Camp, Santa Clara, CA
 
Smart logic
Smart logicSmart logic
Smart logic
 
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking TechnologyDesign of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
 
Microblaze
MicroblazeMicroblaze
Microblaze
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel Awareness
 
Tech Days 2015: Embedded Product Update
Tech Days 2015: Embedded Product UpdateTech Days 2015: Embedded Product Update
Tech Days 2015: Embedded Product Update
 
Xilinx fpga cores
Xilinx fpga coresXilinx fpga cores
Xilinx fpga cores
 
Sessions 6,7 Ethernet
Sessions 6,7 EthernetSessions 6,7 Ethernet
Sessions 6,7 Ethernet
 
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSRun-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
 

Similar to Scalability and Efficiency in Accelerator Sharing on FPGA Devices

SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
UniFabric
 
Verification Strategy for PCI-Express
Verification Strategy for PCI-ExpressVerification Strategy for PCI-Express
Verification Strategy for PCI-Express
DVClub
 

Similar to Scalability and Efficiency in Accelerator Sharing on FPGA Devices (20)

Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
ODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & FeedsODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & Feeds
 
Synopsys User Group Presentation
Synopsys User Group PresentationSynopsys User Group Presentation
Synopsys User Group Presentation
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Exploration of Radars and Software Defined Radios using VisualSim
Exploration of  Radars and Software Defined Radios using VisualSimExploration of  Radars and Software Defined Radios using VisualSim
Exploration of Radars and Software Defined Radios using VisualSim
 
HiPEAC-Keynote.pptx
HiPEAC-Keynote.pptxHiPEAC-Keynote.pptx
HiPEAC-Keynote.pptx
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
 
Verification Strategy for PCI-Express
Verification Strategy for PCI-ExpressVerification Strategy for PCI-Express
Verification Strategy for PCI-Express
 
Protocol for QoS Support Chapter 18
Protocol for QoS Support Chapter 18Protocol for QoS Support Chapter 18
Protocol for QoS Support Chapter 18
 
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
 
Building a Router
Building a RouterBuilding a Router
Building a Router
 
Thaker q3 2008
Thaker q3 2008Thaker q3 2008
Thaker q3 2008
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
 
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
 
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
 

More from NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
NECST Lab @ Politecnico di Milano
 

More from NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Recently uploaded

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Recently uploaded (20)

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 

Scalability and Efficiency in Accelerator Sharing on FPGA Devices

  • 1. Scalability and Efficiency in Accelerator Sharing on FPGA Devices Eli Bozorgzadeh Computer Science Department University of California, Irvine, USA Seminar at NECST lab, Politecnico di Milano- October 26,2018
  • 2. Why Accelerators? •  Era of big data and computational intensive applications •  Failure of continuing Dennard scaling •  Getting close to the end of multi-core scaling 2 Big data applications Computational intensive applications
  • 3. Accelerators in Heterogeneous Architectures •  Accelerators •  GPUs •  Power hungry •  SIMD •  FPGAs •  Power efficiency •  Multiple accelerator types •  Application specific accelerators (ASIC) 3
  • 4. Accelerators in Heterogeneous Architectures •  Accelerators •  GPUs •  Power hungry •  SIMD •  FPGAs •  Power efficiency •  Multiple accelerator types •  Application specific accelerators (ASIC) 4
  • 5. FPGA based accelerator platform •  Hardware custom circuits •  Provide the environment for fine grain parallelism and multiple accelerates •  Challenges •  Lack of a high-level programming language •  HLS tools (Vivado HLS, LegUp, Intel HLS compiler) •  Date transfer and Accelerators invocation •  System software support •  Data interface 5 This talk focuses on an efficient and scalable access to accelerators on FPGA
  • 6. FPGA Multi-Accelerator Framework •  Resource management of multiple accelerators on FPGA •  Data transfer 6 Host Memory DMA Processor FPGA PCIe On board DDR Accelerator 1 Accelerator 2 Accelerator 3 Accelerator 4 Core 1 Core 1 Core 1 Input Buffer Output Buffer
  • 7. Related Work: Multi-Accelerator Invocation •  Open-source frameworks •  RIFFA [TRETS 2015] by Jacobsen, et al. •  JetStream [FPL 2016] by Vesper, et al. •  ffLink [Heart 2015] by Chevallerie, et al. •  Fully automated commercial tools •  SDAccel from Xilinx •  OPAE from Intel 7
  • 8. Open-source framework: RIFFA •  RIFFA •  Sending request through PIO using multiple shared registers •  Controlling data transfer process through PIO •  Blocking I/O requests 8 Host Memory DMA Processor FPGA PCIe On board DDR Core 1 Core 2 RIFFA Accelerator Buffer RIFFA Driver Data transfer management Data transfer
  • 9. Commercial tools •  SDAccel •  Command-based •  Data Transfer •  Buffers on DDR •  Not efficient for streaming applications [FCCM 2018] by Ruan et al. Host Memory DMA Processor FPGA PCIe On board DDR Core 1 Core 2 SDAccel Accelerator 1 Buffer3 SDAccel Driver Accelerator 3 Accelerator 2 SDAccel Buffer2Buffer1 queue queue queue queue Invoking accelerator 2 9
  • 10. Our solution for Invoking Multiple accelerators •  Allocating accelerators to the requests •  Current platforms •  User defines the destination accelerator •  Under utilization In the case of more than one accelerator of the same type •  User needs to know more about the hardware design structure Host Memory DMA Processor FPGA PCIe On board DDR Core 1 Core 2 Accelerator 1 Buffer3 Driver Accelerator 3 Buffer2 Buffer1 queue queue Invoking accelerator 2 10 Accelerator 2 Invoking accelerator 2 queue Scheduler Our proposed framework will address this drawback by moving a scheduler to the hardware, and Adding hardware queues per accelerator type Accelerator 2
  • 11. Multi-core accelerator invocation •  Accelerator request from multiple cores •  Current platforms policies •  Rejecting all the requests while a request is getting processed •  Conflict between cores to access shared queue on the host side •  Keeping the core locked in the case of conflict Host Memory DMA Processor FPGA PCIe On board DDR Core 1 Core 2 Buffer3 Driver Accelerator B Buffer2 Buffer1 queue queue Invoking accelerator type A 11 Accelerator A Invoking accelerator type A Accelerator A queue Scheduler
  • 13. Multi-Queue Multi-Accelerator Interface (MQMAI) Ø proposing multi-queues to multi-core processors to manage resource contention in I/O intensive applications Ø Adapting NVMe data transfer protocol for FPGA accelerator invocation §  Each accelerator on the FPGA can be viewed as a memory block in SSD for parallel access Ø Similar to NVMe, MQMAI bypasses the block layer and instead deploys a multi-queue mechanism 13 S. Rezaei, E. Bozorgzadeh, and K. Kim, “Multi - Queue Data Transfer Scheme for FPGA - based Multi – Accelerators”, in ICCD 2018.
  • 14. MQMAI Architecture •  Software Stack •  C/C++ Library layer •  Device Driver layer: •  Submission/Completion command •  Submission/Completion queue •  Doorbell; Interrupt handler •  Hardware Stack •  Controller •  Doorbell register Management, command management (Scheduler and accelerator allocation), Data movement controller (Scatter Gather) and Completion management •  PCIe Interface •  Accelerator Interface 14
  • 15. MQMAI Software Stack •  Submission/completion Queues •  Circular queues, head/tail registers •  Submission command •  command ID, core ID, I/O command type, Accelerator type •  Completion command •  Command ID, validation bit •  Doorbell: •  Device driver sends doorbell register to hardware to inform a new request (PIO mode) •  All the requests are through DMA except sending doorbell register •  Free up CPU time 15 OS FPGA MqueueMAI controller MqueueMAI driver SQ CQ SQ CQ SQ CQ ... ACC 1 ACC 2 ACC M... Core 1 Core 2 Core N
  • 16. Hardware layer of MQMAI 16 . . . Read Requester Write Requester RX MUX SQ 1 Tail Address SQ 2 Tail Address SQ N Tail Address Xilinx PCI Integrated IP TX MUX Scatter Gather maganger Command Controller Main Command Queue Command Processor Lookup Table (channel selector)SG Commands Queue 1 . Channel Controller 3 . 2 . ... Data from DMA MSI Interrupt Generator Completion Module Read/Write completion Write Request Write to the Completion queue Rx Channel 1 Tx Channel 1 Rx Channel 2 Tx Channel 2 Rx Channel N Tx Channel N SQ 1 Head Address SQ 2 Head Address SQ N Head Address Doorbell Controller Command Fetch requester SG Queue SG Processor SG or Read Request Write Request •  Doorbell Controller •  Command Controller •  Command queue •  Scheduler and accelerator allocation •  Scatter Gather Manager •  Completion Module
  • 17. Timing Diagram of a single accelerator call from host to FPGA 17 CPU core MqueuMAI Controller MqueueMAI Driver User Application PC Memory Rx Channel DMA Call Library Function Making and pushing command Into SQ Pushing Doorbell button Request to get the commands in the SQ queue Receiving commands Processing a command and assigning an accelerator Request to get the SG list Receiving SG list Processing SG list Request to get Data Receiving Data Writing data to the Rx channel dedicated to the selected accelerator Writing the completion command into CQ Send an Interrupt to the core that sent the command FPGAHost ACC Acceleratorcomputation Tx Channel Request to get the SG list Receiving SG list Write Result Data to the memory Processing SG list . . . Read Requester Write Requester RX MUX SQ 1 Tail Address SQ 2 Tail Address SQ N Tail Address Xilinx PCI Integrated IP TX MUX Scatter Gather maganger Command Controller Main Command Queue Command Processor Lookup Table (channel selector)SG Commands Queue 1 . Channel Controller 3 . 2 . ... Data from DMA MSI Interrupt Generator Completion Module Read/Write completion Write Request Write to the Completion queue Rx Channel 1 Tx Channel 1 Rx Channel 2 Tx Channel 2 Rx Channel N Tx Channel N SQ 1 Head Address SQ 2 Head Address SQ N Head Address Doorbell Controller Command Fetch requester SG Queue SG Processor SG or Read Request Write Request
  • 18. Experimental Setup 18 •  Host side •  quad-core Intel processor i5-4590 running at 3.3GHz with Linux Operating system •  8GB memory •  PCI Express Gen3 •  ADM-PCIE-7V3 alpha data board with a Xilinx Virtex 7
  • 19. Experimental Setup 19 Void *thread_main () { buffer *A; buffer *B; … While () { // Start of measuring the delay For (i = 0, i < K, i ++) // A: input data B: output data Ret = acc_fpga (fpga, acc_type, A, length, B, length); wait_fpga (fpga, acc_id, K); // End of measuring the delay } }
  • 20. Experimental Results •  Average end-to-end delay in Multi-Thread Single-Accelerator case •  Impact of command-based multi-queue feature 20 18x faster
  • 21. Experimental Results •  Average end-to-end delay in Multi-Thread Multi-Accelerator •  Impact of accelerator allocation management 21 60x faster
  • 22. Experimental Results •  Resource utilization on a Xilinx Virtex 7 FPGA for 4 I/O channels 22
  • 23. Summary •  We proposed •  An efficient and scalable FPGA-based accelerator invocation framework that is scalable with •  The number of accelerators on the FPGA •  The number of parallel threads/applications invoking accelerators •  Provides an efficient multiple accelerators management and allocation •  Assigning request to the accelerators on the hardware using a scheduler •  A command queue for each accelerator type on the hardware controller •  Allowing multi-core access with minimizing the number of conflicts between requests •  Defining a pair of submission and completion queue per core •  Developing all the way from the software stack (libraries and device driver) to the hardware controller 23
  • 24. System Services for Reconfigurable Hardware Acceleration in Mobile Devices Mobile Devices with Reconfigurable Accelerators •  Multi-core and heterogeneous compute platforms (CPUs, GPUs, FPGA and other DSP devices.) •  Acceleration Service: an approach to deliver improvement across power, performance, programmability and portability. FPGA-based Acceleration •  Programmable according to user application •  High performance due to hardware parallelism •  Power and energy efficient 24 T. Ting, E. Bozorgzadhe, and A.Amirisani, “System Services for Reconfigurable Hardware Acceleration in Mobile Devices”, IEEE Reconfig 2018.
  • 25. Acceleration Service Prototype Software/Hardware co-design Integrate mobile Android software and FPGA design flow Acceleration scheduler: SW/ HW libraries 25 Linux Kernel /dev/binder /dev/uio UserSpaceKernelSpace App Code Acceleration Client Library Acceleration Scheduler Software Library Hardware Library FPGA Driver App Acceleration Service IPC Zynq Programmable SoC Hardware Acc. #N Android PlatformAndroid Applications Acc. #2 Acc. #1 System Overview
  • 26. 26 Label: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} Targer device: Zynq SoC Preliminary Results
  • 27. 27 Label: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} Preliminary Results
  • 28. Conclusions •  Sharing scheme in FPGA-based accelerators is an unavoidable scheme •  FPGAs are being deployed in data centers as accelerators for various applications •  There is a need for multi accelerator ecosystems that enable more efficient and scalable accelerator sharing in multi accelerator-multicore systems •  Our proposed MultiQueue MultiAccelerator Data Interface is a step toward efficient and scalable “shared” access to multiple accelerators. •  Our proposed Android based framework is a step toward application level solution to manage accelerators along with software library usage. Application level accelerator service can further enhance the shared access to multi-accelerator platforms. 28
  • 29. Acknowledgement •  PhD Students •  Siavash Rezaei •  Hsin-Yu Ting •  Collaborators •  Kanghee Kim, Soongsil Univ. , Korea 29