VEDLIoT took part in the 33rd International Conference on Field-Programmable Logic and Applications (FPL 2023), in Gothenburg, Sweden. René Griessl (UNIBI) presented VEDLIoT and our latest achievements in the Research Projects Event session, giving a presentation entitled "Accelerators for Heterogenous Computing in AIoT".
3. 3
VEDLIoT Hardware Platform
Heterogeneous, modular, scalable microserver system
Supporting the full spectrum of IoT from embedded over the edge towards the cloud
Different technology concepts for improving
x86
GPU
ML-ASIC
ARM v8
GPU
SoC
FPGA
SoC
RISC-V
FPGA
VEDLIOT Cognitive
IoT Platform
Performance
Cost-effectiveness
Maintainability
Reliability
Energy-Efficiency
Safety
4. 4
RECS Architecture – RECS|BOX
RECS Server Backplane (up to 15 Carriers)
Carrier (PCIe Expansion)
Carrier (High Performance)
e.g. GPU-Accelerator
Carrier (Low Power)
#3
#2
Microserver
(High Performance)
#1
Microserver
(Low Power)
#16
#3
#2
Microserver
(Low Power)
#1
High-Speed Low-Latency Network (PCIe, High-Speed Serial)
Compute Network (up to 40 GbE)
Management Network (KVM, Monitoring, …)
HDMI/USB
iPass+ HD
QSFP+
RJ45
Ext. Connectors
GPU
SoC
FPGA
SoC
ARM
Soc
Low-Power Microserver
(Apalis/Jetson)
x86 ARM v8
High-Performance Microserver (COM
Express)
FPGA SoC
High-Performance
Carrier
(up to 3 microservers)
Low-Power Carrier
(up to 16 microservers)
5. 5
t.RECS
t.RECS Edge Server
Optimized platform for
local / edge applications
Provide interfaces for
Video
Camera
Peripheral input (USB)
Combine FPGA and
GPU acceleration
Compact dimensions
1 RU, E-ATX form factor
(2 RU/ 3 RU for special cases)
RECS Architecture – t.RECS
Microserver #3
(COM-HPC Client)
Microserver #1
(COM-HPC Client)
Microserver #2
(COM-HPC Server)
Switched PCIe (Host to Host)
External
interfaces
PCIe
expansion
Ethernet (up to 10 GbE)
Management Network (KVM, Monitoring, …)
I/O (Camera, Display, Radar/Lidar, Audio)
7. 7
Microserver Overview
u.RECS
t.RECS
RECS|Box
Xilinx Kria
K26
NVIDIA Jetson
Orin NX
Hailo-8
SMARC 2.1
x86/ARM
CPUs, FPGAs
Raspberry Pi
Compute
Module 4
COM-HPC
Client
X86
COM-HPC
NVIDIA
AGX Adapter
COM-HPC
Server
FPGA
COM Express
ARM v8 Server
SoC Hi1616
COM Express
Xilinx Zynq 7045
COM Express
AMD Ryzen
V1807B
Jetson TX2
NVIDIA
Tegra X2
COM Express
Intel Stratix 10
COM Express
Intel Core i7
8th Gen
NVIDIA Jetson
Orin NX
COM Express
AMD EPYC
3451
8. 8
▪ VEDLIoT accelerators support a large variety
of reconfigurable architectures
▪ From small embedded FPGAs to large ACAPs
▪ Large design space for FPGA-based accelerators
▪ Dynamic hardware reconfiguration
▪ Adapt to changing requirements at run-time
▪ Change characteristics of DL-accelerator
▪ Trade-off between
power and performance, power and accuracy, etc.
▪ Inference and training on FPGA
▪ Supports quantization from int8 to float32
▪ DL and Deep Reinforcement Learning
Reconfigurable DL accelerators
9. 9
Peak performance values of specialized accelerators, provided by the vendors
(precisions varying from INT8 to FP32)
Peak Performance of DL Accelerators
Average efficiency at 1000 GOPS /W
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
0.01 0.1 1 10 100 1000
Performance
[GOPS]
Power [Watt]
ASIC
GPU
FPGA
Ultra Low Power
High Performance
Low Power
11. 11
Yolo v4 accelerator performance (2)
Performance of Yolo v4 for different hardware platform has been evaluated
Performance measurement for other networks (Resnet, EfficientNet) available as well
• ASICs (Hailo-8, Versal AI cores) achieve highest energy efficiency (only INT8)
• Embedded GPUs (Orin, Xavier) show good efficiency in all precisions
• GPGPU (GTX1660, V100, A100) are optimized for performance
12. 12
Summary
Efficient Heterogeneous Computing: The VEDLIoT hardware platform
RECS combines diverse compute architectures, boosting communication
and energy efficiency.
Accelerator Integration: The VEDLIoT project harnesses RECS for
accelerator benchmarking and hardware/software integration.
Seamless Edge to Cloud Integration: RECS offers a unified approach
across the computing spectrum, enhancing interoperability.
15. 15
DL accelerator co-design
"FiBHA: Fixed Budget Hybrid CNN Accelerator", Fareed Qararyah, Muhammad Waqar Azhar, Pedro Trancoso, IEEE 34th International Symposium on Computer Architecture and High-
Performance Computing (SBAC-PAD 2022), Bordeaux, France, November 2–5 2022
Monolithic design
● One engine computes
all the core layers
● E.g. TPU
SEML
● One engine computes all
layers of the same type
● PW engine, DW engine
SESL
● One engine per layer
● E.g. FINN
FiBHA
● SESL + SEML
16. 16
VEDLIoT‘s Deep Learning Toolchain
Enabling the rapid convergence of the fast pace
innovation on the hardware and software
Frameworks &
Exchange Formats
Optimization
Engine
Compilers &
Runtime APIs
Heterogeneous
Hardware
Platforms
17. 17
Simulation platform for ML
accelerators
▪ RISC-V SoCs and Custom
Function Units
▪ Improve test and
verification
▪ Co-simulate Verilog blocks
▪ Used in Google’s CFU
Playground
▪ Continuous integration
based in Gitlab and Google
Cloud Platform
Safety and Robustness
Robustness verification on DL models
▪ Tuning hyperparameters
More in the
hands on
session
18. 18
▪ Common environment for running distributed applications
▪ WebAssembly runtime + Trusted Execution Environment
▪ Security for edge (and cloud) devices
▪ Advances on attestation
▪ Better support for edge devices
▪ Distributed (Byzantine fault-tolerant) attestation and configuration service
▪ Secure IoT Gateway
Security
19. 19
A compositional architecture framework for AIoT
Knowledge creation (e.g.
definition of safety goals).
Concept design (e.g.
introduction of redundancy
to fulfil safety goals).
Final design (e.g. assigning
functions to independent
processors to guarantee
redundancy).
Monitoring concept definition
(e.g. monitoring fulfilment of
safety goals at run-time).
Solution
Space
Problem
Space
Editor's Notes
Focus on Xilinx DPU and then Different DPU (Deeplearning processing Unit) Configurations
Different DPU Configurations
DPU for UltraScale (Zynq Virtex Kintex) in Fabric
DPU for HBM Alveo Cards
DPU in Hardware for Versal
Different clusters are forming
Extensive Benchmarks on 15 devices generating 59 measurements
Self measured effiency data using the RECS system
Values lower due tue real world performance, not peak performance
SEML: Seingle Engine Multi Layer
SESL: Single Engine Single Layer
Matching the hardware to the acceleratorNet distributes compute single layer in single engine
Consider memory, what does it need to access, does it need cache?
Model Optimazation due tue pruning or precision changes to tailor it to the used accelerator and reduce content switching for example
Emulation of the Rebustness models in Renode
Using the WebAssembly runtime to have a capsuled deployment on different system
Integration of remote attastion in WebAssembly
Available on ARM, Jetson, everything that supports OpTee
Requirements engineering Frameworf (RAF) for ML/AI systems and solutions giving guidance how to design a system to make it secure efficient and working goodChallenging as ML cannot be proven formally
Important to make sure ML/AI solutions are save to use in for example autonomous drivingAIoT Artificial intelligence of things