More Related Content Similar to “Making Edge AI Inference Programming Easier and Flexible,” a Presentation from Texas Instruments (20) More from Edge AI and Vision Alliance (20) “Making Edge AI Inference Programming Easier and Flexible,” a Presentation from Texas Instruments1. © 2020 Texas Instruments
Making Edge AI Inference
Programming Easier and Flexible
Manisha Agrawal
Product Marketing Engineer
September 2020
2. © 2020 Texas Instruments
Agenda
2
Challenges with
edge inference
deployment
Open source
inference
framework
advancement
Easier and
flexible
programming on
TI Jacinto™ 7
processors
Programming
experience with
Jacinto 7
processors
compared to
desktop
3. © 2020 Texas Instruments
Deploying Deep Learning Model for Edge Inference (1/2)
3
Edge Inference
Training framework
Trained model Deploy
4. © 2020 Texas Instruments
Deploying Deep Learning Model for Edge Inference(2/2)
4
Training
framework
Trained model
Proprietary
inference
framework
Optimizer
Embedded
hardware
Learn
Custom tools
Runtime engine
Vendor A
Vendor B
Vendor C
Vendor D
Vendor E
API | Interpreter | Scheduler Kernel library
5. © 2020 Texas Instruments
Edge Inference Programming Challenges
5
Trained model
Inference
framework
Embedded
hardware
Challenges
• Tools not easy
to use
• Unsupported
operators
• Accuracy goals not
met
• Performance
not met
• Power goals
not met
Software
Hardware
Optimizer / Compiler
Runtime engine
API | Interpreter | Scheduler Kernel library
Training
framework
6. © 2020 Texas Instruments
Open Source Inference Framework Advancement
6
Trained model
Inference
framework
Embedded
hardware
Open source inference
framework
Compiler / Optimizer
Runtime engine
API | Interpreter | Scheduler Kernel library
General purpose Custom
ArmNN
Training
framework
7. © 2020 Texas Instruments
Promising Open Source Framework Meeting Majority
Customers Need
7
RunTime engineTraining framework
Optimizer / Compiler /
Graph format
TensorFLow Lite format
ONNX format
TVM / Amazon SageMaker Neo
compiler
TFLite RunTime
ONNX-RunTime
Neo-AI-DLR
8. © 2020 Texas Instruments
Open Source TFLite RunTime
8
OpenSource RunTime API
TFLite Kernel library
General purpose hardware
Model artifacts
TFLite RunTime TFLIte RunTime
• Supporting all inference operators on CPU /
GPU
9. © 2020 Texas Instruments
Open Source ONNX RunTime
9
OpenSource RunTime API
ONNX Kernel library
General purpose hardware
Model artifacts
ONNX RunTime ONNX RunTime
• Supporting all inference operators on CPU /
GPU
10. © 2020 Texas Instruments
Open Source Inference Framework with Hooks for
Specialized Hardware
10
OpenSource RunTime API
Backend
Kernel library / Generated code
General purpose hardware
Model artifacts
RunTime
Kernel library
Specialized
hardware
Compiler & RunTime
• Supporting all inference operators on CPU /
GPU
• Backend for specialized hardware
Embedded SoC
11. © 2020 Texas Instruments
TI’s First JacintoTM 7 SoC for Edge Inference
Accelerating key functions lowers
power
• DSP for computer vision
• Vision processing
• Video, graphics
• Deep learning
Industry’s most efficient DL
architecture
• Enables passively-cooling
designs
• 90% utilization of deep
learning accelerator due to
smart memory system
Automotive Quality-ready
process technology
• Power reduction is achieved
through smart architecture,
not process
https://www.ti.com/product/TDA4VM
High speed interconnect 16nm FF
ASIL D / SIL 3
C7x DSP
32k/48K L1
512KB L2
ASIL B / SIL2
Powerisolation
Security acceleration
Crypto: AES, 3DES, SHA, PKA, RNG
Encode
Decode
Video acceleration Encode
Decode
Ethernet switch
Up to 8 Ports
ETHERNE
T
Audio acceleration
HD ATLASRC
Vision acceleration
Vision ISP
w/ LDC
Dense
Optical
Flow
Stereo
Disparity
Estimation
Arm
Cortex A7x
48k/32K each
Arm
Cortex A7x
48k/32K each
1M shared L2
Rogue 8XE GPU
Display sub-system
1x DP/eDP, 1x DSI
Safety MCU
DMSC
Device Management &
Security Controller
Hardware Diagnostics
CRC
RTI
ESM
DCC
BIST
Arm
Cortex R5F
32K/32K L1
64KB RAM
Arm
Cortex R5F
32K/32K L1
64KB RAM
LockStep
1MB SRAM
2x ADC
3x I2C* 2x UART*
3x SPI* 2x I3C*
UDMA GPIO
2x
RGMII/RMII
Capture sub-system
2x CSI2 x 4 DL
GPIO
IPC
IOMMU UDMA
SMMU Debug Timers
WDT
System services
Hardware
Diagnostics
Connectivity Network
4x PCIe 14x
6 Pin 4096FS
Serial
1x I3C1x SDIO
8x McSPI10x
UART
12x
McASP
7x I2C
MMA32k/32K
L1
288KB L2
C66xDSP
MMA
-+ * =
32k/32K L1
288KB L2
C66xDSP
MMA-+ * =
8 MB L3 RAM w/ECC and Coherency
011100
100010
001111
32b LPDDR4/4x-3732 +ECC
Arm
Cortex R5F
16K/16K L1
64KB RAM
Arm
Cortex R5F
16K/16K L1
64KB RAM
LockStep
Arm
Cortex R5F
16K/16K L1
64KB RAM
Arm
Cortex R5F
16K/16K L1
64KB RAM
LockStep
0.5 MB SRAM
Storage
1x UFS
2.x 1x SD 3.0
1x eMMC5.x 2x USB
12. © 2020 Texas Instruments
TI Deep Learning (TIDL) Inference Framework
12
NRE-free, royalty-free tools enable high-performance, fixed-point inference on TI processors
Optimizer
TIDL Optimizer
• Post training quantization for 8-bit and
16-bit
• Range calibration
• Hardware independent optimizations
• Hardware specific optimizations
TI SoC
RunTime
TIDL RunTime
• Runtime API
• Deep learning library
TI SoC
13. © 2020 Texas Instruments
TIDL Accelerate All Operators You Rely On
13
Refer to latest Processor SDK user guide document for complete list of accelerated operators and tested models:
https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tidl_j7_01_02_00_09/ti_dl/docs/user_guide_html/md_tidl_layers_info.html
TIDL features
• Accelerates all operators commonly used by CNN vision
models on our deep learning accelerator cores
• Out-of-box support for 35+ pre-trained CNN models
• List continues to grow
Popular operators supported include
• Convolution
• Pooling
• Element wise
• Inner-product
• Soft-max
• Bias add
• Concatenate
• Scale
• Batch
normalization
• Re-size
• Arg-max
• Slice
• Crop
• Flatten
• Shuffle channel
• Detection
output
• Deconvolution/
Transpose
convolution
14. © 2020 Texas Instruments
Texas Instruments adopting open source framework with
TI Deep Learning (TIDL) Integration
14
OpenSource RunTime API
TIDL RunTime
RunTime Open source adoption
• Integrating TIDL RunTime in open source
RunTime engine
Kernel library / Generated code
Cortex-A72
TIDL library
Deep learning
accelerator
Model artifacts
Neo-AI-DLR
Jacinto 7 processor
15. © 2020 Texas Instruments
Texas Instruments adopting open source framework with
TI Deep Learning (TIDL) Integration
15
OpenSource Compiler tools
High level optimizations | Graph partition
TIDL Backend
TVM Compiler Open source adoption: TVM Compiler
• Integrating TI Optimizer in open source TVM
Compiler tool
Model artifacts
Code Generation
General purpose
hardware
TIDL Optimizer
tools
TI DL hardware
Low level
optimizations
16. © 2020 Texas Instruments
TIDL RunTime
Now Accelerate All Your Models With Open Source API
16
• Inference latency in <5 ms for all popular
classification models
• Minimal accuracy loss with Post Training
Quantization and Quantization Aware Training
tools from TI
TensorFLow Lite
models
ONNX models
TVM/Amazon SageMaker
Neo compiler
TFLite RunTime
ONNX-RunTime
Neo-AI-DLR
User Application
Python / C / C++
API | interpreter | scheduler
TFLite/ONNX-RT/Neo-AI-DLR
Jacinto 7 processor
Deep learning
accelerator
-+ * =
C7x DSP with MMA*
ARM
Cortex A72
ARM
Cortex A72
IPC
*MMA: Matrix Multiplication Accelerator (Tensor Processing Unit)
Linux OS
CPU
Run any model from open source inference
frameworks on TI processors!
17. © 2020 Texas Instruments
Seamless Migration From PC Validation to Inference
Deployment
17
Your DL programming experience on Jacinto 7 processors
is the same as desktop computer programming.
This includes -
• Open source LINUX callable APIs in Python / C / C++
between PC and target board
• Jupyter Notebook examples
Download directly from ti.com
http://www.ti.com/tool/PROCESSOR-SDK-DRA8X-TDA4X
User Application
Python / C / C++
API | interpreter | scheduler
TFLite/ONNX-RT/Neo-AI-DLR
Jacinto 7 processor
ARM
Cortex A72
ARM
Cortex A72
Linux OS
CPU
18. © 2020 Texas Instruments
Programming Example: Amazon SageMaker Neo &
Neo-AI-DLR
18
Model compilation RunTime
Output
Run inference
Create RunTime
SageMaker Neo console
Provide model
information
Select TDA4VM as target
device
Compile the model
TDA4VM
Provide output location
19. © 2020 Texas Instruments
Programming Example: TensorFlow Lite
Create RunTime
Run inference
Get output
Init time: Hook TIDL backend
RunTime
20. © 2020 Texas Instruments
Deep Learning System Performance on Jacinto-7
TDA4VM: Example demo
20
5 simultaneous deep learning algorithms on 3x 1MP camera each @ ~16 fps
• Parking spot detection
• Vehicle detection
• Semantic segmentation
• Motion segmentation
• Depth estimation
Camera 1 Camera 2 Camera 3 Camera 1 Camera 2 Camera 3
Resource loading
A72: 6% C7x+MMA: 94%
DDR BW: 26%
Vehicle detection
Parking spot
detection
Semantic
segmentation
Dense Optical Flow
Depth Estimation
Motion
Segmentation
Inference resolution: 3x768x384
21. © 2020 Texas Instruments
Key Takeaways: Deep Learning Edge Inference at TI
21
• Easier to use
• Opensource Linux callable RunTime APIs supported to program SoC
• Embedded development environment same as a desktop computer environment
• More flexible
• Supports TFLite, ONNX-RT or TVM/Neo-AI-DLR
• Supports compilation at the edge or in the cloud with Amazon SageMaker Neo
• Provide wide model coverage
• All TFLite, ONNX, TVM and SageMaker Neo models supported on TI SoCs
• High compute performance, high throughput, low latency and low power consumption
• Accelerates the model on TI’s deep learning accelerator C7x+MMA to provide the best combination of system
power, deep learning performance and latency at the edge
Jacinto 7 processor silicon is available today!
22. © 2020 Texas Instruments
Explore Deep Learning With TI Today
22
Full development
Turn-key designs
Software
development kits
Support
TDA4 EVM
http://www.ti.com/tool/TDA4VMXEVM
Automotive version of TDA4V Mid
http://www.ti.com/tool/D3-3P-TDAX-DK
TI Processor SDK – Seamlessly reuse and migrate Linux,
Linux-RT and TI-RTOS software across TI processors
http://www.ti.com/tool/PROCESSOR-SDK-DRA8X-TDA4X
https://e2e.ti.com
Open source RunTime engines with TIDL acceleration is packaged as part of TI’s NRE-free, royalty-free Processor
SDK!