“Making Edge AI Inference Programming Easier and Flexible,” a Presentation from Texas Instruments

© 2020 Texas Instruments
Making Edge AI Inference
Programming Easier and Flexible
Manisha Agrawal
Product Marketing Engineer
September 2020

Agenda
2
Challenges with
edge inference
deployment
Open source
inference
framework
advancement
Easier and
flexible
programming on
TI Jacinto™ 7
processors
Programming
experience with
Jacinto 7
processors
compared to
desktop

Deploying Deep Learning Model for Edge Inference (1/2)
3
Edge Inference
Training framework
Trained model Deploy

Deploying Deep Learning Model for Edge Inference(2/2)
4
Training
framework
Trained model
Proprietary
inference
framework
Optimizer
Embedded
hardware
Learn
Custom tools
Runtime engine
Vendor A
Vendor B
Vendor C
Vendor D
Vendor E
API | Interpreter | Scheduler Kernel library

Edge Inference Programming Challenges
5
Trained model
Inference
framework
Embedded
hardware
Challenges
• Tools not easy
to use
• Unsupported
operators
• Accuracy goals not
met
• Performance
not met
• Power goals
not met
Software
Hardware
Optimizer / Compiler
Runtime engine
Training
framework

Open Source Inference Framework Advancement
6
Trained model
Inference
framework
Embedded
hardware
Open source inference
framework
Compiler / Optimizer
Runtime engine
General purpose Custom
ArmNN
Training
framework

Promising Open Source Framework Meeting Majority
Customers Need
7
RunTime engineTraining framework
Optimizer / Compiler /
Graph format
TensorFLow Lite format
ONNX format
TVM / Amazon SageMaker Neo
compiler
TFLite RunTime
ONNX-RunTime
Neo-AI-DLR

Open Source TFLite RunTime
8
OpenSource RunTime API
TFLite Kernel library
General purpose hardware
Model artifacts
TFLite RunTime TFLIte RunTime
• Supporting all inference operators on CPU /
GPU

Open Source ONNX RunTime
9
ONNX Kernel library
Model artifacts
ONNX RunTime ONNX RunTime
GPU

Open Source Inference Framework with Hooks for
Specialized Hardware
10
Backend
Kernel library / Generated code
Model artifacts
RunTime
Kernel library
Specialized
hardware
Compiler & RunTime
GPU
• Backend for specialized hardware
Embedded SoC

TI’s First JacintoTM 7 SoC for Edge Inference
Accelerating key functions lowers
power
• DSP for computer vision
• Vision processing
• Video, graphics
• Deep learning
Industry’s most efficient DL
architecture
• Enables passively-cooling
designs
• 90% utilization of deep
learning accelerator due to
smart memory system
Automotive Quality-ready
process technology
• Power reduction is achieved
through smart architecture,
not process
https://www.ti.com/product/TDA4VM
High speed interconnect 16nm FF
ASIL D / SIL 3
C7x DSP
32k/48K L1
512KB L2
ASIL B / SIL2
Powerisolation
Security acceleration
Crypto: AES, 3DES, SHA, PKA, RNG
Encode
Decode
Video acceleration Encode
Decode
Ethernet switch
Up to 8 Ports
ETHERNE
T
Audio acceleration
HD ATLASRC
Vision acceleration
Vision ISP
w/ LDC
Dense
Optical
Flow
Stereo
Disparity
Estimation
Arm
Cortex A7x
48k/32K each
Arm
Cortex A7x
48k/32K each
1M shared L2
Rogue 8XE GPU
Display sub-system
1x DP/eDP, 1x DSI
Safety MCU
DMSC
Device Management &
Security Controller
Hardware Diagnostics
CRC
RTI
ESM
DCC
BIST
Arm
Cortex R5F
32K/32K L1
64KB RAM
Arm
Cortex R5F
32K/32K L1
64KB RAM
LockStep
1MB SRAM
2x ADC
3x I2C* 2x UART*
3x SPI* 2x I3C*
UDMA GPIO
2x
RGMII/RMII
Capture sub-system
2x CSI2 x 4 DL
GPIO
IPC
IOMMU UDMA
SMMU Debug Timers
WDT
System services
Hardware
Diagnostics
Connectivity Network
4x PCIe 14x
6 Pin 4096FS
Serial
1x I3C1x SDIO
8x McSPI10x
UART
12x
McASP
7x I2C
MMA32k/32K
L1
288KB L2
C66xDSP
MMA
-+ * =
32k/32K L1
288KB L2
C66xDSP
MMA-+ * =
8 MB L3 RAM w/ECC and Coherency
011100
100010
001111
32b LPDDR4/4x-3732 +ECC
Arm
Cortex R5F
16K/16K L1
64KB RAM
Arm
Cortex R5F
16K/16K L1
64KB RAM
LockStep
Arm
Cortex R5F
16K/16K L1
64KB RAM
Arm
Cortex R5F
16K/16K L1
64KB RAM
LockStep
0.5 MB SRAM
Storage
1x UFS
2.x 1x SD 3.0
1x eMMC5.x 2x USB

TI Deep Learning (TIDL) Inference Framework
12
NRE-free, royalty-free tools enable high-performance, fixed-point inference on TI processors
Optimizer
TIDL Optimizer
• Post training quantization for 8-bit and
16-bit
• Range calibration
• Hardware independent optimizations
• Hardware specific optimizations
TI SoC
RunTime
TIDL RunTime
• Runtime API
• Deep learning library
TI SoC

TIDL Accelerate All Operators You Rely On
13
Refer to latest Processor SDK user guide document for complete list of accelerated operators and tested models:
https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tidl_j7_01_02_00_09/ti_dl/docs/user_guide_html/md_tidl_layers_info.html
TIDL features
• Accelerates all operators commonly used by CNN vision
models on our deep learning accelerator cores
• Out-of-box support for 35+ pre-trained CNN models
• List continues to grow
Popular operators supported include
• Convolution
• Pooling
• Element wise
• Inner-product
• Soft-max
• Bias add
• Concatenate
• Scale
• Batch
normalization
• Re-size
• Arg-max
• Slice
• Crop
• Flatten
• Shuffle channel
• Detection
output
• Deconvolution/
Transpose
convolution

Texas Instruments adopting open source framework with
TI Deep Learning (TIDL) Integration
14
TIDL RunTime
RunTime Open source adoption
• Integrating TIDL RunTime in open source
RunTime engine
Kernel library / Generated code
Cortex-A72
TIDL library
Deep learning
accelerator
Model artifacts
Neo-AI-DLR
Jacinto 7 processor

Texas Instruments adopting open source framework with
TI Deep Learning (TIDL) Integration
15
OpenSource Compiler tools
High level optimizations | Graph partition
TIDL Backend
TVM Compiler Open source adoption: TVM Compiler
• Integrating TI Optimizer in open source TVM
Compiler tool
Model artifacts
Code Generation
General purpose
hardware
TIDL Optimizer
tools
TI DL hardware
Low level
optimizations

TIDL RunTime
Now Accelerate All Your Models With Open Source API
16
• Inference latency in <5 ms for all popular
classification models
• Minimal accuracy loss with Post Training
Quantization and Quantization Aware Training
tools from TI
TensorFLow Lite
models
ONNX models
TVM/Amazon SageMaker
Neo compiler
TFLite RunTime
ONNX-RunTime
Neo-AI-DLR
User Application
Python / C / C++
API | interpreter | scheduler
TFLite/ONNX-RT/Neo-AI-DLR
Jacinto 7 processor
Deep learning
accelerator
-+ * =
C7x DSP with MMA*
ARM
Cortex A72
ARM
Cortex A72
IPC
*MMA: Matrix Multiplication Accelerator (Tensor Processing Unit)
Linux OS
CPU
Run any model from open source inference
frameworks on TI processors!

Seamless Migration From PC Validation to Inference
Deployment
17
Your DL programming experience on Jacinto 7 processors
is the same as desktop computer programming.
This includes -
• Open source LINUX callable APIs in Python / C / C++
between PC and target board
• Jupyter Notebook examples
Download directly from ti.com
http://www.ti.com/tool/PROCESSOR-SDK-DRA8X-TDA4X
User Application
Python / C / C++
API | interpreter | scheduler
TFLite/ONNX-RT/Neo-AI-DLR
Jacinto 7 processor
ARM
Cortex A72
ARM
Cortex A72
Linux OS
CPU

Programming Example: Amazon SageMaker Neo &
Neo-AI-DLR
18
Model compilation RunTime
Output
Run inference
Create RunTime
SageMaker Neo console
Provide model
information
Select TDA4VM as target
device
Compile the model
TDA4VM
Provide output location

Programming Example: TensorFlow Lite
Create RunTime
Run inference
Get output
Init time: Hook TIDL backend
RunTime

Deep Learning System Performance on Jacinto-7
TDA4VM: Example demo
20
5 simultaneous deep learning algorithms on 3x 1MP camera each @ ~16 fps
• Parking spot detection
• Vehicle detection
• Semantic segmentation
• Motion segmentation
• Depth estimation
Camera 1 Camera 2 Camera 3 Camera 1 Camera 2 Camera 3
Resource loading
A72: 6% C7x+MMA: 94%
DDR BW: 26%
Vehicle detection
Parking spot
detection
Semantic
segmentation
Dense Optical Flow
Depth Estimation
Motion
Segmentation
Inference resolution: 3x768x384

Key Takeaways: Deep Learning Edge Inference at TI
21
• Easier to use
• Opensource Linux callable RunTime APIs supported to program SoC
• Embedded development environment same as a desktop computer environment
• More flexible
• Supports TFLite, ONNX-RT or TVM/Neo-AI-DLR
• Supports compilation at the edge or in the cloud with Amazon SageMaker Neo
• Provide wide model coverage
• All TFLite, ONNX, TVM and SageMaker Neo models supported on TI SoCs
• High compute performance, high throughput, low latency and low power consumption
• Accelerates the model on TI’s deep learning accelerator C7x+MMA to provide the best combination of system
power, deep learning performance and latency at the edge
Jacinto 7 processor silicon is available today!

Explore Deep Learning With TI Today
22
Full development
Turn-key designs
Software
development kits
Support
TDA4 EVM
http://www.ti.com/tool/TDA4VMXEVM
Automotive version of TDA4V Mid
http://www.ti.com/tool/D3-3P-TDAX-DK
TI Processor SDK – Seamlessly reuse and migrate Linux,
Linux-RT and TI-RTOS software across TI processors
http://www.ti.com/tool/PROCESSOR-SDK-DRA8X-TDA4X
https://e2e.ti.com
Open source RunTime engines with TIDL acceleration is packaged as part of TI’s NRE-free, royalty-free Processor
SDK!

Thank You!
23

“Making Edge AI Inference Programming Easier and Flexible,” a Presentation from Texas Instruments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to “Making Edge AI Inference Programming Easier and Flexible,” a Presentation from Texas Instruments

Similar to “Making Edge AI Inference Programming Easier and Flexible,” a Presentation from Texas Instruments (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

“Making Edge AI Inference Programming Easier and Flexible,” a Presentation from Texas Instruments