In this talk, Shaoshan Liu discussed how PerceptIn enabled robotic applications on IoT devices. The basic services in robotics include localization, scene understanding, and speech recognition, these tasks are complex and often require very expensive computing hardware to deliver real-time performance. However, affordability usually means accessibility, to incubate the massive low-end robotic market, PerceptIn aims to enable complex AI tasks on affordable IoT devices. Previously, PerceptIn has demonstrated that limited autonomous driving workloads can be enabled on an embedded ARM SoC, with 15 W of peak power consumption, and costs a few hundred dollars. Recently, PerceptIn moved a step further towards this direction to enable computer vision, obstacle avoidance, localization on their newly release Zuluko module, which costs about $30 dollars and contains a low-power ARM SoC, and consumes 5 W of peak power.
6. Robotic Technologies: Scene Understanding
Convolution
Layer
Input
Activation
Layer
Pooling
Layer
Fully Connected
Layer
Label
CNN Pipeline:
• Convolution layer: extracts features from the input
• Activation layer: a function to determine whether the signal should be activated or not (e.g.
sigmoid function).
• Pooling layer: reduces the spatial size of the representation
• Fully connected layer: neurons in a fully connected layer have full connections to all
activations in the previous layer, as seen in regular Neural Networks
7. Robotic Technologies: Speech Recognition
Words
Feature
Extraction
Decoder
Speech
Feature
Vector
Pronunciation
Dictionary
Language ModelsAcoustic Models
Speech Recognition Pipeline:
• Feature extraction: converts audio signals into feature vectors
• Decoder: combines acoustic models, language models, and pronunciation dictionary
to convert the feature vector into a list of words
13. Bringing Intelligence to Embedded Systems
Nvidia Jetson TX1:
• CPU: Quad-core ARM A57
• GPU: Nvidia Maxwell with 256
CUDA cores
• Peak power: ~ 15 W
15. Bringing Intelligence to Embedded Systems
• Localization mostly on CPU, but
uses some GPU for feature
extraction acceleration
• Speech recognition is CPU intensive
• Scene understanding mostly on
GPU
• Combined they use 60% CPU and
70% GPU
16. Bringing Intelligence to Embedded Systems
• Localization consumes 4.5 W
• Speech recognition consumes 4 W
• Scene understanding consumes 7 W
• Combined they consume 11 W
17. Bringing Intelligence to Embedded Systems
• Resource utilization, when
offloading, it only consumes about
10% of CPU
• High latency: ranging from 2 ~ 5
seconds
• Local area network: ~ 100 ms
18. Bringing Intelligence to IoT Devices
Quad ARM Cortex-A7
Camera
USB
IMU
UART
TF
Card
MEM 512
MB
FLASH
16MB
• Peak power ~ 5 W
• Bare-metal device
• Small memory size
20. Bringing Intelligence to IoT Devices
• The ARM Compute Library (ACL) is a
collection of low-level software
functions optimized for ARM Cortex
CPU and ARM Mali GPU
architectures
• ACL is targeted at a variety of use-
cases including: image processing,
computer vision and machine
learning.
25. Bringing Intelligence to IoT Devices
• Multitasking on IoT devices
• It requires a light-weight
communication protocol for
different tasks to communicate
• A compiler is needed to directly
convert existing model (MXNET,
Caffe, TensorFlow) to executable
code on different hardware
• Innovation on the computing
hardware side