Mobile Application Development-Components and Layouts
Introduction of Mobile CNN
1. 2016 UEC Tokyo.
Introduction of Mobile CNN
2016/11/10(Thu)
Department of Informatics,
The University of Electro-Communications,
Yanai Laboratory,
Ryosuke Tanno
2. ⓒ 2016 UEC Tokyo.
• Affiliation: master 1 student at University of Electro-
Communications(Yanai Laboratory)
• Research:
– Bachelor: Implementation and Comparative Analysis of Image
Recognition System based on Deep Learning on Mobile OS
– Master: Image Recognition and Image Transfer
based on Deep Learning
Self Introduction
3. ⓒ 2016 UEC Tokyo.
Contributions
• Stand-alone DCNN-based mobile image recognition
– No need of a recognition server and communication.
– Built-it trained DCNN model with UECFOOD-100
– Implemented as iOS/Android app.
– Released as iOS app on https://goo.gl/4m2tQz
– as Android app (APK) on http://foodcam.mobi/
• Excellent performance with reasonable speed and model size
– UECFOOD100 : 78.8% (top-1) 95.2% (top-5)
in 55.7 [ms] with 5.5M weights (22MB)
– Employing Network-in-Network
– Adding batch normalization and additional layers
• Multi-scale recognition
– User can choose the balance between speed and accuracy
• 26.2[ms] for 160x160 images ⇔ 55.7[ms] for 227x227 images (on iPhone 7 Plus)
4. ⓒ 2016 UEC Tokyo.
CNN architecture (1)
• The amounts of weights in AlexNet and VGG-16 are
too much for mobile.
• GoogLeNet is too complicated
for efficient parallel implemen
-tation. (It has many branches.)
5. ⓒ 2016 UEC Tokyo.
CNN architecture (2)
• We adopt Network-in-Network (NIN).
– No fully-connected layers (which bring less weights)
– Straight flow and consisting of many conv layers
⇒ It’s easy for parallel implementation.
Efficient computation for conv layers is needed !
Network-In-Network(NIN)
6. ⓒ 2016 UEC Tokyo.
Extension of NIN
adding BN, 5layers, multiple image size
• Modified models (BN, 5layer, multi-scale)
– adding BN layers just after all the conv/cccp layers
– replaced 5x5 conv with two 3x3 conv layers
– reduced the number of kernels in conv 4 from 1024 to 768
– replaced fixed average pooling with Global Average Pooling
• Multiple image size
4layers
5layers+BN
227x227 180x180 160x160 Trade-off: Accuracy vs speed
227x227
55.7ms 78.8%
180x180
35.5ms 76.0%
160x160
26.3ms 71.5%Global Average Pooling (GAP)
7. ⓒ 2016 UEC Tokyo.
• Speeding up Conv layers →Speeding up GEMM
– computation of conv layer is decomposed into “im2col”
operation and generic matric multiplications(GEMM)
– Multi-threading: Use 2cores in iOS , 4 cores in Android in
parallel
– SIMD instruction(NEON in ARM-based processor)
• Total: iOS: 2Core*4 = 8calculation, Android: 4Core*4 = 16 calculation
– BLAS library(highly optimized for iOS ⇔ not optimized for
Android)
• BLAS(iOS: BLAS in iOS Accelerate Framework, Android: OpenBLAS)
Fast Implementation on Mobile
9. ⓒ 2016 UEC Tokyo.
Evaluation: Processing time
• iOS: BLAS >> NEON, Android: BLAS << NEON
– The BLAS library in iOS Accelerate Framework is very efficient !
iOS
iOS
iOS
Android
Trade-off:
accuracy vs. speed
by changing the size of input images
Fastest !
Most accurate !
Achieved “real”
real-time 26.2 ms!
10. ⓒ 2016 UEC Tokyo.
Comparison to FV-based FoodCam
with UEC-FOOD100 dataset
• Much improved ( 65.3% ⇒ 81.5% (top-1) )
• Even for 160x160 improved ( 65.3% ⇒ 71.5% )
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
1 2 3 4 5 6 7 8 9 10
AlexNet
NIN 5layer [104ms]
NIN 4layer [67ms]
NIN 4layer (160x160) [33ms]
FV (Color+HOG) [65ms]
Top1:
81.5%
Top1:
65.3%
Top5:
96.2%
Top5:
86.7%