Scalable Image Recognition
Model with Deep Embedding
Chieh-En Tsai
b01902004@cml.csie.ntu.edu.tw
Motivation
Motivation: DNN raising
• Deep Neural Network achieved the best
performance for variety of visual tasks.
Motivation: popular mobiles
• devices like smartphone, in-car camera, GoPro,
IOT devices pop up.
Huge amount of valuable images stored not in server,
but in mobile & IOT devices
Motivation: exploit DNN
• High performance brought by DNN
• Valuable data brought by mobile & IOT
devices
How to exploit the best of both worlds ?
Solution: client-server system
La Tour Eiffel
averaging 7 - 12 sec
Can’t do real-time application
Or, another way
Solution: pure mobile system
Dataset
Lib
Linear
Feature extraction
Classification
Or
Further
Processing
Send low dim.
feature to server for
more complicated job
Problem: Limited Storage &
Computing power
• Too many parameters for a DNN model makes
it impossible to fit in a storage & computing
limited system like mobile & IOT devices
• How to perform image classification on mobile
& IOT device?
Krizhevsky et al model size (alexNet)
A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS, 2012.
Layer: Model Size(MB)
Conv1: float*(48+48)*(3*11^2) = 0.1
Conv2: float*(128+128)*(48*5^2) = 1.2
Conv3: float*(192+192)*(256*3^2 = 3.4
Conv4: float*(192+192)*(192*3^2) = 2.5
Conv5: float*(128+128)*(192*3^2) = 1.7
FC6: float*((128+128)*6^2)*4096 = 144(66%)
FC7: float*4096*4096 = 64(29%)
Total = 217 MB
Solution:
Semantic-Rich Low Dim. Feature
• The activations of fully connected layer of
alexNet model are viewed as a general high-
semantic feature in recent years
• 95% of model parameters are for fully
connected
Solution:
Semantic-Rich Low Dim. Feature
Drop fully connected layer in final model
while still encoding it’s information !
How ?
Kernel Preserving Projection(KPP)
• find a linear transformation that project
features into a lower dimensional space
where ”preserve the relevance distance in
kernel space”
YC Su et. al. ,”Scalable Mobile Visual Classification by Kernel Preserving Projection over High Dimensional Features”, IEEE, 2014
Kernel Preserving Projection(KPP)
• find a explicit transform 𝜙(𝑥) such that:
𝑘 𝑥𝑖, 𝑥𝑗 ≈ 𝜙(𝑥𝑖) ∙ 𝜙(𝑥𝑗)
• In matrix representation, we want to find a
matrix 𝑃 ∈ 𝑅 𝑑×𝐷
𝑲 ≈ 𝑷𝑿 𝑇
𝑷𝑿 = 𝑿 𝑇
𝑷 𝑇
𝑷𝑿
Kernel Preserving Projection(KPP)
• MVProjection:
𝑷∗
= argmin
𝑷
|| 𝑲 − 𝑿 𝑇
𝑷 𝑇
𝑷𝑿||F − 𝜆||𝑿 𝑇
𝑷 𝑇
𝑷𝑿|| 𝐅
• L1MVProjection:
𝑷∗
= argmin
𝑷
|| 𝑲 − 𝑿 𝑇
𝑷 𝑇
𝑷𝑿||F − 𝜆||𝑿 𝑇
𝑷 𝑇
𝑷𝑿|| 𝐅 + 𝜂||𝑷||1
Deep Embedding
• Experimental result shows that on hand-craft
feature, RBF kernel perform best
• Thought inf. dim. , RBF space itself is
semantically meaningless !
Deep Embedding
• For RBF kernel,
𝑘 𝑥𝑖, 𝑥𝑗 = 𝜙 𝑥𝑖
𝑇
∙ 𝜙 𝑥𝑗 = 𝑒−𝛾||𝑥 𝑖−𝑥 𝑗||2
• For Deep Embedding,
𝜙 𝑥 = 𝑅𝑒𝐿𝑈(𝑥 𝑐𝑜𝑛𝑣5 × 𝑾 𝑓𝑐6)
Deep Embedding
Not only model reduced,
but also the classifier
Result
In the experiment, we use liblinear as our
classifier and perform 10-fold on scene15
benchmark dataset. We first compare KPP(RBF)
and other methods on hand-craft state-of-the-
art feature(VLAD) to show how KPP outperform
others.
Result
Result-Deep Embed
- Acc. boost from 75.6%(hand-craft) to 89.5%(alexNet)
shows to power of DNN
- Deep embedding outperform other method by
large on DNN feature.
The final model result in:
- Requiring only 14% of parameters, 86% space
saved.(217M->30M)
- Accuracy drop only 1.12%.(89.5%->88.38%)
- Suitable for mobile & IOT device computing !
Result-Deep Embed
21.1M
0
30MB
Result-Deep Embed
- Acc. boost from 75.6%(hand-craft) to 89.5%(alexNet)
shows to power of DNN
- Deep embedding outperform other method by
large on DNN feature.
The final model result in:
- Requiring only 14% of parameters, 86% space
saved.(217M->30M)
- Accuracy drop only 1.12%.(89.5%->88.38%)
- Suitable for mobile & IOT device computing !
Thank you !

Scalable image recognition model with deep embedding

  • 1.
    Scalable Image Recognition Modelwith Deep Embedding Chieh-En Tsai b01902004@cml.csie.ntu.edu.tw
  • 2.
  • 3.
    Motivation: DNN raising •Deep Neural Network achieved the best performance for variety of visual tasks.
  • 4.
    Motivation: popular mobiles •devices like smartphone, in-car camera, GoPro, IOT devices pop up.
  • 5.
    Huge amount ofvaluable images stored not in server, but in mobile & IOT devices
  • 6.
    Motivation: exploit DNN •High performance brought by DNN • Valuable data brought by mobile & IOT devices How to exploit the best of both worlds ?
  • 7.
    Solution: client-server system LaTour Eiffel averaging 7 - 12 sec Can’t do real-time application
  • 8.
  • 9.
    Solution: pure mobilesystem Dataset Lib Linear Feature extraction Classification Or Further Processing Send low dim. feature to server for more complicated job
  • 10.
    Problem: Limited Storage& Computing power • Too many parameters for a DNN model makes it impossible to fit in a storage & computing limited system like mobile & IOT devices • How to perform image classification on mobile & IOT device?
  • 11.
    Krizhevsky et almodel size (alexNet) A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS, 2012. Layer: Model Size(MB) Conv1: float*(48+48)*(3*11^2) = 0.1 Conv2: float*(128+128)*(48*5^2) = 1.2 Conv3: float*(192+192)*(256*3^2 = 3.4 Conv4: float*(192+192)*(192*3^2) = 2.5 Conv5: float*(128+128)*(192*3^2) = 1.7 FC6: float*((128+128)*6^2)*4096 = 144(66%) FC7: float*4096*4096 = 64(29%) Total = 217 MB
  • 12.
    Solution: Semantic-Rich Low Dim.Feature • The activations of fully connected layer of alexNet model are viewed as a general high- semantic feature in recent years • 95% of model parameters are for fully connected
  • 13.
    Solution: Semantic-Rich Low Dim.Feature Drop fully connected layer in final model while still encoding it’s information !
  • 14.
  • 15.
    Kernel Preserving Projection(KPP) •find a linear transformation that project features into a lower dimensional space where ”preserve the relevance distance in kernel space” YC Su et. al. ,”Scalable Mobile Visual Classification by Kernel Preserving Projection over High Dimensional Features”, IEEE, 2014
  • 16.
    Kernel Preserving Projection(KPP) •find a explicit transform 𝜙(𝑥) such that: 𝑘 𝑥𝑖, 𝑥𝑗 ≈ 𝜙(𝑥𝑖) ∙ 𝜙(𝑥𝑗) • In matrix representation, we want to find a matrix 𝑃 ∈ 𝑅 𝑑×𝐷 𝑲 ≈ 𝑷𝑿 𝑇 𝑷𝑿 = 𝑿 𝑇 𝑷 𝑇 𝑷𝑿
  • 17.
    Kernel Preserving Projection(KPP) •MVProjection: 𝑷∗ = argmin 𝑷 || 𝑲 − 𝑿 𝑇 𝑷 𝑇 𝑷𝑿||F − 𝜆||𝑿 𝑇 𝑷 𝑇 𝑷𝑿|| 𝐅 • L1MVProjection: 𝑷∗ = argmin 𝑷 || 𝑲 − 𝑿 𝑇 𝑷 𝑇 𝑷𝑿||F − 𝜆||𝑿 𝑇 𝑷 𝑇 𝑷𝑿|| 𝐅 + 𝜂||𝑷||1
  • 18.
    Deep Embedding • Experimentalresult shows that on hand-craft feature, RBF kernel perform best • Thought inf. dim. , RBF space itself is semantically meaningless !
  • 19.
    Deep Embedding • ForRBF kernel, 𝑘 𝑥𝑖, 𝑥𝑗 = 𝜙 𝑥𝑖 𝑇 ∙ 𝜙 𝑥𝑗 = 𝑒−𝛾||𝑥 𝑖−𝑥 𝑗||2 • For Deep Embedding, 𝜙 𝑥 = 𝑅𝑒𝐿𝑈(𝑥 𝑐𝑜𝑛𝑣5 × 𝑾 𝑓𝑐6)
  • 20.
  • 21.
    Not only modelreduced, but also the classifier
  • 22.
    Result In the experiment,we use liblinear as our classifier and perform 10-fold on scene15 benchmark dataset. We first compare KPP(RBF) and other methods on hand-craft state-of-the- art feature(VLAD) to show how KPP outperform others.
  • 23.
  • 24.
    Result-Deep Embed - Acc.boost from 75.6%(hand-craft) to 89.5%(alexNet) shows to power of DNN - Deep embedding outperform other method by large on DNN feature. The final model result in: - Requiring only 14% of parameters, 86% space saved.(217M->30M) - Accuracy drop only 1.12%.(89.5%->88.38%) - Suitable for mobile & IOT device computing !
  • 25.
  • 26.
    Result-Deep Embed - Acc.boost from 75.6%(hand-craft) to 89.5%(alexNet) shows to power of DNN - Deep embedding outperform other method by large on DNN feature. The final model result in: - Requiring only 14% of parameters, 86% space saved.(217M->30M) - Accuracy drop only 1.12%.(89.5%->88.38%) - Suitable for mobile & IOT device computing !
  • 27.