Subclass deep neural networks

retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Subclass deep neural networks: re-enabling neglected classes
in deep network training for multimedia classification
N. Gkalelis,V. Mezaris
CERTH-ITI,Thermi -Thessaloniki,Greece
26th Int. Conf. Multimedia Modeling
Daejeon, Korea, January 2020

Outline
2
• Problem statement
• Related work
• Identification of neglected classes
• Subclass DNNs
• Experiments

3
• Deep neural networks (DNNs) have shown remarkable performance in many
machine learning problems
• They are being commercially deployed in various domains
Problem statement
• Multimedia
understanding
• Self-driving cars • Edge computing
Image Credits:V2Gov
ImageCredits: [1]
[1]Chen, J., Ran, X.: Deep LearningWith Edge Computing:A Review, Proc. of the IEEE, vol. 107, no. 8, (Aug. 2019)

4
One weakness of DNNs is that some classes are getting less attention during
training, for instance, due to:
• Overfitting: error in the training set becomes negligible although generalization
error is still high
• Simply, the contributions of the misclassified observations to the updating of the
weights associated with these classes cancel out
 How to identify these neglected classes?
 How to boost the classification performance of neglected classes?
Problem statement

5
Related work
• The problem of neglected classes sometimes is related with imbalanced class
learning, where the distribution of training data across different classes is skewed
• There are relatively few works studying the class imbalance problem in DNNs
(e.g. see [1, 2])
 As observed in our experiments neglected classes may also appear with balanced
training datasets (e.g. CIFAR datasets)
 The identification of classes getting less attention from the DNN during the
training procedure is a relatively unexplored topic
[1] Sarafianos, N., Xu, X., Kakadiaris, I.A.: Deep imbalanced attribute classification using visual attention aggregation. ECCV. Munich, Germany
(Sep 2018)
[2] Lin,T., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal Loss for DenseObject Detection, 2017 IEEE ICCV,Venice, 2017.

6
Identification of neglected classes: motivation
• During backpropagation, a strong gradient update gi for the weight vectors wi at
the output layer is typically produced for classes with high training error ei
• In this way, the overall network is guided to extract nonlinear features at different
layers in order to provide a linear separable subspace at the output layer for these
classes
wi
wi’ = wi + gi
gi
length of the gradient update ||gi|| is large as
expected due to a high error in the output node i
weight vector is updated (as well as
weight vectors in the layers below) in
order to reduce the output node error
weight vector of node i
(associated with class i)

7
Identification of neglected classes: motivation
• However, it is observed that for certain output nodes the gradient update is close
to zero although the classes associated with these nodes still have a large error
• The result is that the DNN neglects to extract the necessary discriminant features
for reducing the error for these classes
• We investigate this case in detail and based on the analysis we propose a new
measure to identify the so-called neglected classes
The length of the gradient
update ||gi|| is small despite a
large error in the output node i! the updated weight vector of output node i
remains relatively unchanged because the length
of the gradient update ||gi|| is close to zero
wi’  wi
gi

8
Identification of neglected classes: background
• Suppose a DNN with sigmoid (SG) output layer and cross entropy (CE) loss
xk
xf,k
xj,k
x1,k
wi
w1,i
wj,i
bi
weight vector and bias term of i-th node
input vector to the i-th node associated with the k-th training observation
output of the i-th node associated with k-th training observation
i-th output node of the SG layer (associated with i-th class)
wf,i

9
Identification of neglected classes: background
gradient update of wi
label of k-th observation w.r.t. class i
Learning rate
input vector to the i-th output
node associated with the k-th
training observation in the batch
output of the i-th node associated
with k-th training observation
Number of batch observations

10
Identification of neglected classes Unnormalized gradient update
(w.r.t i-th node)
Unnormalized
gradient update
of positive class
observations
Unnormalized gradient update
of negative class observations

11
Identification of neglected classes

12

13
Length of
unnormalized
gradient update of
positive and
negative class
observations
Length of unnormalized gradient
update (w.r.t i-th node)

14
Training of neglected classes: Subclass DNNs
• A subclass-based learning framework is applied to improve the classification
performance of neglected classes
• Subclass-based classification techniques have been successfully used in the
shallow learning paradigm [1-5].
• To this end, neglected classes are augmented and partitioned to subclasses and
the created labeling information is used to create and train a subclass DNN
[1] Hastie,T.,Tibshirani, R.: Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society. Series B 58(1), (Jul 1996)
[2] Manli, Z., Martinez, A.M.: Subclass discriminant analysis. IEEETrans. PatternAnal. Mach. Intell. 28(8), (Aug 2006)
[3] Escalera, S. et al.: Subclass problem-dependent design for error-correcting output codes. IEEETrans. PatternAnal. Mach. Intell. 30(6), (Jun
2008)
[4]You, D., Hamsici, O.C., Martinez, A.M.: Kernel optimization in discriminant analysis. IEEETrans. PatternAnal. Mach. Intell. 33(3), (Mar 2011)
[5] Gkalelis, N., Mezaris,V., Kompatsiaris, I., Stathaki,T.: Mixture subclass discriminant analysis link to restricted Gaussian model and other
generalizations. IEEETrans. Neural Netw. Learn. Syst. 24(1), (Jan 2013)

15
Training of neglected classes: Subclass DNNs
Number of training observations
Number of classes
Number of subclasses of class i
label of k-th observation
w.r.t subclass (i,j)
output of the node (i.j) associated
with k-th training observation
label of k-th observation
w.r.t class i

Experiments
16
Multiclass classification
• CIFAR10 [1]: 10 classes, 32 x 32 color images, 50000 training and 10000 testing
observations
• CIFAR100 [1]: as CIFAR10 but with 100 classes
• SVHN [2]: 10 classes, 32 x 32 color images, 73257 training, 26032 testing and
531131 extra observations
[1] Krizhevsky, A.: Learning multiple layers of features from tiny images.Tech. rep. (2009)
[2] Netzer,Y.,Wang,T., Coates, A., Bissacco, A.,Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. NIPS
Workshops.Venice, Italy, (Oct 2017)

Experiments
17
Experimental setup (similar setup with [1])
• Images are normalized to zero mean and unit variance
• Data augmentation is applied as in [1] (cropping, mirroring, cutout, etc.)
• DNNs [2,3]: VGG16, WRN-28-10 (CIFAR-10, -100), WRN-16-8 (SVHN)
• CE loss, Minibatch SGD, Nesterov momentum 0.9, batch size 128, weight decay
0.0005, 200 epochs, initial learning rate 0.1, exponential schedule (learning rate
multiplied by 0.1 for CIFAR or 0.2 for SVHN, at epochs 60, 120 and 160)
[1] Devries,T.,Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552 (2017)
[2]Zagoruyko, S., Komodakis, N.:Wide residual networks. BMVC.York, UK, (Sep 2016)
[3] Simonyan, K., Zisserman, A.:Very deep convolutional networks for large-scale image recognition. ICLR. San Diego, CA, USA (May 2015)

Experiments
18
SDNN training
• DNN is trained for 30 epochs in order to compute a reliable neglection score for
each class
• The classes with highest neglection score are selected: 2 classes for CIFAR-10 and
SVHN, 10 classes for CIFAR-100
• Observations of the selected classes are doubled using the augmentation
approach presented in [1]
• Selected classes are partitioned to 2 subclasses using k-means
• SDNN is trained using the annotations at class and subclass level
[1] Inoue, H.: Data augmentation by pairing samples for images classification. arXiv:1801.02929 (2018)

Experiments
19
Results
• The proposed subclass DNNs (SVGG16, SWRN) provide improved performance over their
conventional variants (VGG16 [1], WRN [2])
• Considering that [2] is among the state-of-the-art approaches, the attained CCR performance
improvement is significant
• Training time overhead is negligible for the medium-size datasets (CIFAR-10, -100) and relatively
small for the larger SVHN; testing time is only a few seconds for all DNNs
VGG16 [1] SVGG16 WRN [2] SWRN
CIFAR-10 93.5% (2.6 h) 94.8% (2.7 h) 96.92% (7.1 h) 97.14% (8.1 h)
CIFAR-100 71.24% (2.6 h) 73.67% (2.6 h) 81.59% (7.1 h) 82.17% (7.5 h)
SVHN 98.16% (29.1 h) 98.35 (34.1 h) 98.7% (33.1 h) 98.81% (42.7 h)
[1] Simonyan, K., Zisserman, A.:Very deep convolutional networks for large-scale image recognition. In: ICLR. San Diego, CA, USA (May 2015)
[2] Devries,T.,Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552 (2017)

Experiments
20
Multilabel classification
• YouTube-8M (YT8M) [1]: largest public dataset for multilabel video classification
• 3862 classes, 3888919 training, 1112356 evaluation and 1133323 testing videos
• 1024- and 128-dimensional visual and audio feature vectors are provided for each
video
[1]Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, A.P.,Toderici, G.,Varadarajan, B.,Vijayanarasimhan, S.:YouTube-8M:A large-scale video
classification benchmark. In: arXiv:1609.08675 (2016)

Experiments
21
Experimental setup
• Two experiments: using the visual feature vectors directly; concatenating the
visual and audio feature vectors
• L2-normalization is applied to feature vectors in all cases
• DNN: 1 conv layer with 64 1d-filters, max-pooling, dropout, SG output layer
• CE loss, Minibatch SGD, batch size 512, weight decay 0.0005, 5 epochs, initial
learning rate 0.001, exponential schedule (learning rate multiplied by 0.95 at
every epoch)
• The evaluation metrics of YT8M challenge [1] are utilized for performance
evaluation; GAP@20 is used as the primary evaluation metric
[1]Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, A.P.,Toderici, G.,Varadarajan, B.,Vijayanarasimhan, S.:YouTube-8M:A large-scale video
classification benchmark. In: arXiv:1609.08675 (2016)

22
SDNN training
• DNN is trained for 1/3 of an epoch; a neglection score for each class is computed
• 386 classes with highest neglection score are selected
• Observations of the selected classes are doubled by applying extrapolation in
feature space [1]
• A efficient variant of the nearest neighbor-based clustering algorithm [2,3] is
used to partition the neglected classes to subclasses
• SDNN is trained using the annotations at class and subclass level
[1] DeVries,T.,Taylor, G.W.: Dataset augmentation in feature space. ICLRWorkshops.Toulon, France (Apr 2017)
[2] Manli, Z., Martinez, A.M.: Subclass discriminant analysis. IEEETrans. PatternAnal. Mach. Intell. 28(8), 1274-1286 (Aug 2006)
[3]You, D., Hamsici, O.C., Martinez, A.M.: Kernel optimization in discriminant analysis. IEEE PAMI 33(3), 631-638 (Mar 2011)
Experiments

Experiments
23
Results
• The proposed SDNN outperforms (in terms of GAP@20) the DNN by 1% and 1.5% using visual
and audiovisual features, respectively; both networks outperform Logistic Regression (LR)
• A performance gain of 3% is observed by exploiting the audio information for both networks
Visual Visual + Audio
LR DNN SDNN LR DNN SDNN
Hit@1 82.4% 82.5% 83.2% 82.3% 85.2% 85.7%
PERR 71.9% 72.2% 72.9% 71.8% 75.4% 75.9%
mAP 41.2% 42.3% 45.2% 40.1% 45.6% 47.9%
GAP@20 77.1% 77.6% 78.6% 77% 80.7% 82.2%
Ttr (min) 18.7 59.2 66.2 18.9 60.3 67.1

Experiments
24
Results
• Comparison with best single-model results in YT8M
• These methods use frame-level features and build upon stronger and computationally
demanding feature descriptors (e.g. Fisher Vectors, VLAD, FVNet, DBoF and other)
• The proposed SDNN performs in par with top-performers in the literature despite the fact that we
utilize only the video-level features provided from YT8M
[1] Huang, P.,Yuan,Y., Lan, Z., Jiang, L., Hauptmann,A.G.:Video representation learning and latent concept mining for large-scale multi-label
video classification. arXiv:1707.01408 (2017)
[2] Na, S.,Yu,Y., Lee, S., Kim, J., Kim, G.: Encoding video and label priors for multilabel video classification onYouTube-8M dataset. In: CVPR
Workshops (2017)
[3] Bober-Irizar, M., Husain, S., Ong, E.J., Bober, M.: Cultivating DNN diversity for large scale video labelling. In: CVPRWorkshops (2017)
[4] Skalic, M., Pekalski, M., Pan, X.E.: Deep learning methods for ecient large scale video labeling. In: CVPRWorkshops (2017)
[1] [2] [3] [4] SDNN
GAP@20 82.15% 80.9% 82.5% 82.25% 82.2%

Summary and next steps
25
Summary
• A new criterion was suggested to identify the neglected classes, i.e. classes that getting less
attention from the DNN during the training procedure
• Subclass DNNs were proposed, where the identified classes are augmented and partitioned to
subclasses, and, a new CE Loss is applied to exploit effectively the subclass labelling information
• The proposed approach was evaluated successfully in two different problem domains, multiclass
classification (CIFAR-10, CIFAR-100, SVHN) and multilabel classification (YT8M)
Next steps
• Further improve the classification performance of the proposed approach using a mixture of
experts model to combine the subclass scores at the output layer
• Extend the proposed approach to the semi-supervised paradigm in order to exploit more
effectively partially labelled datasets

26
Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/subclass_deep_neural_networks
This work was supported by the EUs Horizon 2020 research and innovation
programme under grant agreement H2020-780656 ReTV

Subclass deep neural networks

More Related Content

What's hot

Similar to Subclass deep neural networks

More from VasileiosMezaris

Recently uploaded

Subclass deep neural networks