Large-scale Video Classification with Convolutional Neural Network

LARGE-SCALE VIDEO CLASSIFICATION WITH
CONVOLUTIONAL NEURAL NETWORK
ANDREJ KARPATHY LI FEI-FEI SANKETH SHETTY
RAHUL SUKHTHANKAR THOMAS LEUNG GEORGE TODERICI
PRESENTED BY: KHALID KHAN

SUMMARY
• Convolutional Neural networks have been established as a powerful class of models image
recognition problems.
• This paper provides the an extensive empirical evaluation of CNNs on large scale video
classification using a new dataset of approx. 1 million YouTube videos.
• Multiple approaches were studied for extending the connectivity of a CNN in time domain.
• Suggested a multiresolution architecture as a promising way of speeding up the training.
• Some performance improvements were observed compared to previous feature-based and
single-frame models.
2

CONVOLUTIONAL NEURAL NETWORK
• Similar to other Neural Networks, CNN consists of several different layers, each contain neurons
that are independent to each other.
• Each neuron has a learnable weights, they receive some input, performs some operation and
provide the output to the next neuron on another layer.
• CNN consists of an input layer, an output layer and multiple hidden layers, which includes
Convolutional layer, Pooling layer and Fully-Connected Layer.
3

VIDEO CLASSIFICATION USING CNN
• A new dataset Sports-1M is used to train the CNN architecture.
• Sports-1M consists of 1 million YouTube video belonging to 487 classes of sports.
• Provide an architecture that process input into two different resolution – a low resolution
context stream and a high-resolution fovea stream, to improve the runtime performance.
• Applied the network again on another dataset, UCF-101, observe the significant improvement
compared to the results obtained by training networks on UCF-101 alone.
4

RELATED WORK
• CNNs have been applied to small scale image recognition problems on datasets such as MNIST,
CIFAR-10/100, NORB and Caltech-101/256.
• Little to no work on applying CNNs to video classification.
• Available video datasets contain only few thousands of clips and few dozens of classes, which
may be the cause of lack of contribution in video classification.
5

MODELS
• Divided the videos into small clips
and generate the frames.
• Described the three broad
connectivity pattern categories,
Early Fusion, Late Fusion and Slow
Fusion.
• Early Fusion combines information
across an entire time window.
• Late Fusion places two separate
single frame network and then
merges in Fully connected layer.
• Slow fusion is combination of Early
and Late fusion, which results in
higher layer get more global
information.
6

MULTIRESOLUTION CNN
• CNN takes weeks to train large-scale dataset,
therefore runtime performance is critically
important.
• One approach was to reduce the layers, which
will result in lower performance.
• Another approach was to reduce the size of the
images, which will lower the accuracy.
• Finally the solution was to fed two frames, one
with half resolution, referred as context stream
and another one was center-cropped version of
the original frame, referred as fovea stream.
• Improvement was observed, since in most of
the online videos, object of interest often
occupies the center region.
7

UCF-101
• Transfer learning experiment was done on
another dataset, UCF-101 Activity
Recognition dataset, which consists of
13,320 videos belonging to 101 categories.
• Following scenarios were considered:
• Fine-tune top layer: Retrained the last
layer
• Fine-tune top 3 layers: Along with last
layer, retrained two fully-connected
layers
• Fine-tune all layers: All the layers
including convolutional are retrained
• Train from scratch: Full network from
scratch was trained
10

CONCLUSION
• CNNs are capable of learning not only image recognition but video classification also.
• A Slow fusion model consistently performs better than Early and Late Fusion.
• Transfer learning experiment on UCF-101 suggests that highest transfer learning performance
by retraining the top 3 layers.
11
• Hope to incorporate broader categories in the dataset to obtain more powerful and generic
features.
• Explore recurrent neural networks as more powerful technique for combining clip-level
prediction into global video-level prediction.
FUTURE WORK

Large-scale Video Classification with Convolutional Neural Network

More Related Content

What's hot

Similar to Large-scale Video Classification with Convolutional Neural Network

Recently uploaded

Large-scale Video Classification with Convolutional Neural Network