LARGE-SCALE VIDEO CLASSIFICATION WITH
CONVOLUTIONAL NEURAL NETWORK
ANDREJ KARPATHY LI FEI-FEI SANKETH SHETTY
RAHUL SUKHTHANKAR THOMAS LEUNG GEORGE TODERICI
PRESENTED BY: KHALID KHAN
SUMMARY
• Convolutional Neural networks have been established as a powerful class of models image
recognition problems.
• This paper provides the an extensive empirical evaluation of CNNs on large scale video
classification using a new dataset of approx. 1 million YouTube videos.
• Multiple approaches were studied for extending the connectivity of a CNN in time domain.
• Suggested a multiresolution architecture as a promising way of speeding up the training.
• Some performance improvements were observed compared to previous feature-based and
single-frame models.
2
CONVOLUTIONAL NEURAL NETWORK
• Similar to other Neural Networks, CNN consists of several different layers, each contain neurons
that are independent to each other.
• Each neuron has a learnable weights, they receive some input, performs some operation and
provide the output to the next neuron on another layer.
• CNN consists of an input layer, an output layer and multiple hidden layers, which includes
Convolutional layer, Pooling layer and Fully-Connected Layer.
3
VIDEO CLASSIFICATION USING CNN
• A new dataset Sports-1M is used to train the CNN architecture.
• Sports-1M consists of 1 million YouTube video belonging to 487 classes of sports.
• Provide an architecture that process input into two different resolution – a low resolution
context stream and a high-resolution fovea stream, to improve the runtime performance.
• Applied the network again on another dataset, UCF-101, observe the significant improvement
compared to the results obtained by training networks on UCF-101 alone.
4
RELATED WORK
• CNNs have been applied to small scale image recognition problems on datasets such as MNIST,
CIFAR-10/100, NORB and Caltech-101/256.
• Little to no work on applying CNNs to video classification.
• Available video datasets contain only few thousands of clips and few dozens of classes, which
may be the cause of lack of contribution in video classification.
5
MODELS
• Divided the videos into small clips
and generate the frames.
• Described the three broad
connectivity pattern categories,
Early Fusion, Late Fusion and Slow
Fusion.
• Early Fusion combines information
across an entire time window.
• Late Fusion places two separate
single frame network and then
merges in Fully connected layer.
• Slow fusion is combination of Early
and Late fusion, which results in
higher layer get more global
information.
6
MULTIRESOLUTION CNN
• CNN takes weeks to train large-scale dataset,
therefore runtime performance is critically
important.
• One approach was to reduce the layers, which
will result in lower performance.
• Another approach was to reduce the size of the
images, which will lower the accuracy.
• Finally the solution was to fed two frames, one
with half resolution, referred as context stream
and another one was center-cropped version of
the original frame, referred as fovea stream.
• Improvement was observed, since in most of
the online videos, object of interest often
occupies the center region.
7
RESULTS
8
RESULTS
9
UCF-101
• Transfer learning experiment was done on
another dataset, UCF-101 Activity
Recognition dataset, which consists of
13,320 videos belonging to 101 categories.
• Following scenarios were considered:
• Fine-tune top layer: Retrained the last
layer
• Fine-tune top 3 layers: Along with last
layer, retrained two fully-connected
layers
• Fine-tune all layers: All the layers
including convolutional are retrained
• Train from scratch: Full network from
scratch was trained
10
CONCLUSION
• CNNs are capable of learning not only image recognition but video classification also.
• A Slow fusion model consistently performs better than Early and Late Fusion.
• Transfer learning experiment on UCF-101 suggests that highest transfer learning performance
by retraining the top 3 layers.
11
• Hope to incorporate broader categories in the dataset to obtain more powerful and generic
features.
• Explore recurrent neural networks as more powerful technique for combining clip-level
prediction into global video-level prediction.
FUTURE WORK

Large-scale Video Classification with Convolutional Neural Network

  • 1.
    LARGE-SCALE VIDEO CLASSIFICATIONWITH CONVOLUTIONAL NEURAL NETWORK ANDREJ KARPATHY LI FEI-FEI SANKETH SHETTY RAHUL SUKHTHANKAR THOMAS LEUNG GEORGE TODERICI PRESENTED BY: KHALID KHAN
  • 2.
    SUMMARY • Convolutional Neuralnetworks have been established as a powerful class of models image recognition problems. • This paper provides the an extensive empirical evaluation of CNNs on large scale video classification using a new dataset of approx. 1 million YouTube videos. • Multiple approaches were studied for extending the connectivity of a CNN in time domain. • Suggested a multiresolution architecture as a promising way of speeding up the training. • Some performance improvements were observed compared to previous feature-based and single-frame models. 2
  • 3.
    CONVOLUTIONAL NEURAL NETWORK •Similar to other Neural Networks, CNN consists of several different layers, each contain neurons that are independent to each other. • Each neuron has a learnable weights, they receive some input, performs some operation and provide the output to the next neuron on another layer. • CNN consists of an input layer, an output layer and multiple hidden layers, which includes Convolutional layer, Pooling layer and Fully-Connected Layer. 3
  • 4.
    VIDEO CLASSIFICATION USINGCNN • A new dataset Sports-1M is used to train the CNN architecture. • Sports-1M consists of 1 million YouTube video belonging to 487 classes of sports. • Provide an architecture that process input into two different resolution – a low resolution context stream and a high-resolution fovea stream, to improve the runtime performance. • Applied the network again on another dataset, UCF-101, observe the significant improvement compared to the results obtained by training networks on UCF-101 alone. 4
  • 5.
    RELATED WORK • CNNshave been applied to small scale image recognition problems on datasets such as MNIST, CIFAR-10/100, NORB and Caltech-101/256. • Little to no work on applying CNNs to video classification. • Available video datasets contain only few thousands of clips and few dozens of classes, which may be the cause of lack of contribution in video classification. 5
  • 6.
    MODELS • Divided thevideos into small clips and generate the frames. • Described the three broad connectivity pattern categories, Early Fusion, Late Fusion and Slow Fusion. • Early Fusion combines information across an entire time window. • Late Fusion places two separate single frame network and then merges in Fully connected layer. • Slow fusion is combination of Early and Late fusion, which results in higher layer get more global information. 6
  • 7.
    MULTIRESOLUTION CNN • CNNtakes weeks to train large-scale dataset, therefore runtime performance is critically important. • One approach was to reduce the layers, which will result in lower performance. • Another approach was to reduce the size of the images, which will lower the accuracy. • Finally the solution was to fed two frames, one with half resolution, referred as context stream and another one was center-cropped version of the original frame, referred as fovea stream. • Improvement was observed, since in most of the online videos, object of interest often occupies the center region. 7
  • 8.
  • 9.
  • 10.
    UCF-101 • Transfer learningexperiment was done on another dataset, UCF-101 Activity Recognition dataset, which consists of 13,320 videos belonging to 101 categories. • Following scenarios were considered: • Fine-tune top layer: Retrained the last layer • Fine-tune top 3 layers: Along with last layer, retrained two fully-connected layers • Fine-tune all layers: All the layers including convolutional are retrained • Train from scratch: Full network from scratch was trained 10
  • 11.
    CONCLUSION • CNNs arecapable of learning not only image recognition but video classification also. • A Slow fusion model consistently performs better than Early and Late Fusion. • Transfer learning experiment on UCF-101 suggests that highest transfer learning performance by retraining the top 3 layers. 11 • Hope to incorporate broader categories in the dataset to obtain more powerful and generic features. • Explore recurrent neural networks as more powerful technique for combining clip-level prediction into global video-level prediction. FUTURE WORK