Building a Tensorflow-based model that extracts the "best" frames from a video, which are then used as auto-generated thumbnails and thumbstrips. We used transfer learning on Google's Inceptionv3 model, which was pretrained with ImageNet data and retrained on JW Player's thumbnail library.
2. JW Player
1. Company
a. Open-source video player
b. Hosting platform
c. 5% of global internet video traffic
d. 150+ team
2. Data Team
a. Handling 5MM events per minute
b. Storing 1TB+ per day
c. Stack: Storm (Trident), Kafka, Luigi,
Elasticsearch, Spark, AWS, MySQL Customers
3. Thumbnails are Important
● Your video's first impression
● Types: Upload, Manual, Auto (default)
● Manual >> Auto in Play Rate
● Current Auto is 10th second frame
● Many big publishers only use Manual
● 90% of Thumbnails are Auto! :-(
source: tastingtable.com (2016-10-12)
4. What’s a “Good” Thumbnail?
It’s subjective to the viewer!
Common themes:
● Not blurry
● Balanced brightness
● Centered objects
● Large text overlay
● Relevant to subject
vs
Source: Big Buck Bunny, Blender Studios
5. Manually Creating a Model is Hard
● Which features to extract?
● How to describe those features?
● How to weight features?
● How to penalize overfitting of models?
● Many techniques: SIFT, SURF, HOG?
Need to be an expert in Computer Vision :-(
Edge Detection Color Histogram Pixel Segmentation
So Many Image Features...
6. Deep Learning
● Learn features implicitly
● Learn from examples
● Techniques to avoid overfitting
● Success in a lot of applications:
○ Image classification
○ Image captioning
○ Machine translation
○ Speech-to-Text
7. Inception
● Learn multiple models in parallel; concatenate
their outputs (“modules”)
● Factoring convolutions (“towers”): e.g. 1x1
convs followed by 3x3
● Parameter reduction: GoogleNet (5MM) vs.
AlexNet (60MM), VGG (200MM)
● Auxiliary classifiers for regularization
● Residual connections (Inceptionv4)
● Depthwise separable convolutions (Xception)
https://www.udacity.com/course/deep-learning--ud730
https://arxiv.org/abs/1409.4842
Source: Rethinking the Inception Architecture for Computer Vision
8. 1. Dimensionality reduction: fewer
channels, strides, feature pooling
2. Parameter reduction: faster, less
overfitting
3. “Cheap” nonlinearity: 1x1 + 3x3 is non-lin
4. Cross-channel ⊥ spatial correlations
1x1 Convolutions: what’s the point?
1x1 convolution with strides Pooling with 1x1 convolution
Source: http://iamaaditya.github.io/2016/03/one-by-one-convolution/
In Convolutional Nets, there is no such thing as
“fully-connected layers”. There are only
convolution layers with 1x1 convolution kernels. –
Yann LeCun
10. Transfer Learning
1,000,000 images, 1,000 categories● Use pre-trained model
○ Cheaper (no GPU required)
○ Faster
○ Prevents overfitting
● Penultimate (“Bottleneck”) layer contains
image’s “essence” (CNN codes); acts as a
feature extractor
● Just add a linear classifier (Softmax; lin-SVM)
to Bottleneck
11. Fine Tuning + Tips
● Change classification layer +
backprop layers back
● Idea:
Early layers do basic filters; later
layers more dataset specific
● Generally use a pre-trained model
regardless of data size or similarity
Data Size (per class)
< 500 > 500 > 5,000
Similar to
original
Too small TL
TL + FT earlier
layers
Not Similar Too small
TL on earlier
layers
TL + FT entire
network
12. Other Applications of Transfer Learning
Google “Show and Tell”
https://github.com/tensorflow/models/tree/master/im2txt
Image Captioning Image Search
http://www.slideshare.net/ScottThompson90/applying-transfer-learn
ing-in-tensorflow
13. Training: Thesis
Train to differentiate between Manual and Auto
● Manual thumbnails are (usually) better than Auto
● Select Manual with high views and play rate;
Auto selection is random but low plays
● We have a lot of examples: 10K+ manual
● We used InceptionV3 pre-trained on ImageNet
15. Video Pre-Filter
Use FFMPEG to select top 100 frame
candidates
Methods:
● Color histogram changes to avoid
dupes
● Coded Macroblock information
● Remove “black” frame
● Measure motion vectors
20. What’s Next
● Refinements:
○ Fine tuning to earlier layers
○ Other models: ResnetV2, Xception
○ Pre-Filtering: adaptive, hardware accel.
● Products:
○ New auto thumbnails
○ Thumbstrips
21. Resources
Blog Posts:
● https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html
● https://github.com/tensorflow/models/tree/master/inception
● http://iamaaditya.github.io/2016/03/one-by-one-convolution/
● http://www.slideshare.net/ScottThompson90/applying-transfer-learning-in-tensorflow
● https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
● http://cs231n.github.io/transfer-learning/
● https://research.googleblog.com/2015/10/improving-youtube-video-thumbnails-with.html
● https://pseudoprofound.wordpress.com/2016/08/28/notes-on-the-tensorflow-implementation-of-inception-v3/
● https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html
Papers:
● Rethinking the inception architecture for computer vision. https://arxiv.org/abs/1512.00567
● Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357
● CNN Features off-the-shelf: an Astounding Baseline for Recognition. https://arxiv.org/abs/1403.6382
● DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. https://arxiv.org/abs/1310.1531
● How transferable are features in deep neural networks? https://arxiv.org/abs/1411.1792