Intelligent Thumbnail Selection

Intelligent Thumbnail Selection
Kamil Sindi, Lead Data Scientist

JW Player
1. Company
a. Open-source video player
b. Hosting platform
c. 5% of global internet video traffic
d. 150+ team
2. Data Team
a. Handling 5MM events per minute
b. Storing 1TB+ per day
c. Stack: Storm (Trident), Kafka, Luigi,
Elasticsearch, Spark, AWS, MySQL Customers

Thumbnails are Important
● Your video's first impression
● Types: Upload, Manual, Auto (default)
● Manual >> Auto in Play Rate
● Current Auto is 10th second frame
● Many big publishers only use Manual
● 90% of Thumbnails are Auto! :-(
source: tastingtable.com (2016-10-12)

What’s a “Good” Thumbnail?
It’s subjective to the viewer!
Common themes:
● Not blurry
● Balanced brightness
● Centered objects
● Large text overlay
● Relevant to subject
vs
Source: Big Buck Bunny, Blender Studios

Manually Creating a Model is Hard
● Which features to extract?
● How to describe those features?
● How to weight features?
● How to penalize overfitting of models?
● Many techniques: SIFT, SURF, HOG?
Need to be an expert in Computer Vision :-(
Edge Detection Color Histogram Pixel Segmentation
So Many Image Features...

Deep Learning
● Learn features implicitly
● Learn from examples
● Techniques to avoid overfitting
● Success in a lot of applications:
○ Image classification
○ Image captioning
○ Machine translation
○ Speech-to-Text

Inception
● Learn multiple models in parallel; concatenate
their outputs (“modules”)
● Factoring convolutions (“towers”): e.g. 1x1
convs followed by 3x3
● Parameter reduction: GoogleNet (5MM) vs.
AlexNet (60MM), VGG (200MM)
● Auxiliary classifiers for regularization
● Residual connections (Inceptionv4)
● Depthwise separable convolutions (Xception)
https://www.udacity.com/course/deep-learning--ud730
https://arxiv.org/abs/1409.4842
Source: Rethinking the Inception Architecture for Computer Vision

1. Dimensionality reduction: fewer
channels, strides, feature pooling
2. Parameter reduction: faster, less
overfitting
3. “Cheap” nonlinearity: 1x1 + 3x3 is non-lin
4. Cross-channel ⊥ spatial correlations
1x1 Convolutions: what’s the point?
1x1 convolution with strides Pooling with 1x1 convolution
Source: http://iamaaditya.github.io/2016/03/one-by-one-convolution/
In Convolutional Nets, there is no such thing as
“fully-connected layers”. There are only
convolution layers with 1x1 convolution kernels. –
Yann LeCun

InceptionV3 Architecture
https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html
Dog (0.80)
Cat (0.05)
Rat (0.01)
...

Transfer Learning
1,000,000 images, 1,000 categories● Use pre-trained model
○ Cheaper (no GPU required)
○ Faster
○ Prevents overfitting
● Penultimate (“Bottleneck”) layer contains
image’s “essence” (CNN codes); acts as a
feature extractor
● Just add a linear classifier (Softmax; lin-SVM)
to Bottleneck

Fine Tuning + Tips
● Change classification layer +
backprop layers back
● Idea:
Early layers do basic filters; later
layers more dataset specific
● Generally use a pre-trained model
regardless of data size or similarity
Data Size (per class)
< 500 > 500 > 5,000
Similar to
original
Too small TL
TL + FT earlier
layers
Not Similar Too small
TL on earlier
layers
TL + FT entire
network

Other Applications of Transfer Learning
Google “Show and Tell”
https://github.com/tensorflow/models/tree/master/im2txt
Image Captioning Image Search
http://www.slideshare.net/ScottThompson90/applying-transfer-learn
ing-in-tensorflow

Training: Thesis
Train to differentiate between Manual and Auto
● Manual thumbnails are (usually) better than Auto
● Select Manual with high views and play rate;
Auto selection is random but low plays
● We have a lot of examples: 10K+ manual
● We used InceptionV3 pre-trained on ImageNet

Training: Examples
Positive (Manual)
Negative Examples
Negative (Auto)

Video Pre-Filter
Use FFMPEG to select top 100 frame
candidates
Methods:
● Color histogram changes to avoid
dupes
● Coded Macroblock information
● Remove “black” frame
● Measure motion vectors

Motion Vectors
Source: Sintel, Blender Studios

Demo: Examples Original Auto (10th second frame)
Top scored frames
from new model

What’s Next
● Refinements:
○ Fine tuning to earlier layers
○ Other models: ResnetV2, Xception
○ Pre-Filtering: adaptive, hardware accel.
● Products:
○ New auto thumbnails
○ Thumbstrips

Resources
Blog Posts:
● https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html
● https://github.com/tensorflow/models/tree/master/inception
● http://iamaaditya.github.io/2016/03/one-by-one-convolution/
● http://www.slideshare.net/ScottThompson90/applying-transfer-learning-in-tensorflow
● https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
● http://cs231n.github.io/transfer-learning/
● https://research.googleblog.com/2015/10/improving-youtube-video-thumbnails-with.html
● https://pseudoprofound.wordpress.com/2016/08/28/notes-on-the-tensorflow-implementation-of-inception-v3/
● https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html
Papers:
● Rethinking the inception architecture for computer vision. https://arxiv.org/abs/1512.00567
● Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357
● CNN Features off-the-shelf: an Astounding Baseline for Recognition. https://arxiv.org/abs/1403.6382
● DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. https://arxiv.org/abs/1310.1531
● How transferable are features in deep neural networks? https://arxiv.org/abs/1411.1792

Intelligent Thumbnail Selection

Recommended

Recommended

More Related Content

Similar to Intelligent Thumbnail Selection

Similar to Intelligent Thumbnail Selection (20)

Recently uploaded

Recently uploaded (13)

Intelligent Thumbnail Selection