Intelligent Thumbnail Selection
Kamil Sindi, Lead Data Scientist
JW Player
1. Company
a. Open-source video player
b. Hosting platform
c. 5% of global internet video traffic
d. 150+ team
2. Data Team
a. Handling 5MM events per minute
b. Storing 1TB+ per day
c. Stack: Storm (Trident), Kafka, Luigi,
Elasticsearch, Spark, AWS, MySQL Customers
Thumbnails are Important
● Your video's first impression
● Types: Upload, Manual, Auto (default)
● Manual >> Auto in Play Rate
● Current Auto is 10th second frame
● Many big publishers only use Manual
● 90% of Thumbnails are Auto! :-(
source: tastingtable.com (2016-10-12)
What’s a “Good” Thumbnail?
It’s subjective to the viewer!
Common themes:
● Not blurry
● Balanced brightness
● Centered objects
● Large text overlay
● Relevant to subject
vs
Source: Big Buck Bunny, Blender Studios
Manually Creating a Model is Hard
● Which features to extract?
● How to describe those features?
● How to weight features?
● How to penalize overfitting of models?
● Many techniques: SIFT, SURF, HOG?
Need to be an expert in Computer Vision :-(
Edge Detection Color Histogram Pixel Segmentation
So Many Image Features...
Deep Learning
● Learn features implicitly
● Learn from examples
● Techniques to avoid overfitting
● Success in a lot of applications:
○ Image classification
○ Image captioning
○ Machine translation
○ Speech-to-Text
Inception
● Learn multiple models in parallel; concatenate
their outputs (“modules”)
● Factoring convolutions (“towers”): e.g. 1x1
convs followed by 3x3
● Parameter reduction: GoogleNet (5MM) vs.
AlexNet (60MM), VGG (200MM)
● Auxiliary classifiers for regularization
● Residual connections (Inceptionv4)
● Depthwise separable convolutions (Xception)
https://www.udacity.com/course/deep-learning--ud730
https://arxiv.org/abs/1409.4842
Source: Rethinking the Inception Architecture for Computer Vision
1. Dimensionality reduction: fewer
channels, strides, feature pooling
2. Parameter reduction: faster, less
overfitting
3. “Cheap” nonlinearity: 1x1 + 3x3 is non-lin
4. Cross-channel ⊥ spatial correlations
1x1 Convolutions: what’s the point?
1x1 convolution with strides Pooling with 1x1 convolution
Source: http://iamaaditya.github.io/2016/03/one-by-one-convolution/
In Convolutional Nets, there is no such thing as
“fully-connected layers”. There are only
convolution layers with 1x1 convolution kernels. –
Yann LeCun
InceptionV3 Architecture
https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html
Dog (0.80)
Cat (0.05)
Rat (0.01)
...
Transfer Learning
1,000,000 images, 1,000 categories● Use pre-trained model
○ Cheaper (no GPU required)
○ Faster
○ Prevents overfitting
● Penultimate (“Bottleneck”) layer contains
image’s “essence” (CNN codes); acts as a
feature extractor
● Just add a linear classifier (Softmax; lin-SVM)
to Bottleneck
Fine Tuning + Tips
● Change classification layer +
backprop layers back
● Idea:
Early layers do basic filters; later
layers more dataset specific
● Generally use a pre-trained model
regardless of data size or similarity
Data Size (per class)
< 500 > 500 > 5,000
Similar to
original
Too small TL
TL + FT earlier
layers
Not Similar Too small
TL on earlier
layers
TL + FT entire
network
Other Applications of Transfer Learning
Google “Show and Tell”
https://github.com/tensorflow/models/tree/master/im2txt
Image Captioning Image Search
http://www.slideshare.net/ScottThompson90/applying-transfer-learn
ing-in-tensorflow
Training: Thesis
Train to differentiate between Manual and Auto
● Manual thumbnails are (usually) better than Auto
● Select Manual with high views and play rate;
Auto selection is random but low plays
● We have a lot of examples: 10K+ manual
● We used InceptionV3 pre-trained on ImageNet
Training: Examples
Positive (Manual)
Negative Examples
Negative (Auto)
Video Pre-Filter
Use FFMPEG to select top 100 frame
candidates
Methods:
● Color histogram changes to avoid
dupes
● Coded Macroblock information
● Remove “black” frame
● Measure motion vectors
Motion Vectors
Source: Sintel, Blender Studios
Engineering
Demo: Evaluation Tool
Demo: Examples Original Auto (10th second frame)
Top scored frames
from new model
What’s Next
● Refinements:
○ Fine tuning to earlier layers
○ Other models: ResnetV2, Xception
○ Pre-Filtering: adaptive, hardware accel.
● Products:
○ New auto thumbnails
○ Thumbstrips
Resources
Blog Posts:
● https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html
● https://github.com/tensorflow/models/tree/master/inception
● http://iamaaditya.github.io/2016/03/one-by-one-convolution/
● http://www.slideshare.net/ScottThompson90/applying-transfer-learning-in-tensorflow
● https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
● http://cs231n.github.io/transfer-learning/
● https://research.googleblog.com/2015/10/improving-youtube-video-thumbnails-with.html
● https://pseudoprofound.wordpress.com/2016/08/28/notes-on-the-tensorflow-implementation-of-inception-v3/
● https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html
Papers:
● Rethinking the inception architecture for computer vision. https://arxiv.org/abs/1512.00567
● Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357
● CNN Features off-the-shelf: an Astounding Baseline for Recognition. https://arxiv.org/abs/1403.6382
● DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. https://arxiv.org/abs/1310.1531
● How transferable are features in deep neural networks? https://arxiv.org/abs/1411.1792

Intelligent Thumbnail Selection

  • 1.
    Intelligent Thumbnail Selection KamilSindi, Lead Data Scientist
  • 2.
    JW Player 1. Company a.Open-source video player b. Hosting platform c. 5% of global internet video traffic d. 150+ team 2. Data Team a. Handling 5MM events per minute b. Storing 1TB+ per day c. Stack: Storm (Trident), Kafka, Luigi, Elasticsearch, Spark, AWS, MySQL Customers
  • 3.
    Thumbnails are Important ●Your video's first impression ● Types: Upload, Manual, Auto (default) ● Manual >> Auto in Play Rate ● Current Auto is 10th second frame ● Many big publishers only use Manual ● 90% of Thumbnails are Auto! :-( source: tastingtable.com (2016-10-12)
  • 4.
    What’s a “Good”Thumbnail? It’s subjective to the viewer! Common themes: ● Not blurry ● Balanced brightness ● Centered objects ● Large text overlay ● Relevant to subject vs Source: Big Buck Bunny, Blender Studios
  • 5.
    Manually Creating aModel is Hard ● Which features to extract? ● How to describe those features? ● How to weight features? ● How to penalize overfitting of models? ● Many techniques: SIFT, SURF, HOG? Need to be an expert in Computer Vision :-( Edge Detection Color Histogram Pixel Segmentation So Many Image Features...
  • 6.
    Deep Learning ● Learnfeatures implicitly ● Learn from examples ● Techniques to avoid overfitting ● Success in a lot of applications: ○ Image classification ○ Image captioning ○ Machine translation ○ Speech-to-Text
  • 7.
    Inception ● Learn multiplemodels in parallel; concatenate their outputs (“modules”) ● Factoring convolutions (“towers”): e.g. 1x1 convs followed by 3x3 ● Parameter reduction: GoogleNet (5MM) vs. AlexNet (60MM), VGG (200MM) ● Auxiliary classifiers for regularization ● Residual connections (Inceptionv4) ● Depthwise separable convolutions (Xception) https://www.udacity.com/course/deep-learning--ud730 https://arxiv.org/abs/1409.4842 Source: Rethinking the Inception Architecture for Computer Vision
  • 8.
    1. Dimensionality reduction:fewer channels, strides, feature pooling 2. Parameter reduction: faster, less overfitting 3. “Cheap” nonlinearity: 1x1 + 3x3 is non-lin 4. Cross-channel ⊥ spatial correlations 1x1 Convolutions: what’s the point? 1x1 convolution with strides Pooling with 1x1 convolution Source: http://iamaaditya.github.io/2016/03/one-by-one-convolution/ In Convolutional Nets, there is no such thing as “fully-connected layers”. There are only convolution layers with 1x1 convolution kernels. – Yann LeCun
  • 9.
  • 10.
    Transfer Learning 1,000,000 images,1,000 categories● Use pre-trained model ○ Cheaper (no GPU required) ○ Faster ○ Prevents overfitting ● Penultimate (“Bottleneck”) layer contains image’s “essence” (CNN codes); acts as a feature extractor ● Just add a linear classifier (Softmax; lin-SVM) to Bottleneck
  • 11.
    Fine Tuning +Tips ● Change classification layer + backprop layers back ● Idea: Early layers do basic filters; later layers more dataset specific ● Generally use a pre-trained model regardless of data size or similarity Data Size (per class) < 500 > 500 > 5,000 Similar to original Too small TL TL + FT earlier layers Not Similar Too small TL on earlier layers TL + FT entire network
  • 12.
    Other Applications ofTransfer Learning Google “Show and Tell” https://github.com/tensorflow/models/tree/master/im2txt Image Captioning Image Search http://www.slideshare.net/ScottThompson90/applying-transfer-learn ing-in-tensorflow
  • 13.
    Training: Thesis Train todifferentiate between Manual and Auto ● Manual thumbnails are (usually) better than Auto ● Select Manual with high views and play rate; Auto selection is random but low plays ● We have a lot of examples: 10K+ manual ● We used InceptionV3 pre-trained on ImageNet
  • 14.
  • 15.
    Video Pre-Filter Use FFMPEGto select top 100 frame candidates Methods: ● Color histogram changes to avoid dupes ● Coded Macroblock information ● Remove “black” frame ● Measure motion vectors
  • 16.
  • 17.
  • 18.
  • 19.
    Demo: Examples OriginalAuto (10th second frame) Top scored frames from new model
  • 20.
    What’s Next ● Refinements: ○Fine tuning to earlier layers ○ Other models: ResnetV2, Xception ○ Pre-Filtering: adaptive, hardware accel. ● Products: ○ New auto thumbnails ○ Thumbstrips
  • 21.
    Resources Blog Posts: ● https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html ●https://github.com/tensorflow/models/tree/master/inception ● http://iamaaditya.github.io/2016/03/one-by-one-convolution/ ● http://www.slideshare.net/ScottThompson90/applying-transfer-learning-in-tensorflow ● https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html ● http://cs231n.github.io/transfer-learning/ ● https://research.googleblog.com/2015/10/improving-youtube-video-thumbnails-with.html ● https://pseudoprofound.wordpress.com/2016/08/28/notes-on-the-tensorflow-implementation-of-inception-v3/ ● https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html Papers: ● Rethinking the inception architecture for computer vision. https://arxiv.org/abs/1512.00567 ● Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357 ● CNN Features off-the-shelf: an Astounding Baseline for Recognition. https://arxiv.org/abs/1403.6382 ● DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. https://arxiv.org/abs/1310.1531 ● How transferable are features in deep neural networks? https://arxiv.org/abs/1411.1792