Deep Learning behind Prisma
——Image style transfer with Convolutional Neural Network
lostleaf
Agenda
• Introduce deep learning models for image style transfer via recent
papers
• Prisma is kind of a stunt, but it should have used similar techniques
• Agenda
• A brief introduction to convolutional neural network
• Neural style
• Real-Time Style Transfer
Prisma
• An Russian mobile app
• Turns your photos into
awesome artworks
• With Deep Learning!!!
Hotel Ukraine rendered by Prisma from Premier Medvedev’s
Instagram
Image Style Transfer
+
Arch Starry Night (van Gogh)
Arch painted by van Gogh
A brief introduction to
convolutional neural network
Some of the images are from
Prof. Feifei Li’s lecture notes
Neuron
• w: weight, b: bias
Activation function(common ones)
Thresholding, preferred in
modern network structures
Slower: exponentials
Harder to train: vanishing gradient
Activation function: nonlinear functions
Fully connected neural network
Convolution
• The brown numbers in the
yellow part is called
conventional kernel / filter
• Convolve the filter with the
image: slide over the image
spatially, computing dot
products
• Right: A 3*3 convolution sums
up the diagonals
From Prof. Andrew Ng’s UFLDL tutorial
Convolutional layer
Filters always extend the full
depth of the input volume
Why *3?
3 channels: R, G & B
Convolutional layer
1 number:
the result of taking a dot product between the
filter and a small 5*5*3 chunk of the image
Convolutional layer
Transform with activation
function f
f
Convolutional layer
• A convolutional layer
consists of several filters
• For example, if we had 6
5*5 filters, we’ll get 6
separate activation maps
• Stack these up to get a
tensor of size 28*28*6
• May add padding to
obtain same output size
Why convolution?
• Each value could be considered as an
output of a neuron
• Features of image data:
• pixels only related to small
neighborhood (local connection)
• repeat pattern & content move around
(weight sharing)
• Reduces the complexity and
computation of neural network by utilizing
natures of images 6
Pooling Layer
• Right: max pooling for example
• Operate independently on every
depth slice of the input
• Reduce the reduce the spatial size
of activation map (reduce amount
of parameters and computation)
• Increase the shift invariance
Case study1: MNIST & Lenet
• MNIST handwritten digits recognition
• “hello world” of deep learning
Lenet
LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.
pooling pooling
Case study2: ImageNet & VggNet
• ImageNet: a large image dataset in thousands of classes
VggNet(Vgg19)
Image by Mark Chang
• Runner-up of Imagenet
challenge 2014
• 19 trainable layers
• 16 convolutional layers (3*3)
• 5 max pooling layers (2*2)
• 3 fully connected layers
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition."
arXiv preprint arXiv:1409.1556 (2014).
Typical architecture
• Convolutional part & Fully connected part
• [(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
Neural style
Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style."
arXiv preprint arXiv:1508.06576 (2015).
Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "Image style transfer using convolutional neural networks."
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
Intuition
• Convolutional neural networks well trained on large datasets
(VggNet) could be a powerful feature extractor, like human brains
• Human painters are talented in combining content and style
Goal
• Given a content image p and a style image a
• Find an image x that
• Similar to p in content
• Similar to a in style
≈ ? ≈
p a
x
Formulation
• Use Vgg19(Convolutional part) for feature extraction
• Two loss function
• Content loss: difference in content between x and p
• Style loss: difference in style between x and a
• Find an image x that minimize the weighted sum between content
and style loss
How to find x
Image by Mark Chang
Some results
J.M.W. Turner
Vincent van Gogh Edvard Munch
con’d
Pablo Picasso Wassily Kandinsky
Balance content and style
• Weights of content and style:
hyper parameter
• Search multiple combinations
to satisfy personal aesthetic
Photorealistic style transfer
New York London
Drawbacks
• Iterative optimization
• Slow: 65s to render the 600 * 400 arch image with GTX 980M
• Power consuming: not acceptable for mobile apps like Prisma
Real-Time Style Transfer
Intuition
• Style transfer is essentially a image transformation problem: image
in, image out
• Generative CNN’s are proved to be powerful in many other image
transformation problems
Goal
• For a specific style image a, train a CNN that
• Accepts a content image p as input
• Outputs a synthesized image x has content similar to p and style
similar to a
Generative CNN
• Pre trained VggNet for formulating the loss function
• Style target: a fixed style image, e.g. starry night
• Input image & content target: images sampled from a large dataset
• Image Transform Net: fully convolutional network (and some fancy new staffs)
Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. "Perceptual losses for real-time style transfer and super-resolution."
arXiv preprint arXiv:1603.08155 (2016).
Details & Improvements
• Image size 256 * 256
• Trained on a large image dataset for 4h with GTX Titan X
• 200 ~ 1000X rendering speedup
Some results
Con’d
Comparison
• Original neural style: hundreds of optimization iterations
• Generative CNN: tens of thousands of training iterations, one
forward pass for synthesize
• Prisma's offline mode probably uses similar technologies
Parallel work — Texture Network
Ulyanov, Dmitry, et al. "Texture Networks: Feed-forward Synthesis of Textures and Stylized Images."
arXiv preprint arXiv:1603.03417 (2016).
Take home
• What make up a CNN
• Convolution, pooling, fully connected layer...
• How neural style works
• CNN for feature extraction & iterative optimization
• Fast style transfer
• Train a generative CNN for a specific style
Some open course resources
• Introduction to Computer Vision, Udacity
• Deep Learning, Udacity
• Convolutional Neural Networks for Visual Recognition, Stanford
CS231n *
• Deep Learning for Natural Language Processing, Stanford
CS224d

Deep Learning behind Prisma

  • 1.
    Deep Learning behindPrisma ——Image style transfer with Convolutional Neural Network lostleaf
  • 2.
    Agenda • Introduce deeplearning models for image style transfer via recent papers • Prisma is kind of a stunt, but it should have used similar techniques • Agenda • A brief introduction to convolutional neural network • Neural style • Real-Time Style Transfer
  • 3.
    Prisma • An Russianmobile app • Turns your photos into awesome artworks • With Deep Learning!!! Hotel Ukraine rendered by Prisma from Premier Medvedev’s Instagram
  • 4.
    Image Style Transfer + ArchStarry Night (van Gogh) Arch painted by van Gogh
  • 5.
    A brief introductionto convolutional neural network Some of the images are from Prof. Feifei Li’s lecture notes
  • 6.
  • 7.
    Activation function(common ones) Thresholding,preferred in modern network structures Slower: exponentials Harder to train: vanishing gradient Activation function: nonlinear functions
  • 8.
  • 9.
    Convolution • The brownnumbers in the yellow part is called conventional kernel / filter • Convolve the filter with the image: slide over the image spatially, computing dot products • Right: A 3*3 convolution sums up the diagonals From Prof. Andrew Ng’s UFLDL tutorial
  • 10.
    Convolutional layer Filters alwaysextend the full depth of the input volume Why *3? 3 channels: R, G & B
  • 11.
    Convolutional layer 1 number: theresult of taking a dot product between the filter and a small 5*5*3 chunk of the image
  • 12.
    Convolutional layer Transform withactivation function f f
  • 13.
    Convolutional layer • Aconvolutional layer consists of several filters • For example, if we had 6 5*5 filters, we’ll get 6 separate activation maps • Stack these up to get a tensor of size 28*28*6 • May add padding to obtain same output size
  • 14.
    Why convolution? • Eachvalue could be considered as an output of a neuron • Features of image data: • pixels only related to small neighborhood (local connection) • repeat pattern & content move around (weight sharing) • Reduces the complexity and computation of neural network by utilizing natures of images 6
  • 15.
    Pooling Layer • Right:max pooling for example • Operate independently on every depth slice of the input • Reduce the reduce the spatial size of activation map (reduce amount of parameters and computation) • Increase the shift invariance
  • 16.
    Case study1: MNIST& Lenet • MNIST handwritten digits recognition • “hello world” of deep learning
  • 17.
    Lenet LeCun, Yann, etal. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324. pooling pooling
  • 18.
    Case study2: ImageNet& VggNet • ImageNet: a large image dataset in thousands of classes
  • 19.
    VggNet(Vgg19) Image by MarkChang • Runner-up of Imagenet challenge 2014 • 19 trainable layers • 16 convolutional layers (3*3) • 5 max pooling layers (2*2) • 3 fully connected layers Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
  • 20.
    Typical architecture • Convolutionalpart & Fully connected part • [(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
  • 21.
    Neural style Gatys, LeonA., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style." arXiv preprint arXiv:1508.06576 (2015). Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "Image style transfer using convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
  • 22.
    Intuition • Convolutional neuralnetworks well trained on large datasets (VggNet) could be a powerful feature extractor, like human brains • Human painters are talented in combining content and style
  • 23.
    Goal • Given acontent image p and a style image a • Find an image x that • Similar to p in content • Similar to a in style ≈ ? ≈ p a x
  • 24.
    Formulation • Use Vgg19(Convolutionalpart) for feature extraction • Two loss function • Content loss: difference in content between x and p • Style loss: difference in style between x and a • Find an image x that minimize the weighted sum between content and style loss
  • 25.
    How to findx Image by Mark Chang
  • 26.
  • 27.
  • 28.
    Balance content andstyle • Weights of content and style: hyper parameter • Search multiple combinations to satisfy personal aesthetic
  • 29.
  • 30.
    Drawbacks • Iterative optimization •Slow: 65s to render the 600 * 400 arch image with GTX 980M • Power consuming: not acceptable for mobile apps like Prisma
  • 31.
  • 32.
    Intuition • Style transferis essentially a image transformation problem: image in, image out • Generative CNN’s are proved to be powerful in many other image transformation problems
  • 33.
    Goal • For aspecific style image a, train a CNN that • Accepts a content image p as input • Outputs a synthesized image x has content similar to p and style similar to a
  • 34.
    Generative CNN • Pretrained VggNet for formulating the loss function • Style target: a fixed style image, e.g. starry night • Input image & content target: images sampled from a large dataset • Image Transform Net: fully convolutional network (and some fancy new staffs) Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. "Perceptual losses for real-time style transfer and super-resolution." arXiv preprint arXiv:1603.08155 (2016).
  • 35.
    Details & Improvements •Image size 256 * 256 • Trained on a large image dataset for 4h with GTX Titan X • 200 ~ 1000X rendering speedup
  • 36.
  • 37.
  • 38.
    Comparison • Original neuralstyle: hundreds of optimization iterations • Generative CNN: tens of thousands of training iterations, one forward pass for synthesize • Prisma's offline mode probably uses similar technologies
  • 39.
    Parallel work —Texture Network Ulyanov, Dmitry, et al. "Texture Networks: Feed-forward Synthesis of Textures and Stylized Images." arXiv preprint arXiv:1603.03417 (2016).
  • 40.
    Take home • Whatmake up a CNN • Convolution, pooling, fully connected layer... • How neural style works • CNN for feature extraction & iterative optimization • Fast style transfer • Train a generative CNN for a specific style
  • 42.
    Some open courseresources • Introduction to Computer Vision, Udacity • Deep Learning, Udacity • Convolutional Neural Networks for Visual Recognition, Stanford CS231n * • Deep Learning for Natural Language Processing, Stanford CS224d