Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
E4040.2016 fall.cjmd.report.ce2330.jb3852.jdr2162
1. Transfer Style Convolutional Neural Network
E4040.2016Fall.CJMD.report
Carlos Espino ce2330, Juan Borgnino jb3852, Jose Ramirez jdr2162
Columbia University
Abstract
In the following paper we implement the style transfer
algorithm designed by the authors Gatys et al. which
can generate new synthesised images by combining
the style and the content of two images. We explain
the methodology developed by the authors and
implement the solution using Theano. We find that
results are highly dependant on the values specified
for the hyperparameters and training is quite time
intensive due to the large number of parameters
required. In addition, we successfully generate new
images combining several pictures (including one of
us) with the style of several well known painters.
1. Introduction
Our paper starts with an explanation of the work
and methodology developed by the authors Gatys
et. a.l on section 2. We provide an intuitive
explanation of how the algorithm works, including
loss functions used and provide images showing the
results in section 3. We continue with section 4,
where we describe our implementation of the
algorithm on Theano and the architecture used. Next,
in section 5 we present our final results in the form of
synthesised images combining the content and style
of two different images. Finally, section 6 presents
our conclusions and advice for individuals who wish
to obtain the same results.
2. Summary of the Original Paper
The paper introduces a Deep Learning Network
which can
create artistic images by combining the style of one
image with the content of another.
In order to do so, the authors explain that the
representation of content and style in the CNN are
separable. Hence, one can take both separately to
generate a new image based on content of one
image and style of another. To demonstrate their
findings, the authors match the content of a
photograph from Tubingen, Germany with the style of
several well known artworks from different periods.
See Figure 1- A.
The final results show that the synthesised image
maintains the global arrangement of the original
image (the image from Tubingen), while the colours
and local structures that compose the global scenery
belong to the chosen artwork. However, before
generating the new image one can specify how much
one desires to preserve from the style and how much
from the content.
2.1 Methodology of the Original Paper
Gatys et al. designs a method to synthesize an image
that mix the content and the style from different
image using convolutional neural networks.
Specifically, they use the VGG-Network with 19
layers (VGG-19), which consists in a Convolutional
Neural Network that has a similar performance to
humans in basic image recognition tasks
(Russakovsky et al. 2015). They minimize a loss
function that takes into account :
1. The neural representations to capture the
content of an image.
2. The style representations of another image
using that computes the correlations
between different type of neurons.
The following diagram shows the methodology they
use to generate the desired image. More details
about the representation are given in section 3.1
2. Figure 1 Image credit Gatys. et. al. (2015)
2.2 Key Results of the Original Paper
The key result of the paper is that the representations
of style and content in the CNN are separable. As a
result, one can take the content of one image and
combine it with the style of another image.
The following image shows the results when
combining
a photograph from Tubingen, Germany with the style
of three well known artists:
Figure 2 Image credit Gatys. et al. (2015)
Different results can be achieved by variating the
values for the parameters and which
correspond to the weight of content and style of two
images, respectively.
In particular they try different values for the
proportion combining them with style
complexity when including more layers of the
network. More details about , and style
complexity, are given in section 3.1.
Different values for and for style complexity
yield very different results as we can observe from
the images below which combine the photograph
from Tubingen, Germany with the style
corresponding to the painting Composition VII by
Wassily Kandinsky.
Figure 3 Image credit Gatys. et al. (2015)
These results show how the content of one image
can be combined with the stylet of another one, to
successfully generate a new image which can be
quite appealing.
3. Methodology
3.1. Objectives and Technical
Challenges
3. The objective is to generate an image that contains
the content of one image with the style of another
one. To do so, we start with a white noise image and
solve a minimization problem, discussed in the
following section, to find another image that matches
the desired content and style. Thus, one of the main
technical challenges consists in the large number of
parameters to learn on the minimization problem
which is 3 x width x height of image. For instance, if
we wish to generate a color image of size 500 x 500,
the number of parameters to learn are 750,000. Also,
one other important challenge is the selection of the
hyperparameters which balance the style and the
structure in the output image. The final results are
quite sensible to the weights assigned to the content
and style loss functions. Given the complexity of the
minimization problem, the choice of the minimization
algorithm is important and plays a key role on the
quality of the results.
3.2. Problem Formulation and Design
We follow the same methodology as Gatys et al.
(2015) to formulate the minimization problem. We
explain here the minimization problem mentioned in
section 2.1.
Using the VGG-19 Gatys et al. (2015) remove the fully
connected layers, keeping only the 5 pooling layers
and the 16 convolutional layers (see section 4.1). The
trained weights of this network are publicly available
in [4].
This network is used to encode an image at each
convolutional layer using its filter response. In this
way, a layer that has filters, will have feature
maps each of size which corresponds to height x
width of the feature map. We can store all the
responses of layer in a matrix , so
corresponds to the activation of filter th at
position and layer .
Let and be the original image and the generated
image, and and the respective feature
representations at layer , in order to generate the
content of the original image, we need to minimize:
where the gradient with respect to can be
computed using back propagation.
Having defined how to generate the content of an
image, a way to generate a style representation is
needed. To do this, the correlation between different
filter responses is computed, taking the expectation
over the spatial extent of the input image. This is
given by the gram matrix where is
the inner product between the feature maps , in
vector form, in layer
Having defined this, Let and be the original
image and the generated image, and and the
respective style representations at layer , in order to
generate the style of the original image, we need to
minimize:
where is the weighting factor of the contribution of
each layer and is the contribution of layer and
it’s defined as:
Here gradient gradient with respect to can be
computed using back propagation as well.
The original paper chooses for the
convolutional layers we decide to use and 0 for the
layers we don’t want to use.
Having defined the loss functions to generate the
content and the style, if we wish to generate an
4. image with content from image and style from
image , we need to minimize the following loss
function
Now that we have the loss function, we need to
choose a minimization algorithms, we compare
Adam, Adadelta and L-BFGS (limited memory BFGS
by Liu, D. C., & Nocedal, J. (1989).
The limited memory implementation of the BFGS is
important because if we want to consider quasi
Newton algorithms, we need to compute or estimate
the Hessian of the loss function. This can yield to a
huge memory problem given the dimensionality of
the variables. Hence, a limited memory approach is
needed for this kind of minimization problems.
4. Implementation
In the following section, we describe the deep
learning architecture, then we describe the overall
design of our implementation, and details about
challenges and considerations of it. Our project
require a huge number of parameter to be minimized,
therefore we make different experiments with
multiples gradient descent algorithms.
4.1. Deep Learning Network
As mentioned before, our algorithm uses the VGG-19
network, which was created by the Visual Geometry
Group of Oxford university (VGG). The VGG-19
contains five main layers. Each main layer has a set
of convolutional networks connected, the last three
main layers have four convolutions and the first two
have two convolutions.
Figure 4 Architectural block diagram VGG-19 [7].
Replicating our results take at least 12 hours and
requires variation of the gradient descent algorithm
with adaptive learning rates.
Some of the most important hyperparameters in our
model are and . They represent the amount of
style and structure in the output image. This
parameters are very sensible, and they should
change depend on the images involved in the transfer
style. We run multiples combinations of , first
we fix the alpha in 0.001 and test with three different
values of beta (1e3, 0.1e4, 0.1e5). In the following
table we can observe the variation in our desired
output. The beta of 0.1e3 has few blue, the second
one (0.1e4) starts to include some yellow colors and
the last one include more style than structure. Our
final configuration was a beta of 0.1e5, because it
maintains a better balance between style and
structure.
We use ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’
and ‘conv5_1’ layers for the style and ‘conv4_2’ layer
for the content.
Figure 5 Setting hyperparameters alpha/beta
5. To compare our results to ones from the paper we
use The style from Starry Night by Van Gogh and the
content from the Tubingen image.
Then we generate some other examples using the
following images:
1. The style from Circus by Joan Miro and the
content from an image of NYC.
2. The style from Diego Rivera And Frida Kahlo
Dia De Los Muertos painting by Pristine
Cartera Turkus and the content from an
image of the team members in San
Francisco.
The results and images are shown in section 5.
4.2. Software Design
Our architecture has four main components. It
requires functions to manipulate our two input
images, the neural net architecture, different kind of
gradient algorithms and the components to train our
optimization problem and return our final result. All
our code was written in Theano.
Figure 6 General architecture, components
Images manipulation: It is the component used to
load the two input images. It has the responsibility of
crop the original images and rescale them in the
desired resolution.
VGG Model: It is the most important component. It
contains the neural net architecture describes in the
Section 4.1. Also It has the evaluation function.
Gradient Algorithms: This component contains
multiples algorithms which are used to optimize our
loss function. It is a critical component, because our
running time is large, for instance with an image of
600 pixels, it takes 12 hours with the scipy l_bfgs_b
minimizer.
The Adam and Adadelta algorithms start converging
fast but they get stuck in a certain point where they
barely decrease the value of the objective function at
each iteration. This prevents the full expression of the
style. In contrast, L-BFGS finds better local minima
because it approximates the Hessian matrix of the
objective function
Training and results: This component contains the
functions to instantiate our VGG-19 and the gradient.
Also it makes multiple iterations and gives us the
result after minimizing the lost function.
Figure 7 Left Class diagram, Right Call graph
In order to follow the principles of OOP (Object
Oriented Programming), we encapsulate the
functionality of VGG-19, in a class. It contains a
public ordered dictionary with all the Convolutional
and pool layers. Also it calculate the cost function,
which is the sum of alpha times the style loss and
beta times the content lost (See section 3.2).
In addition, in the Figure 7, we have the call graph.
Our main method call the test_method. It first prepare
the images, and create a new instance of VGG-19,
the input is the art and content images. Then our test
creates a dictionary and save the images in each
convolutional layer of our first instance of VGG-19.
The next step is instantiate a second object VGG-19,
this time our input is a random image. Finally our test
uses a Train function that requires one optimizer, in
this example we use adadelta which is slower than
L-BFGS minimizer.
6. The pseudocode of our algorithm is the following
Algorithm: Generate_Image
Inputs: p content image
a: style image
: weight of the content image
: weight of the style image
Output: x the generated image with content from image
p and style from image a.
vgg = VGG19( ) # create a VGG19 network with
pretrained weights
p_layers = [ 'conv1_1', 'conv2_1', 'conv3_1', 'conv4_1',
'conv5_1']
a_layer = 'conv4_2'
p_vgg = vgg(p) #evaluate the convolution layers on p
a_vgg = vgg(a) #evaluate the convolution layers on a
p_feat = [p_vgg.output(layer) for layer in p_layers]
a_feat = a_vgg.output(a_layer)
x = random_image
x_feat = [vgg(x).output(layer) for layer in p_layers and a
_layer]
# minimize the loss function using the
desired minimization algorithm
return x
The content_loss and style_loss functions are defined
in Section 3.1.
5. Results
5.1. Project Results
First we replicated the main example from the paper.
Remember that is takes the style from The Starry
Night by Van Gogh.
Figure 8 Setting hyperparameters alpha/beta
For this experiment we chose a value and of
0.01 and 1,000 respectively. The result shows the
desired output and it’s similar to the one on the
paper, If we wish to get more accurate results we
would need to give more iterations to the
optimization algorithm. At some point the Adam
optimization algorithm gets stuck and decreases the
value of the objective function in small amounts.
We also wanted to try other combination of images to
see what the algorithm is capable of. Our next
attempt was combining the style of The Circus by
Miro and an image of NY. The results are the
following
Figure 9 Result NYC-Circus by Miro
Finally, we tried with an image of the three of us
(Carlos, on the left, Jose on the Middle and Juan on
7. the right) combined with a painting from Diego Rivera
And Frida Kahlo Dia De Los Muertos painting by
Pristine Cartera Turkus . This case is interesting
because the painting contains a blue and white face.
However, we see that only the white face seems to
be translated to the synthesised image because our
faces look white. Probably by making our faces white
the loss function was lower than if changed to blue.
This is due to the fact that the content loss is
probably smaller if our faces are white because this
color is closer to our skin color, when compared to
the blue faces.
Figure 10 San Francisco and Dia de muertos
5.2. Comparison of Results
Our first result tries to reproduce the original image
from Gatys et. al.
Figure 11 Comparison with Gatys et. al.
Our image extracts successfully the style from the
painting and applies it to the photo. We can see that
we are capturing some more small details from the
brush but doesn’t capture the stars in the sky. As we
have commented before, this may be caused by
different factors such as the choices of and and
the optimization algorithm.
We wanted also our to compare result with the
Deepart.io commerical tool. Deepart.io uses the
original algorithm by Gatys et. al. (2015)
Figure 12 Deepart.io vs our result comaprison
It can be appreciated that both images look similar in
the cartoon-like colors and arrangements, however
Deepart.io’s generates more polished styling, we
believe this is happening because we used adam to
generate that image and it got stuck before
generating more details. Also, as we mentioned
before, the results are very sensitive to the choice of
and .
5.3. Discussion of Insights Gained
We learned some important lessons while working
on this project, especially with the importance of the
optimization problem. It’s important to note that the
number of parameters grow quadratically with the
desired size of the image. If we wish to get a very
high resolution image, we will have to minimize a loss
function on millions of parameters. Hence, it’s
important to choose an algorithm that converges fast
and finds a good local minimum. This issue makes us
think about using quasi newton methods, but we also
have to be careful with the size of the Hessian that
can affect significantly the performance of our
algorithm, that’s why limited memory algorithms like
L-BFGS are the right choice.
We also noticed that the choice of and can
change significantly the result of the algorithm, and
it’s fun to play with them with different images to
8. create unique pieces of art. However, each image
takes some hours to generate, making it difficult to
try many different values.
6. Conclusion
We successfully implemented the style
transfer algorithm developed by Gatys et al. which
can combine the content and style of two images.
The main finding of the research of the authors is that
the representations of style and content in the CNN
are separable. Therefore, one can take the content of
one image and combine it with the style of another
image.
To implement the algorithm we used the
convolutional and pooling layers of the VGG-19
network. We then formulated and solved the
minimization problem present on the paper using
Theano. Finally, we replicated the results of the paper
and experimented with new images and artists. We
managed to create appealing images with an artistic
style.
For individuals interested in replicating the
same results, we recommend to first work with small
images (32x32) to assess fast whether the algorithm
is working or not. Given the large number of weights
which need to be updated, the algorithm takes a long
time to run. We also advise to try different values for
and to find the desired balance between style
and content. Finally, we recommend using the
L-BFGS algorithm, a quasi newton method, for
solving the minimization problem. We tried Adam and
regular steepest descent algorithms, but obtained
much faster and better convergence with the L-BFGS
algorithm.
7. Acknowledgement
We would like to acknowledge the work done in [6], it
was a very useful guide that helped us to figure out
some implementation details.
8. References
Include all references - papers, code, links, books.
[1] Bitbucket repo:
https://bitbucket.org/e_4040_ta/e4040_project_cjmd
[2] Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A
neural algorithm of artistic style. arXiv preprint
arXiv:1508.06576.
[3] Russakovsky, O., Deng, J., Su, H., Krause, J.,
Satheesh, S., Ma, S., ... & Berg, A. C. (2015).
Imagenet large scale visual recognition challenge.
International Journal of Computer Vision, 115(3),
211-252.
[4]https://s3.amazonaws.com/lasagne/recipes/pretrai
ned/imagenet/vgg19_normalized.pkl
[5] Gatys, L. A., Ecker, A. S., & Bethge, M. (2016).
Image style transfer using convolutional neural
networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (pp.
2414-2423).
[6]https://github.com/Lasagne/Recipes/blob/master/
examples/styletransfer/Art%20Style%20Transfer.ipy
nb
[7]http://www.slideshare.net/ckmarkohchang/applied
-deep-learning-1103-convolutional-neural-networks
[8] Liu, D. C., & Nocedal, J. (1989). On the limited
memory BFGS method for large scale optimization.
Mathematical programming, 45(1-3), 503-528.
9. Appendix
9.1 Individual student contributions
ce2330 jb3852 jdr2162
Last Name Espino Borgnino Ramirez
Fraction of
(useful) total
contribution
1/3 1/3 1/3
What I did 1 Implement
the
structure
of the
network.
Implement
the loss
functions
and the
theano
training
model
Implement the
classes and
fixed issues on
the code.
What I did 2 Methodolo
gy and
formulatio
n
Introduction,
results and
conclusions
Implementation
s and Software
design
What I did 3 Run third
example
Run first
example
Run second
example