So formally, we are working in the Lab color space. The grayscale information is contained in the L, or lightness channel of the image, and is the input to our system.
The output is the ab, or color channels.
We’re looking to learn the mapping from L to ab using a CNN.
We can then take the predicted ab channels, concatenate them with the input, and hopefully get a plausible colorization of the input image. This is the graphics benefit of this problem.
We note that any image can be broken up into its grayscale and color components, and in this manner, can serve as a free supervisory signal for training a CNN. So perhaps by learning to color, we can achieve a deep representation which has higher level abstractions, or semantics.
Now, this learning problem is less straightforward than one may expect.
For example, consider this grayscale image.
This is the output after passing it through our system. Now, it seems to look plausible. Now here is the ground truth. So notice that these two look very different. But even though red and blue are far apart in ab space, we are just as happy with the red colorization as we are with the blue, and perhaps the red is even better...
This indicates that any loss which assumes a unimodal output distribution, such as an L2 regression loss, is likely to be inadequate.
We reformulate the problem as multinomial classification. We divide the output ab space into discrete bins of size 10.
The system does have some interesting failure cases. We find that many man-made objects can be multiple colors. The system sometimes has a difficult time deciding which one to go with, leading to this type of tie-dye effect.
Also, we find other curious behaviors and biases. For example, when the system sees a dog, it sometimes expects a tongue underneath. Even when there is none, it will just go ahead and hallucinate one for us anyways.
Due to time constraints, we will not be able to discuss all of the tests, but please come by our poster for more details.
We also see units which correspond to more “thing” categories, such as human and dog faces, and flowers. The network was able to discover these units in an unsupervised regime.
The y=0 line shows the performance if we initialize the network using Gaussian weights.
The performance we are hoping to match is if we use imagenet labels to train the system. We will see how well each of these methods make up the difference between Gaussian initialization and using Imagenet labels.
One method for learning features is autoencoders, which rely on a bottleneck. The autoencoder features do not learn very semantically meaningful features. Using stacked k-means, as implemented by Krahenbuhl et al, makes up some of the ground.
Finally, our method, outside of the Doersch detection result, performs competitively relative to other self-supervision methods. We found this result surprising, as our project was primarily focused on the graphics task of colorization. However, note the large gap between self-supervision methods and pre-training on ImageNet. There is still work to be done to achieve strong semantic representations without the benefit of labels.
This is an amateur family photo from the 1950s of my father and great grand-father.
This is a professional photograph from Henri Cartier-Bresson.
Weakly supervised semantic segmentation
Semantic Segmentation with
24 Feb 2018
What can we learn from
(from Stephen Chow’s film)