This post provides some technical details about our work and is primarily intended for readers who are quite familiar with machine learning and neural networks.
Google recently blogged about their generative art, created by running deep convolutional neural networks backwards on photographs. This results in the network fantasizing various additional features and objects in the photographs, such as animals, buildings, people, eyes, legs and wheels. Their networks can also generate images from scratch, which results in repeating patterns and psychedelic landscapes. Their inceptionism gallery has some spectacular examples.
We decided that we wanted to try and reproduce this, and make some convnet art ourselves. To save time, we downloaded the parameters for a trained VGG-16 network. This architecture with 16 trainable layers was proposed by Simonyan et al. and was used to reach the 2nd place in the 2014 ImageNet Large Scale Visual Recognition Challenge. A VGG-16 network primarily consists of convolutional layers with 3-by-3 filters, with the occasional 2-by-2 max-pooling layer in between. We obtained the parameters for a trained VGG-16 network from vlfeat.org.
The blog post briefly mentions that they use a natural image prior to get more recognizable results, but there is no additional information about this prior. A deep convolutional neural network is trained to classify images, not to generate them. As a result it throws away a lot of information about the large-scale structure of the images. It turns out that having such a prior is pretty essential to get good results.
We created a very simple prior by fitting a multivariate Gaussian (or alternatively, a Student-t distribution) on 8-by-8 patches extracted from natural images. We then applied this prior convolutionally to the generated images, at multiple scales. This seemed to work well enough to get some structure in the images. It’s far from perfect, and our approach is insufficient to generate large-scale structure. For example, if you tell it to generate images of spiders, you will see lots of spider legs and bodies scattered across the images, but a complete spider will be a rare occurrence.
To generate an image from a given class in practice, we optimize the logit of the class using gradient descent. This is the input to the softmax function at the output side of the network. This works better than optimizing the class probability or class log likelihood, because in that case the network tries to supress the activations for all other classes. Optimizing logits avoids this and only strengthens features for the desired class. We used the Adam update rule to speed up the optimization process a bit.
In order to do everything in real time, a lot of optimizations had to be made. We had to settle on only having a frame which was actually optimized every couple of seconds, and interpolate the other frames in the video from there. This interpolation however is pretty bad for keeping the interesting features the network can use, because it tends to lose detail.
All in all it took us about one week to implement, starting from scratch. We hope other trippers in the world can dig it as much as we do.