We generally see artificial intelligence classifying and predicting, but creating? That’s somethin’ spicy. 🤩
This is called neural style transfer. NST (Neural Style Transfer) transfers a style from a picture to another- using neural networks to transfer the style.
The content image is the image that receives the style, and the image whose style is transferred is called the style image. No matter what style image you pair up with the content image, the content of the content image will still be apparent in the newly generated image.
a useful analogy is making pizza. you can choose any topping (🍄🍍 🧅) you want, and it’s still a pizza. 🍕 Similarly, you can choose any style to be transferred on the content image, and the content image will still contain its contents.
Basically, we want to extract the ‘content’ of any content image and the ‘style’ of any style image, aaaaaannnd mix! 🥒🥕🥬 ⇒ 🥗
How does this work?
Let’s take a birds eye view on how this would work, then dive into some details later.
How do you extract the content from the content image?
First, a combination/generated image will be initialized (think of this as sort a placeholder)- because its not going to be perfect the first time around, it’ll always be updated during the training process. Its going to be the only variable that needs to be updated during the style transfer process (i.e., the model parameters to be updated during training)
We’re going to take a pre-trained VGG-19 neural network, and feed content into the neural network. A pre-trained VGG-19 network is used because the network is already trained on more than a million images from the ImageNet database, so its able to detect high level features in an image.
Through the process, the pre-trained CNN to extract image features and freeze its model parameters during training.
Because no classification is needed, the entire CNN architecture is not needed. We just need a series of convolutional and pooling layers, and actually don’t care about the end (fully connected layers and the softmax)- we’re going to focus on the convolutional part.
Now we have our initialized combination image, which needs some work. To put this in solid calculations and give us some direction, we need to define Content loss and Style loss.
Our initialized combination image is like a car, and it’s kinda just there 😐 It can’t go anywhere or be of any use without a destination (in this case, the final combination image). How does it get to its destination? With a driver and preferably google maps.
We calculate the loss function through forward propagation, and update the model parameters (the synthesized image for output) through backpropagation.
The loss function commonly used in style transfer consists of three parts:
- content loss — makes the combination image and content image close in terms of content features
- style loss- makes the combination image and style image close in terms of their style features
- total variation loss- helps reduce the noise in the generated image
When the model training is over, we output the model parameters of the style transfer to generate the final combination image.
Let’s go deeper into loss function.
Fundamentally, CNNs extract features so that the fully connected layers can identify the content in the image. Along the way, as the network gets more deeper, it cares more about the actual content of the image and not just the values- so, we want to look into higher convolution layers to extract content
The content loss measures the difference between the combination image (synthesized image) and the content image by a squared loss function.
Choosing layer (L), let P = original image, and F = generated image. F[l] and P[l] are the feature representation of the images in layer L.
For style loss, similar to content loss, it also uses the squared loss function to measure the difference in style between the combined/generated image and the style image.
However, for style loss, we need to use different layers throughout the VGG network- style information is measured as the amount of correlation present between feature maps in a layer. (This means we perform dot products between the same layers, more on that later)
Style depends on some low level elements and also depends on some high level elements- so we need multiple layers throughout the network. An image has a lot of colors/textures (Style), and content is in shapes.
So, this convolutional layer will generate four different matrices, which are called feature maps, which contain low level features of an image. (such as lines, edges, dots, and curves)
If you perform deep convolution, you’ll get higher level features such as shapes, objects, etc.
Now, we know that different feature maps represent different features of an image- however, these features are independent from each other. It can capture objects and shapes but cant capture style of an image. (Style: texture, brush strokes, color distribution, etc)
The style features of an image has occurrences that are dependent on each other, therefore we find correlation. The correlations between feature maps are called a gram matrix.
Because we’re more interested in is not where something happens, but just when things co-occur, a gram matrix comes in clutch.
The constant that we multiply to each of the gram matrix can be seen as a hyperparameter used for changing style levels- if you increase the value for a specific layer, the style from that layer is captured more and not from the other layers. We previously mentioned performing the dot products between the same layers.
A dot product of two vectors is the sum of products of their coordinates and can be written as:
It can be seen for how similar two vectors are.
Gram matrix- the (i,j) th element of the style matrix is computed by computing the element wise multiplication of the i th and j th feature maps and summing across both width and height.
Consider two vectors(more specifically 2 flattened feature vectors from a convolutional feature map of depth C) representing features of the input space, and their dot product give us the information about the relation between them. The lesser the product the more different the learned features are and greater the product, the more correlated the features are. In other words, the lesser the product, the lesser the two features co-occur and the greater it is, the more they occur together. This in a sense gives information about an image’s style(texture) and zero information about its spatial structure, since we already flatten the feature and perform dot product on top of it.
tldr; if the dot product across the activation of two filters is large- then the two channels are correlated. If small, then they are uncorrelated. Mathematically, for all C feature vectors,
Applying the squared loss function for gram matrix, the style loss can be defined as:
Intuitively, the final combined loss of the content loss and style loss is defined as:
where α and β are user-defined hyperparameters.
Total Variation Loss
Sometimes the generated/combined image has lots of noise, and we want to minimize that.
Intuitively, i and j can be thought of as x and y coordinates in the equation above. By reducing the total variation loss, we make the values of the neighboring pixels on the generated/combined image closer.
- Loss function- weighted sum of content loss, style loss, and total variation loss. By adjusting the weight hyperparameters, we can balance among content retention, style transfer, and noise reduction on the generated image.
- Training- Loop: extract content features, extract style features, calculate loss function
- We use Gram matrices to represent the style outputs from the style layers.