Neural Networks - Convolutional Networks, Image Processing

A very common type of input that neural networks have to deal with is images. Neural networks have made great progress in this area recently. Today we will look at how to work with neural networks for image processing, what architectures are available, and we will also look at some interesting results and the methods by which they were achieved.

Convolutional networks

The main problem with image processing using neural networks is the size of the images – a typical image has more than $100 \times 100$ pixels, each pixel having 3 color channels for color images. Even such small images mean that we have 30,000 values on the input. This is a lot for fully connected networks - they would have a very large number of parameters.

For this reason, so-called convolutional networks are often used in image processing. They are designed in such a way that they read small (e.g. $3 \times 3$) parts of the image and apply the same operation to them over and over again (it is just a dot product with the weights of the network, just like perceptron networks). The important thing is that they apply this operation to all possible positions in the image (the length of the step when moving the “window” is referred to as stride, typically the windows overlap) and the weights are shared on all positions. So if we have one such convolution operation described above, the neural network has only 9 parameters (27 for color images). The output of such a convolutional layer is then an “image” that is the same size as the input image (here it depends a little on how we treat the edges of the image). Typically, we don’t have just one such convolutional layer, but several that are applied to the same image - this allows us to process the image using many convolutions at once. We then actually get an image with similar dimensions to the input image, but with a larger number of channels. Convolutional layers can then be applied to it in the same way.

If we only used convolutional layers, we would still have images with the same size and we would not be able to do much with them. Therefore, convolutional layers and sub-sampling (pooling) layers alternate in convolutional networks. The most common pooling layer is max-pooling, it is applied like a convolution layer to all parts of the image (e.g. with dimensions $2 \times 2$) and returns the maximum from them. For sub-sampling layers, on the other hand, overlap is typically not used. This cuts the resolution in half. With such an alternation of convolution and sub-sampling layers, we finally get a relatively small representation of the input image, to which we can already apply fully connected layers.

What exactly do convolutions do? If we visualize the activations of neurons in the individual layers of the convolutional network as pictures, we will see that the first layers after the input they behave as edge detectors, in deeper layers the individual neurons can detect the presence of more complex objects.

Adversarial Examples

Although convolutional neural networks achieve very good results in image processing, they have one interesting property (vulnerability) – they are prone to so-called adversarial examples, i.e. images that are slightly modified in such a way that a person is typically unable to recognize the difference. but the neural network returns different responses to them than to the original images.

The original rationale for the existence of such patterns was that neural networks are highly non-linear and thus small changes can cause their outputs to change significantly. But it turns out that other machine learning models, including linear models, are also prone to adversarial examples. So the reason for adversarial examples can be exactly the opposite, i.e. neural networks are very linear and small changes that add up enough then lead to poor classification.

A very popular technique for generating adversarial examples - FGSM (Fast Gradient Sign Method) is based on the approximation of neural networks using linear models. It calculates the derivative of the error function according to the input to the neural network and adds $\varepsilon \cdot \mathrm{sign}(\nabla J)$ to this input, where $\nabla J$ is just this gradient. It turns out that even small changes for $\varepsilon < 0.1$ can easily confuse many neural network models that otherwise give very good results.

The existence of adversarial examples can be a problem for the application of neural networks in areas where errors can have serious consequences, such as autonomous vehicles. It even turns out that confusing patterns can also be created in the real world, where e.g. special glasses are able to confuse facial recognition models, or special stickers can make a vehicle unable to detect a line separating lanes, or detect a line where there is not one.

Artistic style transfer

One nice application of neural networks is to transfer artistic style between images. Imagine, for example, that you have a photograph and you would like to turn it into a painting in the style of Picasso. Activations in the inner layers of the neural network can be used to transfer the style. It then turns out that the activations themselves correspond to the content of the image, and the correlations between these activations correspond to the style. Style transfer is then defined as an optimization problem where the goal is to achieve (by changing the input image) activations that are similar to the original photo while having correlations similar to the image in the desired style. Depending which layers of the network we choose for this optimizatiin, we get different parts of the style - from brush strokes in the first layers of the networks to colors and image distortions in the deeper layers.

Generative Adversarial Networks

Today, so-called Generative Adversarial Networks are often used to transfer style between images and to generate images. They actually consist of two parts - a generator and a discriminator. The task of the generator is to generate images that are similar to the images in some training set, and the task of the discriminator is to decide if the presented image was generated by the generator or if they came from the training set. The generator then tries to maximize the error of the discriminator, and the discriminator tries to minimize its own error, i.e. it tries to detect the generated images. In this way, both networks train each other. We typically do not use the discriminator after training.