Convolutional Neural Networks (CNNs) are a powerful class of models that perform incredibly well over a wide range of tasks such as classification, segmentation, object detection and more. For any model that distinguishes between images, it is highly desirable that in the learnt latent space the distinct properties such as texture and shape are disentangled from object pose and deformations. Pooling layers inside CNNs impart this characteristic to CNNs and make them spatially invariant to the position of features in images.

However, pooling layers typically have a tiny field of view, i.e. a simple 2×2 pooling region . Thus spatial invariance is truly achieved when a very deep hierarchy of convolutional layers and pooling layers is used to train the model. Despite this, CNNs are not invariant to large spatial transformations of the input data especially when inspecting the early layers in the network.

Spatial Transformer Network (STN) is a derivative of CNN with the additional advantage of providing spatial transformation capabilities to the network. Basically the STN uses an input feature vector to predict the parameters of spatial transformation such as scale, translation and any other non-rigid transformations. These predicted parameters are then used to warp the input feature vector to get it closer to an expected distribution which further improves inference using subsequent layers.

Fig 1: Taken from [a]. a) denotes the inputs that are randomly translated, scaled, rotated and filled with clutter, b) denotes the estimated transformation, c) denotes the warped input based on predicted transform params, d) results of classification on improved warped inputs which would otherwise be less accurate.

Spatial Transformer Network (STNs)

Spatial Transformer is a differentiable module which applies a spatial transformation to a feature vector and can be trained in an end-to-end manner. It consists of three main parts: localisation network (L), Grid generator (G) and sampler (S).

Fig 2: Taken from [a]. U is the input feature vector that is fed to L to predict the warping params theta. G provides a grid where the transformations are to be applied and sampler S produces the output vector V by sampling from the input at the grid point locations.

  1. Localisation net (L): takes as input a feature vector and passes it through multiple hidden layers to predict the parameters of the spatial transformation (theta).
  2. Grid Generator (G): predicted theta is used to create a sampling grid which is essentially a set of points where the spatial transformations are to be applied.
  3. Sampler (S): produces the output feature vector by applying the theta warping to the points provided by G.

Applications in automated image editing at Flixstock

Since STNs can be injected into any CNN and be trained end-to-end, they are candidates for a wide variety of use cases like automated image alignment. 

Consider a use case where an input must be aligned with respect to another support input and the transformations possible include translation, rotation, scaling and shear. This means that one can use the STNs to predict the six parameters of a full affine transformation and use the predicted thetas to warp the input such that it aligns with the support input.

This technique can also be extended to use a projective transform instead of an affine transform by simply changing the number of predicted parameters from six to eight, effectively increasing the degree of freedom from six to eight. However there is no constraint on the amount of params that can be predicted by the STN module and thus more complex transformations like warping can also be structured as STNs.

Challenges for STNs

Even though STNs provide a CNN architecture with the ability to adapt to large spatial transformations in the input, they are still not perfect. For example, STNs do not enable invariant recognition, i.e. inputs will not always be warped in the same manner in the presence of noise or clutter in the input feature vector.


‘Spatial Transformer Networks’ by Google Deepmind ,UK (

‘Understanding when spatial transformer networks do not support invariance, and what to do about it’ by KTH Royal Institute of Technology, Stockholm, Sweden (