It is amazing to see how advances in Generative Adversarial Networks have led us to generate realistic high resolution images. We at FlixStock rely on Generative Adversarial Networks for generating realistic high quality model images. When we use GANs to generate images, it is highly desirable if we can control certain features of the generated image.

A generator learns to map a vector in latent space to an image.  As training progresses, vectors close in latent space are mapped to similar images. This property allows us to explore properties of latent space more deeply. As can be seen in the following t-sne representation of latent space of GAN trained on MNIST dataset, images belonging to the same class are clustered together in the space.

Source: Hackernoon,


We define latent space in machine learning as an abstract multi dimensional space containing feature values that we cannot interpret directly. As we draw a vector from latent space and the generator learns to map those vectors into images which discriminator can’t distinguish from real images, eventually latent space starts to encode a meaningful representation of images. For each visual feature, each dimension in latent space is given a magnitude depending upon the weights of the network. As a result, each feature is controlled by a combination of values of the vector from latent space, thereby making it difficult to visualize how features are encoded in latent space. 

As stated in [1], “Walking on the manifold that is learnt can usually tell us about signs of memorization (if there are sharp transitions) and about the way in which the space is hierarchically collapsed. If walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), we can reason that the model has learned relevant and interesting representations.”

it is shown how interpolation between a series of 9 random points in Z results in smooth transition among the generated images. It can be justified that GAN has learned relevant and interesting representation.  

Figure 2: Image taken from [1]

As shown in [2], linear arithmetic operations over the vectors of latent space can be performed to control features of an image. They demonstrated that the vector(“King”) – vector(“Man”) + vector(“Woman”) resulted in a vector whose nearest neighbor was the vector for Queen.

At flixstock, we study and interpret latent spaces to control features of generated images. We can either specify the magnitude of feature or transfer feature between images. For example, we can either increase the smile on a face by 2 times or specify the source image whose smile we want to translate to the target image. 

There are also some challenges associated with Latent Space exploration, such as Feature Entanglement. Feature Entanglement is a phenomenon where latent space learns correlation between certain features which though independent but due to limitations of dataset occur in unison, due to which they can’t be altered independently. For example, if a human face dataset being used has a general trend of male faces having short hairs and female faces have long hairs, then changing the latent vector to obtain long hair for the image of male in the result would also end up changing the gender, leading to an image of a woman. Modern architectures such as StyleGan employ techniques to ensure feature disentanglement. 


In latent space, each attribute can be represented in a form of vector. To deduce the direction of the vector, we train a binary classifier for the given attribute. We train discriminative networks with latent vectors against target attributes. Discriminative models learn boundaries to separate target attributes. For example if we learn logistic regression, hidden vector learning specifies the orthogonal direction of the boundary plane of the target attribute. 

Other than that we can also deduce direction by sampling a large number of latent samples , the centroids formed by such samples denote the direction of change of that particular attribute .For example upon sampling a large number of smiling faces and non smiling faces, we can get a direction along which we can control the smile attribute of a face. 

We can also allow gradients to flow through discriminative networks into the source latent vector such that its target attribute is modified. This will be similar to optimising our source latent vector over our target attribute.

We can also build a network on top of our discriminative network which can transfer target attributes from source image to target image. 

If following the grid, we can see how along the row the hair colour of face is being controlled and along row smile intensity in being controlled without affecting identity of the person.

Figure 3: Variations in smile and hair colour feature of face


[1] Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.[2] Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed repre- sentations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.