Foreground Extraction has been an active ongoing research in the computer vision community since the last two decades. The current state-of-the-art methods usually follow two different methodologies for foreground extraction; (a) Semantic segmentation, and (b) Alpha matting. The following sections discuss each of these methods in detail.

Figure 1: Example of Image Segmentation; image taken from [1]


In this method, similar objects are grouped into the same class. For instance, all the cars in Figure 1 are grouped within one class and labeled with the red color, while all the trees are labeled with green. Semantic segmentation networks are primarily modeled with an U-Net framework as shown in Figure 2; it is a series of CNN blocks arranged with an encoder-decoder architecture.

Figure 2: SegNet architecture for Semantic segmentation [2]

Semantic segmentation performs well in various image segmentation tasks with pixel level manipulation. However, it cannot precisely extract the foreground objects with sub-pixel accuracy. Human hair segmentation is a good example to realize the same; the binary nature of this model obviously produces a harsh boundary around the hair boundary regions, leaving undesirable artifacts. 


Image Matting is the process of accurately segmenting the image into foreground and background that involves the estimation of foreground opacity, known as the alpha matte [3]. The pixel value of the alpha matte image lies between 0 to 1.

where Ii ,Fi ,Bi , and αi represent the observed color value, the foreground pixel value, the background pixel value, and the foreground opacity at pixel i respectively. 

The problem is highly ill-posed and under-constrained, as seen in Equation (1); for a 3 channel RGB image, 7 values need to be computed (three foreground and background pixel values for each of red, green and blue channel alongside the foreground opacity 𝜶) with only three known observed pixel values for the three color channels [4]. Therefore, an additional rough segmentation mask, known as trimap, is taken into account to constrain the solution space. Typically, a trimap is labeled with three values as shown in Figure 3; white and black pixels denote the known fully opaque and fully transparent pixels respectively, and gray pixels represent the unknown region where the foreground opacity needs to be estimated with the known foreground and background regions.

Figure 3: Few sample results of image matting taken from Web


Image matting, no doubt, is a very good technique to extract the foreground object with sub-pixel manipulation. However the computation of alpha matte is quite challenging due to the following factors.

  • A trimap can be either created manually with user interaction or predicted by a classification network. The former procedure is time consuming while the later always puts a question mark on the accuracy of classification.
  • The degree of accurate alpha matte construction is inversely relative to the number of unknown pixels in the trimap image. Trimap with broader gray bandwidth often fails to correctly interpolate the foreground opacity of the distant unknown pixels.
  • Learning based methods require a large training dataset to learn the underlying mapping between the input and output space. To the best of my knowledge, only a few training datasets are available in image matting [6, 7] and the count of training images in these datasets is very few.
  • Foregrounds usually differ from the underlying background in terms of their visual appearance. color-based sampling methods often incorporate the color difference as a measure to identify the opaque and transparent regions. It may so happen that the color of a foreground object may match to its neighborhood background. In such cases, the color difference lies within the empirical threshold and may result in false alarms.
  • Most of the deterministic matting algorithms usually follow an iterative process with more effort to yield a correct alpha matte, and therefore their computational time is very high especially in case of high resolution images.

Image Matting at Flixstock

We use image matting for foreground extraction in high resolution (4K) for multiple use cases like garment extraction, hair masking or human matting. It forms a key component in data augmentation pipelines as well. Samples results are shown in figure 5. 


[1] Lateef, Fahad, and Yassine Ruichek. “Survey on semantic segmentation using deep learning techniques.” Neurocomputing 338 (2019): 321-348.

[2] Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. “Segnet: A deep convolutional encoder-decoder architecture for image segmentation.” IEEE transactions on pattern analysis and machine intelligence 39, no. 12 (2017): 2481-2495..

[3] Li, Xiaoqiang, Jide Li, and Hong Lu. “A survey on natural image matting with closed-form solutions.” IEEE Access 7 (2019): 136658-136675.

[4] Feng, Xiaoxue, Xiaohui Liang, and Zili Zhang. “A cluster sampling method for image matting via sparse coding.” In European Conference on Computer Vision, pp. 204-219. Springer, Cham, 2016..

[5] Aksoy, Yagiz, Tunc Ozan Aydin, and Marc Pollefeys. “Designing effective inter-pixel information flow for natural image matting.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 29-37. 2017.


[7] Xu, Ning, Brian Price, Scott Cohen, and Thomas Huang. “Deep image matting.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2970-2979. 2017