Fashion automated tagging is the process of automatic generation of fashion attributes of products from an image. These fashion product taggings are very useful knowledge for cataloging purposes .

With the advancement in Deep Learning , specifically in the field of Convolutional Neural Networks (CNN) architectures, today machines are capable of understanding and recognizing even complex visual attributes present in an image. Harnessing this power of CNN, we have built our fashion auto-tagging architecture to recognize multiple visual semantic attributes of clothing items present in an image.

Problem Statement

Some of the common fashion attributes include garment-type, color, texture, pattern, and others. Attributes are also category-specific like: neckline, sleeve type, sleeve length, hemline, etc. Figure 1 demonstrates fashion attributes for a women’s dress.

Figure 1: An illustration of fashion attributes for a women dress.


To generate precise fashion tags we have designed a two-stage CNN architecture. In the first stage, we fine-tuned the Faster R-CNN [1] model to identify the region of interest, i.e. garment, and recognizing the garment category. Once the garment category is known, in the second stage, we select the appropriate trained Inception-v3 [2] multi-label model for generating fashion tags for the detected garment category.

Figure 3: Flowchart of the proposed auto fashion-tagging algorithm.

We use a common multi-label classifier architecture. The state-of-the-art classifier adopted in our approach is the Inception-v3. Inception-v3 is a widely-used image recognition model that has been shown to attain greater accuracy on the ImageNet dataset.

Inception-v3 Deep Neural Network architecture [2]

We have modified the above Inception-v3 architecture to predict multiple garment attributes. The ingenuity in this approach is to use a common feature vector extractor and train multiple classifiers together to generate multiple tags for the same garment with high accuracy. Figure 8 depicts our modified multi-label classifiers for 3 attribute-sub-classes.

Modified Inception-v3 multi-label classifier network architecture.


We have adopted an effective training strategy known as Transfer Learning which involves fine-tuning a pre-trained generic model that has been trained on the ImageNet dataset.


[1] Ren, Shaoqing, et al. “Faster r-cnn: Towards real-time object detection with region proposal networks.” Advances in neural information processing systems. 2015.

[2] Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.