Keypoint Estimation involves identifying and locating the significant points of an object image [1]. In the case of humans, each of these keypoints depicts the joints of different body parts like shoulder, elbow, knee, etc; a set of such keypoints can be associated to represent the complete skeleton of a person as shown in Figure 1. And for objects, these keypoints could be the corners or other high frequency details as shown in Figure 2. Keypoint estimation has a broad range of real life applications like analyzing human behaviors, human tracking, activity recognition, gaming, virtual reality, sports motion analysis, medical assistance, etc [4].

Figure 1: Human skeleton keypoints[2]

Figure 2: Object keypoints taken from[3]


In general, deep convolutional architecture follows an encoder-decoder pipeline to regress such joint locations. In the first stage, a CNN encoder is deployed to extract a set of convolutional feature maps. Most of the state-of-the-art methods differ the way they design the decoder pipeline, especially in two different ways [5]; (1) Direct regression-based and (2) Heatmap-based.  

The first method uses a regressor to directly output the coordinates of each of the keypoints. In Heatmap-based methods, feature maps are passed directly through a decoder like architecture that first outputs the heatmap; heatmaps represent the likelihood that the keypoint can be found in a given region. An additional post processing step is further applied to extract the accurate keypoint location from the heatmap as shown in Figure 3. Direct regression-based and Heatmap-based pipelines are depicted in Figure 4(a) and Figure 4(b) respectively.

Figure 3: Heatmap-based estimation example: (a) Input Image, (b) Generated heatmap and (c) Predicted keypoint [1].



Figure 4: keypoint estimation pipelines: (a) Direct Regression based, (b) Heatmap-based


In general, two different machine learning-based approaches are deployed for keypoint estimation; (1) Top-Down approach and (2) Bottom-up approach.

Top-Down approach [8] first detects all the objects in the image in terms of a set of bounding boxes and then performs either direct regression based or heatmap based method on individual bounding boxes as depicted in Figure 5. Unlike the former method, Bottom-Up approach [6, 7] first detects all the keypoints and then assembles them into skeletons of distinct objects as shown in Figure 6. Top-Down and Bottom-Up pipelines are depicted in Figure 7 and Figure 8 respectively

Figure 5: Top-Down Approach, (a) Input Image, (b) Detected persons, (c) Cropped persons image, (d) Keypoint estimation results on individual persons and (e) final results [1].

Figure 6: Top-Down Pipeline

Figure 7: Bottom-Up Approach, (a) Input Image, (b) keypoints of all the persons and (c) Connected keypoints of various persons to form a person skeleton [1].

Figure 8: Bottom-Up Pipeline


Flixstock provides several vision-centric solutions, for instance, Flixmodel in which an on-mannequin garment is draped over a human body using an automated architecture. In this application, the knowledge of human skeleton joints [8] along with the garment corner points[9], as shown in Figure 9, controls the spatial transformation to correctly align the source on the target.



Figure 9: (a) Human pose estimation and (b) Garment keypoints estimation


[1] Q. Dang, J. Yin, B. Wang and W. Zheng, “Deep learning based 2D human pose estimation: A survey,” in Tsinghua Science and Technology, vol. 24, no. 6, pp. 663-676, Dec. 2019, doi: 10.26599/TST.2018.9010100.

[2] Bearman, Amy L., Stanford and Catherine Dong. “Human Pose Estimation and Activity Classification Using Convolutional Neural Networks.” (2015).

[3] Tremblay, Jonathan, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. “Deep object pose estimation for semantic robotic grasping of household objects.” arXiv preprint arXiv:1809.10790 (2018).

[4] Gong, Wenjuan; Zhang, Xuena; Gonzàlez, Jordi; Sobral, Andrews; Bouwmans, Thierry; Tu, Changhe; Zahzah, El-hadi. 2016. “Human Pose Estimation from Monocular Images: A Comprehensive Survey.” Sensors 16, no. 12: 1966.

[5] Chen, Yucheng, Yingli Tian, and Mingyi He. “Monocular human pose estimation: A survey of deep learning-based methods.” Computer Vision and Image Understanding 192 (2020): 102897.

[6] Cao, Zhe, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. “OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields.” arXiv preprint arXiv:1812.08008 (2018).

[7] Pishchulin, Leonid, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. “Deepcut: Joint subset partition and labeling for multi person pose estimation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4929-4937. 2016.

[8] Chen, Yilun, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. “Cascaded pyramid network for multi-person pose estimation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103-7112. 2018.[9]