Keypoint Estimation involves identifying and locating the significant points of an object image . In the case of humans, each of these keypoints depicts the joints of different body parts like shoulder, elbow, knee, etc; a set of such keypoints can be associated to represent the complete skeleton of a person as shown in Figure 1. And for objects, these keypoints could be the corners or other high frequency details as shown in Figure 2. Keypoint estimation has a broad range of real life applications like analyzing human behaviors, human tracking, activity recognition, gaming, virtual reality, sports motion analysis, medical assistance, etc .
Figure 1: Human skeleton keypoints
Figure 2: Object keypoints taken from
In general, deep convolutional architecture follows an encoder-decoder pipeline to regress such joint locations. In the first stage, a CNN encoder is deployed to extract a set of convolutional feature maps. Most of the state-of-the-art methods differ the way they design the decoder pipeline, especially in two different ways ; (1) Direct regression-based and (2) Heatmap-based.
The first method uses a regressor to directly output the coordinates of each of the keypoints. In Heatmap-based methods, feature maps are passed directly through a decoder like architecture that first outputs the heatmap; heatmaps represent the likelihood that the keypoint can be found in a given region. An additional post processing step is further applied to extract the accurate keypoint location from the heatmap as shown in Figure 3. Direct regression-based and Heatmap-based pipelines are depicted in Figure 4(a) and Figure 4(b) respectively.
Figure 3: Heatmap-based estimation example: (a) Input Image, (b) Generated heatmap and (c) Predicted keypoint .
Figure 4: keypoint estimation pipelines: (a) Direct Regression based, (b) Heatmap-based
APPROACHES TO KEYPOINT ESTIMATION
In general, two different machine learning-based approaches are deployed for keypoint estimation; (1) Top-Down approach and (2) Bottom-up approach.
Top-Down approach  first detects all the objects in the image in terms of a set of bounding boxes and then performs either direct regression based or heatmap based method on individual bounding boxes as depicted in Figure 5. Unlike the former method, Bottom-Up approach [6, 7] first detects all the keypoints and then assembles them into skeletons of distinct objects as shown in Figure 6. Top-Down and Bottom-Up pipelines are depicted in Figure 7 and Figure 8 respectively
Figure 5: Top-Down Approach, (a) Input Image, (b) Detected persons, (c) Cropped persons image, (d) Keypoint estimation results on individual persons and (e) final results .
Figure 6: Top-Down Pipeline
Figure 7: Bottom-Up Approach, (a) Input Image, (b) keypoints of all the persons and (c) Connected keypoints of various persons to form a person skeleton .
Figure 8: Bottom-Up Pipeline
USE CASES AT FLIXSTOCK
Flixstock provides several vision-centric solutions, for instance, Flixmodel in which an on-mannequin garment is draped over a human body using an automated architecture. In this application, the knowledge of human skeleton joints  along with the garment corner points, as shown in Figure 9, controls the spatial transformation to correctly align the source on the target.
Figure 9: (a) Human pose estimation and (b) Garment keypoints estimation
 Q. Dang, J. Yin, B. Wang and W. Zheng, “Deep learning based 2D human pose estimation: A survey,” in Tsinghua Science and Technology, vol. 24, no. 6, pp. 663-676, Dec. 2019, doi: 10.26599/TST.2018.9010100.
 Bearman, Amy L., Stanford and Catherine Dong. “Human Pose Estimation and Activity Classification Using Convolutional Neural Networks.” (2015).
 Tremblay, Jonathan, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. “Deep object pose estimation for semantic robotic grasping of household objects.” arXiv preprint arXiv:1809.10790 (2018).
 Gong, Wenjuan; Zhang, Xuena; Gonzàlez, Jordi; Sobral, Andrews; Bouwmans, Thierry; Tu, Changhe; Zahzah, El-hadi. 2016. “Human Pose Estimation from Monocular Images: A Comprehensive Survey.” Sensors 16, no. 12: 1966.
 Chen, Yucheng, Yingli Tian, and Mingyi He. “Monocular human pose estimation: A survey of deep learning-based methods.” Computer Vision and Image Understanding 192 (2020): 102897.
 Cao, Zhe, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. “OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields.” arXiv preprint arXiv:1812.08008 (2018).
 Pishchulin, Leonid, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. “Deepcut: Joint subset partition and labeling for multi person pose estimation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4929-4937. 2016.
 Chen, Yilun, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. “Cascaded pyramid network for multi-person pose estimation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103-7112. 2018. https://github.com/HiKapok/tf.fashionAI