REFERENCES 
Ballan, L., Taneja, A., Gall, J., Van Gool, L., & Pollefeys, 
M.  (2012).  Motion  capture  of  hands  in  action  using 
discriminative salient points. Proceedings of European 
Conference on Computer Vision, 7577 LNCS(PART 6), 
640–653.  https://doi.org/10.1007/978-3-642-33783-
3_46 
Barsoum,  E.  (2016).  Articulated Hand Pose Estimation 
Review. 1–50. http://arxiv.org/abs/1604.06195 
Erol,  A.,  Bebis,  G.,  Nicolescu,  M.,  Boyle,  R.  D.,  & 
Twombly,  X.  (2007).  Vision-based  hand  pose 
estimation:  A  review.  Computer Vision and Image 
Understanding,  108(1–2),  52–73.  https://doi.org/ 
10.1016/j.cviu.2006.10.012 
Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J., & Yuan, 
J. (2019).  3D hand shape and  pose  estimation from  a 
single RGB image. Proceedings of the IEEE Computer 
Society Conference on Computer Vision and Pattern 
Recognition, 2019-June, 10825–10834. https://doi.org/ 
10.1109/CVPR.2019.01109 
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). 
Rich  feature  hierarchies  for  accurate  object  detection 
and  semantic  segmentation.  Proceedings of the IEEE 
Computer Society Conference on Computer Vision and 
Pattern Recognition,  580–587.  https://doi.org/ 
10.1109/CVPR.2014.81 
Gkioxari,  G.,  Girshick,  R.,  Dollár,  P.,  &  He,  K.  (2018). 
Detecting and Recognizing Human-Object Interactions. 
Proceedings of the IEEE Computer Society Conference 
on Computer Vision and Pattern Recognition,  1(c), 
8359–8367. https://doi.org/10.1109/CVPR.2018.00872 
Hamer, H., Gall, J., Weise, T., & Van Gool, L. (2010). An 
object-dependent hand pose prior from sparse training 
data.  Proceedings of the IEEE Computer Society 
Conference on Computer Vision and Pattern 
Recognition,  671–678.  https://doi.org/10.1109/CVP 
R.2010.5540150 
He,  K.,  Gkioxari,  G.,  Dollar,  P.,  &  Girshick,  R.  (2017). 
Mask R-CNN. Proceedings of the IEEE International 
Conference on Computer Vision,  2017-Octob,  2980–
2988. https://doi.org/10.1109/ICCV.2017.322 
He,  K.,  Zhang,  X.,  Ren,  S.,  &  Sun,  J.  (2014).  Spatial 
pyramid  pooling  in  deep  convolutional  networks  for 
visual  recognition.  IEEE Transactions on Pattern 
Analysis and Machine Intelligence, 8691 LNCS(PART 
3), 346–361. https://doi.org/10.1007/978-3-319-10578-
9_23 
Hei  Law,  Yun  Teng,  Olga  Russakovsky,  J.  D.  (2019). 
CornerNet-Lite : Efficient Keypoint-Based Object 
Detection. 
Iasonas  Oikonomidis,  Nikolaos  Kyriazis,    and  A.  A.  A. 
(2011).  Markerless  and  Efficient  26-DOF  Hand  Pose 
Recovery.  Proceedings of the 10th Asian Conference 
on Computer Vision, 6978 LNCS(PART 1),  365–373. 
https://doi.org/10.1007/978-3-642-24085-0_38 
Law,  H.,  &  Deng,  J.  (2018).  CornerNet.  European 
Conference on Computer Vision(ECCV), 765–781. 
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., 
Xu, W., & Theobalt, C. (2018). Monocular 3D human 
pose  estimation  in  the  wild  using  improved  CNN 
supervision.  Proceedings - 2017 International 
Conference on 3D Vision, 3DV 2017,  506–516. 
https://doi.org/10.1109/3DV.2017.00064 
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, 
M., Fua, P., Seidel, H. P., Rhodin, H., Pons-Moll, G., & 
Theobalt,  C.  (2020).  XNect:  Real-time  Multi-Person 
3D Motion Capture with a Single RGB Camera. ACM 
Transactions on Graphics,  39(4),  1–24. 
https://doi.org/10.1145/3386569.3392410 
Moon,  G.,  Chang,  J.  Y.,  &  Lee,  K.  M.  (2019).  Camera 
distance-aware top-down approach for 3D multi-person 
pose estimation from a single RGB image. Proceedings 
of the IEEE International Conference on Computer 
Vision,  2019-Octob,  10132–10141.  https://doi.org/ 
10.1109/ICCV.2019.01023 
Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, 
D., & Theobalt, C.  (2017). Real-Time Hand Tracking 
under  Occlusion  from  an  Egocentric  RGB-D  Sensor. 
Proceedings of the IEEE International Conference on 
Computer Vision,  2017-Octob,  1163–1172. 
https://doi.org/10.1109/ICCV.2017.131 
Oikonomidis,  I.,  Kyriazis,  N.,  &  Argyros,  A.  (2011). 
Efficient model-based 3D tracking of hand 
articulations using Kinect.  June 2014,  101.1-101.11. 
https://doi.org/10.5244/c.25.101 
Pavllo,  D.,  Feichtenhofer,  C.,  Grangier,  D.,  &  Auli,  M. 
(2019).  3D  human  pose  estimation  in  video  with 
temporal  convolutions  and  semi-supervised  training. 
Proceedings of the IEEE Computer Society Conference 
on Computer Vision and Pattern Recognition,  2019-
June,  7745–7754.  https://doi.org/10.1109/CVPR.201 
9.00794 
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). 
You only look once: Unified, real-time object detection. 
Proceedings of the IEEE Computer Society Conference 
on Computer Vision and Pattern Recognition,  2016-
Decem,  779–788.  https://doi.org/10.1109/CVPR.20 
16.91 
Redmon,  J.,  &  Farhadi,  A.  (2018).  YOLOv3: An 
Incremental Improvement.  http://arxiv.org/abs/1804.0 
2767 
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-
CNN:  Towards  Real-Time  Object  Detection  with 
Region  Proposal  Networks.  IEEE Transactions on 
Pattern Analysis and Machine Intelligence,  39(6), 
1137–1149.  https://doi.org/10.1109/TPAMI.2016.257 
7031 
Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., 
Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, 
Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., 
&  Izadi,  S.  (2015).  Accurate,  robust,  and  flexible 
realtime hand tracking. Conference on Human Factors 
in Computing Systems - Proceedings
,  2015-April, 
3633–3642. https://doi.org/10.1145/2702123.2702179 
Sridhar,  S.,  Mueller,  F.,  Zollhöfer,  M.,  Casas,  D., 
Oulasvirta, A., & Theobalt, C. (2016). Real-time joint 
tracking of a hand manipulating an object from RGB-D 
input. International Journal of Computer Vision, 9906