From Depth Data to Head Pose Estimation: A Siamese Approach

Marco Venturelli, Guido Borghi, Roberto Vezzani, Rita Cucchiara


The correct estimation of the head pose is a problem of the great importance for many applications. For instance, it is an enabling technology in automotive for driver attention monitoring. In this paper, we tackle the pose estimation problem through a deep learning network working in regression manner. Traditional methods usually rely on visual facial features, such as facial landmarks or nose tip position. In contrast, we exploit a Convolutional Neural Network (CNN) to perform head pose estimation directly from depth data. We exploit a Siamese architecture and we propose a novel loss function to improve the learning of the regression network layer. The system has been tested on two public datasets, Biwi Kinect Head Pose and ICT-3DHP database. The reported results demonstrate the improvement in accuracy with respect to current state-of-the-art approaches and the real time capabilities of the overall framework.


  1. Ahn, B., Park, J., and Kweon, I. S. (2014). Real-time head orientation from a monocular camera using deep neural network. In Asian Conference on Computer Vision, pages 82-96. Springer.
  2. Alioua, N., Amine, A., Rogozan, A., Bensrhair, A., and Rziza, M. (2016). Driver head pose estimation using efficient descriptor fusion. EURASIP Journal on Image and Video Processing, 2016(1):1-14.
  3. Baltrus?aitis, T., Robinson, P., and Morency, L.-P. (2012). 3d constrained local model for rigid and non-rigid facial tracking. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2610- 2617. IEEE.
  4. Bleiweiss, A. and Werman, M. (2010). Robust head pose estimation by fusing time-of-flight depth and color. In Multimedia Signal Processing (MMSP), 2010 IEEE International Workshop on, pages 116-121. IEEE.
  5. Breitenstein, M. D., Kuettel, D., Weise, T., Van Gool, L., and Pfister, H. (2008). Real-time face pose estimation from single range images. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8. IEEE.
  6. Chen, J., Wu, J., Richter, K., Konrad, J., and Ishwar, P. (2016). Estimating head pose orientation using extremely low resolution images. In 2016 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), pages 65-68.
  7. Crabbe, B., Paiement, A., Hannuna, S., and Mirmehdi, M. (2015). Skeleton-free body pose estimation from depth images for movement analysis. In Proc. of the IEEE International Conference on Computer Vision Workshops, pages 70-78.
  8. Craye, C. and Karray, F. (2015). Driver distraction detection and recognition using RGB-D sensor. CoRR, abs/1502.00250.
  9. Doumanoglou, A., Balntas, V., Kouskouridas, R., and Kim, T. (2016). Siamese regression networks with efficient mid-level feature extraction for 3d object pose estimation. CoRR, abs/1607.02257.
  10. Drouard, V., Ba, S., Evangelidis, G., Deleforge, A., and Horaud, R. (2015). Head pose estimation via probabilistic high-dimensional regression. In Proc. of IEEE International Conference on Image Processing (ICIP), pages 4624-4628.
  11. Fanelli, G., Dantone, M., Gall, J., Fossati, A., and Van Gool, L. (2013). Random forests for real time 3d face analysis. International Journal of Computer Vision, 101(3):437-458.
  12. Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., and Gool, L. V. (2010). A 3-d audio-visual corpus of affective communication. IEEE Transactions on Multimedia, 12(6):591 - 598.
  13. Fanelli, G., Gall, J., and Van Gool, L. (2011). Real time head pose estimation with random regression forests. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 617-624. IEEE.
  14. Ghiass, R. S., Arandjelovic, O., and Laurendeau, D. (2015). Highly accurate and fully automatic head pose estimation from a low quality consumer-level rgb-d sensor. In Proc. of the 2nd Workshop on Computational Models of Social Interactions: Human-Computer-Media Communication, pages 25-34. ACM.
  15. Hoffer, E. and Ailon, N. (2015). Deep metric learning using triplet network. In Proc. of Int'l Workshop on Similarity-Based Pattern Recognition, pages 84-92. Springer.
  16. Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on pattern analysis and machine intelligence, 35(1):221-231.
  17. Kendall, A., Grimes, M., and Cipolla, R. (2015). Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proc. of the IEEE Int'l Conf. on Computer Vision, pages 2938-2946.
  18. Kondori, F. A., Yousefi, S., Li, H., Sonning, S., and Sonning, S. (2011). 3d head pose estimation using the kinect. In Wireless Communications and Signal Processing (WCSP), 2011 International Conference on, pages 1-4. IEEE.
  19. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097-1105.
  20. Liu, X., Liang, W., Wang, Y., Li, S., and Pei, M. 3d head pose estimation with convolutional neural network trained on synthetic images.
  21. Malassiotis, S. and Strintzis, M. G. (2005). Robust realtime 3D head pose estimation from range data. Pattern Recognition, 38(8):1153-1165.
  22. Morency, L.-P., Whitehill, J., and Movellan, J. (2008). Generalized adaptive view-based appearance model: Integrated framework for monocular head pose estimation. In Proc. of 8th IEEE Int'l Conf. on Automatic Face & Gesture Recognition, 2008. FG'08., pages 1- 8. IEEE.
  23. Murphy-Chutorian, E. and Trivedi, M. M. (2009). Head pose estimation in computer vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 31(4):607-626.
  24. Padeleris, P., Zabulis, X., and Argyros, A. A. (2012). Head pose estimation on depth data based on particle swarm optimization. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 42-49. IEEE.
  25. Papazov, C., Marks, T. K., and Jones, M. (2015). Real-time 3d head pose and facial landmark estimation from depth images using triangular surface patch features. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4722-4730.
  26. Rahman, H., Begum, S., and Ahmed, M. U. (2015). Driver monitoring in the context of autonomous vehicle.
  27. Saeed, A. and Al-Hamadi, A. (2015). Boosted human head pose estimation using kinect camera. In Image Processing (ICIP), 2015 IEEE International Conference on, pages 1752-1756. IEEE.
  28. Seemann, E., Nickel, K., and Stiefelhagen, R. (2004). Head pose estimation using stereo vision for human-robot interaction. In FGR, pages 626-631. IEEE Computer Society.
  29. Sun, Y., Chen, Y., Wang, X., and Tang, X. (2014). Deep learning face representation by joint identificationverification. In Advances in Neural Information Processing Systems, pages 1988-1996.
  30. Viola, P. and Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2):137-154.
  31. Yang, J., Liang, W., and Jia, Y. (2012). Face pose estimation with combined 2d and 3d hog features. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 2492-2495. IEEE.
  32. Yi, K. M., Verdie, Y., Fua, P., and Lepetit, V. (2015). Learning to assign orientations to feature points. arXiv preprint arXiv:1511.04273.

Paper Citation

in Harvard Style

Venturelli M., Borghi G., Vezzani R. and Cucchiara R. (2017). From Depth Data to Head Pose Estimation: A Siamese Approach . In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017) ISBN 978-989-758-226-4, pages 194-201. DOI: 10.5220/0006104501940201

in Bibtex Style

author={Marco Venturelli and Guido Borghi and Roberto Vezzani and Rita Cucchiara},
title={From Depth Data to Head Pose Estimation: A Siamese Approach},
booktitle={Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)},

in EndNote Style

JO - Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)
TI - From Depth Data to Head Pose Estimation: A Siamese Approach
SN - 978-989-758-226-4
AU - Venturelli M.
AU - Borghi G.
AU - Vezzani R.
AU - Cucchiara R.
PY - 2017
SP - 194
EP - 201
DO - 10.5220/0006104501940201