Audiovisual Data Fusion for Successive Speakers Tracking

Quentin Labourey, Olivier Aycard, Denis Pellerin, Michele Rombaut

2014

Abstract

In this paper, a human speaker tracking method on audio and video data is presented. It is applied to conversation tracking with a robot. Audiovisual data fusion is performed in a two-steps process. Detection is performed independently on each modality: face detection based on skin color on video data and sound source localization based on the time delay of arrival on audio data. The results of those detection processes are then fused thanks to an adaptation of bayesian filter to detect the speaker. The robot is able to detect the face of the talking person and to detect a new speaker in a conversation.

References

  1. Brandstein, M. S. and Silverman, H. F. (1997). A practical methodology for speech source localization with microphone arrays. Computer Speech & Language.
  2. Chai, D. and Ngan, K. (1998). Locating facial region of a head-and-shoulders color image. In Automatic Face and Gesture Recognition. Proc.
  3. Farmer, M. E., Hsu, R., and Jain, A. K. (2002). Interacting multiple model kalman filters for robust high speed human motion tracking. ICPR'02.
  4. Gustafsson, F. and Gunnarsson, F. (2003). Positioning using time-difference of arrival measurements. In ICASSP'03.
  5. Hospedales, T. M. and Vijayakumar, S. (2008). Structure inference for bayesian multisensory scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  6. Nguyen, Q. and Choi, J. (2010). Audio-visual data fusion for tracking the direction of multiple speakers. In Control Automation and Systems.
  7. Osuna, E., Freund, R., and Girosi, F. (1997). Training support vector machines: an application to face detection. In CVPR'97.
  8. Rao, B. D. and Trivedi, M. M. (2008). Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition. In Acoustics, Speech, and Signal Processing, 2008.
  9. Schneiderman, H. and Kanade, T. (1998). Probabilistic modeling of local appearance and spatial relationships for object recognition. In CVPR'98.
  10. Snoek, C. G. M. (2005). Early versus late fusion in semantic video analysis. ACM Multimedia.
  11. Thrun, S., Burgard, W., and Fox, D. (2005). Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press.
  12. Vaillant, R., Monrocq, C., and Le Cun, Y. (1994). Original approach for the localisation of objects in images. Vision, Image and Signal Processing, IEEE Proc.
  13. Valin, J.-M., Michaud, F., Hadjou, B., and Rouat, J. (2004). Localization of simultaneous moving sound sources for mobile robot using a frequency- domain steered beamformer approach. In ICRA'04.
  14. Viola, P. and Jones, M. (2004). Robust real-time face detection. IJCV'04.
  15. Zhang, C. and Zhang, Z. (2010). A survey of recent advances in face detection.
Download


Paper Citation


in Harvard Style

Labourey Q., Aycard O., Pellerin D. and Rombaut M. (2014). Audiovisual Data Fusion for Successive Speakers Tracking . In Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 1: VISAPP, (VISIGRAPP 2014) ISBN 978-989-758-003-1, pages 696-701. DOI: 10.5220/0004852506960701


in Bibtex Style

@conference{visapp14,
author={Quentin Labourey and Olivier Aycard and Denis Pellerin and Michele Rombaut},
title={Audiovisual Data Fusion for Successive Speakers Tracking},
booktitle={Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 1: VISAPP, (VISIGRAPP 2014)},
year={2014},
pages={696-701},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004852506960701},
isbn={978-989-758-003-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 1: VISAPP, (VISIGRAPP 2014)
TI - Audiovisual Data Fusion for Successive Speakers Tracking
SN - 978-989-758-003-1
AU - Labourey Q.
AU - Aycard O.
AU - Pellerin D.
AU - Rombaut M.
PY - 2014
SP - 696
EP - 701
DO - 10.5220/0004852506960701