Human Activity Recognition using Deep Neural Network with Contextual Information

Li Wei, Shishir K. Shah

Abstract

Human activity recognition is an important yet challenging research topic in the computer vision community. In this paper, we propose context features along with a deep model to recognize the individual subject activity in the videos of real-world scenes. Besides the motion features of the subject, we also utilize context information from multiple sources to improve the recognition performance. We introduce the scene context features that describe the environment of the subject at global and local levels. We design a deep neural network structure to obtain the high-level representation of human activity combining both motion features and context features. We demonstrate that the proposed context feature and deep model improve the activity recognition performance by comparing with baseline approaches. We also show that our approach outperforms state-of-the-art methods on 5-activities and 6-activities versions of the Collective Activities Dataset.

References

  1. Amer, M. and Todorovic, S. (2011). A chains model for localizing participants of group activities in videos.
  2. Amer, M. R., Xie, D., Zhao, M., Todorovic, S., and Zhu, S.- C. (2012). Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In Computer Vision-ECCV 2012, pages 187-200. Springer.
  3. Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1-27:27.
  4. Choi, W. and Savarese, S. (2012). A unified framework for multi-target tracking and collective activity recognition. In Computer Vision-ECCV 2012, pages 215- 230. Springer.
  5. Choi, W., Shahid, K., and Savarese, S. (2009). What are they doing? : Collective activity classification using spatio-temporal relationship among people. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 1282- 1289.
  6. Choi, W., Shahid, K., and Savarese, S. (2011). Learning context for collective activity recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition.
  7. Dieleman, S., Schluter, J., Raffel, C., Olson, E., Snderby, S. K., Nouri, D., Maturana, D., Thoma, M., Battenberg, E., Kelly, J., Fauw, J. D., Heilman, M., diogo149, McFee, B., Weideman, H., takacsg84, peterderivaz, Jon, instagibbs, Rasul, D. K., CongLiu, Britefury, and Degrave, J. (2015). Lasagne: First release.
  8. Ess, A., Leibe, B., Schindler, K., and Van Gool, L. (2008). A mobile vision system for robust multi-person tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8.
  9. Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics, pages 249-256.
  10. Hasan, M. and Roy-Chowdhury, A. K. (2014). Continuous learning of human activity models using deep nets. In Computer Vision-ECCV 2014, pages 705- 720. Springer.
  11. Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3d convolutional neural networks for human action recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):221-231.
  12. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725-1732. IEEE.
  13. Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., and Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(8):1549-1562.
  14. Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld, B. (2008a). Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8. IEEE.
  15. Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld, B. (2008b). Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8.
  16. Nesterov, Y. et al. (2007). Gradient methods for minimizing composite objective function. Technical report, UCL.
  17. Schuldt, C., Laptev, I., and Caputo, B. (2004). Recognizing human actions: a local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32-36 Vol.3.
  18. Tran, K. N., Bedagkar-Gala, A., Kakadiaris, I. A., and Shah, S. K. (2013). Social cues in group formation and local interactions for collective activity analysis. In VISAPP, pages 539-548.
  19. Tran, K. N., Kakadiaris, I. A., and Shah, S. K. (2012). Part-based motion descriptor image for human action recognition. Pattern Recognition, 45(7):2562-2572.
  20. Tran, K. N., Yan, X., Kakadiaris, I. A., and Shah, S. K. (2015). A group contextual model for activity recognition in crowded scenes. In VISAPP.
  21. Wang, X. and Ji, Q. (2015). Video event recognition with deep hierarchical context model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4418-4427.
  22. Was, J., Gudowski, B., and Matuszyk, P. J. (2006). Social distances model of pedestrian dynamics. In Cellular Automata, pages 492-501. Springer.
  23. Wei, L. and Shah, S. K. (2015). Subject centric group feature for person re-identification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 28-35.
  24. Wei, L. and Shah, S. K. (2016). Person re-identification with spatial appearance group feature. In 2016 IEEE Symposium on Technologies for Homeland Security (HST), pages 1-6.
  25. Weinland, D., Ronfard, R., and Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2):224-241.
  26. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems, pages 487-495.
Download


Paper Citation


in Harvard Style

Wei L. and K. Shah S. (2017). Human Activity Recognition using Deep Neural Network with Contextual Information . In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017) ISBN 978-989-758-226-4, pages 34-43. DOI: 10.5220/0006099500340043


in Bibtex Style

@conference{visapp17,
author={Li Wei and Shishir K. Shah},
title={Human Activity Recognition using Deep Neural Network with Contextual Information},
booktitle={Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)},
year={2017},
pages={34-43},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006099500340043},
isbn={978-989-758-226-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)
TI - Human Activity Recognition using Deep Neural Network with Contextual Information
SN - 978-989-758-226-4
AU - Wei L.
AU - K. Shah S.
PY - 2017
SP - 34
EP - 43
DO - 10.5220/0006099500340043