Pedestrian Counting using Deep Models Trained on Synthetically Generated Images

Sanjukta Ghosh, Peter Amon, Andreas Hutter, André Kaup

2017

Abstract

Counting pedestrians in surveillance applications is a common scenario. However, it is often challenging to obtain sufficient annotated training data, especially so for creating models using deep learning which require a large amount of training data. To address this problem, this paper explores the possibility of training a deep convolutional neural network (CNN) entirely from synthetically generated images for the purpose of counting pedestrians. Nuances of transfer learning are exploited to train models from a base model trained for image classification. A direct approach and a hierarchical approach are used during training to enhance the capability of the model for counting higher number of pedestrians. The trained models are then tested on natural images of completely different scenes captured by different acquisition systems not experienced by the model during training. Furthermore, the effectiveness of the cross entropy cost function and the squared error cost function are evaluated and analyzed for the scenario where a model is trained entirely using synthetic images. The performance of the trained model for the test images from the target site can be improved by fine-tuning using the image of the background of the target site.

References

  1. Andriluka, M., Roth, S., and Schiele, B. (2008). People-tracking-by-detection and people-detection-by-tracking. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008., pages 1-8.
  2. Angelova, A., Krizhevsky, A., and Vanhoucke, V. (2015a). Pedestrian detection with a large-field-of-view deep network. In Proceedings of ICRA 2015.
  3. Angelova, A., Krizhevsky, A., Vanhoucke, V., Ogale, A., and Ferguson, D. (2015b). Real-time pedestrian detection with deep network cascades. In Proceedings of BMVC 2015.
  4. Arteta, C., Lempitsky, V., Noble, J. A., and Zisserman, A. (2014). Interactive Object Counting, pages 504-518. Springer International Publishing, Cham.
  5. Baltieri, D., Vezzani, R., and Cucchiara, R. (2011). 3dpes: 3d people dataset for surveillance and forensics. In Proceedings of the 1st International ACM Workshop on Multimedia access to 3D Human Objects, pages 59-64, Scottsdale, Arizona, USA.
  6. Bengio, I. G. Y. and Courville, A. (2016). Deep learning. Book in preparation for MIT Press.
  7. Chan, A. B., Liang, Z.-S. J., and Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people without people models or tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-7.
  8. Chan, A. B., Morrow, M., and Vasconcelos, N. (2009). Analysis of crowded scenes using holistic properties. In Performance Evaluation of Tracking and Surveillance workshop at CVPR 2009, pages 101-108, Miami, Florida.
  9. Chen, K., Gong, S., Xiang, T., and Loy, C. C. (2013). Cumulative attribute space for age and crowd density estimation. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pages 2467-2474.
  10. Chen, K., Loy, C. C., Gong, S., and Xiang, T. (2012). Feature mining for localised crowd counting. In British Machine Vision Conference, BMVC 2012, Surrey, UK, September 3-7, 2012, pages 1-11.
  11. Enzweiler, M. and Gavrila, D. M. (2009). Monocular pedestrian detection: Survey and experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12):2179-2195.
  12. Fiaschi, L., Koethe, U., Nair, R., and Hamprecht, F. A. (2012). Learning to count with regression forest and structured labels. In 21st International Conference on Pattern Recognition (ICPR), 2012, pages 2685-2688.
  13. Fujii, Y., Yoshinaga, S., Shimada, A., and ichiro Taniguchi, R. (2010). The 1st international conference on security camera network, privacy protection and community safety 2009 real-time people counting using blob descriptor. Procedia - Social and Behavioral Sciences, 2(1):143 - 152.
  14. Girshick, R. B., Donahue, J., Darrell, T., and Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524.
  15. Golik, P., Doetsch, P., and Ney, H. (2013). Cross-entropy vs. squared error training: a theoretical and experimental comparison. In Interspeech, pages 1756-1760, Lyon, France.
  16. Hattori, H., Naresh Boddeti, V., Kitani, K. M., and Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  17. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.
  18. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
  19. Kline, M. and Berardi, L. (2005). Revisiting squared-error and cross-entropy functions for training neural network classifiers. Neural Comput. Appl., 14(4):310-318.
  20. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 1106-1114.
  21. LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436-444.
  22. Lempitsky, V. and Zisserman, A. (2010). Learning to count objects in images. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A., editors, Advances in Neural Information Processing Systems 23, pages 1324-1332. Curran Associates, Inc.
  23. Liu, W., Wen, Y., Yu, Z., and Yang, M. (2016). Large-margin softmax loss for convolutional neural networks. In ICML.
  24. Luo, P., Wang, X., and Tang, X. (2013). Pedestrian parsing via deep decompositional network. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 2648-2655.
  25. Merad, D., Aziz, K. E., and Thome, N. (2010). Fast people counting using head detection from skeleton graph. In Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2010, pages 151-156.
  26. Moody, J. E. (1991). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Advances in Neural Information Processing Systems 4, [NIPS Conference, Denver, Colorado, USA, December 2-5, 1991], pages 847-854.
  27. Overett, G., Petersson, L., Brewer, N., Andersson, L., and Pettersson, N. (2008). A new pedestrian dataset for supervised learning. In Intelligent Vehicles Symposium, 2008 IEEE, pages 373-378.
  28. Richter, S. R., Vineet, V., Roth, S., and Koltun, V. (2016). Playing for data: Ground truth from computer games. In Leibe, B., Matas, J., Sebe, N., and Welling, M., editors, European Conference on Computer Vision (ECCV), volume 9906 of LNCS, pages 102-118. Springer International Publishing.
  29. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and Lopez, A. M. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  30. Segui, S., Pujol, O., and Vitria, J. (2015). Learning to count with deep object features. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
  31. Vezzani, R. and Cucchiara, R. (2010). Video surveillance online repository (visor): an integrated framework. Multimedia Tools and Applications, 50(2):359-380.
  32. Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 27, pages 3320-3328. Curran Associates, Inc.
  33. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., and Lipson, H. (2015). Understanding neural networks through deep visualization. In Deep Learning Workshop, International Conference on Machine Learning (ICML).
  34. Yu, Z., Gong, C., Yang, J., and Bai, L. (2014). Pedestrian counting based on spatial and temporal analysis. In 2014 IEEE International Conference on Image Processing (ICIP), pages 2432-2436.
  35. Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, pages 818-833.
  36. Zhang, C., Li, H., Wang, X., and Yang, X. (2015). Cross-scene crowd counting via deep convolutional neural networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 833-841.
  37. Zhao, H., Gallo, O., Frosio, I., and Kautz, J. (2015). Loss Functions for Neural Networks for Image Processing. ArXiv e-prints 1511.08861.
Download


Paper Citation


in Harvard Style

Ghosh S., Amon P., Hutter A. and Kaup A. (2017). Pedestrian Counting using Deep Models Trained on Synthetically Generated Images . In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017) ISBN 978-989-758-226-4, pages 86-97. DOI: 10.5220/0006132600860097


in Bibtex Style

@conference{visapp17,
author={Sanjukta Ghosh and Peter Amon and Andreas Hutter and André Kaup},
title={Pedestrian Counting using Deep Models Trained on Synthetically Generated Images},
booktitle={Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)},
year={2017},
pages={86-97},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006132600860097},
isbn={978-989-758-226-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)
TI - Pedestrian Counting using Deep Models Trained on Synthetically Generated Images
SN - 978-989-758-226-4
AU - Ghosh S.
AU - Amon P.
AU - Hutter A.
AU - Kaup A.
PY - 2017
SP - 86
EP - 97
DO - 10.5220/0006132600860097