Zero-shot Object Prediction using Semantic Scene Knowledge

Rene Grzeszick, Gernot A. Fink

2017

Abstract

This work focuses on the semantic relations between scenes and objects for visual object recognition. Semantic knowledge can be a powerful source of information especially in scenarios with few or no annotated training samples. These scenarios are referred to as zero-shot or few-shot recognition and often build on visual attributes. Here, instead of relying on various visual attributes, a more direct way is pursued: after recognizing the scene that is depicted in an image, semantic relations between scenes and objects are used for predicting the presence of objects in an unsupervised manner. Most importantly, relations between scenes and objects can easily be obtained from external sources such as large scale text corpora from the web and, therefore, do not require tremendous manual labeling efforts. It will be shown that in cluttered scenes, where visual recognition is difficult, scene knowledge is an important cue for predicting objects.

References

  1. Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., and Etzioni, O. (2007). Open information extraction for the web. In IJCAI, volume 7, pages 2670-2676.
  2. Choi, M. J., Torralba, A., and Willsky, A. S. (2012). A tree-based context model for object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(2):240-252.
  3. Divvala, S. K., Hoiem, D., Hays, J. H., Efros, A. A., and Hebert, M. (2009). An empirical study of context in object detection. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1271- 1278.
  4. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  5. Etzioni, O., Fader, A., Christensen, J., Soderland, S., and Mausam, M. (2011). Open information extraction: The second generation. In IJCAI, volume 11, pages 3-10.
  6. Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98- 136.
  7. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627-1645.
  8. Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2016). Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):142-158.
  9. Grzeszick, R., Sudholt, S., and Fink, G. A. (2016). Optimistic and pessimistic neural networks for scene and object recognition. CoRR, abs/1609.07982.
  10. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D. A., Bernstein, M., and Fei-Fei, L. (2016). Visual genome: Connecting language and vision using crowdsourced dense image annotations.
  11. Lampert, C. H., Nickisch, H., and Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453-465.
  12. Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proc. European Conference on Computer Vision (ECCV), pages 740-755. Springer.
  13. Liu, C., Yuen, J., and Torralba, A. (2009). Nonparametric scene parsing: Label transfer via dense scene alignment. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1972-1979. IEEE.
  14. Manning, C. D. and Schütze, H. (1999). Foundations of statistical natural language processing, volume 999. MIT Press.
  15. Miller, G. A. and Others (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11):39-41.
  16. Modolo, D., Vezhnevets, A., and Ferrari, V. (2015). Context forest for object class detection. In Proc. British Machine Vision Conference (BMVC).
  17. Oliva, A. and Torralba, A. (2006). Building the gist of a scene: The role of global image features in recognition. Progress in Brain Research, 155:23.
  18. Palmer, S. E. (1999). Vision science: Photons to phenomenology. MIT press Cambridge, MA.
  19. Patterson, G., Xu, C., Su, H., and Hays, J. (2014). The sun attribute database: Beyond categories for deeper scene understanding. International Journal of Computer Vision, 108(1-2):59-81.
  20. Ren, S., He, K., Girshick, R. B., and Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91-99.
  21. Rohrbach, M., Stark, M., and Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a largescale setting. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1641-1648. IEEE.
  22. Rohrbach, M., Wei, Q., Titov, I., Thater, S., Pinkal, M., and Schiele, B. (2013). Translating video content to natural language descriptions. In IEEE International Conference on Computer Vision.
  23. Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
  24. Tighe, J. and Lazebnik, S. (2010). Superparsing: scalable nonparametric image parsing with superpixels. In European conference on computer vision, pages 352- 365. Springer.
  25. Vezhnevets, A. and Ferrari, V. (2015). Object localization in imagenet by looking out of the window. arXiv preprint arXiv:1501.01181.
  26. Wu, Q., Shen, C., Hengel, A. v. d., Wang, P., and Dick, A. (2016). Image captioning and visual question answering based on attributes and their related external knowledge. arXiv preprint arXiv:1603.02814.
  27. Xiao, J., Ehinger, K. A., Hays, J., Torralba, A., and Oliva, A. (2014). SUN Database: Exploring a Large Collection of Scene Categories. International Journal of Computer Vision (IJCV), pages 1-20.
  28. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3485-3492. IEEE.
  29. Zhu, Y., Zhang, C., Ré, C., and Fei-Fei, L. (2015). Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint arXiv:1507.05670.
Download


Paper Citation


in Harvard Style

Grzeszick R. and Fink G. (2017). Zero-shot Object Prediction using Semantic Scene Knowledge . In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017) ISBN 978-989-758-226-4, pages 120-129. DOI: 10.5220/0006240901200129


in Bibtex Style

@conference{visapp17,
author={Rene Grzeszick and Gernot A. Fink},
title={Zero-shot Object Prediction using Semantic Scene Knowledge},
booktitle={Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)},
year={2017},
pages={120-129},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006240901200129},
isbn={978-989-758-226-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)
TI - Zero-shot Object Prediction using Semantic Scene Knowledge
SN - 978-989-758-226-4
AU - Grzeszick R.
AU - Fink G.
PY - 2017
SP - 120
EP - 129
DO - 10.5220/0006240901200129