Transformer-Based Video Saliency Prediction with High Temporal Dimension Decoding

Morteza Moradi, Simone Palazzo, Concetto Spampinato

2024

Abstract

In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features’ dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.

Download


Paper Citation


in Harvard Style

Moradi M., Palazzo S. and Spampinato C. (2024). Transformer-Based Video Saliency Prediction with High Temporal Dimension Decoding. In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP; ISBN 978-989-758-679-8, SciTePress, pages 616-623. DOI: 10.5220/0012422800003660


in Bibtex Style

@conference{visapp24,
author={Morteza Moradi and Simone Palazzo and Concetto Spampinato},
title={Transformer-Based Video Saliency Prediction with High Temporal Dimension Decoding},
booktitle={Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP},
year={2024},
pages={616-623},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012422800003660},
isbn={978-989-758-679-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP
TI - Transformer-Based Video Saliency Prediction with High Temporal Dimension Decoding
SN - 978-989-758-679-8
AU - Moradi M.
AU - Palazzo S.
AU - Spampinato C.
PY - 2024
SP - 616
EP - 623
DO - 10.5220/0012422800003660
PB - SciTePress