F4D: Factorized 4D Convolutional Neural Network for Efficient Video-Level Representation Learning

Mohammad Al-Saad; Lakshmish Ramaswamy; Suchendra Bhandarkar

doi:10.5220/0012430200003636

F4D: Factorized 4D Convolutional Neural Network for Efficient Video-Level Representation Learning

Mohammad Al-Saad, Lakshmish Ramaswamy, Suchendra Bhandarkar

2024

Abstract

Recent studies have shown that video-level representation learning is crucial to the capture and understanding of the long-range temporal structure for video action recognition. Most existing 3D convolutional neural network (CNN)-based methods for video-level representation learning are clip-based and focus only on short-term motion and appearances. These CNN-based methods lack the capacity to incorporate and model the long-range spatiotemporal representation of the underlying video and ignore the long-range video-level context during training. In this study, we propose a factorized 4D CNN architecture with attention (F4D) that is capable of learning more effective, finer-grained, long-term spatiotemporal video representations. We demonstrate that the proposed F4D architecture yields significant performance improvements over the conventional 2D, and 3D CNN architectures proposed in the literature. Experiment evaluation on five action recognition benchmark datasets, i.e., Something-Something-v1, Something-Something-v2, Kinetics-400, UCF101, and HMDB51 demonstrate the effectiveness of the proposed F4D network architecture for video-level action recognition.

Download

Paper Citation

in Harvard Style

Al-Saad M., Ramaswamy L. and Bhandarkar S. (2024). F4D: Factorized 4D Convolutional Neural Network for Efficient Video-Level Representation Learning. In Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-680-4, SciTePress, pages 1002-1013. DOI: 10.5220/0012430200003636

in Bibtex Style

@conference{icaart24,
author={Mohammad Al-Saad and Lakshmish Ramaswamy and Suchendra Bhandarkar},
title={F4D: Factorized 4D Convolutional Neural Network for Efficient Video-Level Representation Learning},
booktitle={Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2024},
pages={1002-1013},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012430200003636},
isbn={978-989-758-680-4},
}

in EndNote Style

TY - CONF

JO - Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - F4D: Factorized 4D Convolutional Neural Network for Efficient Video-Level Representation Learning
SN - 978-989-758-680-4
AU - Al-Saad M.
AU - Ramaswamy L.
AU - Bhandarkar S.
PY - 2024
SP - 1002
EP - 1013
DO - 10.5220/0012430200003636
PB - SciTePress