ative adversarial networks: Variants, applications, and
training. ACM Computing Surveys, 54(8).
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C.,
Vijayanarasimhan, S., Viola, F., Green, T., Back, T.,
Natsev, P., Suleyman, M., and Zisserman, A. (2017).
The kinetics human action video dataset. CoRR,
abs/1705.06950.
Kimata, J., Nitta, T., and Tamaki, T. (2022). Objectmix:
Data augmentation by copy-pasting objects in videos
for action recognition. In Proceedings of the 4th
ACM International Conference on Multimedia in Asia,
MMAsia ’22, New York, NY, USA. Association for
Computing Machinery.
Kingma, D. P. and Ba, J. (2015). Adam: A method for
stochastic optimization. In 3rd International Confer-
ence on Learning Representations, ICLR 2015, San
Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings.
Kirillov, A., He, K., Girshick, R., Rother, C., and Dollar,
P. (2019). Panoptic segmentation. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR).
Kong, Y. and Fu, Y. (2022). Human action recogni-
tion and prediction: A survey. Int. J. Comput. Vis.,
130(5):1366–1401.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. A., and
Serre, T. (2011). HMDB: A large video database for
human motion recognition. In Metaxas, D. N., Quan,
L., Sanfeliu, A., and Gool, L. V., editors, IEEE Inter-
national Conference on Computer Vision, ICCV 2011,
Barcelona, Spain, November 6-13, 2011, pages 2556–
2563. IEEE Computer Society.
Lin, J., Gan, C., and Han, S. (2019). Tsm: Temporal shift
module for efficient video understanding. In Proceed-
ings of the IEEE/CVF International Conference on
Computer Vision (ICCV).
Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P.,
Ramanan, D., Doll
´
ar, P., and Zitnick, C. L. (2014).
Microsoft COCO: common objects in context. In
Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars,
T., editors, Computer Vision - ECCV 2014 - 13th
European Conference, Zurich, Switzerland, Septem-
ber 6-12, 2014, Proceedings, Part V, volume 8693 of
Lecture Notes in Computer Science, pages 740–755.
Springer.
Luo, Z., Chen, D., Zhang, Y., Huang, Y., Wang, L., Shen,
Y., Zhao, D., Zhou, J., and Tan, T. (2023). Videofu-
sion: Decomposed diffusion models for high-quality
video generation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 10209–10218.
Park, T., Liu, M.-Y., Wang, T.-C., and Zhu, J.-Y. (2019).
Semantic image synthesis with spatially-adaptive nor-
malization. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition
(CVPR).
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-
ford, A., Chen, M., and Sutskever, I. (2021). Zero-shot
text-to-image generation. In Meila, M. and Zhang, T.,
editors, Proceedings of the 38th International Con-
ference on Machine Learning, volume 139 of Pro-
ceedings of Machine Learning Research, pages 8821–
8831. PMLR.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. (2022). High-resolution image synthesis
with latent diffusion models. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 10684–10695.
Shorten, C. and Khoshgoftaar, T. M. (2019). A survey on
image data augmentation for deep learning. Journal
of Big Data, 6:60.
Soomro, K., Zamir, A. R., and Shah, M. (2012). UCF101:
A dataset of 101 human actions classes from videos in
the wild. CoRR, abs/1212.0402.
Ulhaq, A., Akhtar, N., Pogrebna, G., and Mian, A. (2022).
Vision transformers for action recognition: A survey.
CoRR, abs/2209.05700.
Wang, G., Zhao, Y., Tang, C., Luo, C., and Zeng, W. (2022).
When shift operation meets vision transformer: An
extremely simple alternative to attention mechanism.
CoRR, abs/2201.10801.
Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y.,
Wang, Y., and Qiao, Y. (2023). Videomae v2: Scal-
ing video masked autoencoders with dual masking. In
Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
14549–14560.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz,
J., and Catanzaro, B. (2018a). Video-to-video synthe-
sis. In Bengio, S., Wallach, H., Larochelle, H., Grau-
man, K., Cesa-Bianchi, N., and Garnett, R., editors,
Advances in Neural Information Processing Systems,
volume 31. Curran Associates, Inc.
Wang, X., Girshick, R., Gupta, A., and He, K. (2018b).
Non-local neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Weinzaepfel, P. and Rogez, G. (2021). Mimetics: Towards
understanding human actions out of context. Inter-
national Journal of Computer Vision, 129(5):1675–
1690.
Wu, D., Chen, J., Sharma, N., Pan, S., Long, G., and Blu-
menstein, M. (2019). Adversarial action data aug-
mentation for similar gesture action recognition. In
2019 International Joint Conference on Neural Net-
works (IJCNN), pages 1–8.
Yan, W., Zhang, Y., Abbeel, P., and Srinivas, A. (2021).
Videogpt: Video generation using VQ-VAE and trans-
formers. CoRR, abs/2104.10157.
Yi, X., Walia, E., and Babyn, P. (2019). Generative adver-
sarial network in medical imaging: A review. Medical
Image Analysis, 58:101552.
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo,
Y. (2019). Cutmix: Regularization strategy to train
strong classifiers with localizable features. In Pro-
ceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV).
Yun, S., Oh, S. J., Heo, B., Han, D., and Kim, J. (2020).
Videomix: Rethinking data augmentation for video
classification.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
78