F4D: Factorized 4D Convolutional Neural Network for Efﬁcient

Video-Level Representation Learning

Mohammad Al-Saad

, Lakshmish Ramaswamy

and Suchendra Bhandarkar

School of Computing, The University of Georgia, Athens, GA, U.S.A

Keywords:

Video-Level Action Recognition, Factorized Convolutional Neural Network, Temporal Attention,

Spatio-Temporal Attention, Channel Attention, 3D CNN, 4D CNN.

Abstract:

Recent studies have shown that video-level representation learning is crucial to the capture and understand-

ing of the long-range temporal structure for video action recognition. Most existing 3D convolutional neural

network (CNN)-based methods for video-level representation learning are clip-based and focus only on short-

term motion and appearances. These CNN-based methods lack the capacity to incorporate and model the

long-range spatiotemporal representation of the underlying video and ignore the long-range video-level con-

text during training. In this study, we propose a factorized 4D CNN architecture with attention (F4D) that is

capable of learning more effective, ﬁner-grained, long-term spatiotemporal video representations. We demon-

strate that the proposed F4D architecture yields signiﬁcant performance improvements over the conventional

2D, and 3D CNN architectures proposed in the literature. Experiment evaluation on ﬁve action recogni-

tion benchmark datasets, i.e., Something-Something-v1, Something-Something-v2, Kinetics-400, UCF101,

and HMDB51 demonstrate the effectiveness of the proposed F4D network architecture for video-level action

recognition.

1 INTRODUCTION

In an era dominated by digital mediums, the increas-

ing number of large-scale videos has transformed the

way information is conveyed and consumed. From

autonomous vehicles and intelligent surveillance sys-

tems to online streaming services and social media

platforms, videos have emerged as a pervasive and

rich source of data that captures the essence of human

experiences and surrounding environment. Neverthe-

less, the complexity and sheer volume of these huge

videos present the demand for effective video under-

standing. The initial step of the video understanding

is action recognition which aims to interpret and un-

derstand human actions, gestures, and movements.

Many 2D and 3D Convolutional Neural Network

(CNN) architectures have been proposed for the prob-

lem of video-based human action recognition. A

straightforward CNN-based approach to this problem

uses the entire video as an input to the CNN followed

by a fully convolutional inference (Yu et al., 2017).

https://orcid.org/0009-0000-8934-1159

https://orcid.org/0000-0002-4567-4186

https://orcid.org/0000-0003-2930-4190

However, the data volume in videos are huge which

could result in a very high memory footprint and pro-

cessing power as trying to run a fully convolutional

inference is well above the capabilities of modern

GPUs (Feichtenhofer et al., 2019).

To substantially reduce the memory footprint and

computational cost, most existing deep learning (DL)

models for video representation learning incorporate

clip-level feature learning This allows these DL mod-

els to apply deep networks over video clips of ﬁxed

temporal length focusing on short-term object appear-

ances and motion, thus, learning from video clips in-

stead of the entire video. The clip-based learning

methods sample short video clips comprising of 10-32

frames per clip, and compute the prediction scores for

each clip independently (Tran et al., 2018). Finally,

the individual results from all the clips are pulled to-

gether to generate a ﬁnal video-level prediction.

In general, clip-based models often ignore long-

range spatiotemporal dependencies and the global

video-level structure during training. The temporal

dependency problem in vision-based human action

recognition refers to the challenge of correctly captur-

ing and modeling the dynamic and sequential nature

of human actions over time. It identiﬁes that actions

1002

Al-Saad, M., Ramaswamy, L. and Bhandarkar, S.

F4D: Factorized 4D Convolutional Neural Network for Efﬁcient Video-Level Representation Learning.

DOI: 10.5220/0012430200003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 1002-1013

ISBN: 978-989-758-680-4; ISSN: 2184-433X

are not separated events but unfold as a sequence of

distinctive motion patterns, each pattern contributes

to the overall understanding of the action being per-

formed. Temporal dependency holds the notion that

the duration, timing, and order of these motion pat-

terns are critical for interpreting and recognizing ac-

tions correctly. Capturing the temporal aspect is es-

sential for distinguishing between actions that may

share similar visual appearance but vary in their exe-

cution timing or sequence. In many cases, partial ob-

servation of the underlying video makes it very difﬁ-

cult to recognize an action correctly. Additionally, re-

lying on the average of the prediction scores from in-

dividual clips is considered to result in a sub-optimal

inference.

To learn from an entire video efﬁciently, the Tem-

poral Segment Network (TSN) architecture has been

proposed (Wang et al., 2018b). The TSN represents

the contents of the entire video by operating on a

sequence of multiple short clips (snippets) sampled

from the entire video. In the ﬁnal TSN stage, a seg-

mental consensus function is used to aggregate the

predictions from the sampled snippets, thereby en-

abling the TSN to model long-range temporal struc-

tures. However, the fact that inter-clip interactions

and video-level fusion are performed in the ﬁnal TSN

stage limits the ability of the TSN to capture ﬁne tem-

poral structures. To overcome this limitation, the V4D

CNN model (Zhang et al., 2020) incorporated the 4D

CNN architecture. The 4D convolution operation has

the capacity to model long-range dependencies and

capture inter-clip interactions for efﬁcient video-level

representation learning. To capture ﬁner temporal

structures, the V4D CNN residual blocks are placed

at earlier stages in the network. Nevertheless, the 4D

convolution operation in the V4D CNN model is com-

plex and introduces many more parameters thereby

making the model vulnerable to overﬁtting. Further-

more, the V4D CNN architecture does not incorporate

an attention mechanism to focus on the regions of in-

terest (ROIs) that evolve over time.

Inspired by the above observations of the state of

the art in video-level representation learning, we pro-

pose an effective yet simple framework for video level

representation learning termed as the Factorized 4D

(F4D) architecture, to model both short-range motion

and long-range temporal dependency within a large-

scale video sequence. This paper has two main ob-

jectives; the ﬁrst objective is to enhance accuracy and

to decrease the complexity of the 4D convolution op-

eration introduced in the V4D CNN framework. We

start by factorization of the 4D convolution operation

which renders the proposed F4D CNN model capa-

ble of representing more complex functions by cap-

turing more complex inter-clip interactions and ﬁner

temporal structures. Furthermore, the proposed fac-

torization improves the optimization procedure dur-

ing both training and testing, yielding lower training

and testing errors. The second objective is to imple-

ment an attention mechanism that focuses on an ROI

within the video and enhances the power of the result-

ing representation. We design two attention mech-

anisms, namely the temporal attention (TA) module

and spatio-temporal attention (STA) module. These

modules will focus on the different inter-clip mo-

tion patterns that evolve over time and on the spatio-

temporal discriminative features by focusing on the

ROIs that evolve over time. We insert the proposed

factorized 4D CNN followed by the attention mod-

ules to form a block named F4D residual block. The

F4D residual blocks can be easily inserted into stan-

dard ResNet (He et al., 2016) architecture to form the

F4D architecture. The main contributions of our work

can be summarized as follows:

• We propose a Factorized 4D CNN that can capture

more complex long-range temporal dependency

and inter-clip interactions with lowered training

and testing errors compared to the 4D CNN.

• We propose a temporal attention module (TA)

and a spatio-temporal attention module (STA) that

guide the network to focus on ROIs within the

video and improves the resulting representation

with negligible computation cost.

• An effective yet simple network referred as F4D

architecture is proposed with our F4D residual

blocks that consist of the proposed F4D CNN fol-

lowed by the proposed attention modules, which

can be easily integrated into standard ResNet ar-

chitecture.

• Extensive experiments demonstrate the effective-

ness of the proposed F4D architecture on ﬁve

action recognition benchmark datasets including

Something-Something-v1 and v2 (Goyal et al.,

2017), Kinetics-400 (Kay et al., 2017), UCF101

(Soomro et al., 2012) and HMDB51 (Kuehne

et al., 2011).

2 RELATED WORKS

Two-Stream 2D CNN. The two-stream CNN archi-

tecture represents a very practical approach to video-

level representation learning. The earliest two-stream

CNN architecture was introduced in (Simonyan and

Zisserman, 2014a) where one CNN learns from a

stream of RGB frames and the other CNN from a

stream comprising of stacks of 10 computed optical

F4D: Factorized 4D Convolutional Neural Network for Efﬁcient Video-Level Representation Learning

1003

ﬂow frames. In the later stages, the results of both

streams are averaged to yield the ﬁnal prediction.

Although the two-stream CNN architecture has

been shown to yield impressive results, the extraction

of spatial and temporal features is performed indepen-

dently, and it is easy to ignore their intrinsic connec-

tion, which can inﬂuence the ﬁnal prediction. An-

other limitation of two-stream networks is the exces-

sive demands of optical ﬂow computation where par-

allel optimization is difﬁcult to implement. Some re-

lated works have explored the idea of enhancing the

optical ﬂow computations (Dosovitskiy et al., 2015;

Sun et al., 2018; Zhang et al., 2016; Piergiovanni and

Ryoo, 2019) in this regard.

3D CNN. Since 3D CNNs incorporate spatio-

temporal ﬁlters, they represent a natural approach to

video modelling. The biggest advantage of 3D CNNs

is their ability to create hierarchical representations of

spatio-temporal data. 3D CNNs have been explored

in several works cited in the literature. Ji et al. (Ji

et al., 2012a) pioneered the use of the 3D CNN for

human action recognition by applying 3D convolution

operation in both the spatial and temporal domains.

Tran et al. (Tran et al., 2015) propose the C3D

model and show its effectiveness when trained on

large-scale video datasets. They conducted a sys-

tematic study to show that 3D CNN is better than

2D CNN in learning appearance and motion infor-

mation. Moreover, they show that using 3 × 3 × 3

convolution kernels for all layers works best amongst

the explored architectures. The work in (Tran et al.,

2017) improves upon the C3D model by employ-

ing neural architecture search across multiple dimen-

sions and 3D residual networks that allow for use of

deeper networks that can be trained on large-scale

video datasets.

The two-stream 3D CNN architecture has been

explored by Carreira et al. (Carreira and Zisserman,

2017) with the goal of successfully incorporating 2D

image classiﬁcation models into a 3D CNN by in-

ﬂating all the ﬁlters and pooling kernels by adding

an extra temporal dimension. The authors use a pre-

trained Inception framework as the architectural back-

bone with one stream trained on RGB inputs and an-

other stream trained on optical ﬂow. Recent work in

(Dong et al., 2021) improves the 3D residual architec-

ture by decoupling the 3D convolutional kernel and

also presents the design of a 3D attention mechanism

to decrease the model’s sensitivity to changes in the

background environment.

There are several disadvantages associated with

the 3D CNN architecture. First, the number of 3D

CNN model parameters increases more rapidly com-

pared to the 2D CNN. Second, the 3D CNN is hard

to train and the resulting training information hard to

transfer, and its inference process very slow compared

to other approaches. Third, in some cases, the 3D

convolution operation cannot distinguish between the

human action features and the background features

making the model vulnerable to environmental fac-

tors.

Mapping from 2D to 3D CNN. Several research pa-

pers have explored techniques to transfer the bene-

ﬁts of pre-trained 2D CNNs to 3D CNN architec-

tures. In (Hara et al., 2018), the authors consider

the 2D Resnet and replace all its 2D convolutional

ﬁlters with 3D convolutional kernels to arrive at the

ResNet3D architecture. They assume that a combi-

nation of large-scale datasets and deep 3D CNNs are

capable of replicating the success of 2D CNNs on the

ImageNet dataset. Inspired by ResNeXt architecture

(Xie et al., 2017), Chen et al. (Chen et al., 2018)

propose a multi-ﬁber architecture that divides a com-

plex neural network into an ensemble of lightweight

networks thereby reducing the Identify the computa-

tional cost and simultaneously coordinating the in-

formation ﬂow. Motivated by the SENet (Hu et al.,

2018), the STCNet architecture (Diba et al., 2018)

incorporates channel-wise information within a 3D

block to capture the correlation information between

the temporal and spatial channels throughout the net-

work.

Unifying 2D and 3D CNN. 3D CNNs have wit-

nessed great success in recognizing human action in

videos. However, the high complexity of training

the 3D convolution kernels and the need for large

quantities of training videos limits their applicabil-

ity. To reduce the complexity of 3D CNN training,

the P3D (Qiu et al., 2017) and R(2+1)D (Tran et al.,

2018) architectures explore the idea of 3D factoriza-

tion wherein a 3D kernel is factorized into two sep-

arate operations, a 2D spatial convolution and a 1D

temporal convolution. Trajectory convolution (Zhao

et al., 2018) is based on a similar concept but uti-

lizes deformable convolution for the temporal com-

ponent to better deal with motion. A different ap-

proach of simplifying 3D CNNs is to integrate 2D

and 3D convolutions within a single network. MiCT-

Net (Zhou et al., 2018b) integrates 2D and 3D CNNs

to generate richer, deeper, and more informative fea-

ture maps by decreasing the complexity of training

in each round of spatial-temporal fusion. ARTNet

(Wang et al., 2018a) establishes a relation and appear-

ance network by using a novel building block com-

prising of a spatial branch using 2D CNNs and a re-

lation branch using 3D CNNs. S3D (Xie et al., 2018)

and ECO (Zolfaghari et al., 2018) combine the ad-

vantages of the aforementioned models by adopting

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1004

a top-heavy network to achieve online video under-

standing.

Long-Term Video Modelling Frameworks. In their

seminal work, Wang et al. (Wang et al., 2018b), pro-

pose a simple, ﬂexible, and general framework for

learning action models in videos. Temporal segment

networks (TSNs) are designed by performing sparse

sampling of a long video to extract short snippets fol-

lowed by a segmental consensus function to aggre-

gate information from the sampled snippets. This al-

lows the TSN to model long-range temporal struc-

tures within the entire video. The Temporal Rela-

tional Reasoning Network (TRN) (Zhou et al., 2018a)

enables temporal relational reasoning over videos by

describing the temporal relations between observa-

tions in videos. While the TRN is shown to be capable

of discovery and learning of potential temporal rela-

tions at multiple time scales within a video, it lacks

the capacity to capture ﬁner temporal structure. For

efﬁcient video understanding, Liu et al. (Lin et al.,

2019) introduce a Temporal Shift Module (TSM) that

extends the shift operation to design a temporal mod-

ule to capture temporal relations. The STM archi-

tecture (Jiang et al., 2019) incorporates two channel-

wise modules, one to represent motion features and

the other to encode spatio-temporal features. Inspired

by the approach in (Hu et al., 2018), the TEA archi-

tecture (Li et al., 2020) improves the motion pattern

representation by using the motion features to cali-

brate the spatio-temporal features.

4D CNN. The V4D CNN architecture proposed by

Zhang et al. (Zhang et al., 2020) tackles the analysis

of RGB videos by incorporating a video-level sam-

pling strategy to cover the holistic duration of a given

video. A novel 4D residual block is proposed which

allows the casting of 3D CNNs into 4D CNNs for

learning long-range interactions of the 3D features,

resulting in a “time of time” video-level representa-

tion. The proposed V4D architecture has achieved

excellent results compared to its 3D counterparts.

3 F4D ARCHITECTURE

3.1 Segment Based Sampling

To model the long range spatio-temporal dependency,

we use segment-based sampling described in (Wang

et al., 2018b). Formally, given a whole video V , we

divide it into U sections of equal durations and select

a snippet, termed as an action unit, that is randomly

sampled from each section to represent a short-term

action pattern within that section. The holistic action

in the video is represented by a sequence of action

units {A

,... ,A

}, where A

∈ R

C×T ×H×W

is the

action unit obtained from the i

section, C is the num-

ber of channels, T , H, W are the temporal length,

height, and width . During the training phase, each

action unit A

is randomly selected from each of the

U sections. During testing, the center of each A

located exactly at the center of the corresponding sec-

tion.

3.2 Overview of 4D CNN

In recent years, the 3D CNN has been shown to be

a powerful approach for modelling short-term spatio-

temporal features in video. However, the receptive

ﬁelds of 3D kernels are usually deﬁcient owing to the

compact sizes of kernels, and hence pooling opera-

tions are applied to enlarge the receptive ﬁelds. In

contrast, 4D convolution operations have been imple-

mented to simultaneously model short-term and long-

term spatio-temporal representations since they have

the capacity to model long-range dependencies and

capture inter-clip interactions for efﬁcient video-level

representation learning.

The input to a 4D convolution can be denoted as a

tensor V of size (C,U,T, H,W ), where U is the num-

ber of action units (the 4th dimension). The batch di-

mension has been excluded for simplicity. Formally,

a 4D convolution operation can be viewed as follows:

ut hw

= b

∑

S−1

∑

s=0

P−1

∑

p=0

Q−1

∑

q=0

R−1

∑

r=0

spqr

(u+s)(t+p)(h+q)(w+r)

(1)

where o

uthw

is a pixel at position (u,t, h,w) of the j

channel in the output following the annotation in (Ji

et al., 2012b), b

is a bias term, c is one of the C

in-

put channels of the feature maps, S ×P× Q × R is the

shape of 4D convolutional kernel, W

spqr

is the weight

at the position (s, p,q, r) of the kernel, corresponding

to the c

channel of the input feature maps and j

channel of the output feature maps. Since deep learn-

ing libraries do not provide an implementation for 4D

convolutions, eqn. (1) can be modiﬁed to generate

eqn. (2) which allows the implementation of 4D con-

volutions using 3D convolutions. Eqn. (2) can be for-

mulated as follows:

uthw

= b

S−1

∑

s=0

∑

P−1

∑

p=0

Q−1

∑

q=0

R−1

∑

r=0

spqr

(u+s)(t+p)(h+q)(w+r)

(2)

where the expression in the parentheses can be imple-

mented by 3D convolutions. Within the 4D space, the

4D convolution kernel has the ability to model both

the short-term 3D features of each action unit and the

long-term temporal evolution of several action units

at the same time. Thus, the 4D convolutions have

the power to learn more complicated interactions of

a long-range 3D spatio-temporal representation.

F4D: Factorized 4D Convolutional Neural Network for Efﬁcient Video-Level Representation Learning

1005

Figure 1: F4D Residual Block.

3.3 F4D: Factorization of 4D CNN

In this section, we design a network block termed

as F4D to improve upon the 4D convolution dis-

cussed in the previous section. We follow the work in

(Tran et al., 2018) to approximate the 4D convolution

by a 3D convolution followed by a 1D convolution,

thereby decomposing the spatial modeling and the

temporal modeling for action units into two separate

steps. The (3+1)D block replaces the N

4D convolu-

tional ﬁlters of size N

i−1

× u ×t × h × w, with M

convolutional ﬁlters of size N

i−1

× u × 1 × h × w and

temporal convolution ﬁlters of size M

×1 ×t ×1 ×

1. The hyperparameter M

decides the dimensionality

of the intermediate subspace where the signal is pro-

jected between the spatial convolution and the tempo-

ral convolution. In order to have a (3+1)D block with

the number of parameters approximately equal to the

number of parameters in the implementation of a full

4D convolution layer, we set M

u t h w N

i−1

u h w N

i−1

+tN

The (3+1)D decomposition provides advantages

over the full 4D convolution. First, although the num-

ber of parameters is approximately the same, the num-

ber of nonlinearities in the F4D network will increase

due to the additional ReLU between the 3D and the

1D convolution in each block. Adding more nonlin-

earities results in increased complexity of functions

that can be represented. This has been noted in VGG

(Simonyan and Zisserman, 2014b) and R(2+1)D net-

works which approximate the effect of a big ﬁlter by

applying several smaller ﬁlters with additional non-

linearities introduced between them. Second, forcing

the 4D convolution into separate spatial and temporal

modules can render the optimization easier, resulting

in lower training error compared to the 4D convolu-

tion of the same size and capacity. Hence, for the

same number of layers and parameters, the (3+1)D

block will have lower training error and lower test-

ing error compared to the V4D network. Despite the

fact that (3+1)D is a simpler architecture, experimen-

tal results show that it signiﬁcantly outperforms the

V4D network.

3.4 F4D Block Integration

This section discusses the ability of integrating the

F4D blocks into existing state-of-the-art 3D CNN

frameworks for action recognition. As in (Zhang

et al., 2020), we design a factorized 4D convolution in

the residual structure (He et al., 2016), which shows

the efﬁcacy of combining the short-term 3D features

and the long-term spatiotemporal representations for

video action recognition. We start by deﬁning a per-

mutation function ℘(di,dj ): A

×···×d

×...d

→

×···×d

×...d

, which permutes the dimensions

and d

of a tensor A ∈ R

×···×d

. Formally, the

residual factorized 4D convolution block can be for-

mulated as:

= X

+℘

(U,C)



+ F



℘

(C,U)

);W

+ W



(3)

where F

+ F

(X ; W

+ W

) is the fac-

torized 4D convolution operation, and Y

∈

U×C×T ×H×W

. In order to process X

using

standard 3D CNNs, U is merged into the batch di-

mension whereas in order to process X

using the

factorized 4D convolution, we utilize the permutation

function ℘ to permute the dimensions of X

from

U ×C × T × H ×W to C ×U ×T ×H ×W . Thus, the

output of the factorized 4D convolution can be per-

muted back to the 3D form so that the output dimen-

sions are consistent. The factorized 4D convolution

is followed by a batch normalization layer (Ioffe and

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1006

Figure 2: Temporal Attention Module.

Szegedy, 2015), ReLU activation and a dropout layer.

In theory, any 3D CNN architecture can be recast as a

factorized 4D convolution using the proposed residual

block.

3.5 Attention in F4D Blocks

Inspired by CBAM network (Woo et al., 2018), we

implement two attention modules and embed it within

the F4D block to learn better and more reﬁned long-

term spatiotemporal representations with negligible

computation overhead. The proposed attention has

three major components: temporal attention map over

all action units, the channel attention map, and the

spatio-temporal attention map. We arrange the atten-

tion modules by placing the temporal attention map

in the 4D space, and both the channel attention map

and the spatio-temporal attention map after permuting

back to the 3D dimension.

Temporal Attention (TA) Map. In order to concen-

trate on the long-term temporal evolution of all ac-

tion units, we design a temporal attention map that

focuses on the different inter-clip motion patterns that

evolve over time. Given an intermediate feature map

F ∈ R

C×U×T ×H×W

as input, we infer a temporal at-

tention map M

∈ R

1×U×T ×1×1

by utilizing both av-

erage pooling and max pooling along the channel and

spatial dimensions to obtain two feature descriptors

avg

and F

max

. Although CBAM network adopts a ﬁl-

ter size of 7 × 7 which is considered a design choice

that has low computation cost in 2D image-related

tasks, using a convolutional operation with such a

large ﬁlter size in 3D or 4D space incurs a signiﬁ-

cant computational cost in our model. To obtain sub-

stantial computational cost savings, we use the dilated

convolution. We adopt a two-path 1D dilated tempo-

ral convolution (Cheema et al., 2018). The ﬁrst path

has a temporal dilated convolution with a dilation fac-

tor = 2 (skipping 1 pixel). The second path has a tem-

poral dilated convolution with a dilation factor = 3

(skipping 2 pixels). The two paths model the mul-

tiscale global temporal interdependency between all

action units. The temporal attention map is computed

as follows:

(F) = σ(Conv1D ([AvgPool(F) +(MaxPool(F)])) (4)

= σ



Conv1D



[(F

avg

+ F

max

])



(5)

Where σ denotes the sigmoid function, and Conv1D

denotes the multipath dilated temporal convolution

layer. The reﬁned feature map after the temporal at-

tention module is computed as:

= M

⊗ F + F (6)

Where ⊗ denotes the element-wise multiplication, +

denotes the inner residual connection and F

the re-

ﬁned feature map. In the original implementation of

CBAM, feature reﬁnement is attained by multiplying

the attention maps with the input feature map. How-

ever, it does not take into consideration the preserva-

tion of the original feature map. We use inner residual

connections in all attention modules to preserve the

original information. This helps to avoid any unre-

lated features or background noise in the current lay-

ers.

Channel Attention (CA) Map. As in the CBAM

network, the channel attention map is produced

by exploiting the inter-channel relationship of fea-

tures. Given an intermediate feature map F

′

∈

U×C×T ×H×W

, we compute the channel attention

map by using both, the max-pooled features and av-

erage pooled features at the same time generating two

different descriptors. Subsequently, both descriptors

are fed to a multi-layer perceptron with one hidden

layer with an activation size of R

C/r×1×1×1×1

, where

r is the reduction ratio (we set r = 16). The output

feature vectors are then combined using element-wise

summation. The entire process can be summarized as

follows:

′

) = σ(MLP(AvgPool(F

′

) +MLP(MaxPool(F

′

))) (7)

= σ(W

((W

′

max

)) +W

((W

′

max

))) (8)

Where W

∈ R

C/r×c

and W

∈ R

C×C/r

. In this case,

the weights, W

and W

are shared by both inputs and

the ReLU activation function is followed by weight-

ing by W

. The channel attention map can be summa-

rized as follows:

= M

⊗ F

′

+ F

′

(9)

During multiplication, the channel attention val-

ues are copied along the spatial dimension and the

temporal dimension. Spatio-Temporal Attention

(STA) Map. This module is designed to focus on

the spatio-temporal discriminative features by con-

centrating on the ROIs that evolve over time. The

spatio-temporal attention map is generated by exploit-

ing the inter-spatial relationship of features. Given

an intermediate feature map F

′

∈ R

U×C×T ×H×W

, we

compute the spatio-temporal attention map by ﬁrst

F4D: Factorized 4D Convolutional Neural Network for Efﬁcient Video-Level Representation Learning

1007

Figure 3: Spatio-Temporal Attention Module.

applying both, the max-pooled operations F

′

max

∈

1×1×T ×H×W

and the average pooled operations

′

avg

∈ R

1×1×T ×H×W

along the channel axis and con-

catenate them to generate a reﬁned and efﬁcient fea-

ture descriptor m

. Subsequently, we forward m

to a two-path 2D dilated convolution layer (with skip-

ping 1-pixel and skipping 2-pixels) and two-path 1D

dilated temporal convolution layer (with skipping 1-

pixel and skipping 2-pixels). These two layers are de-

signed to explore multiscale spatial relationships and

local temporal interdependencies respectively.

In summary, the spatio-temporal attention is com-

puted as:

= Concatenate[F

′

avg

′

max

] (10)

= σ(Conv1D(ReLU (Conv2D(m

)) (11)

whereConv2D represents the two path 2D convolution

layer. The reﬁned feature map is computed as:

= M

⊗ F

+ F

(12)

where F

is the reﬁned feature map.

4 EXPERIMENTS

4.1 Datasets

Five benchmark datasets have been used for exper-

imental evaluation of the proposed F4D convolu-

tion block: Something-Something-v1, Something-

Something-v2 (Goyal et al., 2017), Kinetics-400

(Kay et al., 2017), UCF101 (Soomro et al., 2012),

HMDB51 (Kuehne et al., 2011). Something-

Something-v1 is a dataset that contains labeled video

clips of humans performing predeﬁned actions. It

consists of 108,499 videos, with 86,017 in the training

set, 11,522 in the validation set and 10,960 in the test-

ing set comprising of 174 action classes. Something-

Something-v2 is an extension of the ﬁrst version with

a collection of 220,847 videos incorporating several

enhancements such as higher video resolution, and

reduced label noise. The Kinetics 400 dataset cov-

ers 400 action classes with ≈ 400 video clips for

Figure 4: Training and Testing errors for V4D (left) and

F4D (right).

each action. The video clips are obtained from dif-

ferent YouTube videos with each video clip lasting

≈ 10 seconds. The actions are human focused, and

the action classes include a wide range of human-

human and human-object interactions. The UCF101

dataset consists of 13320 video clips with 101 ac-

tion classes. This dataset includes several variations

arising from multi-viewpoints, camera motion, ob-

ject appearance, cluttered background, and illumina-

tion conditions. The HMDB51 dataset has 51 action

classes distributed across 6849 video clips collected

from different sources and public databases such as

YouTube, Google and the Prelinger archive.

4.2 Implementation Details

We perform our initial evaluation on Something-

Something datasets, using the training split for train-

ing and the validation split for testing. To learn the

network parameters, we use the mini batch stochastic

gradient descent (SGD) as the optimization algorithm.

The batch size is set to 128 and the momentum to 0.9.

Initially, the learning rate is set to 0.01, and drops by a

factor of 10 at epochs 20, 40, and 60. Model training

is concluded at 80 epochs. Batch normalization is ap-

plied to all convolutional layers. We follow each F4D

convolutional block with batch normalization, ReLU

activation and a dropout layer. To speed up training,

we utilize the data parallelism strategy implemented

using the torch.nn.DataParallel module in Pytorch to

split the mini-batch of samples into multiple smaller

mini-batches and perform the computation over four

Tesla P100-PCIE-16GB GPUs. Data augmentation

plays an important role in enhancing the performance

of deep learning architectures. During training, we

use random left-right ﬂipping, location jittering, scale

jittering and corner cropping.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1008

Figure 5: Performance of F4D on Something-Something v1

compared with state-of-the-art approaches.

4.3 Results on Motion-Focused Datasets

In this section, we evaluate our proposed approach

with the state-of-the-art approaches on motion-

focused datasets including Something-Something-v1

and Something-Something-v2. Both datasets focus

on modelling motion and temporal information where

the motion of actions is more complicated compared

to that in the Kinetics-400 dataset albeit with a clearer

background. Videos in both datasets contain one con-

tinuous action with clear start and end points along the

temporal dimension. To prepare the videos for train-

ing, we use the segment-based sampling technique ex-

plained in Section 3.1. We segment the holistic du-

ration of a video into U sections of equal durations

in their temporal order and for each section, we ran-

domly select a snippet composed of 32 frames. To

form an action unit, we take each snippet and use the

sampling strategy mentioned in (Feichtenhofer et al.,

2019) to sample 8 frames with a ﬁxed stride of 4.

We also experiment with the number of frames in the

snippet set to 16 with the frame size ﬁxed at 256×256

pixels. After applying the data augmentation tech-

niques mentioned in the previous section, we resize

the cropped region to 224 ×224 pixels. We ﬁx U = 4

in all of experiments. For fair comparison, we use the

ResNet50 CNN as the backbone for proposed F4D

network.

For inference, we follow the approach in (Feicht-

enhofer et al., 2019; Wang et al., 2018c) using fully

spatial convolutional testing. From the entire dura-

tion of a video, we sample 10 action units (U = 10)

of equal duration, scale up the smaller spatial image

dimension to 256 pixels and take 3 crops of 256×256

pixels to spatially cover the entire frame for each ac-

tion unit, and then resize the crops to 224 × 224 pix-

els. Finally, the ﬁnal prediction is produced via global

average pooling over the sequence of all action units.

Figure 4 highlights the training error and testing

error for V4D CNN and F4D architecture. It is il-

lustrated that for the same network backbone (ResNet

Figure 6: Performance of F4D on Something-Something v2

compared with state-of-the-art approaches.

50) and approximately the same number of parame-

ters, the F4D architecture achieves lower training er-

ror and lower testing error. This shows that the factor-

ization of the 4D CNN renders the optimization easier

and achieves better resulting representation.

Figure 5 and Figure 6 show the results of our ap-

proach compared to the state-of-the-art approaches on

the Something-Something datasets. Compared with

the baseline approach that uses a TSN with 8 frames,

the proposed F4D approach with 8 frames achieves

a 35.2% improvement with top-1 accuracy of 54.9 on

the Something-Something-v1 dataset when pretrained

on ImageNet (Deng et al., 2009). When the proposed

F4D model is pretrained on ImageNet and Kinetics-

400, the model achieves 57.5 top-1 accuracy, an im-

provement of 36.8%. On Something-Something-v2,

the F4D model yields a 66.3 and 69.8 in top-1 accu-

racy with an improvement of 39.5% when pretrained

on ImageNet and 43% improvement in top-1 accu-

racy when pretrained on ImageNet and Kinetics-400

respectively.

When the F4D model is trained on ImageNet

and Kinetics-400 using 16 frames on Something-

Something-v1, the F4D model achieves a 58.4 top-

1 accuracy. This shows a 7.7% (50.7 vs 58.4) and

6.1% (52.3 vs 58.4) improvement in accuracy when

compared with STM (Jiang et al., 2019) and TEA (Li

et al., 2020) respectively. The above results show that

the F4D model is capable of learning strong tempo-

ral relationships in the videos in these datasets. When

the F4D model is compared to V4D using 8 frames

on Something-Something-v1, the F4D model shows

a 4.5% (50.4 vs 54.9) and 7.1% (50.4 vs 57.5) im-

provement in top-1 accuracy when pretrained on Im-

ageNet alone and, on ImageNet and Kinetics-400 re-

spectively. This shows that the 4D factorization and

the attention modules added in the residual block of

the F4D model can capture more complex inter-clip

interactions and ﬁner long-range temporal structures

in the underlying video.

F4D: Factorized 4D Convolutional Neural Network for Efﬁcient Video-Level Representation Learning

1009

Figure 7: Performance of the F4D model on Kinetics-400.

4.4 Results on Scene-Focused Datasets

In this section, we compare the proposed F4D ap-

proach with the state-of-the-art approaches on scene-

focused datasets including Kinetics-400, UCF101 and

HMDB51. The videos representing most actions in

these datasets are short and can be recognized by

static appearance without considering temporal rela-

tionships. Furthermore, the background information

contributes heavily towards deciding the action class

in most of these videos.

Figure. 7 shows the results of the F4D model and

other approaches on the Kinetics-400 dataset. When

comparing the F4D model with STM (Jiang et al.,

2019) and TEA (Li et al., 2020), F4D model shows

a performance improvement of 7.5% and 5.1% re-

spectively. Moreover, it outperforms MSNET (Kwon

et al., 2020) by 4.8% and V4D by 3.8%. Although

the F4D model is designed speciﬁcally for temporal

focused action recognition, it shows competitive re-

sults when compared to state-of-the-art methods.

Figure. 8 highlights the results on the UCF-101

and HMDB51 datasets. We follow (Wang et al.,

2018b) in adopting the three training/testing splits for

evaluation. The F4D model was pretrained on Ima-

geNet and Kinetics-400. In both experiments, we set

U = 4 and use 16 frames during training. Our F4D

model achieves 98.2 and 84.3 accuracy on UCF101

and HMDB51 datasets respectively.

4.5 Runtime Analysis

In this section, we compare the proposed F4D ar-

chitecture with the V4D CNN. Our F4D architecture

achieves better results than the V4D CNN on several

benchmark datasets. Table 1 shows the model com-

plexity and accuracy of F4D and V4D on Something-

Something v1 dataset. We follow (Jiang et al., 2019)

to evaluate the FLOPs and speed of our architecture.

We equally sample 8 or 16 frames from a video and

then apply the center crop. Moreover, for speed we

use a batch size of 16. All evaluations are conducted

using two Tesla P100-PCIE-16GB GPU. As seen in

Figure 8: Performance of the F4d model on UCF101 and

HMDB51.

Table 1, F4D improves the accuracy by 7.1% while

achieving 2.3x less FLOPs (72G vs 167G). Moreover,

our F4D gains more accuracy with 1.37x faster speed.

These results demonstrate the effectiveness of the pro-

posed factorization and attention modules in learning

better and reﬁned long-range spatiotemporal repre-

sentation with less FLOPs, more speed, and a very

limited increase in the number of parameters.

4.6 Ablation Study

In this section, we evaluate our F4D model on the

Something-Something datasets given different sce-

narios. All models used in this section are pretrained

on ImageNet and Kinetics-400.

Location of F4D Blocks. In this experiment, we

study the impact of adding the F4D residual block in

different positions within the F4D network. In these

experiments, we ﬁx U = 4 and use 8 frames during

training. As shown in Table 2, adding F4D blocks

at conv2, conv3, conv4 or conv5 layers yields better

top-1 accuracy. Adding an F4D residual block at the

conv1 layer does not have a big impact which means

that the short-long term features need to be reﬁned by

the earlier layers ﬁrst to yield more meaningful rep-

resentations. We found that adding F4D blocks from

conv2 to conv5 yields the best results.

Number of Action Units U Used for Training. In

this experiment, we observe the change in the value

of U during training and we found that the value of U

have a signiﬁcant impact on overall performance. Al-

though we anticipated obtaining higher performance

ﬁgures, the videos in Something-Something datasets

are relatively short and have one single and contin-

uous action, and the action does not involve many

Table 1: Model complexity of F4D compared to V4D using

single crop.

Approach Frames Top1 FLOPs Speed # of param

V4D [5] 8 50.4 167G 38.1 V/s 36.2M

F4D 8 57.5 72G 52.3 V/s 36.8M

F4D 16 58.4 143G 27.5 V/s 36.8M

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1010

Table 2: Location of F4D Residual Blocks.

Location v1 top-1 accuracy v2 top-1 accuracy

conv1 45.3 55.1

conv2 49.9 57.3

conv3 51.6 61.2

conv4 52.8 63.2

conv5 53.2 63.9

conv2-3 54.2 64.3

conv3-4 56.4 67.5

conv2-5 57.5 69.8

Table 3: Impact of Number of Action Unite for Training.

train

V1 top-1 accuracy V2 top-1 accuracy

3 56.8 69.1

4 57.5 69.8

5 57.9 70.3

6 58.3 70.5

7 58.5 70.7

stages. We argue that the effect of higher U values

will be more visible when using longer untrimmed

videos during training.

Impact of Attention Modules. In this experiment,

we study and verify the contributions of each attention

module added in the proposed F4D model. We com-

pare the results of each individual attention module

and the various combinations of these attention mod-

ules. As seen in Table 4, TA+CA+STA achieves the

best top-1 accuracy and outperforms the model that

has no attention by 5.8% on Something-Something v1

and 9.6% on something-something v2. By combining

all the attention modules, the F4D model was able to

learn richer short-long term motion and spatiotempo-

ral features.

Comparison with Other Attention Modules. We

compare the proposed TA and STA attention modules

with two state-of-the-art attention modules namely

SE (Hu et al., 2018) and CBAM (Woo et al., 2018).

Both attention modules can improve the performance

by making the network focus on the distinctive ob-

ject features by incorporating ﬁner channel-wise at-

tention, and the spatial module in CBAM can make

the model concentrate on the spatial ROIs. First, we

remove the proposed TA, CA, and STA modules in the

F4D model and insert the SE module in the 3D space

and compute the top-1 accuracy for both Something-

Something datasets. In the second trial, we insert the

CBAM instead and observe the improvement over the

SE module. As illustrated in Table 5, our proposed

combination of TA, CA and STA modules improves

the performance signiﬁcantly as both proposed atten-

tion modules exploit short term and long-term tempo-

ral relationships unlike SE and CBAM modules that

Table 4: Impact of Attention Modules.

Modules v1 top-1 accuracy v2 top-1 accuracy

No Attention 51.7 60.2

CA 52.5 61.3

STA 53.8 63.2

CA+STA 54.2 65.0

TA 54.0 64.6

TA+CA 55.3 65.4

TA+CA+STA 57.5 69.8

Table 5: Comparison with other Attention Modules.

Modules v1 top-1 accuracy v2 top-1 accuracy

SE [24] 52.1 60.9

CBAM [39] 52.9 62.1

STM Block [34] 53.9 64.8

TEA Block [35] 54.3 65.5

TA+CA+STA 57.5 69.8

do not take temporal modelling into account.

5 CONCLUSION

In this paper, we presented an effective yet sim-

ple framework for video level representation learning

namely F4D, to model both short-range motion and

long-range temporal dependency at a large scale. We

add the F4D residual blocks within the ResNet archi-

tecture to build the F4D pipeline. An F4D residual

block performs the factorized 4D convolutional neu-

ral network which learns complex inter-clip interac-

tions and ﬁner temporal structures. Furthermore, it

applies the two proposed attention modules to the in-

termediate feature maps to learn richer and reﬁned

short-long term motion and spatiotemporal features.

Extensive experiments have been conducted to ver-

ify the effectiveness of F4D on ﬁve action recogni-

tion benchmark datasets, where our proposed F4D

achieved state-of-the-art results.

REFERENCES

Carreira, J. and Zisserman, A. (2017). Quo vadis, action

recognition? a new model and the kinetics dataset.

In proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 6299–6308.

Cheema, N., Hosseini, S., Sprenger, J., Herrmann, E., Du,

H., Fischer, K., and Slusallek, P. (2018). Dilated tem-

poral fully-convolutional network for semantic seg-

mentation of motion capture data. arXiv preprint

arXiv:1806.09174.

Chen, Y., Kalantidis, Y., Li, J., Yan, S., and Feng, J. (2018).

Multi-ﬁber networks for video recognition. In Pro-

ceedings of the european conference on computer vi-

sion (ECCV), pages 352–367.

F4D: Factorized 4D Convolutional Neural Network for Efﬁcient Video-Level Representation Learning

1011

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Diba, A., Fayyaz, M., Sharma, V., Arzani, M. M., Youse-

fzadeh, R., Gall, J., and Van Gool, L. (2018). Spatio-

temporal channel correlation networks for action clas-

siﬁcation. In Proceedings of the European Conference

on Computer Vision (ECCV), pages 284–299.

Dong, M., Fang, Z., Li, Y., Bi, S., and Chen, J. (2021).

Ar3d: attention residual 3d network for human action

recognition. Sensors, 21(5):1656.

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas,

C., Golkov, V., Van Der Smagt, P., Cremers, D., and

Brox, T. (2015). Flownet: Learning optical ﬂow with

convolutional networks. In Proceedings of the IEEE

international conference on computer vision, pages

2758–2766.

Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019).

Slowfast networks for video recognition. In Proceed-

ings of the IEEE/CVF international conference on

computer vision, pages 6202–6211.

Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzyn-

ska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I.,

Yianilos, P., Mueller-Freitag, M., et al. (2017). The”

something something” video database for learning and

evaluating visual common sense. In Proceedings of

the IEEE international conference on computer vi-

sion, pages 5842–5850.

Hara, K., Kataoka, H., and Satoh, Y. (2018). Can spa-

tiotemporal 3d cnns retrace the history of 2d cnns and

imagenet? In Proceedings of the IEEE conference

on Computer Vision and Pattern Recognition, pages

6546–6555.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Hu, J., Shen, L., and Sun, G. (2018). Squeeze-and-

excitation networks. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 7132–7141.

Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-

celerating deep network training by reducing internal

covariate shift. In International conference on ma-

chine learning, pages 448–456. pmlr.

Ji, S., Xu, W., Yang, M., and Yu, K. (2012a). 3d convolu-

tional neural networks for human action recognition.

IEEE transactions on pattern analysis and machine

intelligence, 35(1):221–231.

Ji, S., Xu, W., Yang, M., and Yu, K. (2012b). 3d convolu-

tional neural networks for human action recognition.

IEEE transactions on pattern analysis and machine

intelligence, 35(1):221–231.

Jiang, B., Wang, M., Gan, W., Wu, W., and Yan, J. (2019).

Stm: Spatiotemporal and motion encoding for action

recognition. In Proceedings of the IEEE/CVF Interna-

tional Conference on Computer Vision, pages 2000–

2009.

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C.,

Vijayanarasimhan, S., Viola, F., Green, T., Back, T.,

Natsev, P., et al. (2017). The kinetics human action

video dataset. arXiv preprint arXiv:1705.06950.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre,

T. (2011). Hmdb: a large video database for human

motion recognition. In 2011 International conference

on computer vision, pages 2556–2563. IEEE.

Kwon, H., Kim, M., Kwak, S., and Cho, M. (2020). Mo-

tionsqueeze: Neural motion feature learning for video

understanding. In Computer Vision–ECCV 2020:

16th European Conference, Glasgow, UK, August 23–

28, 2020, Proceedings, Part XVI 16, pages 345–362.

Springer.

Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., and Wang, L.

(2020). Tea: Temporal excitation and aggregation for

action recognition. In Proceedings of the IEEE/CVF

conference on computer vision and pattern recogni-

tion, pages 909–918.

Lin, J., Gan, C., and Han, S. (2019). Tsm: Temporal shift

module for efﬁcient video understanding. In Pro-

ceedings of the IEEE/CVF international conference

on computer vision, pages 7083–7093.

Piergiovanni, A. and Ryoo, M. S. (2019). Representation

ﬂow for action recognition. In Proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 9945–9953.

Qiu, Z., Yao, T., and Mei, T. (2017). Learning spatio-

temporal representation with pseudo-3d residual net-

works. In proceedings of the IEEE International Con-

ference on Computer Vision, pages 5533–5541.

Simonyan, K. and Zisserman, A. (2014a). Two-stream con-

volutional networks for action recognition in videos.

Advances in neural information processing systems,

27.

Simonyan, K. and Zisserman, A. (2014b). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101:

A dataset of 101 human actions classes from videos in

the wild. arXiv preprint arXiv:1212.0402.

Sun, S., Kuang, Z., Sheng, L., Ouyang, W., and Zhang, W.

(2018). Optical ﬂow guided feature: A fast and robust

motion representation for video action recognition. In

Proceedings of the IEEE conference on computer vi-

sion and pattern recognition, pages 1390–1399.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri,

M. (2015). Learning spatiotemporal features with 3d

convolutional networks. In Proceedings of the IEEE

international conference on computer vision, pages

4489–4497.

Tran, D., Ray, J., Shou, Z., Chang, S.-F., and Paluri, M.

(2017). Convnet architecture search for spatiotempo-

ral feature learning. arXiv preprint arXiv:1708.05038.

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and

Paluri, M. (2018). A closer look at spatiotemporal

convolutions for action recognition. In Proceedings of

the IEEE conference on Computer Vision and Pattern

Recognition, pages 6450–6459.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1012

Wang, L., Li, W., Li, W., and Van Gool, L. (2018a).

Appearance-and-relation networks for video classiﬁ-

cation. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 1430–

1439.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang,

X., and Van Gool, L. (2018b). Temporal segment net-

works for action recognition in videos. IEEE trans-

actions on pattern analysis and machine intelligence,

41(11):2740–2755.

Wang, X., Girshick, R., Gupta, A., and He, K. (2018c).

Non-local neural networks. In Proceedings of the

IEEE conference on computer vision and pattern

recognition, pages 7794–7803.

Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. (2018). Cbam:

Convolutional block attention module. In Proceed-

ings of the European conference on computer vision

(ECCV), pages 3–19.

Xie, S., Girshick, R., Doll

ar, P., Tu, Z., and He, K. (2017).

Aggregated residual transformations for deep neural

networks. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 1492–

1500.

Xie, S., Sun, C., Huang, J., Tu, Z., and Murphy, K. (2018).

Rethinking spatiotemporal feature learning: Speed-

accuracy trade-offs in video classiﬁcation. In Pro-

ceedings of the European conference on computer vi-

sion (ECCV), pages 305–321.

Yu, S., Cheng, Y., Xie, L., and Li, S.-Z. (2017). Fully con-

volutional networks for action recognition. IET Com-

puter Vision, 11(8):744–749.

Zhang, B., Wang, L., Wang, Z., Qiao, Y., and Wang, H.

(2016). Real-time action recognition with enhanced

motion vector cnns. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2718–2726.

Zhang, S., Guo, S., Huang, W., Scott, M. R., and Wang,

L. (2020). V4d: 4d convolutional neural networks

for video-level representation learning. arXiv preprint

arXiv:2002.07442.

Zhao, Y., Xiong, Y., and Lin, D. (2018). Trajectory con-

volution for action recognition. Advances in neural

information processing systems, 31.

Zhou, B., Andonian, A., Oliva, A., and Torralba, A.

(2018a). Temporal relational reasoning in videos. In

Proceedings of the European conference on computer

vision (ECCV), pages 803–818.

Zhou, Y., Sun, X., Zha, Z.-J., and Zeng, W. (2018b).

Mict: Mixed 3d/2d convolutional tube for human ac-

tion recognition. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 449–458.

Zolfaghari, M., Singh, K., and Brox, T. (2018). Eco: Ef-

ﬁcient convolutional network for online video under-

standing. In Proceedings of the European conference

on computer vision (ECCV), pages 695–712.

F4D: Factorized 4D Convolutional Neural Network for Efﬁcient Video-Level Representation Learning

1013