MA-ResNet50: A General Encoder Network for Video Segmentation
Xiaotian Liu, Lei Yang, Xiaoyu Zhang and Xiaohui Duan
School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Keywords:
Video Segmentation, Attention Mechanism, Encoder Network.
Abstract:
To improve the performance of segmentation networks on video streaming, most researchers now use optical-
flow based method and non optical-flow CNN based method. The former suffers from heavy computa-
tional cost and high latency while the latter suffers from poor applicability and versatility. In this paper,
we design a Partial Channel Memory Attention module (PCMA) to store and fuse time series features from
video sequences.Then, we propose a Memory Attention ResNet50 network (MA-ResNet50) by combining
the PCMA module with ResNet50, making it the first video based feature extraction encoder appliable for
most of the currently proposed segmentation networks. For experiments, we combine our MA-ResNet50
with four acknowledged per-frame segmentation networks: DeeplabV3P, PSPNet, SFNet, and DNLNet. The
results show that our MA-ResNet50 outperforms the original ResNet50 generally in these 4 networks on
VSPW and CamVid. Our method also achieves state-of-the-art accuracy on CamVid. The code is avilable at
https://github.com/xiaotianliu01/MA-Resnet50.
1 INTRODUCTION
As a video scene analysis technology, video seg-
mentation is to assign pixel-wise labels for voxels
(pixels from Spatial-Temporal viewpoint) (Qiu et al.,
2018).As video becomes the main medium of infor-
mation transmission,video segmentation now plays
an increasingly important role in many cutting-edge
technologies such as autonomous driving (Zhang
et al., 2013; Teichmann et al., 2018), augmented real-
ity (Miksik et al., 2015), robotic vision (Vineet et al.,
2015), and so on.
In recent years, with the development of deep con-
volutional neural networks, some non-sequential seg-
mentation networks (Chen et al., 2018; Zhao et al.,
2017; Lee et al., 2019; Yin et al., 2020) have already
achieved relatively high performance on several per-
frame annotated datasets. However, the direct appli-
cation of these non-sequential models on video steam-
ing always brings about two problems, i.e., unstable
infer results and redundant calculations, mainly be-
cause of the disability of integrating spatial and tem-
poral relations between consecutive frames.
Nowadays, researchers on video feature inte-
gration have mainly developed three methods, i.e.,
optical-flow based method, spatial and temporal
CNN based method, and non-CNN algorithms based
method. For the first method, optical-flow can mea-
sure the apparent motion of pixels between consec-
utive frames, so it is used in keyframe mechanism
to reduce calculations (Xu et al., 2018; Zhu et al.,
2017; Li et al., 2018) and fed into CNN as extra-
neous information to improve accuracy (Ding et al.,
2020; Gadde et al., 2017; Nilsson and Sminchis-
escu, 2018). However, calculating optical-flow (by
algorithms or CNN) can be computationally expen-
sive, which affects models’ speed and latency. For
the second method, researchers design some special
spatial-temporal CNN networks for video sequence
and conduct end-to-end training progress to extract
spatial and temporal features simultaneously (Qiu
et al., 2018; Siam et al., 2017; Siam et al., 2016;
Hu et al., 2020). Although this method doesn’t in-
troduce extra calculations, every structure proposed
can only be applied in one specified network, which
impairs its applicability and versatility. For the last
method, to achieve feature fusion outside CNN, re-
searchers propose some non-CNN algorithms which
won’t introduce too much computational cost and can
be modularized to different models (Lin et al., 2019;
Wang et al., 2021a; Wang et al., 2021b). Our pro-
posed method belongs to the third category illustrated
above.
In contrast to the aforementioned methods, our
method achieves a trade-off between computational
cost and model’s versatility. Our contributions can
be summarized as follows: (1) We design a novel
Partial Channel Memory Attention module (PCMA)
Liu, X., Yang, L., Zhang, X. and Duan, X.
MA-ResNet50: A General Encoder Network for Video Segmentation.
DOI: 10.5220/0010800800003124
In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages
79-86
ISBN: 978-989-758-555-5; ISSN: 2184-4321
Copyright
c
2022 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
79
based on attention mechanism to capture relations be-
tween frames outside CNN, which won’t introduce
so much computational cost as optical-flow based
method. (2) We propose a general Memory Atten-
tion ResNet50(MA-ResNet50) encoder network for
video sequence feature extraction, which covers the
deficiency of applicability and versatility for spatial
and temporal CNN based method. (3) We apply
our proposed MA-ResNet50 to four acknowledged
per-frame segmentation networks. The experiments
show that our MA-ResNet50 is superior to the origi-
nal ResNet50 in accuracy on two video segmentation
datasets, namely CamVid and VSPW, for video se-
mantic segmentation task and video object segmenta-
tion task. Also, our method achieves state-of-the-art
accuracy on CamVid.
2 RELATED WORKS
2.1 Attention Mechanism
Originated from Machine Translation and Natural
Language Processing (Bahdanau et al., 2016), the at-
tention mechanism was designed to adaptively allo-
cate limited computing resources on different parts of
input data according to their contributions to the fi-
nal result. In computer vision area, attention mech-
anism specifically contains channel attention mecha-
nism (Hu et al., 2018), spatial attention mechanism
(Jaderberg et al., 2016), and channel-spatial attention
mechanism (Woo et al., 2018).
There are two major tasks in the video segmen-
tation area: video semantic segmentation (Garcia-
Garcia et al., 2018), which requires the identification
of every pixel’s class in a video, and video object
segmentation (Caelles et al., 2017), which requires
the separation of an object from the background in
a video. For video semantic segmentation task, the
memory attention mechanism was firstly introduced
by (Wang et al., 2021b), who uses a key-value mem-
ory method to calculate attention matrix with histori-
cal features, retrieving information from the previous
frames to enhance the representation of the current
frame. For video object segmentation task, (Wang
et al., 2021a) uses the memory attention mechanism
to renew non-redundant information between frames,
and (Oh et al., 2019) leverages a similar method to
match features between memory and current frames.
Also based on attention mechanism, our PCMA mod-
ule improves the integrality of the temporal features
by exploiting features of different depths and reduce
computational cost at the same time.
2.2 ResNet
Winner of ILSVRC-2016 with 96.4% accuracy,
ResNet (He et al., 2016) is well known for its intro-
duction of residual blocks, which makes training deep
neural networks easier by inserting skip connections
among neural networks. Since its emergence, many
newly proposed segmentation networks (Chen et al.,
2018; Zhao et al., 2017; Lee et al., 2019; Yin et al.,
2020) have used it as a well pretrained encoder net-
work, and achieved satisfactory performance on many
different datasets by combining it with different pro-
posed decoder networks. Developed from ResNet by
using its 50 layers version, our MA-ResNet50 makes
it possible for encoder network to extract temporal
features and achieve better performance on video seg-
mentation task.
3 METHODOLOGY
3.1 Overview
An overview of our Memory Attention ResNet50 is
illustrated in Figure 1. The whole model mainly
contains three elements, i.e., ResNet50 blocks, Par-
tial Channel Memory Attention module (PCMA) and
memory M
n
where n
{
1, 2,3, 4
}
. For ResNet50,
with 4 convolutional blocks, it extracts 4 high-
dimensional features C
n
where n
{
1, 2, 3, 4
}
from
the input image I
i
. For the PCMA module, it obtains
the long-range temporal context information from M
n
and uses it to enhance the presentation of current fea-
ture C
n
, which we will explain in more detail in the
next part. For the memory M
n
, it is a key-value struc-
ture data, where the key is used to generate atten-
tion matrix by calculating pixel-wise feature corre-
lation between consecutive frames and the value is
used to generate enhanced features. Specially, mem-
ory M
n
, contains the feature information for histori-
cal T frames and both its key M
K
n
and value M
V
n
are
generated by concatenating T enhanced features from
PCMA module in channel dimension.
For the whole segmentation process, our MA-
ResNet50 works in a cyclic updating way. For every
image fed into the network, it is extracted firstly by
ResNet50, thus producing four features of different
sizes. Along with their corresponding memory, the
features are then fed into PCMA to generate temporal
enhanced features. On one hand, the enhanced fea-
tures are sent to segmentation head network to gener-
ate segmentation results. On the other hand, the en-
hanced features are used to update memory on a First
In First Out(FIFO) basis.
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
80
Figure 1: Overview of our Partial Memory Attention ResNet50.The input current image is firstly extracted by 4 convolutional
blocks from ResNet50. Along with their corresponding memory M
1
, M
2
, M
3
, M
4
, the output features C
1
, C
2
, C
3
, C
4
are then
fed into PCMA module for feature enhancement. Output enhanced features from PCMA module are then sent to segmentation
head for results generation and to memory for updating.
3.2 Partial Channel Memory Attention
Module
The intuition of PCMA module comes from human
visual system. In a continuous observation process,
human visual system forms a short-time memeory
which contains past semantic information about the
observation tatrget. When understanding the current
view, human visual system exploits this memeory as
reference to allocate attention on different parts of
the view. In PCMA module, we apply the similar
mechanism by using feature matrices to store seman-
tic information from past frames and using correlation
calculation to enhance attention allocation on current
frame.
Figure 2 shows the pipeline of our PCMA mod-
ule. As mentioned in 3.1, the input of PCMA module
is the current feature C
n
and key-value data M
K
n
, M
V
n
from memory feature M
n
. The output is the enhanced
feature, namely Enhanced Feat, for generating re-
sults and key-value data, namely U pdate Key and
U pdate Value, for memory updating.
To reduce the computational cost of generating
memory attention, we adopt a 2D convolution layer to
compress the input feature by halving the number of
its channels. Specially, instead of feeding the whole
C
n
R
c×h×w
into the 2D convolution layer, we inno-
vatively select partial channels of C
n
to be calculated
and enhanced, namely C
selected
n
, which balances the
contributions of historical and current information to
the final results and also saves calculation resources.
To adjust the selection strategy, we introduce an ex-
ogenous ratio, namely r [0, 1], whose effect is tested
in the experiments part. The selection strategy can be
illustrated as:
C
selected
n
= Θ
C
(1r)c
n
,C
(1r)c+1
n
, ...,C
c1
n
,C
c
n
(1)
Here C
i
n
donates the 2D tensor in i
th
channel of
C
n
, Θ donates concatenate operation and c donates the
channel number of C
n
.After the key encoder progress,
we flatten the output tensor C
K
n
R
cr0.5×h×w
to a
2D matrix, namely C
K
n
0
R
cr0.5×hw
, for later cal-
culation.
3.2.1 Enhanced Feature Generation
After flattening the M
K
n
R
T cr0.5×h×w
to a 2D ma-
trix, namely M
K
n
0
R
cr0.5×T hw
, we conduct matrix
multiplication on C
K
n
0
and M
K
n
0
to generate memory
attention A
n
, which can be illustrated as:
A
n
(i, j) =
cr0.5
l=1
M
K
n
0
(l, i)C
K
n
0
(l, j) (2)
For A
n
R
T hw×hw
, A
n
(i, j) denotes the corre-
lation between i
th
pixel in M
K
n
and j
th
pixel in C
K
n
,
MA-ResNet50: A General Encoder Network for Video Segmentation
81
Figure 2: Pipeline of Partial Channel Memory Attention module. After Key Encoder, the current feature C
n
is transferred to
2D matrix C
K
n
0
. The key-value memory data M
K
n
, M
V
n
are also transferred to 2D matrix M
K
n
0
, M
V
n
0
. Memory attention matrix
A
n
is obtained by conducting matrix multiplication on C
K
n
0
and M
K
n
0
. Similarly, enhanced feature M
n
enhanced
is obtained by
conducting matrix multiplication on A
n
and M
V
n
0
.After feature aggregation and the Judge module, the enhanced feature of full
channels is output for results generation. Also, the intermediate result C
K
n
and M
n
enhanced
are output for memory updating.
which indicates the pixel-wise matching between his-
torical features and current features. Then we con-
duct a softmax operation on A
n
s first dimension for
normalization and flatten M
V
n
R
T cr×hw
to M
V
n
0
R
cr×T hw
. Using normalized A
n
as weight, we con-
duct matrix multiplication similarly on A
n
and M
V
n
0
,
which produces the final temporal enhanced feature,
namely M
enhanced
n
, after the reshape operation.
3.2.2 Output Judge Module
3.2.3 Partial Channel Key Encoder
For better feature aggregation, we adopt a mean-
scaling add method, which is:
F
n
= M
n
enchanced
mean(C
seclected
n
)
mean(M
enhanced
n
)
+C
seclected
n
(3)
Here F
n
donates the fusion feature.
In the actual test process, we find that when a tem-
poral sequence has low consistency, mainly because
of the rapid change of scenes and the low fps of input
video, the relations between frames can be inappar-
ent, which makes the F
n
tend to be equalized. This
happens more when the number of channels for the
enhanced feature increases. Therefore, to avoid the
equalized F
n
from affecting the final results as noise,
we design a judge module to output the final feature.
The principle of the judge module can be shown as:
a =
mean(F
n
) min(F
n
)
max(F
n
) min(F
n
)
(4)
O
n
=
(
C
selected
n
a <0.1 or a >0.9
F
n
0.1<a <0.9
(5)
Here O
n
donates the output embedding feature. Then,
we combine O
n
with the unselected channels from
C
n
to output the full channels enhanced feature to
generate segmentation results. Specially, C
K
n
and
M
n
enhanced
are also output as key-value data to update
memory.
4 EXPERIMENTS
4.1 Datasets
To verify the validity of our MA-ResNet50 on video
semantic segmentation task and video object segmen-
tation task, we conduct experiments on two datasets,
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
82
Figure 3: Segmentation results visualization. Specially, model and model MA separately stand for results of model based on
original ResNet50 and our MA-ResNet50.
Figure 4: 2D depth maps visualization. C
selected
1
stands for the original feature from Block1 of ResNet50. M
enhanced
1
stands
for the enhanced memory feature of C
selected
1
. O
1
stands for the final output enhanced feature, which is C
selected
1
+M
enhanced
1
in
most cases.
i.e., CamVid (Brostow et al., 2009) and VSPW (Miao
et al., 2021). With 11 categories labeled, CamVid
is a video semantic segmentation dataset for traffic
scenes which contains 4 videos and each video is an-
notated at 1 fps. For VSPW, it contains 3536 long-
temporal clips for various real-world scenarios with
dense pixel-wise annotations at 15fps. To conduct ex-
periments on video object segmentation task, instead
of using VSPW’s whole 124 categories, we only se-
lect the category with the highest frequency as the de-
tection target, which is the Person category. Besides,
we adopt mean Intersection-over-Union(mIoU) as our
MA-ResNet50: A General Encoder Network for Video Segmentation
83
Table 1: Comparison of ResNet50 baseline and our MA-ResNet50 on CamVid and VSPW.
Dataset Model Encoder mIoU% fps mIoU%
CamVid
(Less Temporal)
DeeplabV3P
ResNet50 73.77 45.78
+1.25
MA-ResNet50 75.02 38.60
PSPNet
ResNet50 69.61 46.14
+1.78
MA-ResNet50 71.39 42.60
SFNet
ResNet50 75.55 12.99
+1.50
MA-ResNet50 77.05 11.55
VSPW
(More Temporal)
DeeplabV3P
ResNet50 81.79 43.45
+1.31
MA-ResNet50 83.10 37.27
PSPNet
ResNet50 75.95 45.35
+2.57
MA-ResNet50 78.52 42.08
DNLNet
ResNet50 78.94 11.79
+1.88
MA-ResNet50 80.82 10.88
revaluation metric, which can be illustrated as:
mIoU =
1
k + 1
k
i=1
p
ii
k
j=0
p
i j
+
k
j=0
p
ji
p
ii
(6)
Here k donates the number of segmentation classes
and p
i j
donates the number of pixels whose ground
truth class is i and prediction class is j.
4.2 Training
To ensure the rigor of the controlled trial, we train
all the networks with the same setting on Nvidia
TeslaV100 GPU. For loss and optimizer, we use
cross-entropy loss and SGD with its weight decay
set to 4e-5. We employ a poly learning rate policy
to adjust the learning rate every iteration from 0.1 to
0.001. For resolution, images from CamVid are re-
sized to 960×720 while images from VSPW are re-
sized to 640×640. Besides, the total iteration is set
to 60000 for CamVid and 120000 for VSPW and the
batch size is set to 2. For data augmentation, we adopt
resize, random horizontal flip and normalize without
shuffling.
4.3 Improving ResNet50 Baseline
To prove that our MA-ResNet50 is superior to the
original ResNet50 baseline in segmentation accuracy,
we employ 4 acknowledged per-frame segmentation
models, i.e., DeeplabV3P (Chen et al., 2018), PSP-
Net (Zhao et al., 2017), SFNet (Lee et al., 2019) and
DNLNet (Yin et al., 2020). We respectively replace
the encoder networks in these 4 models with the orig-
inal ResNet50 and our MA-ResNet50. After the train-
ing process in the same setting illustrated above, the
comparison of results is shown in Table 1.
The comparison implies that our MA-ResNet50
outperforms the original ResNet50 generally on these
two datasets by 1% - 3% mIoU. Specially, in contrast
to CamVid, the improvement is more significant for
more temporal data from VSPW.
To make the comparison more intuitive, we visual-
ize some of the segmentation results. Also, we visual-
ize some 2D depth maps from C
selected
n
and M
enhanced
n
to show the effect of enhancement. The visualization
results are shown in Figure 3 and Figure 4.
4.4 Ablation Study
For ablation study, we test the effect of two variables
in our model, i.e., T , the number of stored features in
memory and r, the channel select ratio. Experiment
results for ablation study are illustrated in Table 2 and
Table 3.
Table 2: Effect of the T variable on VSPW for three models
based on MA-ResNet50.
Model
mIoU%
T =1 T =2 T =3
DeepLabv3p 83.1 83.09 83.09
PSPNet 78.47 78.52 78.48
DNLNet 80.81 80.82 80.81
The results imply that the optimal T tends to be 2
for more temporal data from VSPW and the optimal r
tends to be 0.25 for less temporal data from CamVid.
It should be mentioned, however, that greater T and r
will result in an increase in computational cost. To il-
lustrate this numerically, we calculate FLOPS of MA-
ResNet50 with different combinations of T and r, and
the results are shown in Table 4.
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
84
Table 3: Effect of the r variable on CamVid for two models
based on MA-ResNet50.
Model
mIoU%
r=0.125 r=0.25 r=0.5
PSPNet 71.3 71.39 71.25
SFNet 77 77.05 76.27
Table 4: FLOPS for different combination of r and
T .Specially, the last row stands for
FLOPS o f MAResNet50
FLOPS o f ResNet50
100%.
r T FLOPS(B) /ResNet50(%)
0.125
1 4.11 102.59
2 6.82 104.43
3 9.52 106.01
0.25
1 11.07 107.08
2 16.47 110.38
3 21.88 113.8
0.5
1 33.47 121.13
2 44.29 128.03
3 55.1 134.8
4.5 State-of-the-Art Comparison
Applying MA-ResNet50 to SFNet (Lee et al., 2019),
our method achieves better accuracy than other state-
of-the-art methods for video segmentation task on
CamVid. The comparison between our method and
other optical-flow based and non optical-flow based
methods is shown in Table 5.
Table 5: Comparison of our method with other state-of-the-
art methods on CamVid.
Method mIoU%
GRFP (Nilsson and Sminchisescu, 2018) 66.10
Netwarp(Gadde et al., 2017) 67.10
TDNet(Hu et al., 2020) 76.00
TMANet(Wang et al., 2021b) 76.50
Ours 77.05
5 CONCLUSIONS
In this paper, we propose a Memory Attention
ResNet50 encoder network for video sequence fea-
ture extraction. Specially, we design a Partial Chan-
nel Memory Attention module to integrate long-term
temporal relations in consecutive frames. Experi-
ments imply that our method outperforms the origi-
nal ResNet50 in 4 per-frame segmentation networks.
Our method also achieves state-of-the-art accuracy on
CamVid. In future work, we will mainly work on new
correlation calculation algorithms to reduce computa-
tional cost and improve enhancement effectiveness.
REFERENCES
Bahdanau, D., Cho, K., and Bengio, Y. (2016). Neural ma-
chine translation by jointly learning to align and trans-
late.
Brostow, G. J., Fauqueur, J., and Cipolla, R. (2009). Seman-
tic object classes in video: A high-definition ground
truth database. Pattern Recognit. Lett., 30:88–97.
Caelles, S., Maninis, K.-K., Pont-Tuset, J., Leal-Taix
´
e, L.,
Cremers, D., and Gool, L. V. (2017). One-shot video
object segmentation.
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and
Adam, H. (2018). Encoder-decoder with atrous sep-
arable convolution for semantic image segmentation.
In Proceedings of the European Conference on Com-
puter Vision (ECCV).
Ding, M., Wang, Z., Zhou, B., Shi, J., Lu, Z., and Luo,
P. (2020). Every frame counts: Joint learning of
video segmentation and optical flow. Proceedings
of the AAAI Conference on Artificial Intelligence,
34(07):10713–10720.
Gadde, R., Jampani, V., and Gehler, P. V. (2017). Se-
mantic video cnns through representation warping. In
Proceedings of the IEEE International Conference on
Computer Vision (ICCV).
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-
Martinez, V., Martinez-Gonzalez, P., and Garcia-
Rodriguez, J. (2018). A survey on deep learning tech-
niques for image and video semantic segmentation.
Applied Soft Computing, 70:41–65.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Hu, J., Shen, L., and Sun, G. (2018). Squeeze-and-
excitation networks. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR).
Hu, P., Caba, F., Wang, O., Lin, Z., Sclaroff, S., and Per-
azzi, F. (2020). Temporally distributed networks for
fast video semantic segmentation. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR).
Jaderberg, M., Simonyan, K., Zisserman, A., and
Kavukcuoglu, K. (2016). Spatial transformer net-
works.
Lee, J., Kim, D., Ponce, J., and Ham, B. (2019). Sfnet:
Learning object-aware semantic correspondence. In
Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR).
Li, Y., Shi, J., and Lin, D. (2018). Low-latency video
semantic segmentation. In Proceedings of the IEEE
MA-ResNet50: A General Encoder Network for Video Segmentation
85
Conference on Computer Vision and Pattern Recogni-
tion (CVPR).
Lin, J., Gan, C., and Han, S. (2019). Tsm: Temporal shift
module for efficient video understanding. In Proceed-
ings of the IEEE/CVF International Conference on
Computer Vision (ICCV).
Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., and Yang, Y.
(2021). Vspw: A large-scale dataset for video scene
parsing in the wild. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR).
Miksik, O., Vineet, V., Lidegaard, M., Prasaath, R.,
Nießner, M., Golodetz, S., Hicks, S. L., P
´
erez, P.,
Izadi, S., and Torr, P. H. (2015). The semantic paint-
brush: Interactive 3d mapping and recognition in large
outdoor spaces. In Proceedings of the 33rd Annual
ACM Conference on Human Factors in Computing
Systems, CHI ’15, page 3317–3326, New York, NY,
USA. Association for Computing Machinery.
Nilsson, D. and Sminchisescu, C. (2018). Semantic video
segmentation by gated recurrent flow propagation. In
Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR).
Oh, S. W., Lee, J.-Y., Xu, N., and Kim, S. J. (2019). Video
object segmentation using space-time memory net-
works. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV).
Qiu, Z., Yao, T., and Mei, T. (2018). Learning deep spatio-
temporal dependence for semantic video segmenta-
tion. IEEE Transactions on Multimedia, 20(4):939–
949.
Siam, M., Valipour, S., J
¨
agersand, M., and Ray, N. (2016).
Convolutional gated recurrent networks for video seg-
mentation. CoRR, abs/1611.05435.
Siam, M., Valipour, S., Jagersand, M., and Ray, N. (2017).
Convolutional gated recurrent networks for video seg-
mentation. In 2017 IEEE International Conference on
Image Processing (ICIP), pages 3090–3094.
Teichmann, M., Weber, M., Zoellner, M., Cipolla, R., and
Urtasun, R. (2018). Multinet: Real-time joint seman-
tic reasoning for autonomous driving.
Vineet, V., Miksik, O., Lidegaard, M., Nießner, M.,
Golodetz, S., Prisacariu, V. A., K
¨
ahler, O., Murray,
D. W., Izadi, S., P
´
erez, P., and Torr, P. H. S. (2015).
Incremental dense semantic stereo fusion for large-
scale semantic scene reconstruction. In 2015 IEEE
International Conference on Robotics and Automation
(ICRA), pages 75–82.
Wang, H., Jiang, X., Ren, H., Hu, Y., and Bai, S.
(2021a). Swiftnet: Real-time video object segmen-
tation. CoRR, abs/2102.04604.
Wang, H., Wang, W., and Liu, J. (2021b). Temporal mem-
ory attention for video semantic segmentation. CoRR,
abs/2102.08643.
Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. (2018). Cbam:
Convolutional block attention module. In Proceed-
ings of the European Conference on Computer Vision
(ECCV).
Xu, Y.-S., Fu, T.-J., Yang, H.-K., and Lee, C.-Y. (2018). Dy-
namic video segmentation network. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Yin, M., Yao, Z., Cao, Y., Li, X., Zhang, Z., Lin, S., and Hu,
H. (2020). Disentangled non-local neural networks. In
Vedaldi, A., Bischof, H., Brox, T., and Frahm, J.-M.,
editors, Computer Vision ECCV 2020, pages 191–
207, Cham. Springer International Publishing.
Zhang, H., Geiger, A., and Urtasun, R. (2013). Understand-
ing high-level semantics by modeling traffic patterns.
In Proceedings of the IEEE International Conference
on Computer Vision (ICCV).
Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017).
Pyramid scene parsing network. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Zhu, X., Xiong, Y., Dai, J., Yuan, L., and Wei, Y. (2017).
Deep feature flow for video recognition. In Proceed-
ings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR).
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
86