A New Approach to Moving Object Detection and Segmentation:

The XY-shift Frame Differencing

N. Wondimu

1,4 a

, U. Visser

3 b

and C. Buche

1,2 c

Lab-STICC, Brest National School of Engineering, 29280, Plouzan

e, France

IRL CROSSING, CNRS, Adelaide, Australia

University of Miami, Florida, U.S.A.

School of Information Technology and Engineering, Addis Ababa University, Addis Ababa, Ethiopia

Keywords:

Moving Object Detection, Frame Differencing, Object Segmentation, XY-shift Frame Differencing, Three-

Frame Differencing.

Abstract:

Motion out-weights other low-level saliency features in attracting human attention and deﬁning region of in-

terests. The ability to effectively identify moving objects in a sequence of frames help to solve important

computer vision problems, such as moving object detection and segmentation. In this paper, we propose a

novel frame differencing technique along with a simple three-stream encoder-decoder architecture to effec-

tively and efﬁciently detect and segment moving objects in a sequence of frames. Our frame differencing

component incorporates a novel self-differencing technique, which we call XY-shift frame differencing, and

an improved three-frame differencing technique. We fuse the feature maps from the raw frame and the two

outputs of our frame differencing component, and fed them to our transfer-learning based convolutional base,

VGG-16. The result from this sub-component is further deconvolved and the desired segmentation map is

produced. The effectiveness of our model is evaluated using the re-labeled multi-spectral CDNet-2014 dataset

for motion segmentation. The qualitative and quantitative results show that our technique achieves effective

and efﬁcient moving object detection and segmentation results relative to the state-of-the-art methods.

1 INTRODUCTION

Motion is a key social stimulus that engages visual at-

tention and induces autonomic arousal in the viewer

(Williams et al., 2019; Mahapatra et al., 2008). Mo-

tion detection is extensively used in computer vision

applications to facilitate the analysis of real world

video scenes. Video surveillance (Sehairi et al., 2017)

being a signiﬁcant one, monitoring trafﬁc and pedes-

trian (Zhao et al., 2016), robot control (Motlagh et al.,

2014), target detection and counting (Garc

ıa et al.,

2012), and detection of human activity (Zhang et al.,

2012) are some of its applications in computer vision.

The process of relating moving region of interests

(ROI) with object/s is called moving object detec-

tion. Moving object detection concerns how to take

out moving objects from video frames and remove the

background region and noise. Robust moving object

https://orcid.org/0000-0002-0726-9892

https://orcid.org/0000-0002-1254-2566

https://orcid.org/0000-0003-0264-2683

detection enable computationally optimal foreground

object detection and saliency prediction (Sengar and

Mukhopadhyay, 2017).

The optical ﬂow method use sequence of ordered

images to estimate motion of objects as either instan-

taneous image velocities or discrete image displace-

ments (Beauchemin and Barron, 1995). It is based on

the properties of ﬂow vector of the object over time

to detect moving object regions. Optical ﬂow method

is highly complex and susceptible to noise corruption,

fake motion and illumination variation.

On the other hand, background subtraction com-

prises of two steps: (i) background modeling and (ii)

computation of difference between the current back-

ground model and the current video frame. The per-

formance of background subtraction method is highly

dependent on the accuracy of the background model.

This method yield outstanding performance whenever

the background model is accurate; otherwise wrong

moving object detection will be undertaken (Garcia-

Garcia et al., 2020). The performance of background

subtraction technique degrades when it face videos

Wondimu, N., Visser, U. and Buche, C.

A New Approach to Moving Object Detection and Segmentation: The XY-shift Frame Differencing.

DOI: 10.5220/0011664500003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 309-318

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

309

with smaller frame rate, camera jitter, and signiﬁcant

illumination change.

The other widely used moving object detection

technique is frame differencing. It detects moving

objects by taking pixel-by-pixel difference of consec-

utive frames in a video sequence. Frame differencing

is the most common and computationally less com-

plex method for moving object detection in scenarios

where the scene is dynamic due to camera movement

and mobility of objects in a video sequence. How-

ever, this method fails to detect whole relevant pixels

of some types of moving objects (foreground aperture

problem) (Xu et al., 2017). It also wrongly detects

a trailing regions as moving object (known as ghost

region) when there is an object that is moving fast

in the frames (Fei et al., 2015). Most signiﬁcantly,

frame differencing fails to detect objects that preserve

uniform regions.

To this end, we propose a novel XY-shift

frame differencing technique along with three-stream

encoder-decoder architecture to address the setbacks

of frame differencing based moving object detec-

tion and segmentation techniques. The effective-

ness of our frame diffrencing technique is analyzed

both as a standalone high-pass ﬁlter algorithm and

as an input for our improved three-frame differenc-

ing and encoder-decoder network. Furthermore, we

implement three-stream encoder-decoder architecture

to build simple but robust moving object detection

model.

The rest of the paper is organized as follows. The

second part brieﬂy introduce related research works,

the third part introduce the proposed method in detail,

the fourth part of this paper show experimental anal-

ysis of the research work, and ﬁnally, a summary of

this paper is presented.

2 RELATED WORKS

In the early days of moving object detection, re-

searchers formulated a well established background

subtraction techniques for stationary camera setting.

These techniques have been extended for many years

and are able to successfully detect moving objects as

long as the camera is stationary (Stauffer and Grim-

son, 1999; Barnich and Van Droogenbroeck, 2010;

Stauffer and Grimson, 1999; Dou et al., 2017; Cuc-

chiara et al., 2003; Bouwmans, 2012; Sengar and

Mukhopadhyay, 2017). However, a relatively long

initialization time to model the background and resid-

ual image alignment error on non-stationary camera

setting are the main setbacks of this technique (Zhang

et al., 2006). These problems are incontestably ad-

dressed by optical ﬂow based methods (Narayana

et al., 2013; Ochs et al., 2013; Horn and Schunck,

1981). However, optical ﬂow is highly dependent on

optical ﬂow vectors. The quality of the these vectors

is crucial for the motion segmentation performance.

Besides, optical ﬂow is highly complex and due to

high sensibility of noise corruption, it cannot meet the

need of real time object detection. Optical ﬂow excep-

tionally works well on large moving objects and fails

to detect smaller objects due to blurry edges and low

resolution (Rozantsev et al., 2016).

A variety of frame differencing techniques address

the aforementioned problems. For instance, Inter-

frame differencing technique generate the difference

between two consecutive frames over a period of time

for identifying background and foreground pixels. A

research in (Liang et al., 2010) use inter-frame dif-

ferencing algorithm to detect moving target in avia-

tion video. Their experiment indicate that the algo-

rithm has few computations and high accuracy to ex-

tract moving-target in aviation videos. In (Nakashima

and Yabuta, 2018), interframe differencing and dy-

namic binarization using discriminant analysis is ap-

plied. The positions of the moving object in the image

are determined by observing the histograms of each

frame.

A slightly different method with comparable re-

sult with that of inter-frame differencing is three-

frame differencing. This method put three adjacent

frames as a group, subtracts both adjacent frames and

lets two differential results do the logical AND oper-

ation. This has been the most widely used and tradi-

tional three-frame differencing method. In (Yin et al.,

2016), traditional three-frame differencing technique

and W4 algorithms are used to detect foreground ob-

jects in the infrared video datasets. Another research,

(Sengar and Mukhopadhyay, 2016), propose a mov-

ing object detection method under static background.

The algorithm use a non-overlapping blocks of the

difference frames and calculate the intensity sum and

mean of each block.

The inter-frame differencing and three-frame dif-

ferencing techniques suffer from the foreground aper-

ture and ghosting problems due to slow-moving as

well as fast-moving foreground objects. Besides,

these methods are known for partial or splitted de-

tection of objects (Tsai and Yeh, 2013). As a result,

frame differencing techniques are prone to false posi-

tives and sometimes false negatives.

To this end, combined moving object detection

methods improved the performance of frame differ-

encing techniques. Relatively robust moving object

detection methods combine background subtraction

with frame differencing (Weng et al., 2010; Xiao

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

310

et al., 2010; Cheng and Wang, 2014; Fei et al., 2015)

or optical ﬂow (Halidou et al., 2014; Zhang et al.,

2006; Fern

andez-Caballero et al., 2010).

A static background based on three-frame differ-

encing method in combination with background sub-

traction method is proposed in (Weng et al., 2010)

and (Cheng and Wang, 2014). A combination of op-

tical ﬂow and three-frame differencing based mov-

ing object detection method is employed in (Hali-

dou et al., 2014). It use region of interest (ROI) and

multi-block local binary pattern descriptors. Another

frame differencing and optical ﬂow based moving ob-

ject detection technique is proposed in (Fern

andez-

Caballero et al., 2010). Here, a thermal infrared cam-

era mounted on autonomous mobile robot is used as a

feed to the detection module.

A method for detecting moving people in the in-

door environment is proposed with the help of frame

differencing and neural network based classiﬁcation

techniques (Foresti et al., 2005). This method reduce

the false alarm and provides a robust classiﬁcation

with the help of a ﬁnite state automation. Similarly, a

new approach based on fuzzy adaptive resonance the-

ory, neural network with forgetting method for fore-

ground detection and background establishment in

natural scenes is proposed in (Dou and Li, 2014). On

the other hand, the frame differencing and the non-

pyramidal Lucas-Kanade approaches (Bouguet et al.,

2001) are used to detect human candidates based on

thermal signatures when the robot stops and moves.

In (Xu et al., 2017), an efﬁcient foreground de-

tection method is proposed by combining three-frame

differencing and Gaussian mixture model. Another

research work, (Lee et al., 2013), present a moving

object detection method by combining background

subtraction, separable morphological edge detector,

and optical ﬂow.

Recently, a more sophisticated and efﬁcient mov-

ing object detection methods have been proposed

by intriguing improved frame differencing techniques

with deep neural network technologies. For instance,

(Ellenfeld et al., 2021) propose a deep learning based

moving object detection method. It use a Deep Con-

volutional Neural Network (DCNN) for multi-modal

motion segmentation. Improved three-frame differ-

encing and current RGB frame is used to capture

temporal information and appearance of the current

scene respectively. These inputs are later fused in the

DCNN component for effective, efﬁcient and robust

motion segmentation. This model improved the per-

formance of three-frame differencing techniques in

detecting tiny moving objects.

A research work in (Yang et al., 2017) applied

a frame differencing technique with Faster Region-

Convolutional Neural Network (R-CNN) for highly

precise detection and tracking characteristics. Sim-

ilarly, (Mohtavipour et al., 2022) propose a multi-

stream CNN and frame differencing based moving

object detection method for deep violence detection.

It use a handcrafted features related to appearance,

speed of movement, and representative image and fed

to a convolutional neural network (CNN) as spatial,

temporal, and spatiotemporal streams.

Furthermore, (Siam et al., 2018) and (Wang et al.,

2018) show promising results using CNN for mov-

ing object detection. They use a two-stream con-

volutional network to jointly model motion and ap-

pearance cues in a single convolutional network. In

(Wang et al., 2018) a new framework named moving-

object proposals generation and prediction framework

(MPGP) is proposed to reduce the searching space

and generate some accurate proposals which can re-

duce computational cost. In addition,they explored

the relation of moving regions in feature map of dif-

ferent layers. This method utilize spatial-temporal in-

formation to strengthen the detection score and fur-

ther adjust the location of the bounding boxes.

3 OUR APPROACH

We propose a novel moving object detection and seg-

mentation method using XY-shift and improved three-

frame differencing. Furthermore, we have extended

our method by feeding it to a three-stream encoder-

decoder network. In this section, we discuss details

of our proposed technique.

3.1 The Proposed Framework

Fig 2 show our proposed framework, consisting of

two major components namely: frame differenc-

ing and three-stream encoder-decoder network. The

frame differencing component consists of two frame

differencing methods. The ﬁrst method is a novel XY-

shift frame differencing technique and the second one

is an improved three-frame differencing technique.

The XY-shift frame differencing differs a frame from

its shifted replica. The effect of this operation is

equivalent to the result of a high-pass ﬁlter method,

but with signiﬁcantly smaller computational resource.

We have mainly used this method to reduce the vis-

ibility of irrelevant background objects and expose

foreground object even if they are in a temporarily

static position. The mathematical formalization of

XY-shift frame differencing is depicted as follows in

equation 1. Let a be the ﬁrst channel of image A with

a shape of (h,w,3). Then, the XY-shift frame differ-

A New Approach to Moving Object Detection and Segmentation: The XY-shift Frame Differencing

311

encing of a is calculated as:

g(a) =











a(x

, y

) − a(x

i+ f

+, y

j+ f

), if i <= h − f

j <= w − f

a(x

, y

) − a(x

i− f

+, y

j− f

), if i = h or j = w

(1)

where h and w stands for the height and width of the

channel and f is a shift-factor.

Figure 1: Tabular representation of XY-shift operands.

Figure 1 depicts a notational representation of

XY-shift frame differencing. Assuming a 6X6 pixel

raw image as a minuend, the subtrahend of XY-shift

frame differencing technique with 1 shift-factor is

constructed starting from the second row and column

of the minuend.

What follows the XY-shift frame differencing is

an improved three-frame differencing technique. This

technique use the output of XY-shift differencing.

It takes three consecutive frames, compute the dif-

ference between the current frame and the previous

frame, the current frame and the next frame sepa-

rately, and extract a pixel-wise max between these

two resulting frames. This technique is adapted and

enhanced to improve the extraction of temporal fea-

tures from datasets in spatio-temporal domain. The

improved three-frame differencing method is formal-

ized as follows in equation 2. Let A,B, and C be the

ﬁrst channel of three consecutive XY-shift frame dif-

ferenced frames with a shape of (h,w). Let B be the

ﬁrst channel of the current frame. Then the improved

three-frame differencing, f(A,B,C), is calculated as:

f (A, B,C)

i, j

= max

i, j

(|B

i, j

− A

i, j

|, |B

i, j

−C

i, j

|) (2)

where for i,j >= 0 and i<= h and j <= w. Further-

more, the pixel-wise maximum of two images is

computed as shown in 3. Let Q1 be the absolute

difference of the current frame B and its predecessor

frame A. Let Q2 be the absolute difference of the

current frame B and its successor frame C. Let’s say

both differenced images have a size of (h,w). Then,

the pixel-wise maximum, P

ax, of these two frames

is calculated as:

max(Q1, Q2)

i, j

(

i, j

, if Q

i, j

> Q2

i, j

, if otherwise

(3)

where for i,j >= 0 and i<= h and j <= w.

The second component of our model constitutes a

VGG-16 based encoder and a decoder network. The

top ﬁve convolutional layers of VGG-16 along with

the weights of ImageNet (Simonyan and Zisserman,

2014) are used as an encoder with the intention of

avoiding excessive sparsity of hidden units.

The implementation detail is as follows. Three

consecutive raw image frames are passed to the frame

differencing component. The XY-shift frame differ-

encing component takes each frame separately and

perform XY-shift frame differencing. The result of

XY-shift frame differencing on random images is pre-

sented in ﬁgure 3 Concurrently, the improved three-

frame differencing takes all XY-Shift frame differ-

enced frames and perform three-frame differencing

for each consecutive frames using a pixel-wise maxi-

mum function, shown in equation 3.

The purpose of the XY-shift frame differencing as

depicted in column two of Fig. 3 is to clear irrelevant

background objects. This contributed a lot in reduc-

ing textures that affects the three-frame differencing

negatively. The third column of Fig. 3 clearly shows

the contribution of the XY-shift frame differencing in

enhancing three-frame differencing techniques.

Furthermore, the result of the improved three-

frame differencing, the XY-shift frame differencing,

and the raw RGB image frames is fused into one

future space using a residual layer and fed to the

last encoder-decoder network. The VGG-16 based

encoder extracts further features and the decoder

segment produces the desired segmentation map as

shown in Fig. 2.

3.2 Loss Function

Moving object detection and segmentation task is a

binary pixel-wise classiﬁcation task. Consequently,

binary cross-entropy is chosen as the loss function to

evaluate the model’s performance during training.

The binary crossentropy loss function calculates

the loss as shown in equation 4:

BCI

= −

∑

i=1

. log ˆy

+ (1 − y

).(log(1 − y

) (4)

where ˆy

is the i

scalar value in the model output,

is the corresponding target value, and S, the out-

put size, is the number of scalar values in the model

output.

3.3 Training Protocol

We use the (Tezcan et al., 2021)’s 4-fold cross vali-

dation strategy to split between training and test data

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

312

Figure 2: Moving object detection and segmentation architecture.

of CDNet-2014. All CDNet-2014 sequences are di-

vided equally into four disjoint splits. The model is

trained on three of the splits. The remaining split is

used to evaluate the model’s performance. We used a

built-in python randomize function to select sequence

of frames in each iteration from a varity of videos. A

randomizer function is set to pick n number of frames

in each iteration where n is the batch size. In this

way we enhanced the representatives and complexity

of the data; overcoming the possibility of overﬁtting

at the same time. The designated training data is fur-

ther divided into a training split (90 %) and valida-

tion split (10 %). Moreover, early stopping is used to

prevent overﬁtting: if the validation loss does not im-

prove in two consecutive epochs, the training process

is terminated.

We used Adaptive Moment Estimation (Adam)

(Kingma and Ba, 2014) to optimize the network dur-

ing the training process. During training, the learning

rate was set to 0.0001 and was decreased by a fac-

tor of 10 every 2 epochs. The network was trained

for 10 epochs. The whole model is trained in an end-

to-end manner. The entire training procedure takes

about 8 hours using a single NVIDIA Quadro RTX

3000 Max-Q GPU.

4 EXPERIMENTS

4.1 Datasets

We use the relabeled version of CDNet-2014, (Wang

et al., 2014) dataset, for training and evaluation.

CDNet-2014 includes over 160,000 pixel-wise an-

notated frames in 53 video sequences subdivided in

11 categories and two spectra: VIS and thermal IR.

The 53 sequences contain a large variety of differ-

ent scenes with varying image quality and resolution

ranging from (320 × 240) to (720 × 480) pixels. Most

scenes show an urban environment with persons or

cars. The dataset contains both indoor and outdoor

scenes and covers many different real world chal-

lenges such as shadows, dynamic backgrounds, and

camera motion. The ground truth for each image is a

gray-scale image that describe the 4 motion classes:

static, hard shadow, unknown motion, and motion.

A New Approach to Moving Object Detection and Segmentation: The XY-shift Frame Differencing

313

Figure 3: A random presentation of XY-shift frame differenced and Three-Frame differenced frames: frames not thresholded.

An additional class is used to mark areas that are out-

side the region of interest (non-ROI). Pixels annotated

as non-ROI are discarded during evaluation.

4.2 Evaluation Metrics

We evaluate our model on the testing sets of CDNet-

2014, in total of 11 video sequences with nearly

39,820 frames. We emphasized on the Recall (Re),

Precision (Pr), and F1-score from the the standard

evaluation measures (Wang et al., 2014). This is

mainly due to the sufﬁciency of the selected metrics

for the problem at hand.

4.3 Frame Differencing Experiments

We undertook an ablation analysis on the different

components of our architecture. The signiﬁcance

of the XY-shift frame differencing and improved

three-frame differencing is thoroughly analysed by

eliminating and replacing each of them with inter-

frame differencing (Liang et al., 2010; Nakashima

and Yabuta, 2018) and traditional frame differencing

(Yin et al., 2016; Sengar and Mukhopadhyay, 2016)

techniques. The substitution of both of these frame

differencing methods exhibit a reduced model per-

formance. Especially, elimination of our XY-shift

frame differencing technique and the use of inter-

frame differencing as a source of input for our model

caused a major performance degradation. The elim-

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

314

Table 1: Quantitative comparison of results.

Approach Precision Recall F1-Score

(Bosch, 2021) 0.626 0.673 0.553

(Xiao et al., 2010) 0.462 0.513 0.42

STBGS (Bouwmans, 2012) 0.406 0.549 0.401

(Sengar and Mukhopadhyay, 2016) 0.375 0.58 0.389

(Ellenfeld et al., 2021) 0.774 0.751 0.745

Ours 0.801 0.795 0.772

ination of our frame differencing techniques signiﬁ-

cantly reduced our model’s performance on most of

its salient features, such as robustness against false-

motion, temporarily at-rest object detection, and tiny

moving object detection.

The use of our XY-shift frame differencing tech-

nique along with the improved three-frame differ-

encing technique exhibit an outstanding performance.

This is mainly due to the power of our XY-shift

frame differencing technique in eliminating dynamic

and noisy backgrounds. Moreover, the use of XY-

shift differenced frames for three-frame differenc-

ing component further improved the performance of

our model, especially in dynamic background scene

videos. From the qualitative analysis point of view,

the absence of XY-shift frame differencing affected

our model’s performance on tiny moving objects and

temporarily at-rest foreground objects.

4.4 Optical Flow Experiments

Our second phase of ablation analysis is concerned

with the popular optical ﬂow technique. We converted

our model into a two-stream architecture and assessed

the impact of optical ﬂow technique. Here, we used

optical ﬂow output and raw image as input source of

a two-stream encoder-decoder architecture. The abil-

ity of our model to detect false-negatives was slightly

compromised in this setup. We extended our model

to a three-stream network by replacing the improved

three-frame differencing segment by the optical ﬂow

output. This setting slightly improved its performance

but with major deﬁciency to yield a competitive re-

sult.

4.5 Comparisons with the

State-of-the-Art

Comparing ones model with the state-of-the-art re-

quires ready-made repositories of research works that

constitute the state-of-the-art code base. It has been

difﬁcult to ﬁnd code bases of moving object detection

and segmentation research works. However, we were

able to compare our proposed model with ﬁve other

state-of-the-art methods for motion detection and seg-

mentation namely, (Bosch, 2021), (Xiao et al., 2010),

(Sengar and Mukhopadhyay, 2016), and (Ellenfeld

et al., 2021). Table X shows the quantitative results

regarding Precision, Recall, and F1-score. Our ap-

proach outperforms most of these methods by a large

margin and score a competitive result with the recent

research work that combines three-frame differencing

with DCNN (Ellenfeld et al., 2021).

The qualitative result is visualized in ﬁgure 4.

What is depicted in the ﬁrst row is the ground truth set

by the CDNet-2014 dataset. The second raw show the

raw appearance images for the corresponding ground

truth. We used columns to showcase the performance

of the state-of-the-art papers and our model over these

images. Images are selected from different scene

videos to show the generalization capacity of mod-

els in different environment. The qualitative analysis

show the poor object detection and segmentation per-

formance of background subtraction and frame differ-

encing based algorithms in complex scene environ-

ment like, images with water in the background and

dynamic scenes. As it can be seen in ﬁgure 4, these

algorithms are highly prone to false positives and neg-

atives. Compared to these algorithms, our model was

able to detect both moving objects and temporarily at-

rest objects effectively and efﬁciently.

The most robust model that we analysed in this

section is (Ellenfeld et al., 2021). As it is discussed in

the previous sections, this model combine improved

three-frame differencing technique with appearance

frame in a two stream encoder-decoder architecture.

Compared to other deep learning based algorithms,

these model show relatively better moving object de-

tection and segmentation performance. Finally, our

qualitative and quantitative results show the efﬁciency

and effectiveness of our model. It has also scored

a competitive result compared to the state-of-the-art

methods. However, given the big error gap shown in

Table 1, there is still a need to further enhance meth-

ods. Exploiting robust digital image processing and

deep neural network technologies should be the focus

of our next phase of research.

A New Approach to Moving Object Detection and Segmentation: The XY-shift Frame Differencing

315

Figure 4: Qualitative evaluation of different models against our model: The green, blue, and red shades indicate the correct,

false negative, and false positive classiﬁcations.

5 CONCLUSION

We proposed a novel moving object detection and

segmentation technique. Our contribution in terms

of architectural component is two fold: we propose

a novel frame differencing technique, XY-shift frame

differencing, and enhanced the traditional three-frame

differencing technique. The XY-shift frame differenc-

ing is mainly used to sharpen the image and transform

it to a high-pass ﬁltered frame in relatively smaller

computational resource. Image passed through this

component eliminate most types of noises and irrel-

evant background objects. Feeding the output of the

XY-shift frame differencing to the traditional three-

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

316

frame differencing technique happen to overcome the

most common defects of three-frame differencing.

The performance of the the three-frame differencing

technique is improved by feeding XY-shift frame dif-

ferenced frames instead of raw image and computing

the pixel wise maximum of differences.

We have also introduced an efﬁcient way of fusing

raw appearance images with frame differenced im-

ages resulting in a signiﬁcant improvement on mov-

ing object detection and segmentation models. We

used a three-stream encoder-decoder deep neural net-

work architecture. The raw appearance image, XY-

shift frame differenced image, and three-frame differ-

enced images are fed to their corresponding stream

and later a feature space which is a deep fusion of the

three stream data is produced. This intermediate fea-

ture space is fed to the encoder, which the top 5 lay-

ers of VGG-16, and the resulting feature has been de-

convolved to get the desired segmentation map. The

CDNet-2014 change detection dataset were used to

build and analyse our model.

Our experimental analysis covered multiple per-

spectives of moving object detection and segmenta-

tion. We undertook an exhaustive ablation analysis

by replacing our proposed frame differencing compo-

nent with background subtraction, three-frame differ-

encing, and optical ﬂow based moving object detec-

tion and segmentation techniques. For the evaluation,

we used precision, recall and F1-score metrics. Both

qualitative and quantitative results show that the pro-

posed approach outperform the state-of-the-art mov-

ing object detection and segmentation methods.

ACKNOWLEDGEMENTS

This work would not have been possible without the

ﬁnancial support of the French Embassy in Ethiopia,

Brittany region administration and the Ethiopia Min-

istry of Education (MoE). We are also indebted to

Brest National School of Engineering (ENIB) and

speciﬁcally LAB-STICC for creating such a con-

ducive research environment.

REFERENCES

Barnich, O. and Van Droogenbroeck, M. (2010). Vibe: A

universal background subtraction algorithm for video

sequences. IEEE Transactions on Image processing,

20(6):1709–1724.

Beauchemin, S. S. and Barron, J. L. (1995). The computa-

tion of optical ﬂow. ACM computing surveys (CSUR),

27(3):433–466.

Bosch, M. (2021). Deep learning for robust motion seg-

mentation with non-static cameras. arXiv preprint

arXiv:2102.10929.

Bouguet, J.-Y. et al. (2001). Pyramidal implementation of

the afﬁne lucas kanade feature tracker description of

the algorithm. Intel corporation, 5(1-10):4.

Bouwmans, T. (2012). Background subtraction for visual

surveillance: A fuzzy approach. Handbook on soft

computing for video surveillance, 5:103–138.

Cheng, Y. H. and Wang, J. (2014). A motion image de-

tection method based on the inter-frame difference

method. In Applied Mechanics and Materials, volume

490, pages 1283–1286. Trans Tech Publ.

Cucchiara, R., Grana, C., Piccardi, M., and Prati, A. (2003).

Detecting moving objects, ghosts, and shadows in

video streams. IEEE transactions on pattern analy-

sis and machine intelligence, 25(10):1337–1342.

Dou, J. and Li, J. (2014). Modeling the background and

detecting moving objects based on sift ﬂow. Optik,

125(1):435–440.

Dou, J., Qin, Q., and Tu, Z. (2017). Background subtraction

based on circulant matrix. Signal, Image and Video

Processing, 11(3):407–414.

Ellenfeld, M., Moosbauer, S., Cardenes, R., Klauck, U., and

Teutsch, M. (2021). Deep fusion of appearance and

frame differencing for motion segmentation. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 4339–4349.

Fei, M., Li, J., and Liu, H. (2015). Visual tracking based on

improved foreground detection and perceptual hash-

ing. Neurocomputing, 152:413–428.

Fern

andez-Caballero, A., Castillo, J. C., Mart

ınez-Cantos,

J., and Mart

ınez-Tom

as, R. (2010). Optical ﬂow or

image subtraction in human detection from infrared

camera on mobile robot. Robotics and Autonomous

Systems, 58(12):1273–1281.

Foresti, G. L., Micheloni, C., and Piciarelli, C. (2005).

Detecting moving people in video streams. Pattern

Recognition Letters, 26(14):2232–2243.

Garc

ıa, J., Gardel, A., Bravo, I., L

azaro, J. L., Mart

ınez, M.,

and Rodr

ıguez, D. (2012). Directional people counter

based on head tracking. IEEE Transactions on Indus-

trial Electronics, 60(9):3991–4000.

Garcia-Garcia, B., Bouwmans, T., and Silva, A. J. R.

(2020). Background subtraction in real applica-

tions: Challenges, current models and future direc-

tions. Computer Science Review, 35:100204.

Halidou, A., You, X., Hamidine, M., Etoundi, R. A., Di-

akite, L. H., et al. (2014). Fast pedestrian detection

based on region of interest and multi-block local bi-

nary pattern descriptors. Computers & Electrical En-

gineering, 40(8):375–389.

Horn, B. K. and Schunck, B. G. (1981). Determining optical

ﬂow. Artiﬁcial intelligence, 17(1-3):185–203.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Lee, S., Kim, N., Paek, I., Hayes, M. H., and Paik, J.

(2013). Moving object detection using unstable cam-

era for consumer surveillance systems. In 2013 IEEE

A New Approach to Moving Object Detection and Segmentation: The XY-shift Frame Differencing

317

International Conference on Consumer Electronics

(ICCE), pages 145–146. IEEE.

Liang, R., Yan, L., Gao, P., Qian, X., Zhang, Z., and Sun,

H. (2010). Aviation video moving-target detection

with inter-frame difference. In 2010 3rd International

Congress on Image and Signal Processing, volume 3,

pages 1494–1497.

Mahapatra, D., Winkler, S., and Yen, S.-C. (2008). Mo-

tion saliency outweighs other low-level features while

watching videos. In Human Vision and Electronic

Imaging XIII, volume 6806, pages 246–255. SPIE.

Mohtavipour, S. M., Saeidi, M., and Arabsorkhi, A. (2022).

A multi-stream cnn for deep violence detection in

video sequences using handcrafted features. The Vi-

sual Computer, 38(6):2057–2072.

Motlagh, O., Nakhaeinia, D., Tang, S. H., Karasﬁ, B., and

Khaksar, W. (2014). Automatic navigation of mobile

robots in unknown environments. Neural Computing

and Applications, 24(7):1569–1581.

Nakashima, T. and Yabuta, Y. (2018). Object detection

by using interframe difference algorithm. In 2018

12th France-Japan and 10th Europe-Asia Congress

on Mechatronics, pages 98–102. IEEE.

Narayana, M., Hanson, A., and Learned-Miller, E. (2013).

Coherent motion segmentation in moving camera

videos using optical ﬂow orientations. In Proceed-

ings of the IEEE International Conference on Com-

puter Vision, pages 1577–1584.

Ochs, P., Malik, J., and Brox, T. (2013). Segmentation of

moving objects by long term video analysis. IEEE

transactions on pattern analysis and machine intelli-

gence, 36(6):1187–1200.

Rozantsev, A., Lepetit, V., and Fua, P. (2016). Detecting

ﬂying objects using a single moving camera. IEEE

transactions on pattern analysis and machine intelli-

gence, 39(5):879–892.

Sehairi, K., Chouireb, F., and Meunier, J. (2017). Com-

parative study of motion detection methods for video

surveillance systems. Journal of Electronic Imaging,

26(2):023025.

Sengar, S. S. and Mukhopadhyay, S. (2016). A novel

method for moving object detection based on block

based frame differencing. In 2016 3rd International

Conference on Recent Advances in Information Tech-

nology (RAIT), pages 467–472. IEEE.

Sengar, S. S. and Mukhopadhyay, S. (2017). Foreground

detection via background subtraction and improved

three-frame differencing. Arabian Journal for Science

and Engineering, 42(8):3621–3633.

Siam, M., Mahgoub, H., Zahran, M., Yogamani, S., Jager-

sand, M., and El-Sallab, A. (2018). Modnet: Motion

and appearance based moving object detection net-

work for autonomous driving. In 2018 21st Interna-

tional Conference on Intelligent Transportation Sys-

tems (ITSC), pages 2859–2864.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Stauffer, C. and Grimson, W. E. L. (1999). Adaptive back-

ground mixture models for real-time tracking. In

Proceedings. 1999 IEEE computer society conference

on computer vision and pattern recognition (Cat. No

PR00149), volume 2, pages 246–252. IEEE.

Tezcan, M. O., Ishwar, P., and Konrad, J. (2021). Bsuv-net

2.0: Spatio-temporal data augmentations for video-

agnostic supervised background subtraction. IEEE

Access, 9:53849–53860.

Tsai, C.-M. and Yeh, Z.-M. (2013). Intelligent mov-

ing objects detection via adaptive frame differencing

method. In Asian Conference on Intelligent Informa-

tion and Database Systems, pages 1–11. Springer.

Wang, H., Wang, P., and Qian, X. (2018). Mpnet: An

end-to-end deep neural network for object detection

in surveillance video. IEEE Access, 6:30296–30308.

Wang, Y., Jodoin, P.-M., Porikli, F., Konrad, J., Benezeth,

Y., and Ishwar, P. (2014). Cdnet 2014: An expanded

change detection benchmark dataset. In Proceedings

of the IEEE conference on computer vision and pat-

tern recognition workshops, pages 387–394.

Weng, M., Huang, G., and Da, X. (2010). A new interframe

difference algorithm for moving target detection. In

2010 3rd international congress on image and signal

processing, volume 1, pages 285–289. IEEE.

Williams, E. H., Cristino, F., and Cross, E. S. (2019). Hu-

man body motion captures visual attention and elicits

pupillary dilation. Cognition, 193:104029.

Xiao, J., Cheng, H., Sawhney, H., and Han, F. (2010). Vehi-

cle detection and tracking in wide ﬁeld-of-view aerial

video. In 2010 IEEE Computer Society Conference

on Computer Vision and Pattern Recognition, pages

679–684. IEEE.

Xu, Z., Zhang, D., and Du, L. (2017). Moving object de-

tection based on improved three frame difference and

background subtraction. In 2017 International Con-

ference on Industrial Informatics-Computing Technol-

ogy, Intelligent Technology, Industrial Information In-

tegration (ICIICII), pages 79–82. IEEE.

Yang, Y., Gong, H., Wang, X., and Sun, P. (2017). Aerial

target tracking algorithm based on faster r-cnn com-

bined with frame differencing. Aerospace, 4(2):32.

Yin, J., Liu, L., Li, H., and Liu, Q. (2016). The infrared

moving object detection and security detection related

algorithms based on w4 and frame difference. Infrared

Physics & Technology, 77:302–315.

Zhang, Y., Kiselewich, S. J., Bauson, W. A., and Ham-

moud, R. (2006). Robust moving object detection at

distance in the visible spectrum and beyond using a

moving camera. In 2006 Conference on Computer Vi-

sion and Pattern Recognition Workshop (CVPRW’06),

pages 131–131. IEEE.

Zhang, Y., Liu, X., Chang, M.-C., Ge, W., and Chen, T.

(2012). Spatio-temporal phrases for activity recogni-

tion. In European Conference on Computer Vision,

pages 707–721. Springer.

Zhao, N., Xia, Y., Xu, C., Shi, X., and Liu, Y. (2016).

Appos: An adaptive partial occlusion segmentation

method for multiple vehicles tracking. Journal of

Visual Communication and Image Representation,

37:25–31.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

318