Time-unfolding Object Existence Detection

in Low-quality Underwater Videos using Convolutional Neural Networks

Helmut T

odtmann

1,4

, Matthias Vahl

, Uwe Freiherr von Lukas

1,2

and Torsten Ullrich

3,4

Fraunhofer Institute for Computer Graphics Research IGD, Rostock, Germany

University of Rostock, Institute for Computer Science, Rostock, Germany

Fraunhofer Austria Research GmbH, Visual Computing Graz, Austria

Institute of Computer Graphics and Knowledge Visualization, Graz University of Technology, Austria

Keywords:

Convolutional Neural Network, Deep Learning, Environmental Monitoring, Implicit Segmentation, Detection.

Abstract:

Monitoring the environment for early recognition of changes is necessary for assessing the success of renat-

uration measures on a facts basis. It is also used in ﬁsheries and livestock production for monitoring and for

quality assurance. The goal of the presented system is to count sea trouts annually over the course of several

months. Sea trouts are detected with underwater camera systems triggered by motion sensors. Such a scenario

generates many videos that have to be evaluated manually. This article describes the techniques used to auto-

mate the image evaluation process. An effective method has been developed to classify videos and determine

the times of occurrence of sea trouts, while signiﬁcantly reducing the annotation effort. A convolutional neural

network has been trained via supervised learning. The underlying images are frame compositions automat-

ically extracted from videos on which sea trouts are to be detected. The accuracy of the resulting detection

system reaches values of up to 97.7 %.

1 INTRODUCTION

It is of great interest to the world to keep na-

ture and environment in balance and to maintain a

healthy state. Based on the Red List of Endan-

gered Species (Winkler et al., 1991) the population

of sea trouts was seriously threatened. As a conse-

quence, in the time from from 1992 to 1999 renat-

uration programs attempted to repopulate these re-

gions as pilot project. The success of these programs

has been documented and guidelines were derived,

which serve as a basis for further population recovery

projects in other areas; the success of this project led

to similar projects in 30 further areas (Schwevers and

Adam, 2020). The measured results have been pub-

lished (Mannerla et al., 2011), but in 2015 difﬁculties

in correctly monitoring the populations have been dis-

covered (Pedersen et al., 2017), because the underly-

ing data was insufﬁcient, erroneous and did not allow

to distinguish between different populations. Conse-

quently, a new method of regular counting was de-

veloped: Underwater video cameras were installed in

selected reference areas in order to precisely count all

incoming and outgoing sea trouts as shown in Fig-

ure 1. Sea trouts are a very special kind of ﬁsh. Once

every year, sea trouts leave the salt water of the Baltic

Sea and battle their way upstream into freshwater,

where they form hollows in sandy riverbeds to lay

their eggs. But at numerous spots along the way, their

path is blocked by dams or weirs. In many rivers, the

sea trouts have already been driven to extinction. Cur-

rent measuring methods are counting the hollows or

ﬁshing a reference part of the water. All these meth-

ods are labor-intensive, locally limited and sometimes

dangerous to the stock. Environmental institutes in

Europe have been commissioned to monitor sea trout

numbers. Along the rivers, members of each institute

are constructing bottle necks that sea trout have to tra-

Figure 1: Camera setups in rivers are used to precisely count

sea trouts. The photo has been taken at a low water level.

370

Tödtmann, H., Vahl, M., von Lukas, U. and Ullrich, T.

Time-unfolding Object Existence Detection in Low-quality Underwater Videos using Convolutional Neural Networks.

DOI: 10.5220/0008962803700377

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

370-377

ISBN: 978-989-758-402-2; ISSN: 2184-4321

verse and are monitoring them with video cameras.

But evaluating the images is still a work-intensive

process: an employee needs 1416 hours only to watch

over 340 000 videos that have been captured at all bot-

tle necks over the course of ﬁve months. An example

of the video frames to evaluate is shown in Figure 2.

It illustrates that even humans need training to detect

Figure 2: Video frames have to be evaluated manually or

automatically. The top row shows no sea trout whereas the

bottom row shows a positive result (sea trout).

and classify sea trout in video images.

This is where this article comes in. The new auto-

matic system described here takes a common personal

computer only ﬁve days to do the same work using a

machine learning approach. Another advantage of de-

ploying artiﬁcial intelligence (AI) is that it makes gen-

uine wide-scale surveillance of sea trout stocks pos-

sible, whereas previously only ﬁve or six rivers could

be monitored. Furthermore this approach allows to re-

place the usual costly video annotation process with a

fast and easy alternative, while obtaining the informa-

tion of temporal occurrence of sea trouts in a video.

2 RELATED WORK

The presented system uses artiﬁcial intelligence tech-

niques and applies them to a non-computer science

domain. In order to address both areas, computer

science and the ﬁeld of application, the related work

from both areas will be examined.

2.1 Convolutional Neural Networks

An extensive survey on deep learning theory and ar-

chitectures is given by (Alom et al., 2019). The

authors motivate and explain the following concepts

which are crucial to understand the details of the re-

lated work and the approach: supervised learning,

convolutional neural network, pooling layer, activa-

tion function, optimization methods, transfer learn-

ing.

When choosing any deep learning approach for

the task of ﬁnding objects in images or videos, it is

important to distinguish between classiﬁcation, detec-

tion and segmentation. Assuming that a video is sim-

ply a sequence of images, classiﬁcation is the process

of deciding whether an image belongs to a particular

category among a set of given categories. Detection

attempts to ﬁnd objects of a particular category within

an image and usually returns a list of bounding boxes.

A segmentation task can be understood as a more so-

phisticated detection on a per-pixel level. It returns a

binary mask, thus providing more precise results.

Krizhevsky, Sutskever, and Hinton introduced the

ﬁrst convolutional neural network for classiﬁcation

that achieved state of the art performance at the Im-

ageNet Large Scale Visual Recognition Challenge in

contrast to former used algorithms (Krizhevsky et al.,

2012). They used ﬁve convolutional layers, of which

some were followed by pooling layers, and addition-

ally three fully connected layers, while making use of

the rectiﬁed linear unit (ReLU) activation function. In

“Going Deeper with Convolutions” (Szegedy et al.,

2014) improved this approach with “GoogLeNet”

(see Figure 3). The designed network consisting of

22 layers, that needed twelve times less learned pa-

rameters, while still achieving a far better result. Con-

cerning detection, (Redmon et al., 2016) presented a

milestone, outperforming the former state of the art

of detection in terms of accuracy and speed, while

only needing one forward pass. They used 24 convo-

lutional layers followed by two fully connected lay-

ers. Instead of inception modules from “GoogLeNet”

they make use of simple (1 × 1) reduction layers and

(3 × 3) convolutional layers. In the ﬁeld of segmen-

tation, (He et al., 2017) developed “Mask R-CNN”

which is based on the classiﬁcation neural network

called “ResNet” presented by (He et al., 2016). The

network uses 101 layers. It consists of three main

components, which compute the bounding boxes, the

class identity, and the segmentation mask.

2.2 Fishes in Computer Vision

Deformable template matching can be used to clas-

sify two species of ﬁsh (Rova et al., 2007). The

authors improved over previous Support Vector Ma-

chine (SVM)-based methods and reached an accuracy

of up to 90%. However, the approach required manu-

ally cropped underwater video images.

Spampinato et al. introduced a framework

for detecting, tracking, and counting ﬁshes in

Time-unfolding Object Existence Detection in Low-quality Underwater Videos using Convolutional Neural Networks

371

Figure 3: Structure of original “GoogLeNet” has been designed by (Szegedy et al., 2014).

videos (Spampinato et al., 2008). A classiﬁcation ac-

curacy of 93% is achieved while the overall count-

ing success rate is about 85%. Their approach only

worked on video data of controlled environments and

required extensive human annotation beforehand.

(Ravanbakhsh et al., 2015) detect ﬁshes in under-

water video with a shape-based approach. A Haar

classiﬁer is used for precise localisation of ﬁsh head

and snout, effectively leading to a sub pixel accuracy

for retaining the shape of the ﬁsh. For a set of 35 sam-

ples, they report an accuracy of up to 100%. However,

this approach seems not to work in uncontrolled envi-

ronments and requires images that contain completely

visible ﬁsh.

An approach using convolutional neural networks

has been presented by (French et al., 2015). They

make us of a segmentation approach based on the

“N

ﬁelds” algorithm and tested a variety of differ-

ent architectures of which the best delivered a count

accuracy of 92,87% (relative error of 7,13%). Unfor-

tunately, this approach requires segmentation annota-

tion for each video frame of the training set.

(Shafait et al., 2016) introduce image set classi-

ﬁcation with a one-nearest-neighbour classiﬁcation.

Given a set of already localised images of a ﬁsh with

unknown species type, it is matched with all exist-

ing training sets; for which multiple exist for each

class. Instead of calculating an average pairwise dis-

tance between all possible image combinations, they

construct synthetic images as linear combination from

previously calculated base images to be as similar as

possible. The label of the training image set with the

smallest distance is applied. They achieve an accu-

racy of 94,6%.

In “Comparison between Deep Learning and

HOG+SVM Methods” (Villon et al., 2016) compared

the combination of histogram of oriented gradients

(HOG) and support-vector machines (SVM) against

a customized GoogleNet architecture for detecting

coral reef ﬁshes. Training was done with labeled

cropped objects from video frames, while testing

was done with a sliding window approach on video

frames. It was found that the deep learning approach

performed better in general.

(Li et al., 2016) use a Fast RCNN approach on

images of ﬁsh to distinguish between 12 classes and

reach a mean average precision of 81.4%. How-

ever, the approach works only on images and requires

bounding box annotation for each video frame of the

training set.

An approach by (Sung et al., 2017) uses the “You

Only Look Once” (YOLO) algorithm (Redmon et al.,

2016) to detect ﬁshes on underwater images. A grid

search on the hyper parameters is performed to im-

prove the performance. Also, their data set is en-

hanced with lots of positive and negative samples to

create a more robust network and reach a classiﬁca-

tion accuracy of 93%.

In “Fish Recognition Using Convolutional Neu-

ral Network” the authors handcrafted several convolu-

tional neural network architectures to perform a clas-

siﬁcation task on four species of ﬁsh (Ding et al.,

2017). The best network achieved a classiﬁcation ac-

curacy of 96,5%.

Further improvements have been obtained

by (Rathi et al., 2018). They also develop a novel

convolutional neural network architecture that trains

on denoised still images from a custom ﬁsh data

set and includes a threshold-based segmentation

as additional fourth image layer. The result is an

classiﬁcation accuracy of 96,29%.

An approach working on an input data set with a

video quality comparable to the presented system has

been presented by (Shevchenko et al., 2018). They

apply three background subtraction methods on un-

derwater videos to detect ﬁshes. The best method

achieves 60% accuracy.

2.3 Summary

While the ﬁeld of deep learning for object recogni-

tion undergoes constant improvements, the domain of

dealing with ﬁshes in underwater images and videos

has not yet been making use of all the newly de-

veloped possibilities to the full extent. Contrary to

our needs, some of these approaches additionally deal

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

372

with tracking, shape retrieval and species classiﬁca-

tion, while most of them are restricted to images or

even selected parts of images. Our use case of binary

video classiﬁcation with obtaining the times of occur-

rence of sea trout on vast amount of video data is not

present in the available literature. To the best of the

authors’ knowledge the following approach has not

been proposed before.

3 APPROACH

The used camera system has been selected in the pre-

vious research project so that no further intervention

was possible. For the sake of completeness, the origi-

nal video material is described here.

3.1 Data Set Description

Figure 4: Selection of videos containing sea trout. The

frame containing the most ﬁsh like features was selected.

The installed camera systems recorded 341 238

videos in total. The videos show eight different rivers

over the course of ﬁve months. Each video has a

resolution of 704 × 576 pixels and a frame rate of

15 frames per second. Figure 4 and Figure 5 show-

case visual variance, low quality and separability dif-

ﬁculty. The ﬁrst 20 lines of each frame are reserved

for overlaid video information and thus removed in

the preparation process, leading to an effective reso-

lution of 704 × 556 pixels. The video length varies

between 5 and 300 seconds, the average duration of a

video is about 15 seconds. Videos are stored as a Ma-

troska Video (MKV) ﬁles, using the h264 codec. The

average size of a video ﬁle is approximately 550 kilo-

bytes. For generating labeled learning material 10 %

of these videos were selected at equal time intervals

and manually classiﬁed by the domain expert Uwe

Friedrich with the tools described in the Section “Im-

plementation”. As a consequence, 307 114 videos re-

mained unlabeled.

Figure 5: Selection of videos containing no sea trout. The

frame containing the most ﬁsh like features was selected.

3.2 Detection by Classiﬁcation

As mentioned in the Section “Related Work”, almost

all approaches of detecting objects in still images and

videos require labeled data, for example bounding

boxes or polygons marking these objects. For videos

this labeling process is very time consuming, since a

sufﬁcient number of frames have to be annotated for

each video. In our use case, any of these approaches

were not usable, since the amount of available video

data and the visual variance was too high to create

labeled data for supervised learning. Instead, we de-

veloped an approach that only requires the informa-

tion of whether a video contains sea trouts (see Fig-

ure 6). This approach can reliably detect the existence

Figure 6: The pipeline starts with a video and returns a bi-

nary classiﬁcation.

of sea trout in previously unseen videos and beyond

that, can deliver information on which frames of a

video contained sea trout, and potentially even where

in these frames the sea trout was located. For each

single video the workﬂow is as follows:

1. removing video header

2. time-unfolding convolution

3. implicit coarse segmentation

4. 2D classiﬁcation

Time-unfolding Object Existence Detection in Low-quality Underwater Videos using Convolutional Neural Networks

373

3.3 Time-unfolding Convolution

In order to enable classiﬁcation on videos, a sufﬁcient

number of individual images from a video sequence

are arranged next to each other in a grid layout. Fig-

ure 7 shows an example with 64 successive images of

a video.

Figure 7: The image sequence of a video is arranged in a

2D layout.

Convolutional neural networks are “a natural way

to embed translation invariance” (Deng et al., 2013),

i.e. for the recognition of learned features it is irrele-

vant where these are located in the image, as long as

they are present. That way, existing pre-trained neural

networks for 2D classiﬁcation can be used with little

modiﬁcations as opposed to developing and training

a 3D classiﬁcation network from scratch. We found

that resulting artiﬁcial edges between images did not

hurt the overall classiﬁcation performance, as these

occur both in image grids containing and not contain-

ing sea trout and thus did not represent an important

visual feature. On the one hand there must be enough

single images to reliably detect even a short appear-

ance of a sea trout in a video. On the other hand not

any arbitrarily high number of single images can be

used for calculation. In detail, small images would

negate possible information retrieval, but big image

grids would represent a limiting factor concerning the

graphics memory and the general performance of the

graphics card. An 8 × 8 grid has proven to be the per-

fect trade-off. In our tests a 4 × 4 grid often didn’t

capture short occurrences of ﬁsh, while a 16 ×16 grid

implicated images which have been too small for de-

tection. A video with width w, height h, and a scale

value s, leads to s

evenly distributed sample frames;

i.e. the three dimensional information is rearranged in

two dimensions as follows:

f (x, y,t) 7→



x + w · (t mod s)

y + h · bt/sc



with

t ∈ [0,s

− 1],x ∈ [0,w − 1], y ∈ [0,h − 1]

Using this scaled grid the resulting image has the

same dimensions as a single frame from the original

video (see Figure 7).

3.4 Implicit Coarse Segmentation

The used network structure is a modiﬁed and pre-

trained “GoogLeNet” (see Figure 3 and (Szegedy

et al., 2014)). We chose this architecture due to its

good performance, which has been conﬁrmed in our

tests, while being fast. The input layer has been

changed to match the dimension of the image grid.

Consequently, the sizes of all following layers are

increased, leading to a ﬁnal inception layer with di-

mension 22 × 17 and the corresponding ﬁnal pool-

ing layer of dimension 16 × 11. In that regard the

step of increasing the size of the input layer to match

the image size is technically equivalent to removing

pooling layers with or without their connected convo-

lutional layers deeper in the network, like shown in

(Zhou et al., 2016). This shifts the category of the

algorithm from simple classiﬁcation to implicit seg-

mentation with very low resolution, neither requiring

segmented training data, nor delivering segmented re-

sults. The output of the ﬁnal inception layer can be

interpreted as set of activation maps, indicating which

parts of the grid image are responsible for the classi-

ﬁcation result. For an activation map A with width

w and height h, the time point t of the highest ac-

tivation can be found as follows: Within the range

R = [0,w − 1] × [0,h − 1] the maximum (x,y) meets

the equation

∃(x,y) ∈ R ∀(u, v) ∈ R : A(u, v) ≤ A(x,y).

Using the scaling factor s mentioned above the time t

can be determined via

t =

y · s

x · s

It is important to note that the highest activation of

the implicit coarse segmentation – the ﬁnal value of t

– is not directly used for the classiﬁcation. It is rather

an evaluation tool to ensure that the neural network

indeed learned a set of useful features to detect sea

trouts in the grid image. For example, Figure 8 is in-

dicating that the most important parts of the grid im-

age from Figure 7 are at t = 7 (ﬁrst row to the right)

and t = 8 (second row to the left), which clearly con-

tain sea trout. The whole second row of the grid im-

age also contains sea trout, but is not highly visually

present in this activation map, which may indicate that

either these frames are not necessary for the network

to detect sea trout, or are present in another activation

map.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

374

Figure 8: This visualisation shows the activation map that

corresponds to Figure 7. The regions with a high activation

(red) contain sea trouts, which can be veriﬁed in the input

images shown in the grid layout.

4 IMPLEMENTATION

In order to realize the new approach several tools have

been developed. The software has been programmed

using the integrated development environment “Qt”

(www.qt.io) in the programming language C++. The

software relies on the NVIDIA fork of the framework

Caffe (Jia et al., 2014) using GPU acceleration. As a

consequence, at least a CUDA 8.0 compatible graph-

ics card has to be present, including the appropriate

drivers and libraries.

4.1 Annotation Tool

The annotation tool loads a custom set of videos, dis-

plays them in a list, while the currently selected video

is played in a loop. By a single key stroke the anno-

tator can decide if that video contains sea trout or not,

after which the next video is selected. With this tool,

an experienced annotator needs just a few seconds to

label a video. Having completed the annotation of the

complete set of videos, they are copied and split to

separate folders according to their class.

4.2 Preparation Tool

From a set of videos, in each video the header is re-

moved to prevent the learning of misleading informa-

tion or introducing biases. Then the time-unfolding

convolution is performed. The resulting grid images

are stored as lossless compressed Portable Network

Graphics (PNG) ﬁles.

4.3 Training the Network

During the training of the network the deep learn-

ing framework DIGITS (developer.nvidia.com/digits)

is used. First, a data set for classiﬁcation is cre-

ated out of the grid images from the preparation tool.

Half of the data is used for training, the other half

is reserved for validation and testing. According to

the number of different scenarios, class balancing

is performed to assure each scenario is represented

with about the same number of grid images. Like-

wise, class balancing is performed again with fo-

cus on the ratio of sea trout videos to non-sea trout

videos. Class balancing is done with enriching the

smaller class/scenario with duplicates of itself. The

Adam optimizer (Kingma and J., 2015) is used with

the learning rate lr = 0.0001 and the exponential de-

cay γ = 0.95. The training of the network took about

75 hours on four simultaneously used GPUs of type

GeForce GTX TITAN X with 12 GB of RAM each.

4.4 Analysis Tool

The trained network is loaded into the Caffe frame-

work for C++. Afterwards, a custom set of unlabeled

videos is loaded. The steps from the preparation tool

are performed. For each resulting grid image, infer-

ence is performed with the convolutional neural net-

work. The corresponding videos are copied and split

into separate folders according to the classiﬁcation re-

sult. Finally, a report is generated, listing all videos

with their assigned class and classiﬁcation conﬁdence

in the form of the softmax function.

5 EVALUATION

The trained network has been evaluated with unla-

beled videos. The data set (see Section “Data Set De-

scription”) comprehends 307 114 unlabeled videos.

The analysis tool has classiﬁed 5 235 of 307 114

videos as “sea trout”. A manual control by the do-

main expert Uwe Friedrich conﬁrmed that 5 098 of

5 235 videos were containing sea trout, resulting in

a precision of 97.38%. Likewise a showcase sam-

ple of 6 000 videos from the 307 114 videos has been

selected and then classiﬁed by both the expert and

the analysis tool. The expert classiﬁed 938 of these

videos as “sea trout”, while the analysis tool classi-

ﬁed 902 videos as “sea trout”, resulting in a precision

of 96.16 %. The results are listed in Table 1. For the

ﬁrst case, no accuracy can be reported due to the lack

of ground truth data, for the second case, an accuracy

Time-unfolding Object Existence Detection in Low-quality Underwater Videos using Convolutional Neural Networks

375

Table 1: Evaluation of the new system using “previously

unseen” data.

Test Positives Sea trout

Size by System by Expert Precision

307 114 5 235 5 098 97.38%

6 000 938 902 96.16%

of 97.70 % is achieved:

accuracy =

(T P + T N)

(T P + FP + FN + T N)

= 97.70 %

The values of true/false positive/negative detections

are listed in Table 2.

Table 2: Confusion matrix for selected set. Abbreviations

are for true/false positive/negative.

Labeled Labeled

6 000 Samples positive negative

Detected true T P = 902 FP = 102

Detected f alse FN = 36 T N = 4 960

6 CONCLUSION

Demonstrated at sea trouts examples, a convolutional

neural network has been trained via supervised learn-

ing. The underlying images are frame compositions

automatically extracted from videos on which sea

trouts are to be detected. The approach detects ob-

jects in underwater videos even in use cases with re-

duced quality: low resolution, minimal contrast, poor

visibility.

6.1 Contribution

It was shown that time-unfolding convolution enables

the use of still image classiﬁcation rather than the

usual detection or segmentation approaches, which in

turn allows much faster annotation of ground truth

data while also achieving very high accuracy. With

the use of implicit segmentation, further statements

can be made about the times of the occurrence of sea

trout in the video. The accuracy of the resulting de-

tection system reaches values of up to 97.7 %.

6.2 Beneﬁt

The manual evaluation of videos does not scale.

Within this project an employee would need 1416

hours only to watch all videos that have been cap-

tured. The automatic solution is not only much faster,

since no interaction is necessary, the time required

even becomes irrelevant. Now, the monitoring pro-

cess scales and is not the limiting factor any more.

6.3 Outlook

In the future, network structure modiﬁcations to in-

crease the implicit segmentation resolution will be ex-

plored, which could allow implicit detection of loca-

tions of sea trout features in a video frame. Likewise,

we will evaluate how low the quality of the videos can

be and what the optimal parameters for the time un-

folding convolution are. Furthermore it is imaginable

to improve on the process of class activation mapping

and develop general purpose approaches for precisely

detecting and segmenting arbitrary objects in videos

with little to no information about position and times

of occurrence.

ACKNOWLEDGEMENTS

The authors would like to thank Uwe Friedrich from

the Institut f

ur Fisch und Umwelt. All the data was

generated and labeled in the project “Saisonale Er-

mittlung des Meerforellenbestandes in den Einzugs-

gebieten der Vorpommerschen Boddengewsser und

der Mecklenburger Bucht mittels videooptischer Er-

fassung aufsteigender Individuen in Verbindung mit

Kartierungsarbeiten und ﬁschereibiologischen Unter-

suchungen”, project number DRM-149, “Landes-

forschungsanstalt fr Landwirtschaft und Fischerei ”.

The authors also thank Tim Dolereit, Tom Krause

and Mohamad Albadawi from MAG of Fraunhofer

Institute for Computer Graphics Research in Rostock

for their valueable input.

Furthermore, the authors acknowledge the gen-

erous support of the Carinthian Government and

the City of Klagenfurt within the innovation center

KI4Life.

REFERENCES

Alom, M. Z., Taha, T. M., Yakopcic, C., Westberg, S.,

Sidike, P., Nasrin, M. S., Hasan, M., Van Essen, B. C.,

Awwal, A. A. S., and Asari, V. K. (2019). A State-of-

the-Art Survey on Deep Learning Theory and Archi-

tectures. Electronics, 8:292ff.

Deng, L., Abdel-Hamid, O., and Yu, D. (2013). A Deep

Convolutional Neural Network using Heterogeneous

Pooling for Trading Acoustic Invariance with Pho-

netic Confusion. IEEE International Conference on

Acoustics, Speech, and Signal Processing (ICASSP),

38:6669–6673.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

376

Ding, G., Song, Y., Guo, J., Feng, C., Li, G., and He, B.

(2017). Fish Recognition Using Convolutional Neural

Network. Mts, pages 0–3.

French, G., Fisher, M., Mackiewicz, M., and Needle, C.

(2015). Convolutional Neural Networks for Counting

Fish in Fisheries Surveillance Video. Machine Vision

of Animals and their Behaviour Workshop, 7:1–10.

He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017).

Mask R-CNN. IEEE International Conference on

Computer Vision (ICCV), 15:2980–2988.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep

Residual Learning for Image Recognition. IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), 19:770–778.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S., and Darrell, T. (2014).

Caffe: Convolutional Architecture for Fast Feature

Embedding. International Conference on Multimedia,

22:675–678.

Kingma, D. P. and J., B. L. (2015). Adam: A Method for

Stochastic Optimization. International Conference on

Learning Representations (ICLR), 3:19ff.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. International Conference on Neural Infor-

mation Processing Systems, 25:1097–1105.

Li, X., Shang, M., Qin, H., and Chen, L. (2016). Fast accu-

rate ﬁsh detection and recognition of underwater im-

ages with Fast R-CNN. OCEANS 2015 - MTS/IEEE

Washington, pages 1–5.

Mannerla, M., Andersson, M., Birzaks, J., Debowski, P.,

Degerman, E., Huhmarniemi, A., H

aaggstr

om, H.,

Ikonen, E., Jokikokko, E., Jutila, E., Kesler, M.,

Kesminas, V., Kontautas, A., Pedersen, S., Persson, J.,

Romakkaniemi, A., Saura, A., Shibaev, S., Titov, S.,

Tuus, H., Tylik, K., and Yrj

a, T. (2011). Salmon and

Sea Trout Populations and Rivers in the Baltic Sea.

Helsinki Commission, 1 edition.

Pedersen, S., Degerman, E., Debowski, P., and Petereit, C.

(2017). Assessment and recruitment status of Baltic

Sea trout populations. Sea Trout: Science & Man-

agement: Proceedings of the International Sea Trout

Symposium, 2:423–441.

Rathi, D., Jain, S., and Indu, S. (2018). Underwater

Fish Species Classiﬁcation using Convolutional Neu-

ral Network and Deep Learning. International Con-

ference on Advances in Pattern Recognition (ICAPR),

9:344–349.

Ravanbakhsh, M., Shortis, M. R., Shafait, F., Mian, A., Har-

vey, E. S., and Seager, J. W. (2015). Automated Fish

Detection in Underwater Images using Shape-based

Level Sets. Photogrammetric Record, 30:46–62.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You Only Look Once: Uniﬁed, Real-Time

Object Detection. IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR), 19:779–788.

Rova, A., Mori, G., and Dill, L. M. (2007). One ﬁsh, two

ﬁsh, butterﬁsh, trumpeter: Recognizing ﬁsh in under-

water video. IAPR Conference on Machine Vision Ap-

plications (MVA), 7:404–407.

Schwevers, U. and Adam, B. (2020). Fish Protection Tech-

nologies and Fish Ways for Downstream Migration.

Springer International Publishing, 1 edition.

Shafait, F., Mian, A., Shortis, M., Ghanem, B., Culver-

house, P. F., Edgington, D., Cline, D., Ravanbakhsh,

M., Seager, J., and Harvey, E. S. (2016). Fish Iden-

tiﬁcation from Videos Captured in Uncontrolled Un-

derwater Environments. ICES Journal of Marine Sci-

ence: Journal du Conseil, 73:2737–2746.

Shevchenko, V., Eerola, T., and Kaarna, A. (2018). Fish

Detection from Low Visibility Underwater Videos.

International Conference on Pattern Recognition,

24:1971–1976.

Spampinato, C., Chen-Burger, Y. H., Nadarajan, G., and

Fisher, R. B. (2008). Detecting, Tracking and Count-

ing Fish in Low Quality Unconstrained Underwater

Videos. International Conference on Computer Vision

Theory and Applications (VISAPP), 3:514–519.

Sung, M., Yu, S.-C., and Girdhar, Y. (2017). Vision based

Real-time Fish Detection Using Convolutional Neural

Network. OCEANS, 7:1–6.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2014). Going Deeper with Convolutions.

IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), 17:1–9.

Villon, S., Chaumont, M., Subsol, G., Vill

eger, S., Claverie,

T., and Mouillot, D. (2016). Coral Reef Fish Detec-

tion and Recognition in Underwater Videos by Super-

vised Machine Learning: Comparison between Deep

Learning and HOG+SVM Methods. Lecture Notes in

Computer Science, Artiﬁcial Intelligence, and Bioin-

formatics, 10016:160–171.

Winkler, H. M., Hamann, N., and Waterstraat, A. (1991).

Rote Liste der gef

ahrdeten Rundm

auler, S

ußwasser-

und Wanderﬁscharten Mecklenburg-Vorpommerns.

Umweltministerium Mecklenburg-Vorpommern, 1

edition.

Zhou, B., Khosla, A., A., L., Oliva, A., and Torralba, A.

(2016). Learning Deep Features for Discriminative

Localization. CVPR.

Time-unfolding Object Existence Detection in Low-quality Underwater Videos using Convolutional Neural Networks

377