Ship Detection in Harbour Surveillance

based on Large-Scale Data and CNNs

Matthijs H. Zwemer

1,2

, Rob G. J. Wijnhoven

and Peter H. N. de With

Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands

ViNotion B.V., Eindhoven, The Netherlands

Keywords:

Object Detection, Harbour Surveillance, CNN, SSD Detector, Ships, Vessel Tracking System.

Abstract:

This paper aims at developing a real-time vessel detection and tracking system using surveillance cameras in

harbours with the purpose to improve the current Vessel Tracking Systems (VTS) performance. To this end, we

introduce a novel maritime dataset, containing 70,513 ships in 48,966 images, covering 10 camera viewpoints

indicating real-life ship trafﬁc situations. For detection, a Convolutional Neural Network (CNN) detector is

trained, based on the Single Shot Detector (SSD) from literature. This detector is modiﬁed and enhanced to

support the detection of extreme variations of ship sizes and aspect ratios. The modiﬁed SSD detector offers

a high detection performance, which is based on explicitly exploiting the aspect-ratio characteristics of the

dataset. The performance of the original SSD detector trained on generic object detection datasets (including

ships) is signiﬁcantly lower, showing the added value of a novel surveillance dataset for ships. Due to the

robust performance of over 90% detection, the system is able to accurately detect all types of vessels. Hence,

the system is considered a suitable complement to conventional radar detection, leading to a better operational

picture for the harbour authorities.

1 INTRODUCTION

Maritime trafﬁc management systems in harbours

commonly use radar technology and Automatic Iden-

tiﬁcation System (AIS) information to detect and fol-

low moving vessels in a Vessel Tracking System

(VTS). Video cameras are mainly used for visual ve-

riﬁcation by the trafﬁc management operators and not

for automatic detection and tracking of vessels.

Radar systems measure the reﬂections of actively

transmitted radio waves while continuously scanning

the area by rotating 360 degrees. Although radar

technology provides accurate detection results, inter-

ference in the radar signal from clutter causes false

detections. Especially in harbours, the environment

contains other objects such as buildings, bridges, cra-

nes, etc. In these areas, the operators often use the vi-

sual information of cameras to support the monitoring

process and to verify radar detections. Each (large)

ship actively broadcasts its location and identity using

Automatic Identiﬁcation System (AIS) messages. A

simple radio receiver is required to observe all ships

in a few kilometers range. However, data validity

depends on the cooperation of the captain. Non-

transmitting ships remain unseen by the AIS system.

This work is part of the European Advancing Plug

& Play Smart Surveillance (APPS) project

, which

aims to enable smart surveillance made by multi-

sensor systems (radar, visual, thermal, acoustic and

physicochemical). All information of these sensor-

systems are fused to reduce or alleviate shortcomings

of each individual sensor. We propose to use surveil-

lance cameras for visual localization of ships to ad-

dress those shortcomings of the AIS and radar sys-

tems.

Ship detection using surveillance cameras is at-

tractive due to its low cost and the ease of installa-

tion and maintenance. Cameras are nowadays abun-

dantly available in harbour areas to visually support

trafﬁc management operators. Detection of vessels

in video images is a challenging task due to the high

intra-class variance. Examples of intra-class variance

are the type of the ship (e.g. cruise ships, sailing

ships, cargo ships and speedboats) and the different

viewpoints. Additionally, the highly dynamic water

regions prevent the use of conventional background

subtraction techniques. To develop a reliable image-

based ship detection and recognition system, a large

APPS Project page: https://itea3.org/project/apps.html

Zwemer, M., Wijnhoven, R. and With, P.

Ship Detection in Harbour Surveillance based on Large-Scale Data and CNNs.

DOI: 10.5220/0006541501530160

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages

153-160

ISBN: 978-989-758-290-5

153

and realistic database of ship images is required to

support learning of various ship types and obtaining

a high robustness that completes the supplementary

role to the existing radar systems.

In this paper we propose a visual-based detection

algorithm to be used in a VTS. We speciﬁcally focus

on two aspects: the video-based vessel detection al-

gorithm using Convolutional Neural Networks (CNN)

and the creation of a large image dataset used to train

this algorithm. We evaluate the performance of the

detection system and provide insight in our dataset

containing vessels in surveillance scenarios.

This paper is divided as follows. Section 2 pre-

sents related work, Section 3 outlines the visual lo-

calization system with detection, tracking and global

position conversion. Section 4 discusses the constitu-

tion of the dataset while Section 5 presents the expe-

riments. Section 6 and 7 conclude the paper.

2 RELATED WORK

Vessel detection techniques can be categorized into

background modelling and appearance-based met-

hods. In background modelling, a background of the

scene is learned over time and used to detect fore-

ground objects (Arshad et al., 2010; Hu et al., 2011).

Morphological operations on the foreground objects

are applied to remove waves around the vessel detecti-

ons. Background modelling techniques only work for

ﬁxed camera viewpoints or need time to reinitialize

the background model. Related to background mo-

delling, context modelling (Bao et al., 2013) can be

used to get a segmentation map consisting of water,

vegetation and ‘other objects’. Ships are detected by

motion analysis of the ’other object’ class which lie in

the water region. Static ships are not detected because

of the lack of motion.

Appearance-based methods try to model the ves-

sels appearance. Wijnhoven et al. propose to use His-

togram of Oriented Gradient (HOG) features to detect

vessels with a sliding-window approach (Wijnhoven

et al., 2010). Because the aspect ratio of the detector

is ﬁxed, they detect the cabin region of the ship and

fail to detect the full extend of the ship.

In the generic ﬁeld of visual object detection,

state-of-the-art performance is obtained by Convoluti-

onal Neural Networks (CNNs). These networks origi-

nate from image classiﬁcation and have recently been

extended to the localization task. Initially, a separate

region proposal algorithm (Uijlings et al., 2013) ﬁnds

regions of interest in the image and a subsequent CNN

classiﬁes these regions into object/background (Girs-

hick et al., 2014). This has evolved to a single

CNN integrating region proposals, classiﬁcation and

bounding-box regression (Girshick, 2015). Other ob-

ject detection systems skip the region proposal step

altogether and estimate bounding boxes directly from

the input image. YOLO (Redmon et al., 2015) uses

the topmost feature map to predict bounding boxes di-

rectly for for each cell in a ﬁxed grid. The Single Shot

MultiBox Detector (SSD) (Liu et al., 2016) extends

this concept by using multiple default bounding boxes

with various aspect ratios at different scaled versions

of the topmost feature map. The SSD detector is more

robust to large variations in object size.

Although the state-of-the-art object detection met-

hods achieve high detection performance, a large and

realistic training dataset is required to cover all ship

variations and viewpoints. Note that various public

datasets exist such as PASCAL VOC (Everingham

et al., 2007; Everingham et al., 2012), Microsoft

COCO (Lin et al., 2014) and ImageNet (Russakovsky

et al., 2015), but they are not suitable for our task be-

cause of their limited dataset size and are unrealistic

for surveillance.

We propose to use the state-of-the-art SSD (Liu

et al., 2016) network because of the relatively low

computational requirements and high accuracy. A

suitable ship dataset for training is not available. The-

refore, we introduce a novel dataset containing a

broad variation of ship types, viewpoints and weather

conditions, to support robust detection.

3 PROPOSED VISUAL

LOCALIZATION SYSTEM

We propose to improve the localization accuracy of

a Vessel Tracking System (VTS), using a vessel de-

tection and tracking system based on surveillance ca-

meras. Our system uses camera images as input and

provides GPS coordinates of vessels to the VTS sy-

stem. Localization is performed by subsequently per-

forming object detection and tracking in image coor-

dinates. The obtained vessel locations are then con-

verted to GPS coordinates using camera calibration.

The object detector and GPS conversion are discus-

sed in more detail below.

3.1 Object Detection and Tracking

Vessel detection is carried out using the Single Shot

Multibox Detector (SSD) (Liu et al., 2016). This de-

tector consists of a (pre-trained) CNN to extract image

features and adds convolution layers which estimate

bounding-box locations. More speciﬁcally, the added

CNN layers predict a conﬁdence and bounding box

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

154

VGG-16

through Conv5_3

Conv7

(FC7)

1024

Conv8_2

512

Conv9_2

256

Conv4_3

Conv10_2

256

512

Image

512

Non-Maximum Suppression

Conv11_2

Conv6

(FC6)

1024

Conv12_2

256

Conv 3x3x1024 Conv 1x1x1024

Conv 1x1x256

Conv 3x3x512-s2

Conv 1x1x128

Conv 3x3x256-s2

Conv 1x1x128

Conv 3x3x256-s2

Conv 1x1x128

Conv 3x3x256-s2

Conv 1x1x128

Conv 3x3x256-s2

Classiﬁer: Conv: 3x3(3x(c

ship

+ {c

, c

, w, h})), a

={1, 2}

Classiﬁer: Conv: 3x3(6x(c

ship

+ {c

, c

, w, h})), a

={1, 2, 3, 5, 8}

Classiﬁer: Conv: 3x3(6x(c

ship

+ {c

, c

, w, h})), a

={1, 2, 3, 5, 8}

Classiﬁer: Conv: 3x3(5x(c

ship

+ {c

, c

, w, h})), a

={1, 2, 3, 5}

Classiﬁer: Conv: 3x3(4x(c

ship

+ {c

, c

, w, h}))

Conv: 3x3(2x(c

ship

+ {c

, c

, w, h}))

Detections: 20362

ar={1}

Extra Feature Layers

Figure 1: SSD model with additional feature layers at the end of the base network to predict the offsets and conﬁdences.

offsets to a set of default bounding boxes over various

scales. These default boxes are called prior-boxes.

We speciﬁcally adapt the conﬁguration of the added

layers for the application of vessel detection.

The input of the detector is the video stream pro-

vided by the surveillance camera. The original video

stream has a resolution of 2048 × 1536 pixels and is

downscaled to 512 × 512 pixels to match the detec-

tor input. Figure 1 shows complete network. The

ﬁrst part of the network consists of the VGG16 (Si-

monyan and Zisserman, 2014) network as proposed

by Liu et al. and computes the image features. There

are several CNN layers attached to this base network

which estimate a conﬁdence c

ship

and location offsets

∇(cx,cy,w,h). These offsets are measured with re-

spect to the locations of the prior-boxes to form de-

tections. Between each output layer, there are one or

two convolutions that perform downscaling of the fea-

tures such that objects can be localized at several sca-

les. In total, there are 8 layers on which detections

are predicted. The detections are merged using non-

maximum suppression to obtain the ﬁnal detections.

The output layers that estimate the conﬁdences

and location offsets use a set of default bounding-

boxes called prior-boxes. For each position in an out-

put layer, multiple prior-boxes with different aspect

ratios a

= width/height are deﬁned. We propose to

use aspect ratios which best match the application of

the involved ship detection. So we modiﬁed the set of

aspect ratios compared to the original implementation

to a

= {1,2,3,5,8}. Note that we do not use a

< 1,

as vessels are not likely to appear vertically in the ca-

mera image. Figure 2 shows the employed set of prior

boxes. In the network design (Figure 1) the prior-box

aspect ratios for each output layer are given.

Vessel detection by the SSD network is perfor-

med at 5 frames per second. To create trajectories

over time, visual tracking is performed by feature

point (Shi and Tomasi, 1994) tracking with optical

ﬂow. The detections provided by the detector also in-

voke (re-)initialization of the tracking algorithm.

Figure 2: Overview of the prior-boxes, red box shows best

overlap with ground truth. Note that prior-boxes are located

at ﬁxed coordinates and do not necessarily align with object.

3.2 GPS Location by Calibration

The trajectory information provided by object de-

tection and tracking is given in pixel locations. To

aid for integration in a VTS, the locations are con-

verted to GPS coordinates, requiring the camera to

be calibrated. This camera calibration is performed

based on the horizon line and the vertical vanishing

point (Brouwers et al., 2016). In our application, we

have deﬁned the horizon line and vertical vanishing

point by manually annotating several parallel lines in

the camera image. Then, the algorithm as proposed

by Brouwers et al. computes the camera calibration.

The obtained camera calibration can convert the pixel

locations to a local real-world grid (deﬁned with re-

spect to the camera position). Finally, an annotated

point correspondence between GPS coordinates and

pixel locations is exploited to rotate and translate the

local real-world grid to GPS coordinates.

4 PROPOSED DATASET

GENERATION

The CNN network requires extensive training data to

cover all the intra-class variations, e.g. various ves-

sel types and viewpoints. To obtain sufﬁcient training

data, we recorded 73 days of video during a time-

span of 6 months. Manual selection of vessel sam-

ples would be an enormous task. Therefore, we pre-

Ship Detection in Harbour Surveillance based on Large-Scale Data and CNNs

155

#2 #3

#4 #5

17,422 6,192 351 18,253 6,310

#8 #9 #1010,186 331 2,156 1,746 7,566

Figure 3: Overview of the different camera viewpoints in the dataset with the number of vessels in each viewpoint.

sent an algorithm to automatically select frames con-

taining ships from many hours of video to construct

our dataset. Note that the choice of algorithm is not

relevant for the ﬁnal detection performance. We want

to emphasize that the detection algorithm for dataset

generation has insufﬁcient accuracy for ship localiza-

tion and can only be used to detect the presence of a

ship. Therefore, ship localization is performed manu-

ally for dataset generation.

4.1 Video Frame Selection

The aim of our proposed algorithm is to efﬁciently

ﬁnd interesting samples of vessels in many hours of

video. For simplicity, the recordings are split into se-

parate viewpoints and the goal is to only detect if a

vessel occurs in the current frame of the video.

The algorithm downscales the original video input

to 512 × 512 pixels and divides each frame in cells of

8 × 8 pixels. For each cell, we subtract the pixel va-

lues of the cell in the previous frame and compute the

mean µ and variance σ

of the pixel values in the cell.

We decide per cell if it contains an object, based on

a threshold for the mean and variance. A low diffe-

rence in mean implies that the cells are close in co-

lor intensity, and a high difference in mean indicates

that the colors have signiﬁcantly changed. Similarly, a

low difference in variance indicates that the structure

within the cell is similar, whereas a high difference

in variance highlights that the cell contains a signi-

ﬁcant difference in texture activity compared to the

previous cell. We have empirically determined that

the thresholds are µ = 0.12 and σ

= 0.003 and that a

vessel is in the video frame when there are more than

12 neighboring cells exceeding this threshold. In this

case we store the frame for further annotation, with a

minimum time-span of 5 seconds between the stored

frames. Although our algorithm for selecting frames

containing vessels is not perfect and also responds

to (large) bow waves and reﬂections in the water, it

provides us a small subset of frames containing ob-

jects in a fast and efﬁcient way, in order to prepare a

large dataset in limited time. The video frames obtai-

ned by the algorithm are manually post-processed to

remove frames without vessels. In the resulting fra-

mes, we manually annotate ships by drawing a boun-

ding box around the complete vessel. Ships pushing

a barge, i.e. towboats, are annotated separately from

the barge. When multiple barges are connected, we

annotate each barge individually.

4.2 Dataset Statistics

Video was recorded from various camera viewpoints

(angles). Therefore, all ships appear only once in each

viewpoint dataset. Per viewpoint, multiple images of

a ship are selected and annotated, such that there are

examples of various orientations and backgrounds.

Ships making the same route repetitively will occur

more often. The ﬁnal vessel dataset contains a total

of 70,513 vessels in 48,966 images, collected from 10

different camera viewpoints. The viewpoint and num-

ber of samples per viewpoint are shown in Figure 3.

Example annotations from our dataset are depicted by

the white bounding boxes in Figure 7. The ﬁgure also

shows difﬁcult situations such as low light, very small

ships and clutter from in-harbour structures. Figure 4

visualizes the width and height of all annotations in

the dataset. It can be clearly seen that ships typically

have a large aspect ratio (long objects). Many trun-

cated annotations (green) occur, due to ships that are

not fully visible in the camera view. The datasets con-

tain some ships that are truncated at both sides of the

image. For the selection of the prior-boxes in the SSD

detector (see Section 3.1), we commence with se-

lected values of aspect ratios and update the occurren-

ces accordingly during training. The resulting prior-

boxes are depicted by yellow circles in Figure 4 and

match well with the distribution of the ship annotati-

ons.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

156

Figure 4: Visualization of size (width vs. height) of all an-

notations. Green data points denote truncated objects, while

violet data points are vessels which are completely visible

in camera image. The yellow dots show prior-box aspect

ratios and sizes used in our SSD implementation.

5 EXPERIMENTAL RESULTS

The ship detection performance of the SSD detector

has been experimentally validated. Cross-validation

is performed over the camera viewpoints. The evalua-

tion criteria are presented ﬁrst, after which the details

of the SSD detector are presented. Cross-validation

is then used to obtain an objective evaluation of de-

tection performance, followed by an in-depth discus-

sion in terms of aspect ratio and size.

A. Training: In all experiments, the pre-trained

model VGG16 (Simonyan and Zisserman, 2014) has

been used, similar to the original SSD network (Liu

et al., 2016). The output layers are conﬁgured as des-

cribed in Section 3.1 and are trained from scratch.

The model is ﬁne-tuned with an initial learning rate

of 10

−5

, momentum 0.9, weight decay 0.0005, mul-

tistep learning policy [2k, 5k, 40k, 80k] and batch size

12 for 120k iterations. The other parameters are equal

to the original SSD implementation. We train the de-

tector using the proposed vessel dataset and add the

VOC PASCAL 2007 set to gather hard-negatives.

B. Evaluation: This is carried out using the

Average Precision (AP) metric as used in the PAS-

CAL VOC challenge (Everingham et al., 2012). This

metric summarizes a recall-precision curve by the

average interpolated precision value of the positive

samples. Recall R(c) denotes the fraction of objects

that are detected with a conﬁdence value of at least

c. An object is detected if the detected bounding box

has a minimum Jaccard index of 0.5 with the ground-

truth bounding box, otherwise a detection is conside-

red incorrect. Precision is deﬁned as the fraction of

detections that are correct:

P(c) =

R(c) · N

+ F(c)

, (1)

where N j is the number of ships in the ground-truth

set and F(c) denotes the number of incorrect detecti-

ons with a conﬁdence value of at least c.

Besides the AP metric, the average normalized

precision (AP

), as proposed by Hoiem et al. is

used (Hoiem et al., 2012). This normalized metric

is required to ensure that subclasses with speciﬁc pro-

perties (aspect ratio and size) can be evaluated in a

reproducible way, which handles cases that proper-

ties overlap or have arbitrary variations. For this,

Hoiem et al. introduce a constant N

= N to com-

pute the normalized precision P

(c). We choose N =

#images for each dataset. The normalized precision

obtain the average normalized precision AP

5.1 Cross-validation on Viewpoints

In the ﬁrst experiment, we perform a cross-validation

on the 10 viewpoints to ﬁnd more insight of the inﬂu-

ence of scene context on the detection performance.

Each detector excludes one viewpoint from the trai-

ning set and uses that viewpoint for evaluation. De-

tector 1 is trained on all viewpoints except View-

point 1. Additionally, the performance of the original

SSD network is also evaluated, but without the modi-

ﬁcations on the aspect ratios. Hence, the original SSD

networks are trained on two different datasets. First,

the “SSD512 Trained” network is based on our propo-

sed dataset. Second, the “SSD512 Original” network

has been trained on the PASCAL VOC07+12 and the

Microsoft COCO datasets from the original SSD pa-

per. The trained detectors are evaluated on all view-

points by measuring the average precision. Results of

this evaluation are shown in Table 1.

We can observe that the average precision is

around 0.90 for most viewpoints in the cross-

validation (bold diagonal values in the table). The

performance does not or only marginally decrease

when the viewpoint is not used for training. We can

conclude that the combination of images from all vie-

wpoints is sufﬁcient to train a good performing detec-

tor. However, there are some exceptions to this and

we will discuss these in more detail. Viewpoints 2

and 4 show a lower detection performance, even for

detectors that include these viewpoints in their trai-

ning data (Columns 2 and 4). We will evaluate this

further in Section 5.3. Viewpoint 8 shows a lower per-

formance for Detector 8, meaning that Viewpoint 8

contains speciﬁc information not occurring in other

viewpoints. A large bridge structure in the camera

view causes missed and incorrect detections.

The original SSD network (“SSD512 Original”)

results in a low detection performance on the data-

set. We conclude that the PASCAL VOC07+12 and

Ship Detection in Harbour Surveillance based on Large-Scale Data and CNNs

157

Table 1: Cross-validation of the ship detectors (vertical) per viewpoint (horizontal) using the average precision. Each row

shows the results of a detector, each column shows the results on that camera viewpoint. The “SSD512 Original” denotes

the SSD512 detector trained on VOC07+12+COCO and “SSD512 Trained” is the original SSD conﬁguration trained on our

dataset. “SSD512 Proposed” is our conﬁguration trained on our dataset.

1 2 3 4 5 6 7 8 9 10 Avg

Detector 1 0.89 0.82 0.90 0.78 0.90 0.91 0.91 0.91 0.91 0.90 0.88

Detector 2 0.90 0.69 0.90 0.76 0.90 0.91 0.91 0.91 0.91 0.90 0.87

Detector 3 0.90 0.79 0.90 0.76 0.90 0.91 0.91 0.91 0.91 0.90 0.88

Detector 4 0.90 0.78 0.90 0.59 0.90 0.91 0.91 0.91 0.90 0.90 0.86

Detector 5 0.90 0.80 0.90 0.77 0.90 0.91 0.91 0.91 0.91 0.90 0.88

Detector 6 0.90 0.81 0.90 0.77 0.90 0.90 0.91 0.91 0.90 0.90 0.88

Detector 7 0.90 0.80 0.90 0.77 0.90 0.91 0.91 0.91 0.90 0.90 0.88

Detector 8 0.90 0.81 0.90 0.77 0.90 0.91 0.91 0.80 0.90 0.90 0.87

Detector 9 0.90 0.80 0.90 0.77 0.90 0.91 0.91 0.91 0.90 0.90 0.88

Detector 10 0.90 0.80 0.90 0.77 0.90 0.91 0.91 0.91 0.90 0.88 0.88

SSD512 Original 0.48 0.28 0.50 0.30 0.66 0.46 0.16 0.18 0.19 0.62 0.41

SSD512 Trained 0.90 0.78 0.90 0.76 0.90 0.90 0.91 0.90 0.90 0.90 0.88

SSD512 Proposed 0.90 0.78 0.90 0.76 0.90 0.91 0.91 0.91 0.91 0.90 0.88

the COCO dataset statistics are not sufﬁcient to train

a good detector for typical surveillance applications.

Furthermore, the original SSD512 network trained on

our dataset (“SSD512 Trained”) obtains only slightly

lower performance than our modiﬁed conﬁguration of

the SSD network (VP 6,8,9). It is therefore concluded

that the proposed ﬁne-tuning only yields a marginal

improvement on the detection performance. Appa-

rently both systems (with different prior-box conﬁgu-

rations) are able to accurately localize ships.

5.2 Aspect Ratio and Size

In this section, the inﬂuence of aspect ratios and sizes

is measured by evaluating the AP

. For both features,

individual categories are deﬁned. Bounding boxes are

assigned to a size category, based on their percentile

size (Hoiem et al., 2012). The adopted size categories

are: Extra Small (XS, bottom 10%), Small (S, next

20%), Medium (M, next 40%), Large (L, next 20%)

and Extra Large (XL, last 10%). For the aspect ratio

categories, we summarize all tall objects (T, a

< 1.0)

in a single category, since very few tall objects occur

in our dataset. For all other aspect ratios, we deﬁne

the following categories based on the objects percen-

tile aspect ratio: Medium (M, bottom 10%), Medium

Wide (MW, next 20%), wide (W, next 40%), Extra

Wide (XW, next 20%), Ultra Wide (XXW, last 10%).

The categories and corresponding bounding-box

sizes and aspect ratios are shown in Table 2. Note that

the size difference between the XS and XL category

is very high, so the detector performance is evaluated

over a large scale range. Also note that the largest

aspect ratio equals a ship of only 18.4 pixels in height,

covering the entire input image width (512 pixels).

The AP

values for the size categories for each

viewpoint are shown in Figure 5. In all viewpoints,

Table 2: Size and aspect-ratio categories with the corre-

sponding maximum actual size and aspect ratio.

Size Aspect Ratio

Cat. Max. Area Cat. Max. a

XS 131 T 1.0

S 1,901 M 1.7

M 12,038 MW 2.8

L 24,749 W 5.6

XL 114,688 XW 8.2

XXW 27.8

the detectors have a high performance on Medium to

Extra Large ships. Extra Small ships are poorly de-

tected. In Viewpoints 2 and 4, many small ships ap-

pear, which causes a low average precision (denoted

by the dotted line in the ﬁgure). The detection perfor-

mance improves for Small ships.

Evaluation on aspect ratios (Figure 5) shows the

detection performance on tall objects is poor, while

the detection performances for Wide (W), Extra Wide

(XW) and Ultra Wide (XXW) are good. Objects in

the Tall and Medium aspect-ratio categories are typi-

cally clearly truncated and represent only the front or

back sides of ships. Frontal and rear views of ships

dominantly occur far away from the camera (in Vie-

wpoints 2 and 4), leading to a low performance due

to the small size of the ships. In several viewpoints,

the XXW performance is lower, which is mainly cau-

sed by ships that are only partly visible in the camera

view or even beyond the camera view on both sides.

5.3 Visual Inspection

The performance of Detectors 2 and 8 is further in-

vestigated by creating heat maps of the locations of

missed ground-truth objects and incorrect detections.

For each heat map, the number of bounding boxes per

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

158

0.0

0.2

0.4

0.6

0.8

1.0

Viewpoint #1 #2 #3 #4 #5 #6 #7 #8 #9 #10

XXW

0.0

0.2

0.4

0.6

0.8

1.0

Aspect Ratio

Size

Figure 5: Average normalized precision for ship sizes (top) and aspect ratio (bottom) shown for all viewpoints.

pixel location are counted and normalized. Note that

the heat maps are scaled independently from each ot-

her, so there is no relation to the amount of objects.

Figure 6 shows the heat maps created for View-

points 2 and 8. It can be observed that missed detecti-

ons for Viewpoints 2 mainly occur at locations which

appear at the scene background. This conﬁrms our

prior ﬁnding that very small ships are not detected.

False detections mainly occur at land areas surroun-

ded by water and around the bridge structure farther

away in Viewpoint 2, this also holds for the similar

Viewpoint 4.

The heat maps for Viewpoint 8 show that missed

detections mainly occur when ships move under the

bridge, where also incorrect detections occur. This

points to localization errors (bridge structure) and

changes of conditions (shadows). Ships are accura-

tely detected when they are in open space.

Figure 7 visualizes several detection examples and

shows some typical cases of incorrect and missed de-

tections. Localization errors mainly occur due low vi-

sibility of ship parts, for example, by strong shadows

under the bridge. Other typical localization errors are

single detections over a tow/tug boat and the con-

nected barge. Sailing ships are sometimes not de-

tected due to their high mast extending the body of the

ship. The correct detections indicate that the detector

can handle the large range of intra-class variation pre-

sent in our dataset.

6 DISCUSSION

Although the original SSD network is not speciﬁcally

designed for large aspect-ratio objects such as ships,

the performance compared to our proposed SSD net-

work obtains only marginally lower detection perfor-

mance. This shows that the original conﬁguration can

also handle the high aspect ratios of vessels. Hence,

it indicates that the SSD network is able to exploit all

relevant information from the training dataset in con-

trast to manual ﬁne-tuning.

False detections mostly occur due to localization

errors and viewing distance and not to ship scales.

One possible solution to detect small (distant) ships

is to apply the SSD network at a higher resolution.

A more elegant approach would be to incorporate the

perspective of the scene in the network.

Our dataset contains many different ship types, but

is limited in the number of viewpoints and scenes.

Therefore, the dataset should be extended with more

diversity in the scenes.

7 CONCLUSIONS

We have presented the application of the SSD ob-

ject detector in the ﬁeld of vessel surveillance and

introduced a novel dataset containing 70,513 ships

in 48,966 images, covering 10 camera viewpoints.

Cross-validation over the viewpoints shows that the

SSD detector obtains an average precision of over

90%, which results in accurate detection of ships. The

system detects vessels over a large range of variati-

ons in aspect ratio and size. The trained network can

detect various types of ships accurately, such as tow

boats, sailing vessels and barges.

An in-depth evaluation of the inﬂuence of aspect

ratio and size as speciﬁc features was performed. It

was found that the detector can handle extreme varia-

tions in aspect ratio, while variations in size are well

handled by the detector. However, very small ships

are poorly detected. Other failure cases mainly origi-

nate from heavily truncated vessels at image bounda-

Ship Detection in Harbour Surveillance based on Large-Scale Data and CNNs

159

VP 2

VP 8

Missed

Incorrect

Figure 6: Heatmaps for missed objects and incorrect detections. Note that each heatmap is individually scaled, so colors are

not directly comparable. The two cases are taken from critical camera viewpoints, pointing to difﬁcult conditions.

Figure 7: Examples of correct detections (top rows) and incorrect/missed detections (bottom rows). Correct detections are

shown in yellow, incorrect detections in red and ground-truth annotations in white.

ries, incorrect merging of multiple ships and the inﬂu-

ence of surrounding infrastructure like bridges.

The SSD detector trained on the proposed surveil-

lance dataset signiﬁcantly outperforms the detector

trained on the PASCAL and COCO datasets. This

shows that the dataset statistics for the commonly

used generic object detection datasets are quite dif-

ferent from our real-life surveillance dataset, speciﬁ-

cally dedicated to harbours and ships.

The obtained performance and robustness of the

developed ship detector proves to be valuable for sur-

veillance in harbour infrastructure, where radar is al-

ready used. The location of the detected vessels is

complementing the positioning information of the ra-

dar system, leading to a higher accuracy of the Vessel

Tracking System (VTS). Moreover, the use of a ca-

mera enables visual feedback on details of the ships

and provides the operator with a visual cue about the

considered vessels.

REFERENCES

Arshad, N., Moon, K.-S., and Kim, J.-N. (2010). Multiple

ship detection and tracking using background registra-

tion and morphological operations. In Signal Proces-

sing and Multimedia, pages 121–126. Springer.

Bao, X., Javanbakhti, S., Zinger, S., Wijnhoven, R., and

de With, P. H. N. (2013). Context modeling combined

with motion analysis for moving ship detection in port

surveillance. Journal of Electronic Imaging, 22(4).

Brouwers, G. M. Y. E., Zwemer, M. H., Wijnhoven, R. G. J.,

and de With, P. H. N. (2016). Automatic calibration of

stationary surveillance cameras in the wild. In Proc.

ECCV 2016 Workshops, pages 743–759. Springer.

Everingham, M. et al. (2007). The PASCAL Visual Object

Classes Challenge 2007 (VOC2007) Results.

Everingham, M. et al. (2012). The PASCAL Visual Object

Classes Challenge 2012 (VOC2012) Results.

Girshick, R. (2015). Fast R-CNN. In ICCV.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detection

and semantic segmentation. In Proc. IEEE CVPR.

Hoiem, D., Chodpathumwan, Y., and Dai, Q. (2012). Diag-

nosing Error in Object Detectors, pages 340–353.

Hu, W.-C., Yang, C.-Y., and Huang, D.-Y. (2011). Ro-

bust real-time ship detection and tracking for visual

surveillance of cage aquaculture. Journal of Visual

Comm. and Image Representation, 22(6):543–556.

Lin, T. et al. (2014). Microsoft COCO: common objects in

context. CoRR, abs/1405.0312.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In ECCV, pages 21–37. Springer.

Redmon, J., Divvala, S. K., Girshick, R. B., and Farhadi, A.

(2015). You only look once: Uniﬁed, real-time object

detection. CoRR, abs/1506.02640.

Russakovsky, O. et al. (2015). ImageNet Large Scale Visual

Recognition Challenge. IJCV, 115(3):211–252.

Shi, J. and Tomasi, C. (1994). Good features to track. In

1994 Proceedings of IEEE CVPR, pages 593–600.

Simonyan, K. and Zisserman, A. (2014). Very deep convo-

lutional networks for large-scale image recognition.

Uijlings, J. R., Van De Sande, K. E., Gevers, T., and Smeul-

ders, A. W. (2013). Selective search for object recog-

nition. Int. journal of comp. vision, 104(2):154–171.

Wijnhoven, R., Van Rens, K., Jaspers, E. G., and de With,

P. H. (2010). Online learning for ship detection in

maritime surveillance. In Proc. of 31th Symposium on

Information Theory in the Benelux, pages 73–80.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

160