Deep Neural Networks with Intersection over Union Loss for Binary

Image Segmentation

Floris van Beers, Arvid Lindstr

om, Emmanuel Okafor and Marco A. Wiering

Bernoulli Institute, Department of Artiﬁcial Intelligence, University of Groningen,

Nijenborgh 9, Groningen, The Netherlands

Keywords:

Deep Learning, Image Segmentation, Loss Function, Intersection over Union, Jaccard Index.

Abstract:

In semantic segmentation tasks the Jaccard Index, or Intersection over Union (IoU), is often used as a measure

of success. While this measure is more representative than per-pixel accuracy, state-of-the-art deep neural

networks are still trained on accuracy by using Binary Cross Entropy loss. In this research, an alternative is

used where deep neural networks are trained for a segmentation task of human faces by optimizing directly

an approximation of IoU. When using this approximation, IoU becomes differentiable and can be used as

a loss function. The comparison between IoU loss and Binary Cross Entropy loss is made by testing two

deep neural network models on multiple datasets and data splits. The results show that training directly on

IoU signiﬁcantly increases performance for both models compared to training on conventional Binary Cross

Entropy loss.

1 INTRODUCTION

Semantic segmentation aims to map each pixel in an

image to its associated label such as car, building

and pedestrian. In the ﬁeld of image segmentation

with deep neural networks, increasingly complex sys-

tems are created to improve on the semantic segmen-

tation task. All these systems compete in complex-

ity and state-of-the-art performance. A commonality

between early works described in (Shelhamer et al.,

2017), more recent works such as SegNet (Badri-

narayanan et al., 2017) and comparative studies (Siam

et al., 2018) is that they use Intersection over Union

(IoU), also known as the Jaccard Index, as a measure

of success on test images. IoU is much more indica-

tive of success for segmentation tasks, compared to

pixel-wise accuracy, especially when the input data

is signiﬁcantly sparse. When labels used for training

consist of 80-90% background, and only a small per-

centage of positive labels, a naive measure such as ac-

curacy can score up to 80-90% by labeling everything

as background. Because IoU does not concern it-

self with true negatives, this naive solution will never

occur if IoU is used as the loss function, even with

highly sparse data. While this research is focused on

binary segmentation, the naive solution problem am-

pliﬁes itself when applied to segmentation over mul-

tiple classes as the ratio of background to a speciﬁc

class becomes worse with more classes.

With the assumption that IoU as a measure of suc-

cess is helpful, this research attempts to use a loss

function based on IoU to train a segmentation model

directly. While research has been done extensively on

which loss functions are best for which task (Janocha

and Czarnecki, 2017; Zhao et al., 2017), the default

used for image segmentation is still usually Cross En-

tropy (Shelhamer et al., 2017; Badrinarayanan et al.,

2017; Noh et al., 2015). Cross Entropy is a loss func-

tion that is, mathematically, much more closely re-

lated to accuracy than IoU, even though the ﬁnal per-

formance of semantic segmentation models is mea-

sured using IoU. By deﬁning a loss function more

closely related to IoU the training process could be

improved. As such the question that needs to be

answered is as follows: Can a model trained on an

IoU loss function perform better than a model trained

on Binary Cross Entropy (BCE) loss? Previous re-

search on optimizing IoU (Rahman and Wang, 2016)

has proposed a loss function based directly on IoU.

This loss function is tested on a single model, which

is trained on the separate classes of several semantic

segmentation datasets. Signiﬁcant improvements are

found by applying this loss function to a single model.

In this research, since binary segmentation is consid-

ered, the comparison will be between this loss func-

tion based on IoU, detailed in section 2, and BCE. Op-

438

van Beers, F., Lindström, A., Okafor, E. and Wiering, M.

Deep Neural Networks with Intersection over Union Loss for Binary Image Segmentation.

DOI: 10.5220/0007347504380445

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 438-445

ISBN: 978-989-758-351-3

timization of the IoU has also been used in other work

on semantic segmentation, as shown in work with the

Conditional Random Field (Ahmed et al., 2015) and

work on probabilistic models (Nowozin, 2014). In

another work a different loss function is proposed,

which is also based on IoU (Yuan et al., 2017). This

implementation strays further from IoU by applying a

quadratic component. As such, in this work, the loss

function without this quadratic component is used as

an implementation of IoU loss. Another loss function

proposed to optimize IoU is the Lov

asz-Softmax loss

(Berman and Blaschko, 2017). By using similar ar-

gumentation, this loss function was not considered in

this work due to it straying further from a purely IoU

based approach.

This research extends previous work on the topic

(Rahman and Wang, 2016), and evaluates the perfor-

mance differences between loss functions on multiple

models. In addition, to focus solely on the difference

in performance based on the loss functions, the mod-

els will use the same structure and parameters. A base

model has been created that functions when changing

the loss function, keeping all other parameters con-

stant. To further emphasize the difference in perfor-

mance based purely on the use of a new loss func-

tion a sufﬁciently dense dataset has been chosen. This

avoids skewing the results in favor of IoU loss, which

theoretically performs better with a sparser dataset. In

previous research (Rahman and Wang, 2016), the data

is much sparser due to the use of separate classes in

multi-class segmentation datasets, such as PASCAL

VOC2011 (Everingham et al., 2011).

In section 2, the models, datasets, and the loss

functions will be elaborated upon. Some mathemat-

ical adaptation of IoU will be used in order to make

the calculation differentiable. In section 3 the experi-

mental setup will be detailed, both in terms of hyper-

parameters and dataset splits. In section 4 the out-

come of the experiments will be presented. These re-

sults will be put into context in section 5 and evalu-

ated for statistical relevance. Furthermore, this sec-

tion shows some example output. Final conclusions

and suggestions for future work will be conveyed in

section 6.

2 METHODS

2.1 Models

In this research, two different encoders are extended

into fully convolutional networks (FCNs) for their use

as a semantic segmentation model. These encoders

are the convolutional parts of VGG-16 (Simonyan and

Zisserman, 2014) and ResNet-50 (He et al., 2015).

We have trained our own custom neural network

systems using pretrained weights from the earlier

mentioned neural network systems. These were pre-

viously trained on the ImageNet dataset (Deng et al.,

2009). The training of the custom systems employs

face images from two separate face datasets used as

input to the proposed methods. All experiments were

carried out with the aid of the Keras deep learning

framework (Chollet et al., 2015), because it contains

rich libraries for computer vision tasks.

2.1.1 VGG-16/BFCN-32s

The VGG-16 network (Simonyan and Zisserman,

2014) is a well-established deep neural network

(DNN) used for classiﬁcation. It is a convolutional

network with 16 convolutional layers after which

there are several fully connected layers. Finally,

a softmax layer determines the classiﬁcation out-

come. It has been trained extensively on the ImageNet

dataset (Deng et al., 2009). VGG-16 has previously

been extended to a fully convolutional network (Shel-

hamer et al., 2017). For this research the same exten-

sion from VGG-16 to FCN-32s was used as described

in previous work (Shelhamer et al., 2017). A notable

difference is the number of output classes. Where the

original FCN-32s is modelled to segment 21 classes,

i.e. the 20 classes of the VOC-2011 dataset (Evering-

ham et al., 2011) and background, the adaptation used

here considers only 2 classes. These classes are face

and background.

The main steps in adapting VGG-16 to the seg-

mentation task remain the same as described in pre-

vious work (Shelhamer et al., 2017). The fully con-

nected layers of VGG-16 are replaced with fully con-

nected convolutional layers. The ﬁrst of these convo-

lutional layers has 4096 feature maps, a kernel size

of 7 × 7 and a stride of 1. This means that they are

essentially fully connected feature maps. The second

fully connected layer is replaced by a similar layer,

but with a kernel size of 1 × 1. These layers use a

rectiﬁed linear unit activation. Finally, the fully con-

nected layer that is the size of the label space is re-

placed with a fully convolutional layer with the same

purpose. This layer has 21 feature maps, one for each

class, in the original creation of FCN-32s (Shelhamer

et al., 2017) and has 1 feature map in our adaptation,

named BFCN-32s (Binary FCN-32s). This ﬁnal layer

goes from feature space to label space and uses a lin-

ear activation to create the feature map.

To make the model fully convolutional, the result-

ing 7 × 7 feature maps have been upsampled by a

trainable deconvolution layer. This layer uses a stride

Deep Neural Networks with Intersection over Union Loss for Binary Image Segmentation

439

of 32 to regain the original image size and counteracts

the size decreases performed by the max-pooling lay-

ers done in each convolutional block of the encoder.

Since the number of classes is reduced to 2, a soft-

max layer as used in the original model (Shelhamer

et al., 2017) is no longer necessary. Instead, the up-

sampling layer uses a sigmoid activation to map the

output pixels to values between 0 and 1. These steps

create a model where the input image is the same size

as the output image, allowing the model to be trained

pixel-wise end-to-end.

As with the implementation of FCN-32s (Shel-

hamer et al., 2017), all layers of the VGG encoder

were frozen before training on any new data, with the

exception of the last convolutional layers positioned

in convolutional block 5. This is done to speed up the

training process. In addition, the size of the datasets

used would not inﬂuence earlier layers which are al-

ready sufﬁciently trained on ImageNet (Deng et al.,

2009).

2.1.2 ResNet/FCResNet

Another well-performing classiﬁcation network is

ResNet-50 (He et al., 2015). This model uses resid-

ual learning in a 50-layer DNN to classify images. It

has been trained on the ImageNet dataset (Deng et al.,

2009). The same steps as described in section 2.1.1

were taken to make a fully convolutional version from

ResNet, named FCResNet in this research. While

ResNet does not have fully-connected layers, but only

a softmax layer the size of the label space, a fully con-

nected convolutional layer was added. After this a

convolutional layer was used to replace the softmax

layer and perform the same conversion from feature

space to label space as mentioned in section 2.1.1.

The fully connected convolutional layer, as described

in section 2.1.1, is used to facilitate the transition be-

tween the feature extraction part of the model and the

reduction to label space. Without this layer, the re-

duction to label space is too aggressive and therefore

loses a lot of information. This fully connected con-

volutional layer is similar to the one used in BFCN-

32s and uses 4096 feature maps, a kernel size of 7

× 7 and a stride of 1. The convolution layer replac-

ing the softmax layer reduces the feature maps from

4096 to 1 through linear activations. Finally, a similar

up-sampling layer was added to obtain an output im-

age with the same size as the input image, resulting in

a pixel-wise end-to-end trainable version of ResNet.

This up-sampling layer uses a sigmoid activation, as

explained in section 2.1.1.

As with BFCN-32s, some layers of FCResNet

were frozen. Using the same argumentation as in sec-

tion 2.1.1, all layers of the encoder were frozen except

the last convolutional block, named res5.

2.2 Datasets

To explore the differences in performance from these

models two pixel-wise labeled datasets have been

used. These are the Labeled Faces in the Wild: Part

Labels dataset (LFW) (Kae et al., 2013) and the HE-

LEN dataset (Le et al., 2012). Both sets have been

chosen for their relatively dense appearance of posi-

tive labels. The objective is to use densely distributed

data instead of sparse, because the latter might skew

the results in favor of an IoU approach, as explained

in section 1.

2.2.1 LFW

Labeled Faces in the Wild: Part Labels (Kae et al.,

2013) is a dataset containing 2927 images of faces, an

example of which is shown in ﬁgure 1. These images

are labeled in two ways, but for this research, only the

pixel-wise labeling is considered, as seen in ﬁgure 2.

This labeling is in three classes, namely face, hair and

background. Since the task considered is binary seg-

mentation, some preprocessing had to be performed.

For different experimental settings, we consider the

hair to be either part of the face, as shown in ﬁgure

3 or not, as shown in ﬁgure 4. This results in a bi-

nary labeling. The data was split by using 292 im-

ages (10%) for testing. The amount of training and

validation images are dependent on the experimental

settings as detailed in section 3.

Figure 1: Example Input Image LFW.

2.2.2 HELEN

The HELEN dataset (Le et al., 2012) is also a dataset

of faces, containing 2330 images. It is labeled in mul-

tiple classes for separate parts of the face, such as

mouth and hair. As with LFW this dataset has to be

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

440

Figure 2: Example 3-class Label LFW.

Figure 3: Example LFW label with hair.

preprocessed in such a way that it can be used for bi-

nary image segmentation. As such we construct new

labels out of the provided labels that either label all

parts of the face as face and the rest as background,

or label the hair as background as well. This results

in the same data structure as with the LFW dataset.

With this dataset, 233 images (10%) are used for test-

ing. The amount of images for training and validation

are dependent on the experimental setup.

2.3 Loss Functions

In assessing the effectiveness of an IoU loss function,

a baseline has to be established. Binary Cross Entropy

(BCE) loss, or log loss, is such a baseline in that it is

used by default in a wide range of recent classiﬁca-

tion and segmentation works (Shelhamer et al., 2017;

Badrinarayanan et al., 2017; Siam et al., 2018; Noh

et al., 2015).

The formula for binary cross entropy loss can be

seen in equation 1. In this equation T refers to the true

label image, T

refers to a single element of that label,

P refers to the prediction of the output image and P

Figure 4: Example LFW label without hair.

to a single element of that prediction.

BCE

∑

−(T

logP

+ (1 − T

)log(1 − P

)) (1)

The cases of P

= 1 and P

= 0 would lead to

log(0), which is undeﬁned. To prevent this, the values

from P are clipped in the range [ε, 1 − ε]. This is done

by the Keras framework, where ε is set to 1 × 10

−7

In equation 1, it can be seen that BCE, while incorpo-

rating an element of probability, smoothed out by the

log component, awards both true positives and true

negatives, while penalizing false positives and false

negatives. Referring back to the problem described

in section 1, this can lead to simplistic solutions to

segmentation when the data is signiﬁcantly sparse, by

labeling all output as background.

The other loss function we use is one that directly

incorporates the value for Intersection over Union.

This loss function is proposed in previous work (Rah-

man and Wang, 2016), and this previous work de-

scribes the mathematical aspects, which will be elab-

orated upon here as well. The original equation for

IoU can be given as:

IoU =

|T ∩ P|

|T ∪ P|

(2)

As before, in equation 2, T stands for the true la-

bel image, P for the prediction of the output image

and the symbols are taken from set theory. This IoU

is then taken as the average over the entire set of pix-

els producing an IoU value between 0 and 1. These

set symbols are, however, not differentiable. To apply

these set symbols in their true form, the numbers in T

and P need to be absolute 1’s and 0’s. However, while

the label image T contains these values, the output P

contains values between 1 and 0 due to the sigmoid

activation in the ﬁnal up-sampling layer of the net-

work. To solve this, an approximation of IoU can be

made using probabilities. This gives the equation for

Deep Neural Networks with Intersection over Union Loss for Binary Image Segmentation

441

this approximation IoU

IoU

|T ∗ P|

|T + P − (T ∗ P)|

(3)

Here T and P remain the same, but T ∗ P is the

element-wise multiplication of T and P. In the nu-

merator, this gives an approximation of Intersection

by giving the probability of P

when T

is 1 and giv-

ing 0 otherwise. As such, the intersection is highest

when P

is 1 wherever T

is 1, exactly as is to be ex-

pected. The denominator is an addition of T and P

with a deduction of the Intersection, just as in a regu-

lar calculation for the union, to mitigate the effect of

counting the intersection area twice.

After reducing IoU from set operations to arith-

metic operations, producing IoU

, the formula is dif-

ferentiable. As a loss function, the error needs to ap-

proach 0 when results become better. To achieve this

the loss function is deﬁned in terms of IoU

as such:

IoU

= 1 − IoU

(4)

This loss L

IoU

is applied to each element in a batch

and added producing a value between 0 and batch-

size, when IoU for each sample approaches 1 or 0,

respectively. This is a loss function that can, again, be

minimized. To achieve this it needs to be differenti-

ated, which is done in the following way:

∂L

IoU

∂P

−U ∗

∂I

∂P

+ I ∗

∂U

∂P

(5)

−U ∗T

+ I ∗ (1 − T

)

(6)

These derivatives, and the backpropagation that

follows, are computed by the Keras framework dur-

ing training when the equation used to calculate IoU

is replaced with the differentiable approximation.

3 EXPERIMENTS

Training the models using the datasets and loss func-

tions described in section 2 is done according to cer-

tain design choices, such as hyper-parameters of the

models and dataset usage. Table 1 shows the different

splits of the datasets such that 8 distinct experimen-

tal setups are created. Applying these 8 data splits to

each combination of BFCN32-s and FCResNet, either

with BCE or IoU loss, results in 32 distinct setups, the

results of which are presented in section 4.

To ensure no other factors would inﬂuence the

outcome of these experiments, any other hyper-

parameters have been kept constant. The relevant

hyper-parameters can be seen in table 2. While

most parameters are found through a parameter sweep

based on the results from previous work (Shelhamer

et al., 2017), the patience parameter is less self-

explanatory. It determines the number of epochs with-

out improvements after which the training stops. This

is our point of convergence. As such this value is

more important for total learning time than the num-

ber of epochs, the maximum of which was never

reached.

Table 1: Different data uses: Number of images used for

training and validation is given by # training and # valida-

tion respectively. The Hair? column determines whether

hair is included in the face (yes) or in the background (no).

Dataset # Training # Validation Hair?

LFW 2342 292 yes

LFW 1000 100 yes

LFW 2342 292 no

LFW 1000 100 no

HELEN 1864 233 yes

HELEN 1000 100 yes

HELEN 1864 233 no

HELEN 1000 100 no

Table 2: Experimental parameters.

Parameter Value

Epochs 1000

Batch-size 100

Patience 20

Learning Rate 0.0001

Optimizer RMSprop

Using these data splits, models and parameters,

each of the models is trained until convergence as de-

termined by the patience parameter. After training

the model was tested on previously unseen data of

the dataset it was trained on. For both datasets this

was the last 10%, i.e. 292 images for LFW and 233

images for HELEN, as mentioned in section 2.2.

4 RESULTS

The results of the experiments described in section

3 can be seen in tables 3, 4, 5 and 6. These ta-

bles are each structured similarly. Each table displays

the results of a different combination of model and

dataset and presents each of the 4 data splits within

that dataset. Results are evaluated on three metrics:

original binary accuracy, Intersection over Union and

epochs before convergence. For each metric, the re-

sults from the models trained on IoU loss are in the

IoU

column and the results from binary cross entropy

loss are in the L

BCE

column. Better performances are

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

442

Table 3: Results from BFCN-32s trained and tested on LFW: The name describes whether that setting used all of the training

and validation images (big) or only 1000 training images and 100 validation images (small) and whether hair is included as

part of the face (hair) or not (no hair). Accuracy refers to the pixel-wise accuracy score for that setting. Time needed for

convergence is measured in epochs. Columns marked L

IoU

show results for IoU loss. Columns marked L

BCE

show results for

binary cross entropy loss.

Name Accuracy IoU-Score Convergence

IoU

BCE

IoU

BCE

IoU

BCE

big, hair 0.974 0.978 0.921 0.906 91 146

small, hair 0.962 0.963 0.887 0.857 116 98

big, no hair 0.977 0.976 0.896 0.852 89 83

small, no hair 0.970 0.967 0.872 0.837 155 117

Table 4: Results from BFCN-32s trained and tested on HELEN: see table 3 for clariﬁcation.

Name Accuracy IoU-Score Convergence

IoU

BCE

IoU

BCE

IoU

BCE

big, hair 0.961 0.961 0.877 0.838 125 130

small, hair 0.943 0.940 0.820 0.779 118 131

big, no hair 0.983 0.984 0.905 0.877 83 125

small, no hair 0.978 0.977 0.871 0.837 93 129

Table 5: Results from FCResNet trained and tested on LFW: see table 3 for clariﬁcation.

Name Accuracy IoU-Score Convergence

IoU

BCE

IoU

BCE

IoU

BCE

big, hair 0.942 0.947 0.829 0.826 75 64

small, hair 0.922 0.942 0.777 0.818 44 113

big, no hair 0.959 0.943 0.821 0.730 78 59

small, no hair 0.951 0.949 0.798 0.757 72 67

Table 6: Results from FCResNet trained and tested on HELEN: see table 3 for clariﬁcation.

Name Accuracy IoU-Score Convergence

IoU

BCE

IoU

BCE

IoU

BCE

big, hair 0.926 0.905 0.768 0.673 62 50

small, hair 0.916 0.917 0.747 0.723 62 97

big, no hair 0.967 0.962 0.816 0.775 68 72

small, no hair 0.963 0.960 0.796 0.766 88 80

marked in bold.

These tables clearly show that while accuracy and

convergence favor both IoU and BCE seemingly at

random, the IoU-score is distinctly higher for the new

IoU loss function in almost all experimental settings.

The exact nature and signiﬁcance of this improvement

will be detailed in section 5.

5 DISCUSSION

The experiments performed and the results reported

in section 4 show the patterns for the three metrics

used to compare the two loss functions. Each of these

patterns will be discussed brieﬂy. While binary accu-

racy has been established as being less relevant as a

measure of success for a segmentation task, it is note-

worthy nonetheless to show that it does not decrease

with the use of the IoU loss function.

Binary Accuracy. As can be observed in tables 3, 4,

5 and 6, IoU-loss and BCE-loss score higher on accu-

racy in 56.25% and 37.5% of the cases, respectively.

However, whether IoU-loss or BCE-loss results in a

higher binary accuracy, the results are within 2.5%

in every comparison. A paired t-test on these values

remains inconclusive, with a p-value of 0.52, imply-

ing no signiﬁcant difference between the mean per-

formance of these loss functions based on binary ac-

curacy.

Intersection over Union. Contrary to binary accu-

racy, the results for Intersection over Union show con-

sistent and signiﬁcant improvements in almost all ex-

perimental settings when using IoU-loss. A paired t-

test shows that IoU-loss scores better with a p-value

of 4.7 × 10

−4

. This is undoubtedly signiﬁcant. This

Deep Neural Networks with Intersection over Union Loss for Binary Image Segmentation

443

p-value is not surprising as IoU performs better in 15

out of 16 test cases. With the exception of one test

case, improvements of multiple percentage points can

be seen across the board. On average over all cases

the error margin is reduced by 17.5%.

Convergence. While the focus of this research was

on a comparison of performance of the two loss func-

tions, an improvement in training time would also be

relevant. From the results, however, it can be seen that

these training times vary wildly. A paired t-test shows

that the average training time for IoU is lower with a

p-value of 0.256. This shows that no informed con-

clusion can be drawn about either loss function con-

verging faster.

Models. When we compare the results of BFCN-

32s and the results of FCResNet, we observe that the

BFCN-32s model performs better. Although the accu-

racies obtained with the different models do not differ

very much, the IoU scores of BFCN-32s are in gen-

eral much higher. While the encoder part of FCRes-

Net is more developed than the encoder of BFCN-

32s, the performance of the decoder is worse. This is

due to the choice of only applying a single fully con-

volutional layer in FCResNet compared to two fully

convolutional layers in BFCN-32s. This choice was

based on the amount of trainable parameters that re-

sulted in adding these layers. The output of FCRes-

Net’s encoder has considerably more parameters than

the output of BFCN-32s and, as such, connecting this

to a fully convolutional layer ampliﬁes this dispar-

ity. A second convolutional layer in FCResNet was,

therefore, not feasible and thus the results for this

model are worse than the results for BFCN-32s.

Example Output Images. In ﬁgures 5 and 6 example

outputs are shown. These images are from the exper-

imental setting using the BFCN-32s model and the

LFW dataset. They correspond to the example input

and label shown in ﬁgure 1 and ﬁgure 4, respectively.

While neither segmentation is perfect, the output for

IoU in ﬁgure 5 is clearly showing the contours of the

hairless face better than the output for BCE in ﬁgure

6. Unfortunately, for both images, some up-sampling

artifacts remain. This is most likely due to the coarse

deconvolution layer, which upsamples by a factor of

32. These up-sampling artifacts are much more pro-

nounced upon initialization and are never fully re-

moved during training.

Dataset Size. In section 3 multiple dataset splits

were described. The distinction between large and

small was made to evaluate the effect of the size of

the dataset on the effectiveness of either loss func-

tion. With the results presented in section 4 it can be

seen that whenever a smaller dataset is used, perfor-

mance suffers slightly, as is to be expected. However,

Figure 5: Example Output: BFCN-32s with IoU loss on

LFW without hair.

Figure 6: Example Output: BFCN-32s with BCE loss on

LFW without hair.

whether IoU or BCE performs better does not change

based on the size of the dataset.

6 CONCLUSION

Taking into account both the statistical analysis of the

results and the output images presented, it is clear that

training a segmentation model directly on the Inter-

section over Union objective can lead to signiﬁcant

improvements. The statistical analysis shows a sig-

niﬁcant improvement over all categories when using

IoU loss over BCE loss when the common measure

of success IoU is used. As such IoU loss is deﬁnitely

an option worth considering when attempting to im-

prove a segmentation model for state-of-the-art per-

formance. This improvement on segmentation is in

accordance with the results from previous work (Rah-

man and Wang, 2016). This research shows that these

improvements are independent of the sparsity of data.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

444

What is also shown by this research is that in

an ever-continuing attempt to improve state-of-the-

art Deep Neural Nets, the area of loss functions has

not been fully explored. This is in agreement with

conclusions from other work (Janocha and Czarnecki,

2017), which state that while cross entropy has been

an unquestionable favourite, adopting one of the vari-

ous other losses can be equally, if not more, effective.

These conclusions, together with the conclusion from

this and other research on the effectiveness of IoU

loss show the same thing. More and more research

is being done towards architectures, creating deeper

or different convolutional networks, while a signiﬁ-

cant improvement can already be made by choosing a

different loss function.

While this research shows that performance im-

proves signiﬁcantly for the models that were used,

this can not be claimed for every model. As such

research on IoU loss with other models, such as

the well-established SegNet (Badrinarayanan et al.,

2017), may support our hypothesis that IoU loss per-

forms better in general for semantic segmentation

tasks.

In section 1 an explanation is given why the loss

function by (Rahman and Wang, 2016) is preferred

to the loss functions proposed by (Yuan et al., 2017;

Berman and Blaschko, 2017). In future research, it is

also interesting to compare these loss functions based

on IoU directly.

Finally, the claim has been made that the bene-

ﬁt from training on IoU directly will only magnify

when a model is presented with sparse data. This has

not been evaluated in this research and can be done

by expanding the models presented here to perform

a segmentation task on multiple classes. This would

signiﬁcantly reduce the amount of positive samples in

a dataset and thus be a way to explore the hypothe-

sis that an IoU loss function outperforms binary cross

entropy on sparser data. In previous work (Rahman

and Wang, 2016) sparse data is already used. How-

ever, here the sparsity of the data is taken as it is and

not isolated to determine its effect on the performance

of the IoU loss function. As such future work could

focus speciﬁcally on certain datasets, comparing per-

formance on sparse and dense data.

REFERENCES

Ahmed, F., Tarlow, D., and Batra, D. (2015). Optimiz-

ing expected intersection-over-union with candidate-

constrained CRFs. In 2015 IEEE International Con-

ference on Computer Vision (ICCV), pages 1850–

1858.

Badrinarayanan, V., Kendall, A., and Cipolla, R. (2017).

Segnet: A deep convolutional encoder-decoder ar-

chitecture for image segmentation. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

39(12):2481–2495.

Berman, M. and Blaschko, M. B. (2017). Optimization

of the jaccard index for image segmentation with the

Lov

asz Hinge. CoRR, abs/1705.08790.

Chollet, F. et al. (2015). Keras. https://keras.io.

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-

Fei, L. (2009). ImageNet: A large-scale hierarchical

image database.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn,

J., and Zisserman, A. (2011). The PASCAL Visual

Object Classes Challenge 2011 (VOC2011) Results.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep

residual learning for image recognition. CoRR,

abs/1512.03385.

Janocha, K. and Czarnecki, W. M. (2017). On loss func-

tions for deep neural networks in classiﬁcation. CoRR,

abs/1702.05659.

Kae, A., Sohn, K., Lee, H., and Learned-Miller, E. (2013).

Augmenting CRFs with Boltzmann machine shape

priors for image labeling. In the IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

Le, V., Brandt, J., Lin, Z., Bourdev, L., and Huang, T. S.

(2012). Interactive facial feature localization. In Pro-

ceedings of the 12th European Conference on Com-

puter Vision - Volume Part III, ECCV’12, pages 679–

692, Berlin, Heidelberg. Springer-Verlag.

Noh, H., Hong, S., and Han, B. (2015). Learning de-

convolution network for semantic segmentation. In

Proceedings of the IEEE International Conference on

Computer Vision, pages 1520–1528.

Nowozin, S. (2014). Optimal decisions from probabilistic

models: The intersection-over-union case. In 2014

IEEE Conference on Computer Vision and Pattern

Recognition, pages 548–555.

Rahman, M. A. and Wang, Y. (2016). Optimizing

intersection-over-union in deep neural networks for

image segmentation. In International Symposium on

Visual Computing.

Shelhamer, E., Long, J., and Darrell, T. (2017). Fully con-

volutional networks for semantic segmentation. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 39(4):640–651.

Siam, M., Gamal, M., Abdel-Razek, M., Yogamani, S.,

and J

agersand, M. (2018). RTSeg: Real-time se-

mantic segmentation comparative study. CoRR,

abs/1803.02758.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

Yuan, Y., Chao, M., and Lo, Y. C. (2017). Automatic

skin lesion segmentation using deep fully convolu-

tional networks with Jaccard distance. IEEE Trans-

actions on Medical Imaging, 36(9):1876–1886.

Zhao, H., Gallo, O., Frosio, I., and Kautz, J. (2017).

Loss functions for image restoration with neural net-

works. IEEE Transactions On Computational Imag-

ing, 3(1):47–57.

Deep Neural Networks with Intersection over Union Loss for Binary Image Segmentation

445