Statistical Inference of the Inter-sample Dice Distribution for

Discriminative CNN Brain Lesion Segmentation Models

Kevin Raina

a

Department of Mathematics and Statistics, University of Ottawa, Ontario, Canada

Keywords:

Uncertainty, Brain Lesions, MRI, Segmentation Sampling, Convolutional Neural Network, Discriminative.

Abstract:

Discriminative convolutional neural networks (CNNs), for which a voxel-wise conditional Multinoulli distri-

bution is assumed, have performed well in many brain lesion segmentation tasks. For a trained discriminative

CNN to be used in clinical practice, the patient’s radiological features are inputted into the model, in which

case a conditional distribution of segmentations is produced. Capturing the uncertainty of the predictions can

be useful in deciding whether to abandon a model, or choose amongst competing models. In practice, how-

ever, we never know the ground truth segmentation, and therefore can never know the true model variance. In

this work, segmentation sampling on discriminative CNNs is used to assess a trained model’s robustness by

analyzing the inter-sample Dice distribution on a new patient solely based on their magnetic resonance (MR)

images. Furthermore, by demonstrating the inter-sample Dice observations are independent and identically

distributed with a ﬁnite mean and variance under certain conditions, a rigorous conﬁdence based decision

rule is proposed to decide whether to reject or accept a CNN model for a particular patient. Applied to the

ISLES 2015 (SISS) dataset, the model identiﬁed 7 predictions as non-robust, and the average Dice coefﬁcient

calculated on the remaining brains improved by 12 percent.

1 INTRODUCTION

Discriminative CNNs, such as those constructed by

(Kamnitsas et al., 2017; Havaei et al., 2017; Ron-

neberger et al., 2015), have consistently ranked on the

top of the leaderboard in many brain lesion segmen-

tation challenges (Maier et al., 2017; Winzeck et al.,

2018; Bakas et al., 2018). Commonly, they are formu-

lated by assuming a voxel-wise Multinoulli distribu-

tion, conditional on the MRI intensities of neighbor-

ing voxels where the parameters are obtained and es-

timated with a CNN. For the case of 2 labels, if x

j

de-

notes the 3D MRI intensities used in the prediction of

voxel j over a discrete domain Ω ⊂ R

3

, Y

j

represents

the random label at voxel j, and Θ are the parameters

of a CNN architecture, then the model is formulated

as

Y

j

|x

j

∼ Bernoulli(π

j

= CNN(Θ, x

j

)). (1)

At prediction time, the translational invariance prop-

erty of the CNN allows for a fast way to estimate

all the voxel-wise conditional distributions, without

a

https://orcid.org/0000-0002-6240-9675

having to separately feed their respective covariates.

When a model is fully trained, the parameters in (1)

are replaced with an estimate:

ˆ

Θ. In this way, dis-

criminative CNNs are capable of sampling segmenta-

tions by sampling the associated label at each voxel,

conditionally independent of each other. Despite this

source of variability, a decision rule of selecting the

label with the highest probability is often applied ren-

dering the segmentation deterministic.

Measuring the performance variation of trained

discriminative CNNs in brain lesion segmentation

on a patient by patient basis can help clinical prac-

titioners instill a degree of conﬁdence in the use

of automated segmentation methods. One metric

of model performance is the Dice coefﬁcient (Dice,

1945; Sørensen, 1948), which measures the similar-

ity between two 3D voxel-wise labeled images: S

1

and S

2

. The Dice coefﬁcient is a real number in [0, 1]

given by

Dice(S

1

, S

2

) =

2T P

2T P +FN + FP

, (2)

where a value of 0 indicates no overlap and a value

of 1 indicates perfect similarity. TP, FP and FN re-

fer to the number of true positive, false positive, and

168

Raina, K.

Statistical Inference of the Inter-sample Dice Distribution for Discriminative CNN Brain Lesion Segmentation Models.

DOI: 10.5220/0010286201680173

In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021) - Volume 2: BIOIMAGING, pages 168-173

ISBN: 978-989-758-490-9

Copyright

c

2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

false negative voxels in the medical image respec-

tively. Formulating this incentive mathematically, if

D denotes the Dice coefﬁcient random variable which

incorporates randomness across patients, and across

model segmentations, and x collects all the covariates

in (1) across all voxels in the image for the patient to

from a high dimensional vector, then the interest lies

in measuring the model variance

Var(D|x,

ˆ

Θ). (3)

Here the Dice random variable is theoretically mea-

sured on the true segmentation, which is never known,

and the random segmentations produced by a trained

discriminative CNN model on the particular patient.

Intuitively, the observations of D can be generated

by: observing a new random patient whose radiolog-

ical features are drawn independently from other pa-

tients, randomly generating a segmentation from the

trained discriminative CNN given their radiological

features, and calculating the Dice coefﬁcient against

the ground truth. In (3), the ﬁrst source of variation is

eliminated by conditioning on the patient. Estimating

(3) can also help suggest the complete abandonment

of an algorithm in favor of another, as some CNNs

may be better tuned to deal with speciﬁc brain lesion

characteristics like small lesion volume.

Despite these advantages, some brain lesion seg-

mentation challenges compute variability in the Dice

metric across patient predictions (Maier et al., 2017).

Moreover, for a particular patient, only a single de-

terministic segmentation is produced by some deci-

sion rule. That is, Var(D|

ˆ

Θ) becomes the measure

to be estimated, but this estimate does not solely ac-

count for model variability. It is important to note this

measure is unconditional on the patient’s radiological

features, and thus inherently incorporates additional

variability. The merit of this metric is in its applica-

bility, as not all segmentation algorithm are capable of

sampling segmentations. However in the case of dis-

criminative CNNs, by its very nature, this metric can

solely be used to quantify variation in the Dice coef-

ﬁcient across patients and could completely discard a

model that may in fact perform well on certain kinds

of patients. Another disadvantage is that this metric

is obtained from the training set, and then must be

generalized to arbitrary cases, which may present sig-

niﬁcant differences. Analysis of model performance

variance on a single patient (3), conditional on their

radiological features, eliminates these extra sources

of variation, and can be more useful in clinical prac-

tice by providing specialized patient care.

One step towards this direction was taken in the

BraTS 2019 challenge, which is an extension of

(Bakas et al., 2018). In particular, voxel-wise mea-

sures of segmentation uncertainty ranging from 0 to

100 were calculated from 3D MRIs for all patients

individually. Then, at speciﬁed thresholds, uncer-

tain voxels were ﬁltered out in the calculation of the

Dice coefﬁcient. Depending on the structure of un-

certainties, this method can reward or penalize the

Dice score, but this can never be conﬁrmed in practice

since we never know the ground truth segmentation.

Segmentation sampling and consequent analysis

of the inter-sample Dice distribution has been under-

taken for generative models, for instance, by L

ˆ

e et

al. (L

ˆ

e et al., 2016). In their work, they use a Gaus-

sian process to produce segmentation samples based

on a single expert manual segmentation of grade 4

gliomas. The mean of the inter-sample Dice distribu-

tion and variability of the segmentations can be con-

trolled by a single model parameter. The samples can

then be used in radiotherapy planning by delivering

radiation to certain voxels and avoiding dose to un-

certain voxels, where perhaps there are more sensitive

tissues.

The contribution mentions the applicability of

their method in evaluating uncertainty in the per-

formance of segmentation algorithms, by repeatedly

sampling segmentations off the ground truth and cal-

culating the variability against a deterministic pre-

dicted segmentation. Though this method can be ap-

plied to arbitrary segmentation algorithms, the source

of variation is not produced by the model predictions,

as in the case of discriminative CNNs. As a conse-

quence, the method is effective in assessing the effec-

tiveness of a particular segmentation produced from a

model, but not the model itself. Another related work

by Roy et al. (Roy et al., 2018), shows that the inter-

sample Dice coefﬁcient correlates with Dice perfor-

mance using a discriminative CNN for the segmen-

tation of brain scans from children with psychiatric

disorders (CANDI-13 dataset), and suggests it as a

measure for quantifying uncertainty.

In this proposed work, the inter-sample Dice ob-

servations are shown to be independent and iden-

tically distributed samples from a distribution with

ﬁnite variance and mean, under certain conditions.

Segmentations are sampled directly from a discrim-

inative CNN. The mean with a conﬁdence interval of

the inter-sample Dice distribution is then estimated by

the central limit theorem and used in place of (3) to

decide on whether to reject or accept the CNN model

on patients with ischemic stroke. This chapter is or-

ganized as follows: Section 2 describes the methods,

Section 3 presents the results, and a discussion fol-

lows in Section 4.

Statistical Inference of the Inter-sample Dice Distribution for Discriminative CNN Brain Lesion Segmentation Models

169

2 METHODS

2.1 Architecture

The CNN architecture considered is Wider2DSeg,

which was originally constructed by Kamnitsas et al.

(Kamnitsas et al., 2017) and is a two dimensional

variant of their 3D deepMedic architecture. Raina

et al. (Raina. et al., 2020) tuned this architecture

in conjunction with additional symmetry covariates

obtained from a reﬂective registration step to yield a

0.62 Dice coefﬁcient over a 7-fold cross validation on

the ISLES 2015 SISS dataset. All the implementation

details in this work are exactly as in the Wider2DSeg

implementation of Raina et al. (Raina. et al., 2020)

with nonlinear symmetry covariates.

2.2 Segmentation Sampling

Referencing equation (1), suppose there are V voxels

in an image. The attention is restricted to the case of 2

voxel labels, though this analysis can extend to an ar-

bitrary ﬁnite number of labels. Let S ∈ {0, 1}

V

repre-

sent the segmentation random vector of a trained dis-

criminative CNN model, where each element S

k

= Y

j

,

for unique j ∈ Ω. The segmentation vector is un-

conditional and incorporates randomness across pa-

tients. Although in what follows it does not matter

which permutation in the assignment of voxels to ar-

ray elements is used, the assignment must be chosen

and ﬁxed. Then, denote x as the patient’s radiologi-

cal features as formulated in equation (3), and

ˆ

Θ as

the estimated parameters of the CNN. Segmentations

can be sampled conditional on a patient’s radiologi-

cal features if conditional independence is assumed

across voxels. In this case, the distribution is com-

pletely known by

S|x ∼ p(s; x,

ˆ

Θ) =

V

∏

j=1

π

j

(x

j

,

ˆ

Θ)

s

j

(1 − π

j

(x

j

,

ˆ

Θ))

1−s

j

.

(4)

2.3 Inter-sample Dice Distribution

Let S

∗

1

and S

∗

2

represent two iid model segmentation

random vectors for a given patient, each distributed

according to (4). By independence, the joint distribu-

tion of these samples is given by

(S

∗

1

, S

∗

2

) ∼ p(s

∗

1

;x,

ˆ

Θ)p(s

∗

2

;x,

ˆ

Θ). (5)

In this way, the inter-sample Dice random variable is

just a function of the random vectors. In particular,

if S

∗

2

is seen as the ground truth, then the Dice coefﬁ-

cient of S

∗

1

can be computed against it, and vice-versa.

Deﬁne Γ to be the inter-sample Dice random variable.

Then

Γ = Dice(S

∗

1

, S

∗

2

). (6)

In fact, the complete distribution of Γ can be ob-

tained, but can be quite computationally expensive

as the number of voxels are often large. That is,

if D

(·)

represents the support of a random variable,

then |D

S

∗

1

| ≤ 2

V

, where the upper bound is obtained

by having 0 < π

j

< 1 for all voxels in the image.

By independence of S

∗

1

and S

∗

2

, this further implies

that |D

(S

∗

1

,S

∗

2

)

| ≤ 2

2V

. Subsequently, after applying the

Dice transformation (6) which may at most associate

a unique Dice value for each element in D

(S

∗

1

,S

∗

2

)

, the

inequality |D

Γ

| ≤ 2

2V

is obtained. Hence, computing

the exact Dice distribution may require the calcula-

tion of at most 2

2V

probabilities, where V tends to be

in the millions. Though, since D

Γ

is bounded by 2

2V

,

Γ has a discrete distribution with a ﬁnite and count-

able support.

In order to conduct statistical inference on E[Γ] it

must also be ﬁnite, however, this is not the case. No-

tice that if (S

∗

1

, S

∗

2

) = (0, 0), Γ is undeﬁned, and this

outcome always has a non-zero probability. In fact,

this is the only way for Γ to be undeﬁned. To correct

this, two necessary modiﬁcations are introduced: 1)

∃ π

j

6= 0, for some j and 2) The inter-sample Dice

random variable is redeﬁned as

Γ

∗

= Γ|(S

∗

1

, S

∗

2

) 6= (0, 0). (7)

The ﬁrst condition ensures that (0, 0) can never be the

only sample generated, thereby permitting its removal

and still retaining a distribution through the redeﬁni-

tion in the second condition. The second condition

redeﬁnes the inter-sample Dice random variable to

be conditional on all samples drawn from (5) except

(0, 0). Now, ∀γ ∈ D

Γ

∗

, 0 ≤ γ ≤ 1, where D

Γ

∗

6=

/

0.

Then

E[(Γ

∗

)

2

] < ∞. (8)

As a consequence, sampling observations from

this distribution, and estimating its mean from a sam-

ple can be used. Independent and identically dis-

tributed pairs of segmentations can be sampled from

(5), and can generate an iid sample of inter-sample

Dice observations. To adjust for the second modiﬁ-

cation, any (0, 0) sample, or equivalently undeﬁned

Dice score, is removed in the sampling phase. More-

over, since the inter-sample Dice distribution has a ﬁ-

nite mean and variance, the central limit theorem can

be applied, and associated conﬁdence intervals can be

constructed.

BIOIMAGING 2021 - 8th International Conference on Bioimaging

170

Computationally, this method can be undertaken

by ﬁrst producing the estimated probability tensors.

At each voxel in the probability tensor are the esti-

mated label probabilities obtained from the features

and CNN parameter estimates. Once this tensor is cal-

culated, segmentations are sampled by sampling from

the associated Multinoulli distribution and iterating

over all voxels. Naturally, these will produce segmen-

tation samples that can be appropriately viewed as

such since the format of the probability tensors align

with that of the actual image.

2.4 Decision Rule

Let γ

1

, .., γ

n

be an iid sample of realized inter-sample

Dice coefﬁcients for a given patient, and construct

a (1 − α)% approximate conﬁdence interval, based

on the central limit theorem. For a speciﬁed thresh-

old, reject the use of a discriminative CNN model on

a patient if the conﬁdence interval is entirely below

the threshold. In this manner, the clinician can jus-

tify with (1 − α)% conﬁdence that the true mean of

the inter-sample Dice distribution is below the given

threshold, and appropriate actions to reject the model

in favor of another can be undertaken.

3 EXPERIMENTS AND RESULTS

3.1 Dataset

The discriminative CNN model is trained and eval-

uated on the ISLES2015 (SISS) training data, which

consists of 28 patients with sub-acute ischemic stroke.

The radiological features x for a patient are 4 MR se-

quences: FLAIR, DWI, T1 and T1-contrast. Each im-

age has a total of 230 × 230 × 154 voxels. At each

voxel, there are only two possible labels classes: le-

sion or non-lesion.

3.2 Efﬁcacy of Decision Rule

The proposed segmentation sampling based decision

rule is applied to the results of Raina et al. (Raina.

et al., 2020), which yielded an average Dice coef-

ﬁcient (predicted against ground truth) of 0.62 over

a 7-fold cross-validation from single deterministic

segmentations obtained by selecting the label with

the highest probability. For each fold in the cross-

validation and for each brain in the validation fold,

30 inter-sample Dice observations are sampled, and

The central limit theorem conﬁdence interval is com-

puted. Table 1 displays the patient by patient results.

The Pearson correlation between mean ISD and Dice

Table 1: Case by case hypothesis testing for Wider2dSeg

on ISLES2015 SISS. The inter-sample Dice conﬁdence in-

terval is computed using the central limit theorem with 30

samples. The bolded rows indicate rejecting the use of the

CNN on the patient as per the decision rule.

Patient No. ISD conﬁdence Dice

1 0.940339 ± 0.000116 0.866798

2 0.955828 ± 0.000243 0.815299

3 0.897922 ± 0.001002 0.736231

4 0.946641 ± 0.000122 0.797607

5 0.944956 ± 0.000133 0.857821

6 0.965364 ± 0.000146 0.905630

7 0.943856 ± 0.000178 0.822416

8 0.909659 ± 0.000349 0.702697

9 0.969197 ± 0.000076 0.854602

10 0.875187 ± 0.000340 0.592438

11 0.933384 ± 0.000399 0.775896

12 0.888211 ± 0.000810 0.517170

13 0.850324 ± 0.001449 0.277828

14 0.980536 ± 0.000095 0.815283

15 0.974488 ± 0.000179 0.887611

16 0.580227 ± 0.004708 0.009584

17 0.508344 ± 0.005390 0.164190

18 0.949571 ± 0.000435 0.730066

19 0.730039 ± 0.003967 0.487031

20 0.950651 ± 0.000441 0.795393

21 0.687765 ± 0.003033 0.434674

22 0.863127 ± 0.000823 0.685439

23 0.754482 ± 0.002522 0.634320

24 0.861986 ± 0.001278 0.507946

25 0.916227 ± 0.001130 0.761670

26 0.598502 ± 0.006355 0.191637

27 0.824157 ± 0.000633 0.000156

28 0.887825 ± 0.001183 0.732581

over the 28 patients was computed to be r = 0.81. Fig-

ure 1 plots the mean ISD against Dice score for each

patient, and depicts the ﬁtted regression line. The

threshold was set to 0.85 with reference to the regres-

sion line in Figure 1, and corresponds to a Dice score

of 0.60. In addition, the conﬁdence level is 95%. By

removing the rejected predictions, the average Dice

coefﬁcient increased from 0.62 to 0.74.

Statistical Inference of the Inter-sample Dice Distribution for Discriminative CNN Brain Lesion Segmentation Models

171

Figure 1: Plot of mean ISD against Dice score with regres-

sion line: y = 1.66x − 0.82, where y is Dice score and x is

mean ISD. The F-test yielded a p-value of 1.4× 10

−7

.

4 DISCUSSION

The reason as to why ISD is highly correlated to

Dice performance for discriminative CNNs, and can

be used to detect weak segmentations is not entirely

clear. As a counter-example, consider a model that

always predicts P(lesion) = 0 for all but one voxel,

which instead has P(lesion) = 1. Then, the ISD

would always be 1, since the model produces exactly

the same segmentation with probability 1, regardless

of the input. However, it would be unexpected to

see a high Dice score for this model. The coupling

of CNN outputs and ISD for detecting uncertain seg-

mentations requires further investigation for a deeper

understanding of its performance.

One important remark is that the Dice metric can

be substituted by other metrics such as the sensitivity,

speciﬁcity, mean squared error, or precision and the

preceding analysis would also follow for these distri-

butions, thereby permitting hypothesis testing. More-

over, the computations considered in this work were

over the entire brain, but could also be calculated on

speciﬁc regions of interest (ROIs). Deciding on which

metrics to use, and applying them to more detailed

brain sub-regions could improve the decision-making

potential, and is a possible area of development.

Another point to remark is that the proposed

method can be used to rigorously test competing dis-

criminative models based on their respective inter-

sample mean Dice conﬁdence intervals, and select

the most robust one on an individualized patient ba-

sis. Applying this unifying technique for all compet-

ing CNNs in a brain lesion challenge may exhibit the

best possible performance, without any consideration

to the ground truth. Segmentation challenges have re-

cently begun to incorporate uncertainty analysis, but

further work is required to apply these techniques on

various types of brain lesion structures.

REFERENCES

Bakas, S., Reyes, M., Jakab, A., Bauer, S., Rempﬂer,

M., Crimi, A., Shinohara, R. T., Berger, C., Ha,

S. M., Rozycki, M., et al. (2018). Identifying

the best machine learning algorithms for brain tu-

mor segmentation, progression assessment, and over-

all survival prediction in the brats challenge. arXiv

preprint:1811.02629.

Dice, L. R. (1945). Measures of the amount of ecologic

association between species. Ecology, 26(3):297–302.

Havaei, M., Davy, A., Warde-Farley, D., Biard, A.,

Courville, A., Bengio, Y., Pal, C., Jodoin, P.-M., and

Larochelle, H. (2017). Brain tumor segmentation

with deep neural networks. Medical image analysis,

35:18–31.

Kamnitsas, K., Ledig, C., Newcombe, V. F., Simpson,

J. P., Kane, A. D., Menon, D. K., Rueckert, D., and

Glocker, B. (2017). Efﬁcient multi-scale 3D CNN

with fully connected CRF for accurate brain lesion

segmentation. Medical image analysis, 36:61–78.

L

ˆ

e, M., Unkelbach, J., Ayache, N., and Delingette, H.

(2016). Sampling image segmentations for uncertai-

nty quantiﬁcation. Medical image analysis, 34:42–51.

Maier, O., , B. H., von der Gablentz, J., H

¨

ani, L., Heinrich,

M. P., Liebrand, M., Winzeck, S., Basit, A., Bentley,

P., Chen, L., et al. (2017). ISLES 2015 - a public eval-

uation benchmark for ischemic stroke lesion segmen-

tation from multispectral MRI. Medical image analy-

sis, 35:250–269.

Raina., K., Yahorau., U., and Schmah., T. (2020). Ex-

ploiting bilateral symmetry in brain lesion segmen-

tation with reﬂective registration. In Proceedings of

the 13th International Joint Conference on Biomed-

ical Engineering Systems and Technologies - Vol-

ume 2: BIOIMAGING,, pages 116–122. INSTICC,

SciTePress.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation. In International Conference on Medical

image computing and computer-assisted intervention,

pages 234–241. Springer.

Roy, A. G., Conjeti, S., Navab, N., and Wachinger, C.

(2018). Inherent brain segmentation quality con-

trol from fully convnet monte carlo sampling. In

International Conf. on Medical Image Computing

and Computer-Assisted Intervention, pages 664–672.

Springer.

Sørensen, T. J. (1948). A method of establishing groups of

equal amplitude in plant sociology based on similarity

of species content and its application to analyses of

the vegetation on Danish commons. I kommission hos

E. Munksgaard.

Winzeck, S., Hakim, A., McKinley, R., Pinto, J. A., Alves,

V., Silva, C., Pisov, M., Krivov, E., Belyaev, M.,

BIOIMAGING 2021 - 8th International Conference on Bioimaging

172

Monteiro, M., et al. (2018). ISLES 2016 and 2017-

benchmarking ischemic stroke lesion outcome predic-

tion based on multispectral MRI. Frontiers in neurol-

ogy, 9.

Statistical Inference of the Inter-sample Dice Distribution for Discriminative CNN Brain Lesion Segmentation Models

173