Residual Convolutional Neural Networks for Breast Density

Classiﬁcation

Francesca Lizzi

1,2,4

, Stefano Atzori

, Giacomo Aringhieri

, Paolo Bosco

, Carolina Marini

Alessandra Retico

, Antonio C. Traino

, Davide Caramella

2,3

and M. Evelina Fantacci

1,2

Istituto Nazionale di Fisica Nucleare (INFN), Pisa, Italy

University of Pisa, Pisa, Italy

Azienda Ospedaliero-Universitaria Pisana (AOUP), Pisa, Italy

Scuola Normale Superiore, Pisa, Italy

Keywords:

Convolutional Neural Networks, Breast Density, BI-RADS, Residual Neural Networks.

Abstract:

In this paper, we propose a data-driven method to classify mammograms according to breast density in BI-

RADS standard. About 2000 mammographic exams have been collected from the “Azienda Ospedaliero-

Universitaria Pisana” (AOUP, Pisa, IT). The dataset has been classiﬁed according to breast density in the

BI-RADS standard. Once the dataset has been labeled by a radiologist, we proceeded by building a Residual

Neural Network in order to classify breast density in two ways. First, we classiﬁed mammograms using two

“super-classes” that are dense and non-dense breast. Second, we trained the residual neural network to classify

mammograms according to the four classes of the BI-RADS standard. We evaluated the performance in terms

of the accuracy and we obtained very good results compared to other works on similar classiﬁcation tasks.

In the near future, we are going to improve the results by increasing the computing power, by improving the

quality of the ground truth and by increasing the number of samples in the dataset.

1 INTRODUCTION

Breast cancer is one of the most diagnosed and fatal

cancer all over the world (International Agency for

Research on Cancer, 2018). The strongest weapons

we have against it are prevention and early diagno-

sis. It has been evaluated that one woman in eight is

going to develop a breast cancer in her life (Loberg

et al., 2015). It is also widely accepted that early di-

agnosis is one of the most powerful instrument we

have in ﬁghting this cancer (Loberg et al., 2015).

Full Field Digital Mammography (FFDM) is a non-

invasive high sensitive method for early stage breast

cancer detection and diagnosis, and represents the ref-

erence imaging technique to explore the breast in a

complete way (D. R. Dance et al., 2014). Since mam-

mography is a 2D x-ray imaging technique, it suf-

fers from some intrinsic problems: a) breast struc-

tures overlapping, b) malignant masses absorb x-rays

similarly to the benignant ones and c) the sensitiv-

ity of the detection is lower for masses or microcal-

ciﬁcation cluster in denser breasts. Breast density is

the amount of ﬁbroglandular tissue with respect to fat

tissue as seen on a mammographic exam (Krishnan

et al., 2017). A mammogram with a very high per-

centage of ﬁbro-glandular tissue is less readable be-

cause dense tissue presents an x-ray absorption coef-

ﬁcient similar to cancer one. Furthermore, to have

a sufﬁcient sensitivity in dense breast, a higher dose

has to be delivered to the subject (Miglioretti et al.,

2016). Moreover, breast density is an intrinsic risk

factor in developing cancer (McCormack, 2006). In

order to have an early diagnosis, screening programs

are performed on asymptomatic women at risk in a

range between 45 and 74 years. Since a lot of healthy

women are exposed to ionizing radiation, dose deliv-

ering should be carefully controlled and personalized

with respect to the imaging systems, measurement

conditions and breast structures. Furthermore, the

European Directive 59/2013/EURATOM (Euratom,

2013) states that subjects have to be well informed

about the amount of received radiation dose. For these

reasons, the RADIOMA project (“RADiazioni IOn-

izzanti in MAmmograﬁa”, funded by “Fondazione

Pisa”, partners: “Dipartimento di Fisica” of Univer-

sity of Pisa, “Istituto Nazionale di Fisica Nucleare”

(INFN), “Fisica sanitaria” of “Azienda Ospedaliero-

Universitaria Pisana” (AOUP) and “Dipartimento di

258

Lizzi, F., Atzori, S., Aringhieri, G., Bosco, P., Marini, C., Retico, A., Traino, A., Caramella, D. and Fantacci, M.

Residual Convolutional Neural Networks for Breast Density Classiﬁcation.

DOI: 10.5220/0007522202580263

In Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019), pages 258-263

ISBN: 978-989-758-353-7

Figure 1: Top left: almost entirely fatty breast (“A”). Top

right: breast with scattered areas of ﬁbroglandular density

(“B”). Lower left: heterogeneously dense breast (“C”).

Lower right: extremely dense breast (“D”).

Ricerca Traslazionale e delle Nuove Tecnologie in

Medicina e Chirurgia” of University of Pisa) was born

with the aim of developing a personalized and reliable

dosimetric quantitative index for mammographic ex-

amination (Traino et al., 2017) (Sottocornola et al.,

2018). Since breast dense tissue is radio-sensitive,

a new personalized dosimetric index should consider

breast density. For all these reasons, we decided

to build a breast density classiﬁer based on a resid-

ual convolutional neural network. The breast density

standard we chose is reported on the Fifth Edition of

the BI-RADS Atlas (Breast Imaging-Reporting And

Data System) (Sickles et al., 2013). The BI-RADS

standard consists in four qualitative classes, deﬁned

by textual description (Figure 1): almost entirely fatty

(“A”), scattered areas of ﬁbroglandular density (“B”),

heterogeneously dense (“C”) and extremely dense

(“D”).

The assessment of breast density is a very impor-

tant issue since a woman with a dense breast should

be directed towards more in-depth screening paths.

As radiologist breast density assessment suffers from

a not-negligible intra and inter-observer variability

(Ciatto et al., 2005), computer methods have been de-

veloped. The most known is called Cumulus (Alonzo-

Proulx et al., 2015) which is a software that works

with radiologist manual input and allows to segment

ﬁbroglandular tissue. In the last years, fully auto-

mated methods have been developed in order to re-

duce the breast density assessment variability as much

as possible (Alonzo-Proulx et al., 2015). Other works

applied deep learning techniques to solve this kind of

problem. Wu et al. (Wu et al., 2017) trained a deep

convolutional neural network in order to produce both

BI-RADS and two super-class classiﬁcation. Fonseca

et al. (Fonseca et al., 2017) used a HT-L3 Network

to extract features to be fed to Support Vector Ma-

chine. In this paper, we propose a residual convolu-

tional neural network to perform BI-RADS classiﬁca-

tion.

2 MATERIALS AND METHODS

2.1 Data Collection

In order to have a sufﬁcient number of digital

mammographic exams, the “Azienda Ospedaliero-

Universitaria Pisana” collected 1962 mammographic

exams (7848 images/single projections) from the

Senology Department. The dataset has been collected

and classiﬁed by a radiologist, specialized in mam-

mography, with the support of a radiology technician.

The chosen selection criteria are:

• All exam reports were negative. Where possible,

the later mammographic exam in medical records

has been examined to verify the current health

state of the woman.

• Badly exposed X-ray mammograms have not

been collected.

• Only women with all the four projections usually

kept in mammography (craniocaudal and medio-

lateral oblique of left and right breast) have been

chosen.

Moreover, the mammographic imaging sys-

tems used were GIOTTO IMAGE SDL, SELE-

NIA DIMENSIONS, GE Senograph DS VER-

SION ADS 54.11 and GE Senograph DS VERSION

ADS 53.40 (Table 1).

Table 1: Mammographic imaging systems as reported in

DICOM ﬁles.

IMAGING SYSTEM EXAMS

Giotto Image SDL 230

Selenia Dimensions 50

GE Senograph ADS 54.11 121

GE Senograph ADS 53.40 1561

TOTAL 1962

The mammographic exams were provided in DI-

COM image format. Each exam includes the four

standard mammographic projections.

2.2 Network Model

In order to train, ﬁt and evaluate the CNNs, Keras

(Chollet, 2018) has been used. Keras is an API writ-

ten in Python with Tensorﬂow in backend. In order to

Residual Convolutional Neural Networks for Breast Density Classiﬁcation

259

make these exams readable to Keras, they have been

converted in the Portable Network Graphics (PNG)

format in 8 bits, maintaining the original size. Even if

the exams have been acquired in 12 bits, they had to

be converted in 8 bits because Keras does not support

12 or 16 bits images. All the PNG images has been

controlled one by one and automatically divided ac-

cording to the density class and the mammographic

projections. We present a model based on a very

deep residual convolutional neural network (He et al.,

2015). The architecture is the same for both two

super-classes classiﬁcation and BI-RADS classiﬁca-

tion. The architecture was made of 41 convolutional

layers, organized in residual blocks, and it had about

2 millions learnable parameters. The input block con-

sists of a convolutional layer, a batch normalization

layer (Ioffe and Szegedy, 2015), a leakyReLU as ac-

tivation function and a 2D-max pooling. The output

of this block has been fed into a series of four blocks,

each made of 3 residual modules. In Figure 2, the

architecture of one of the four block is shown.

The input of each of the four blocks is shared by

two branches: in the ﬁrst, it passes through several

convolutional, batch normalization, activation and

max pooling layers while in the other branch it passes

through a convolutional layer and a batch normaliza-

tion only. The outputs of these two branches are then

added together to constitute the residual block pre-

viously proposed by He et al. (He et al., 2015). The

sum goes through a non-linear activation function and

the result passes through two identical modules. The

architecture of the left branch of these last modules

is the same of the ﬁrst one. In the right branch, in-

stead, no operation is performed. At the exit of the

module, the two branches are summed together. At

the end of the network, the output of the last block

is fed to a global average pooling and to a fully-

connected layer with a softmax as activation function.

For both the problems, the optimizer is a Stochastic

Gradient Descent (SGD), all the activation functions

are leakyReLU (α = 0.2), the loss function is a cate-

gorical cross-entropy and the performance measure is

the accuracy. The accuracy measures the capability of

the network to predict the right label on test mammo-

grams and it is deﬁned as:

Accuracy =

T P + TN

T P + TN + FP + FN

(1)

where TP is the number of true positive, TN the num-

ber of true negative, FP the number of false positive

and FN the number of false negative. The training

has been performed in mini-batches of 8 images. In

Table 2, the optimized hyperparameters that are equal

for all the network are reported. The CNN has been

trained for 100 epochs and the reported results refer to

Figure 2: One of the four blocks made of 3 residual blocks.

the epoch with the best validation accuracy. In order

to consider all the four projections related to a sub-

ject, four CNNs have been separately trained. The

ﬁnal breast density assessment has been produced by

an overall evaluation of the four mammographic pro-

jections related to a subject. The number of samples

per class in the dataset has been rescaled in order to

respect the distribution of classes reported on the BI-

RADS Atlas (Figure 3) (Sickles et al., 2013).

2.2.1 Two Super-classes Classiﬁcation

In BI-RADS standard, the discrimination between

dense and non-dense breast means to classify two

BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms

260

Table 2: Chosen Hyperparameters.

HYPERPARAMETER VALUE

Batch size 8

LeakyReLU alpha 0.2

Learning Rate (LR) 0.1

LR decay 0.1

SGD momentum 0.9

Nesterov True

Figure 3: BI-RADS density classes distribution of 3865070

screening mammography examinations over 13 years

(1996-2008).

“super-classes”, the one made of mammograms be-

longing to A and B classes and the other made of C

and D classes. This problem has a clinical relevance

since a woman with a dense breast should be exam-

ined more carefully. The AOUP dataset has been ran-

domly divided in training set (1356 exams), valida-

tion set (120 exams) and test set (120 exams). Four

CNNs have been trained on the four different mam-

mographic projections. The classiﬁcation scores of

the last layers of each CNN have been averaged in

order to produce a label that takes into account all

the images related to a single subject. Furthermore,

different input image sizes have been explored in or-

der to understand whether there is a dependence of

accuracy on the image input size. So, seven differ-

ent CNNs per projection have been trained with im-

ages with dimensions ranging from 250x250 pixels to

850x850 pixels.

2.2.2 BI-RADS Classiﬁcation

The dataset has been randomly divided in training set

(1170 exams), validation set (150 exams) and test set

(150 exams). Since breast density is an overall evalu-

ation of the projections, if a density asymmetry oc-

curs between the left and right breast, the radiolo-

gist assigns the higher class of that subject. To re-

produce such behaviour, the classiﬁcation scores have

been averaged separately for right and left breast and,

if asymmetry occured, the higher class has been as-

signed to the woman.

2.3 Computing Power

The hardware has been made available by INFN and

consists in:

• CPUs: 2x 10 cores Intel Xeon E5-2640v4 @2.40

GHz;

• RAM: 64 GB;

• GPUs: 4x nVidia Tesla K80, with 2x GPUs Tesla

GK210, 24 GB RAM and 2496 CUDA cores

each;

3 RESULTS

The results for the CNN trained on the dense/non-

dense problem are reported in Table 3. The best test

accuracy over all the four projections is reached by

650x650 pixel images and it is equal to 89.4% (chance

level for a two-class classiﬁcation problem equal to

50%). Furthermore, there are no evidence of remark-

able accuracy trend over input image size.

In Table 4, the results of the training on the four

BI-RADS classes are reported. The values of the ac-

curacy refer to the label assigned with the rule ex-

plained above. The maximum accuracy is obtained

for images of 650x650 pixels size and it is equal

to 78.0% (chance level for a two-class classiﬁcation

problem equal to 25%). As above, there is not a clear

trend of the accuracy over input image size.

4 DISCUSSION

Regarding the dense/non-dense problem, the convolu-

tional neural network trained on 650x650 pixels im-

ages predicts the right label with an accuracy equal

to 89.4%, which is the best test accuracy obtained in

this task to our knowledge. Compared to the previ-

ous work of Wu et al. (Wu et al., 2017), the per-

formance on the two “super-class” problem is com-

parable. In fact, Wu et al. reached a test accuracy

equal to 86.5% with their whole dataset, which con-

sisted in about 200000 exams. Since Wu et al. (Wu

et al., 2017) studied how the accuracy changed over

the number of samples in the training set, we can com-

pare our results with theirs obtained on the 1% of their

dataset. In that case they obtained a test accuracy

equal to 84.9% which is lower than the one reached

in this work. Regarding the BIRADS classiﬁcation,

we obtained a test accuracy on 650x650 pixel images

equal to 78.0%. This result is comparable with re-

spect to the one achieved by previous works. Fonseca

et al. (Fonseca et al., 2017) reached an accuracy of

Residual Convolutional Neural Networks for Breast Density Classiﬁcation

261

Table 3: Accuracy means over different projections for dense/non-dense problem. BV = mean calculated using classiﬁcation

scores at the epoch of Best Validation accuracy.

Input size 250x250 350x350 450x450 550x550 650x650 750x750 850x850

Right breast (BV) 86.3% 90.6% 84.4% 86.9% 88.8% 85.6% 86.3%

Left breast (BV) 86.9% 85.6% 85.0% 85.0% 85.0% 85.0% 86.9%

All proj (BV) 86.3% 86.9% 84.4% 85.6% 89.4% 87.5% 86.3%

Table 4: Accuracy means over different projections for BI-RADS problem. BV = mean calculated using classiﬁcation scores

at the epoch of Best Validation accuracy.

Input size 250x250 350x350 450x450 550x550 650x650 750x750 850x850

Right breast (BV) 74.7% 76.7% 74.7% 72.7% 77.3% 76.0% 72.7%

Left breast (BV) 72.7% 70.7% 72.0% 68.7% 74.7% 72.7% 72.0%

All proj (BV) 76.0% 76.7% 74.0% 73.3% 78.0% 75.3% 72.0%

76% by training their HT-L3 network on about 1000

exams. Wu et al. (Wu et al., 2017) reached an ac-

curacy equal to 76.7%, by using their whole dataset.

We are aware that a correct comparison can only be

made using the same dataset. However, a validated

and shared mammographic dataset is not available

yet. The test accuracy of our approach can be further

increased by implementing some technological and

methodological improvements. First, the considered

ground truth is represented by the density assessment

made by one radiologist only. Since the intra-observer

and inter-observer variabilities are quite high in BI-

RADS classiﬁcation (Ekpo et al., 2016), we could

produce a ground truth using the maximum agreement

between more than one radiologist. In fact, especially

for mammograms belonging to B and C classes, the

assessment produced by only one physician can be

considered as a confusing factor. Second, we are go-

ing to increase the size of our dataset by collecting

a huge number of screening mammographic exams

from “Azienda USL Nord-Ovest Toscana” (ATNO).

Third, we are going to use more powerful GPUs,

which will allow us to improve the size of the im-

ages used as input of the CNNs and study whether

and how the accuracy changes. Furthermore, we are

aware that relevant information may be lost in the

conversion from 12 to 8 bits. For this reason we are

going to work directly with Tensorﬂow and use im-

ages at full depth. Moreover, a way to improve ac-

curacy may be the possibility to build a model able

to take as input the four mammographic projections,

related to one subject, that would be merged together

into the CNN architecture. Finally, a cross-validation

process could be done to validate this classiﬁer and to

estimate the performance variability and the stability

of the parameters.

5 CONCLUSIONS

In this paper, a residual convolutional neural network

to classify mammograms density has been presented.

First, the AOUP collected a dataset of 1962 mammo-

graphic exams from the Senology Department. Fur-

ther, a CNN has been trained in order to discriminate

between non-dense and dense breasts, represented re-

spectively by exams belonging to A and B classes,

and exams belonging to C and D classes. The highest

test accuracy is equal to 89.4%. This result is very

good compared to the one achieved by Wu et al. (Wu

et al., 2017). Finally, a residual convolutional neural

network has been trained in order to classify mam-

mograms in the four BI-RADS standard classes. The

best test accuracy is equal to 78.0%, which is com-

parable with respect to the one achieved by previ-

ous works. This work demonstrates that breast den-

sity can be successfully analyzed with residual con-

volutional neural networks and opens several perspec-

tives on this research ﬁeld. Indeed, new techniques of

image processing can be explored in order to obtain

higher accuracy and to include more samples in the

dataset. Futhermore, it can help in the evaluation of

biomarkers to predict breast cancer, being able to an-

alyze the huge amount of data that can be collected

from screening programs. We are going to collect, in

fact, a high number of mammographic exams from

Tuscany screening program along with information

gathered through a questionnaire on known risk fac-

tors of breast cancer.

ACKNOWLEDGEMENTS

This work has been partially supported by the RA-

DIOMA Project, funded by Fondazione Pisa, Tech-

nological and Scientiﬁc Research Sector, Via Pietro

BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms

262

Toselli 29, Pisa. We would like to thank Giulia Fe-

riani and Sharon Gruttadauria for the contribution to

the realization of the dataset.

REFERENCES

Alonzo-Proulx, O., Mawdsley, G. E., Patrie, J. T., Yaffe,

M. J., and Harvey, J. A. (2015). Reliability of au-

tomated breast density measurements. Radiology,

275(2):366–376.

Chollet, F. (2018). Keras documentation.

Ciatto, S., Houssami, N., Apruzzese, A., Bassetti, E., Bran-

cato, B., Carozzi, F., Catarzi, S., Lamberini, M., Mar-

celli, G., Pellizzoni, R., Pesce, B., Risso, G., Russo,

F., and Scorsolini, A. (2005). Categorizing breast

mammographic density: intra- and interobserver re-

producibility of BI-RADS density categories. The

breast, 14(4):269–275.

D. R. Dance, S. Christoﬁdes, I.D. McLean, and A.D.A.

Maidment, K.H. Ng (2014). Diagnostic Radiology

Physics: A Handbook for Teachers and Students.

Ekpo, E. U., Ujong, U. P., Mello-Thoms, C., and McEn-

tee, M. F. (2016). Assessment of Interradiologist

Agreement Regarding Mammographic Breast Den-

sity Classiﬁcation Using the Fifth Edition of the BI-

RADS Atlas. American Journal of Roentgenology,

206(5):1119–1123.

Euratom (2013). Council directive 2013/59/euratom

of 5 december 2013 laying down basic safety

standards for protection against the dangers aris-

ing from exposure to ionising radiation, and re-

pealing directives 89/618/euratom, 90/641/euratom,

96/29/euratom, 97/43/euratom and 2003/122/eu-

ratom. page 73.

Fonseca, P., Casta

neda, B., Valenzuela, R., and Wainer,

J. (2017). Breast density classiﬁcation with convo-

lutional neural networks. In Beltr

an-Casta

on, C.,

Nystr

om, I., and Famili, F., editors, Progress in Pat-

tern Recognition, Image Analysis, Computer Vision,

and Applications, volume 10125, pages 101–108.

Springer International Publishing.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-

ual learning for image recognition. Conference on

Computer Vision and Pattern Recognition (CVPR).

International Agency for Research on Cancer (2018).

http://gco.iarc.fr/today/home.

Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-

celerating deep network training by reducing internal

covariate shift. arXiv:1502.03167.

Krishnan, K., Baglietto, L., Stone, J., Simpson, J. A., Sev-

eri, G., Evans, C. F., MacInnis, R. J., Giles, G. G.,

Apicella, C., and Hopper, J. L. (2017). Longitudinal

study of mammographic density measures that predict

breast cancer risk. 26(4):651–660.

Loberg, M., Lousdal, M. L., Bretthauer, M., and Kalager,

M. (2015). Beneﬁts and harms of mammography

screening. Breast Cancer Res. 2015;17:63., 17(1).

McCormack, V. A. (2006). Breast density and parenchy-

mal patterns as markers of breast cancer risk: A meta-

analysis. Cancer Epidemiology Biomarkers & Pre-

vention, 15(6):1159–1169.

Miglioretti, D. L., Lange, J., van den Broek, J. J., Lee,

C. I., van Ravesteyn, N. T., Ritley, D., Kerlikowske,

K., Fenton, J. J., Melnikow, J., de Koning, H. J., and

Hubbard, R. A. (2016). Radiation-induced breast can-

cer incidence and mortality from digital mammogra-

phy screening: A modeling study. Annals of Internal

Medicine, 164(4):205.

Sickles, E., D’Orsi, C., and Bassett, L. e. a. (2013). ACR

BI-RADS

 atlas, breast imaging reporting and data

system.

Sottocornola, C., Traino, A., Barca, P., Aringhieri, G.,

Marini, C., Retico, A., Caramella, D., and Fantacci,

M. E. (2018). Evaluation of dosimetric properties in

full ﬁeld digital mammography (ffdm) - development

of a new dose index. In Proceedings of the 11th Inter-

national Joint Conference on Biomedical Engineering

Systems and Technologies - Volume 1: BIODEVICES,,

pages 212–217. INSTICC, SciTePress.

Traino, A. C., Sottocornola, C., Barca, P., Marini, C.,

Aringhieri, G., Caramella, D., and Fantacci, M. E.

(2017). Average absorbed breast dose in mammogra-

phy: a new possible dose index matching the require-

ments of the european directive 2013/59/EURATOM.

European Radiology Experimental, 1(1).

Wu, N., Geras, K. J., Shen, Y., Su, J., Kim, S. G., Kim,

E., Wolfson, S., Moy, L., and Cho, K. (2017). Breast

density classiﬁcation with deep convolutional neural

networks. arXiv:1711.03674.

Residual Convolutional Neural Networks for Breast Density Classiﬁcation

263