Making Data Big for a Deep-learning Analysis: Aggregation of Public

COVID-19 Datasets of Lung Computed Tomography Scans

Francesca Lizzi

1,2

, Francesca Brero

3,4

, Raffaella Fiamma Cabini

3,5

, Maria Evelina Fantacci

2,6

Stefano Piffer

7,8

, Ian Postuma

, Lisa Rinaldi

3,4

and Alessandra Retico

Scuola Normale Superiore, Pisa, Italy

National Institute for Nuclear Physics (INFN), Pisa Division, Pisa, Italy

INFN, Pavia Division, Pavia, Italy

Department of Physics, University of Pavia, Pavia, Italy

Department of Mathematics, University of Pavia, Pavia, Italy

Department of Physics, University of Pisa, Pisa, Italy

Department of Biomedical Experimental Clinical Science ”M. Serio”, University of Florence, Florence, Italy

INFN, Florence Division, Florence, Italy

Keywords:

COVID-19, Lung CT, U-net, Data Aggregation, Image Segmentation.

Abstract:

Lung Computed Tomography (CT) is an imaging technique useful to assess the severity of COVID-19 in-

fection in symptomatic patients and to monitor its evolution over time. Lung CT can be analysed with the

support of deep learning methods for both aforementioned tasks. We have developed a U-net based algorithm

to segment the COVID-19 lesions. Unfortunately, public datasets populated with a huge amount of labelled

CT scans of patients affected by COVID-19 are not available. In this work, we ﬁrst review all the currently

available public datasets of COVID-19 CT scans, presenting an extensive description of their characteristics.

Then, we describe the design of the U-net we developed for the automated identiﬁcation of COVID-19 lung

lesions. Finally, we discuss the results obtained by using the different publicly available datasets. In particu-

lar, we trained the U-net on the dataset made available within the COVID-19 Lung CT Lesion Segmentation

Challenge 2020, and we tested it on data from the MosMed and the COVID-19-CT-Seg datasets to explore the

transferability of the model and to assess whether the image annotation process affects the detection perfor-

mances. We evaluated the performance of the system in lesion segmentation in terms of the Dice index, which

measures the overlap between the ground truth and the predicted masks. The proposed U-net segmentation

model reaches a Dice index equal to 0.67, 0.42 and 0.58 on the independent validation sets of the COVID-

19 Lung CT Lesion Segmentation Challenge 2020, on the MosMed and on the COVID-19-CT-Seg datasets,

respectively. This work focusing on lesion segmentation constitutes a preliminary work for a more accurate

analysis of COVID-19 lesions, based for example on the extraction and analysis of radiomic features.

1 INTRODUCTION

Lung Computed Tomography (CT) is a very sensitive

medical imaging technique to detect lung lesions due

to COVID-19. It can be used for the diagnosis, prog-

nosis and for monitoring the disease evolution over

time. Despite the use of CT for diagnosis is not rec-

ommended by the World Health Organization (World

Health Organization, 2020), lung CT analysis can be

very informative regarding the severity of the disease

and its time evolution (Fang et al., 2021). The use

of CT in clinical practice for COVID-19 diagnosis

in symptomatic patients has been explored. Since

the unexpected outbreak of the pandemic, physicians

tried to use CT imaging of the chest to diagnose

COVID-19 disease. The ﬁrst publication describing

in details radiological ﬁndings of CT was published

in January, the 24 of 2020 (Huang et al., 2020) and it

describes the radiological ﬁndings of the majority of

COVID-19 hospitalized patients of this study, such as

bilateral multiple lobular and subsegmental areas of

consolidation and bilateral ground-glass opacity. Af-

terwards, several studies have been published to de-

scribe the radiological ﬁndings of COVID-19 chest

CT (Carotti et al., 2020). A summary of all possible

ﬁndings and their incidence is reported in Table 1.

316

Lizzi, F., Brero, F., Cabini, R., Fantacci, M., Piffer, S., Postuma, I., Rinaldi, L. and Retico, A.

Making Data Big for a Deep-learning Analysis: Aggregation of Public COVID-19 Datasets of Lung Computed Tomography Scans.

DOI: 10.5220/0010584403160321

In Proceedings of the 10th International Conference on Data Science, Technology and Applications (DATA 2021), pages 316-321

ISBN: 978-989-758-521-0

Table 1: Summary of COVID-19 chest CT ﬁndings and

their incidence on the population. The normal chest CT

ﬁndings are also associated to symptomaticity (Huang et al.,

2020).

Findings Incidence

Normal chest CT ﬁndings 10.6% (95%

CI: 7.6%, 13.7%)

Ground-Glass opacity,

Lower lobe involvement, High incidence

Bilateral abnormalities, (More than 70%)

Vascular enlargement,

Posterior predilection,

Consolidation, linear

opacity, septal thickening

and/or reticulation,

crazy-paving pattern, Intermediate

air bronchogram, pleural incidence

thickening, halo sign, (between 70%

bronchiectasis, nodules, and 10%)

bronchial wall thickening,

reversed halo sign

Pleural effusion,

lymphadenopathy,

tree-in-bud sign, Low incidence

central lesion distribution, (less than 10%)

pericardial effusion,

cavitating lung lesions

We underline that the dataset used by (Huang

et al., 2020) contains a very limited number of CT

scans (41 patients) and it is private. Most of the chest

CT ﬁndings cannot be related exclusively to COVID-

19 because they are nonspeciﬁc signs of disease and

they are strongly related to the stage of the disease.

This means that there are other forms of pneumonia

that may have the same signs such as SARS-CoV-1

and MERS-CoV. For this reason, the World Health

Organization (WHO) deﬁned as “conﬁrmed case” the

patient that have been tested positive for COVID-

19 RT-PCR, irrespective of clinical signs and symp-

toms (World Health Organization, 2020). Further-

more, it is necessary to differentiate the COVID-19

infections not only from other viral pneumonia but

also from bacterial pneumonia, such as mycoplasma

pneumonia (Ishiguro et al., 2019). The use of chest

CT to diagnose COVID-19 is, hence, under discus-

sion since it implies the use of ionizing radiation (Sci-

entiae et al., 2020) while its ability to monitor the pro-

gression of the disease seems to be a promising way

to use lung CTs (Adams et al., 2020). Artiﬁcial In-

telligence (AI) is a powerful instrument that allows to

analyse a huge quantity of data, such as CT scans and,

hence, it can be used to monitor and study COVID-

19 CT signs (G

ulbay et al., 2021). Unfortunately,

AI implementations require a great amount of data,

which may be not easily available. This is especially

true when deep-learning methods are used. Since

the beginning of the pandemic, some lung CT scans

of COVID-19 patients have been released by differ-

ent institutions following different guidelines for both

image acquisition and annotation (ground truth). In

this work, all the public lung COVID-19 CT datasets,

to the best of our knowledge, suitable for training

AI-based systems are reviewed. In this work, we

present an extensive description of the currently pub-

licly available datasets, which present different char-

acteristics, and discuss the segmentation results ob-

tained by using them. In particular, we trained a U-net

on the dataset released within the COVID-19 Lung

CT Lesion Segmentation Challenge 2020 (An et al.,

2020) and tested it on data from MosMed (Moro-

zov et al., 2020) and COVID-19-CT-Seg datasets (Ma

et al., 2020). Finally, the limits and the advantages of

aggregating this kind of data are discussed.

2 AI AND MEDICAL IMAGE

DATASET ISSUES

AI has been used to analyse and process CT to diag-

nose COVID-19, to segment lesions inside the lungs

and, also, in longitudinal studies to track the evolution

of the disease (Ma et al., 2020). AI based methods,

especially deep-learning ones, need a huge amount of

labelled data that are not easy to collect and share.

As already described in the introduction, some stud-

ies use private datasets which do not allow a fair com-

parison with other AI based systems. Furthermore,

the characteristics of CT images depend on the scan-

ner, on the acquisition and the reconstruction proto-

cols and on other information which may not be avail-

able. This can be due also to the anonymization pro-

cess needed to preserve subjects’ privacy or to the

use of image format different from DICOM (Stan-

dard DICOM, 2021), such as the NIfTI format. DI-

COM is the most used image format for medical im-

ages and it contains several meta-data in its header.

The DICOM header stores many information, some

of which is Protected Health Information (PHI) or pri-

vate keys that are inserted and encoded by the manu-

facturer and may contain PHI as well. On the other

hand, some meta-data, such as anode characteristics

or X-ray parameters, do not contain PHI and they can

be useful in analysing images. For all these reasons,

anonymizing a DICOM ﬁle is not a trivial problem

and dataset may include images in a different format

such as NIfTI (Moore et al., 2015). Deep learning

based methods often require the association with a la-

Making Data Big for a Deep-learning Analysis: Aggregation of Public COVID-19 Datasets of Lung Computed Tomography Scans

317

bel depending on the task we want to solve. Many ap-

proaches are based on supervised learning and, hence,

image annotation plays a crucial role. Usually, medi-

cal image labels are given by one or more radiologists

with experience in the speciﬁc ﬁeld, and image an-

notation is a very time-consuming task. This is the

reason why there is a general lack of public labelled

datasets of medical images. In order to save time, it

may happen that the labelling is made with the sup-

port of an automatic tool and then labels are adjusted

manually by one or more physicians.

3 LUNG CT DATASETS

In this section, the currently available datasets of

lung CT and their annotation process are reported.

The dataset are: COVID-19 Lung CT Lesion Seg-

mentation Challenge 2020 Dataset, MosMed Dataset,

COVID-19-CT-Seg Dataset and TCIA-COVID-19-

AR.

3.1 COVID-19 Lung CT Lesion

Segmentation Challenge 2020

Dataset

The COVID-19 Lung CT Lesion Segmentation Chal-

lenge 2020 (Challenge dataset) dataset is a pub-

lic dataset made by 199 unenhanced chest CT

with positive RT-PCR for SARS-CoV-2 patients (An

et al., 2020), published as training set in the occa-

sion of the COVID GrandChallenge (https://covid-

segmentation.grand-challenge.org/). Each CT is an-

notated voxel-wise and indicates all the COVID-19

lesions in a unique mask. Data has been provided

by The Multi-national NIH Consortium for CT AI in

COVID-19 via the NCI TCIA public website in NIfTI

format. Annotations have been made using a COVID-

19 segmentation model provided by NVIDIA that

takes a full CT chest volume and produces pixel wise

segmentation masks of COVID-19 lesions. These

segmentation masks have been adjusted manually by

a board of certiﬁed radiologists in order to give 3D

consistency to the lesion masks. The annotations of

the training set have been published in the context of

the challenge while the system performance has been

evaluated by challenge organizers on an independent

validation set of 50 CT scans, for which the lesion an-

notations were not publicly released. A third set, an

independent test set consisting of 46 CT scans, was

used to the deﬁne the ﬁnal ranking among the partic-

ipants, and, also in this case, the lesion segmentation

annotations were not publicly released.

3.2 MosMed Dataset

MosMed (Morozov et al., 2020) is a dataset of

COVID-19 Chest CT scans collected by the Re-

search and Practical Clinical Center for Diagnos-

tics and Telemedicine Technologies of the Moscow

Health Care Department. It includes 1110 CT stud-

ies taken from 1110 patients and it is provided with a

labelling that consists of 5 classes, based on the per-

centage of involved lung parenchyma. A small sub-

set of class CT-1 cases (50 patients) has been anno-

tated by expert radiologists with the support of Med-

Seg software (2020 Artiﬁcial Intelligence AS). The

image annotations consist of binary masks in which

white voxels represent both Ground-Glass opacities

and consolidation. Both CT scans and annotations

were provided in NIfTI format. During the DICOM-

to-NIfTI conversion only one every 10th image was

preserved (MosMed, 2020).

3.3 COVID-19-CT-Seg Dataset

The COVID-19-CT-Seg dataset is a collection of CT

scans made available by the Coronacases Initiative

and Radiopaedia (Ma et al., 2020) and contains 20

CT scans of patients resulted positive for RT-PCR

COVID-19 infection. It is a public dataset which con-

tains annotations related to both lung and infection

localization. The ground truth has been made in three

steps: ﬁrst, junior radiologists (1-5 years of experi-

ence) delineated the annotations of lungs and infec-

tions, then two radiologists (5-10 years of experience)

reﬁned the labels and ﬁnally the annotations were ver-

iﬁed and optimized by a senior radiologist (more than

10 years of experience in chest radiology). The anno-

tations have been produced with ITK-SNAP software.

Ten cases of this dataset were provided in 8-bit depth

which are not commonly used in clinical practice.

3.4 TCIA-COVID-19-AR

The TCIA-COVID-19-AR (Desai et al., 2020) is

a dataset of COVID-19 cases taken from a rural

population, which is often underrepresented in public

datasets. It contains 24 CT scans of patients with

both lung lesions due to COVID-19 and control

cases. Each patient is described by a set of clinical

data correlates that includes key radiology ﬁndings.

Moreover, for each patient the information about In-

tensive Care Unit (ICU) admission is included while

annotations on images are not included in this dataset.

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

318

Figure 1: U-net summary: the U-shaped neural network is

made of 5 levels of depth. In the left path (compression), the

input is processed through convolutions, activation layers

(ReLu) and instance normalization layers, while in the right

one (decompression), in addition to those already men-

tioned, 3D Transpose Convolution (de-convolution) layers

are also introduced. Each block (green) is made of 3 convo-

lutional layers.

4 COVID-19 LESION

SEGMENTATION

We developed an automatic system which can seg-

ment COVID-19 lesions based on a U-net (Ron-

neberger et al., 2015) in the framework of the

COVID-19 Lung CT Lesion Segmentation Challenge

2020 (GrandChallenge, 2020). First, a bounding box

which contains the lungs has been built for each CT

scan to reduce as much as possible the background

from the images. An in-house lung segmentation al-

gorithm based on active contours was developed for

this purpose and implemented in matlab (The Math-

Works, Inc.). This algorithm, which accurately seg-

ments the lung parenchyma in absence of lesions,

has very limited performance on CT scans of sub-

jects with COVID-19 lesions. The CT images have

been cropped to the bounding boxes, resized to a ma-

trix of 200x150x100 voxels and a CT windowing in

[-1000,300] range of Hounsﬁeld Units has been ap-

plied on them to enhance the COVID-19 lesions. A

schematic representation of the used U-net is reported

in Figure 1.

We trained the network on the Challenge train-

ing dataset of 199 CT scans, using a weighted cross-

entropy as loss function, and we tested it on the Chal-

lenge validation set (independent from the training

set). In order to have a sufﬁcient number of sam-

ples, we applied data augmentation with rotations,

zooming and elastic transformation to the training set.

We tested the network also on the 50 annotated cases

of MosMed and on the 10 annotated cases of the

COVID-19-CT-Seg-Dataset. The MosMed dataset

contains images and labels taken in a very different

way with respect to those of the Challenge dataset.

The COVID-19-CT-Seg dataset has been built in a

more similar way to the Challenge one for both data

characteristics, such as slice thickness, and labelling

process. We evaluated the segmentation performance

of the trained network model in terms of Dice index

(Equation 1) deﬁned as:

Dice

metric

2 · |M

true

∩ M

predict

true

| + |M

pred

(1)

where M

true

is the ground truth mask and M

pred

is the

predicted one.

We participated in the challenge, obtaining a

Dice index equal to 0.67 on the challenge validation

set (GrandChallenge, 2020). Then, we computed the

segmentation performance of the trained model on the

MosMed dataset obtaining a Dice of 0.42, and on the

COVID-19-CT-Seg-Dataset, obtaining a Dice of 0.58.

We show in Figure 2 a visual comparison between

the reference COVID-19 lesion masks and the ones

predicted by the trained U-net for a representative CT

scan of the MosMed and of the COVID-19-CT-Seg

dataset.

5 DISCUSSION AND

CONCLUSIONS

We obtained good results in terms of the Dice index

as regards the segmentation of the lung lesions re-

lated to COVID-19 infection on the Challenge dataset

compared to literature (Ma et al., 2020). The re-

sults obtained on the other two datasets are not good

as the ﬁrst one. We underline that on the dataset

more similar to the Challenge one, the COVID-19-

CT-Seg dataset, we obtained better results compared

to MosMed. As expected, we conclude that aggregat-

ing data from different sources can be difﬁcult if la-

belling has been performed using different guidelines.

In fact, medical images have many parameters to be

considered, such as the resolution of pixels and the

size of the Field Of View (FOV). These parameters

can be studied in order to attempt a standardization

of images from different datasets, by contrast, differ-

ent annotation styles can not be easily standardized.

Since CT image characteristics can be variable, deep

learning is a useful method to analyse them and their

aggregation. Moreover, U-nets allows a quantiﬁcation

of the volumes of both COVID-19 lesions and lungs.

On the other hand, the use of deep learning based

methods requires a huge amount of homogeneous or

harmonized data both to carry out an optimal training

Making Data Big for a Deep-learning Analysis: Aggregation of Public COVID-19 Datasets of Lung Computed Tomography Scans

319

Figure 2: Visual comparison between the reference COVID-19 lesion masks (green) and the ones predicted (red) by the

trained U-net for a representative CT scan of the MosMed (ﬁrst row, study-0255.nii) and of the COVID-19-CT-Seg (second

row, coronacases-001.nii) datasets. The original CT scans are shown on the left as a reference.

process and to implement a fair representation of the

population to be studied.

This preliminary study has been useful to un-

derstand which parameters should be considered as

the most critical ones in training a neural network

model. Lesion labeling and data selection criteria

are crucial for this kind of segmentation problems be-

cause of the lack of largely populated public datasets,

impacting in a relevant way on the performances.

In conclusion, we reviewed all the public available

datasets (at the best of our knowledge in April 2021),

i.e. COVID-19 Lung CT Lesion Segmentation Chal-

lenge 2020, MosMed, COVID-19-CT-Seg Dataset

and TCIA-COVID-19-AR. We used the Challenge

data to train and evaluate a U-net for COVID-19 lung

lesion segmentation, and we carried out an indepen-

dent test of the MosMed and the COVID-19-CT-Seg

datasets, obtaining good performances, as compared

to other results available in literature (Ma et al., 2020).

We are going to improve our system by adding a

module for lung segmentation which could help in

quantifying the percentage of lung tissue affected by

COVID-19 lesions. We also plan to let radiologists

evaluate the application of this algorithm on a part of

public CT datasets without labelling. Furthermore,

segmentation of COVID-19 lesions is a starting point

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

320

for an accurate radiomic analysis for the prediction,

based on radiological signs, of the clinical outcome

of patients affected by COVID-19 pneumonia.

ACKNOWLEDGEMENTS

This work has been carried out within the

Artiﬁcial Intelligence in Medicine (AIM)

project funded by INFN (CSN5, 2019-2021),

https://www.pi.infn.it/aim. We are grateful to the

staff of the Data Center of the INFN Division of

Pisa. We thank the CINECA Italian computing

center for making available part of the computing

resources used in this paper; in particular, Dr. Tom-

maso Boccali (INFN, Pisa) as PI of PRACE Project

Access #2018194658 and a 2021 ISCRA-C grant.

Moreover, we thank the EOS cluster of Department

of Mathematics ”F. Casorati” (Pavia) for computing

resources.

REFERENCES

Adams, H. J., Kwee, T. C., Yakar, D., Hope, M. D., and

Kwee, R. M. (2020). Chest CT Imaging Signature of

Coronavirus Disease 2019 Infection: In Pursuit of the

Scientiﬁc Evidence. Chest, 158(5):1885–1895.

An, P., Xu, S., Harmon, S. A., Turkbey, E. B., Sanford,

T. H., Amalou, A., Kassin, M., Varble, N., Blain,

M., Anderson, V., Patella, F., Carraﬁello, G., Turkbey,

B. T., and Wood, B. J. (2020). CT Images in COVID-

19.

Carotti, M., Salafﬁ, F., Sarzi-Puttini, P., Agostini, A.,

Borgheresi, A., Minorati, D., Galli, M., Marotto,

D., and Giovagnoni, A. (2020). Chest CT features

of coronavirus disease 2019 (COVID-19) pneumo-

nia: key points for radiologists. Radiologia Medica,

125(7):636–646.

Desai, S., Baghal, A., Wongsurawat, T., Al-Shukri, S.,

Gates, K., Farmer, P., Rutherford, M., Blake, G.,

Nolan, T., Powell, T., Sexton, K., Bennett, W., and

Prior, F. (2020). Data from Chest Imaging with Clin-

ical and Genomic Correlates Representing a Rural

COVID-19 Positive Population [Data set].

Fang, X., Kruger, U., Homayounieh, F., Chao, H., Zhang, J.,

Digumarthy, S. R., Arru, C. D., Kalra, M. K., and Yan,

P. (2021). Association of AI quantiﬁed COVID-19

chest CT and patient outcome. International Journal

of Computer Assisted Radiology and Surgery.

GrandChallenge (2020). COVID-19 Lung CT Le-

sion Segmentation Challenge - 2020, https://covid-

segmentation.grand-challenge.org/COVID-19-20/.

ulbay, M.,

Ozbay, B. O., Mendi, B. A. R., Bas¸tu

A., and Bodur, H. (2021). A CT radiomics analy-

sis of COVID-19-related ground-glass opacities and

consolidation: Is it valuable in a differential diag-

nosis with other atypical pneumonias? PloS one,

16(3):e0246582.

Huang, C., Wang, Y., Li, X., Ren, L., Zhao, J., Hu, Y.,

Zhang, L., Fan, G., Xu, J., Gu, X., Cheng, Z., Yu,

T., Xia, J., Wei, Y., Wu, W., Xie, X., Yin, W., Li, H.,

Liu, M., Xiao, Y., Gao, H., Guo, L., Xie, J., Wang,

G., Jiang, R., Gao, Z., Jin, Q., Wang, J., and Cao,

B. (2020). Clinical features of patients infected with

2019 novel coronavirus in Wuhan, China. The Lancet,

395(10223):497–506.

Ishiguro, T., Kobayashi, Y., Uozumi, R., Takata, N.,

Takaku, Y., Kagiyama, N., Kanauchi, T., Shimizu,

Y., and Takayanagi, N. (2019). Viral pneumonia

requiring differentiation from acute and progressive

diffuse interstitial lung diseases. Internal Medicine,

58(24):3509–3519.

Ma, J., Wang, Y., An, X., Ge, C., Yu, Z., Chen, J., Zhu,

Q., Dong, G., He, J., He, Z., Nie, Z., and Yang, X.

(2020). Towards Efﬁcient COVID-19 CT Annotation:

A Benchmark for Lung and Infection Segmentation.

pages 1–7.

Moore, S. M., Mafﬁtt, D. R., Smith, K. E., Kirby, J. S.,

Clark, K. W., Freymann, J. B., Vendt, B. A., Tarbox,

L. R., and Prior, F. W. (2015). De-identiﬁcation of

medical images with retention of scientiﬁc research

value. Radiographics, 35(3):727–735.

Morozov, S. P., Andreychenko, A. E., Pavlov, N. A.,

Vladzymyrskyy, A. V., Ledikhova, N. V., Gom-

bolevskiy, V. A., Blokhin, I. A., Gelezhe, P. B., Gon-

char, A. V., and Chernina, V. (2020). MosMedData:

Chest CT Scans with COVID-19 Related Findings

Dataset. medRxiv, page 2020.05.20.20100362.

MosMed (2020). MosMed dataset website,

https://mosmed.ai/en/.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-

net: Convolutional networks for biomedical im-

age segmentation. Lecture Notes in Computer Sci-

ence (including subseries Lecture Notes in Artiﬁcial

Intelligence and Lecture Notes in Bioinformatics),

9351:234–241.

Scientiae, D. C.-S., Adams, H. J. A., Kwee, T. C., Hope,

M. D., Kwee, R. M., Hja, A., Tc, K., and Yakar, D.

(2020). Systematic Review and Meta- in the Diagno-

sis of Coronavirus. (December):1342–1350.

Standard DICOM (2021). DICOM standard.

World Health Organization (2020). WHO Interim guidance

20 March 2020 - Global Surveillance for COVID-19

disease caused by human infection with novel coron-

avirus (COVID-19). Who, (January):1–4.

Making Data Big for a Deep-learning Analysis: Aggregation of Public COVID-19 Datasets of Lung Computed Tomography Scans

321