Classiﬁcation of Respiratory Sounds with Convolutional Neural Network

A. A. Saraiva

3,7 a

, D. B. S. Santos

2 b

, A. A. Francisco

2 c

, Jose Vigno Moura Sousa

1,2 d

N. M. Fonseca Ferreira

4,5 e

, Salviano Soares

3 f

and Antonio Valente

3,6 g

University Brazil, Sao Paulo, Brazil

UESPI-University of State Piaui, Piripiri, Brazil

University of Tr

as-os-Montes and Alto Douro,Vila Real, Portugal

Coimbra Polytechnic - ISEC, Coimbra, Portugal

Knowledge Engineering and Decision-Support Research Center (GECAD) of the Institute of Engineering,

Polytechnic Institute of Porto, Porto, Portugal

INESC-TEC Technology and Science, Porto, Portugal

University of Sao Paulo, Sao Carlos, Brazil

Keywords:

CNN, Sounds, Breath, MFCC.

Abstract:

Noting recent advances in the ﬁeld of image classiﬁcation, where convolutional neural networks (CNNs) are

used to classify images with high precision. This paper proposes a method of classifying breathing sounds

using CNN, where it is trained and tested. To do this, a visual representation of each audio sample was made

that allows identifying resources for classiﬁcation, using the same techniques used to classify images with high

precision.For this we used the technique known as Mel Frequency Cepstral Coefﬁcients (MFCCs). For each

audio ﬁle in the dataset, we extracted resources with MFCC which means we have an image representation

for each audio sample. The method proposed in this article obtained results above 74%, in the classiﬁcation of

respiratory sounds used in the four classes available in the database used (Normal, crackles, wheezes, Both).

1 INTRODUCTION

Automatic analysis of respiratory sounds has been a

ﬁeld of great research interest in recent decades. Au-

tomated classiﬁcation of respiratory sounds has the

potential to detect abnormalities in the early stages of

respiratory dysfunction and thus increase the effec-

tiveness of decision making Pasterkamp et al. (1997);

Morillo et al. (2013).

Respiratory sounds are important indicators of

respiratory health and respiratory disorders. The

sound emitted when a person breathes is directly re-

lated to air movement, changes in lung tissue and

position of lung secretions. A wheezing, for exam-

https://orcid.org/0000-0002-3960-697X

https://orcid.org/0000-0003-4018-242X

https://orcid.org/0000-0002-0714-3333

https://orcid.org/0000-0002-5164-360X

https://orcid.org/0000-0002-2204-6339

https://orcid.org/0000-0001-5862-5706

https://orcid.org/0000-0002-5798-1298

ple, is a common sign that a patient has an obstruc-

tive airway disease such as asthma or chronic ob-

structive pulmonary disease Moussavi (2006). These

sounds can be recorded using digital stethoscopes and

other recording techniques. This digital data opens

the possibility of using machine learning to automat-

ically diagnose respiratory disorders such as asthma,

pneumonia and bronchiolitis, among others Nayden-

ova (2018).

When performed by advanced computational

methods, in-depth analysis of these sounds may be

of great support to the physician, which may result

in improved detection of respiratory diseases. In

this context, machine learning techniques have been

shown to provide an invaluable computational tool for

detecting disease-related anomalies in the early stages

of respiratory dysfunction Perna and Tagarelli (2019).

Based on this information, this article describes a

method capable of classifying four types of breath-

ing sounds (Normal, crackles, wheezes, both), the

ICBHI 2017 Challenge dataset was used Rocha et al.

(2018a). The method chosen and implemented con-

138

Saraiva, A., Santos, D., Francisco, A., Sousa, J., Ferreira, N., Soares, S. and Valente, A.

Classiﬁcation of Respiratory Sounds with Convolutional Neural Network.

DOI: 10.5220/0008965101380144

In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) - Volume 3: BIOINFORMATICS, pages 138-144

ISBN: 978-989-758-398-8; ISSN: 2184-4305

sists of the construction of a CNN, but a feature ex-

traction method of breath sounds that was used for the

classiﬁcation task is also implemented. The article is

divided into 6 sections, where 2 is related work, 3 is

the dataset description, while section 4 is the method-

ology used, section 5 presents the results and discus-

sions, while section 6 consists of in conclusion.

2 RELATED WORKS

Respiratory diseases are currently among the most

common causes of serious illness and death world-

wide. Prevention and early diagnosis are essential in

all diseases to limit or even reverse the tendency that

characterizes the spread of such diseases.The devel-

opment of advanced computational tools for the anal-

ysis of respiratory auscultation sounds can become

a watershed in detecting disease or disease-related

anomalies Perna and Tagarelli (2019).

For the diagnosis of respiratory diseases, it is ex-

tremely important to hear the sounds generated during

the patient’s breathing, which is usually heard by a

specialist with the help of a stethoscope Kandaswamy

et al. (2004). These include Asthma, Pneumonia,

COPD, among others, which are anomalies and may

cause unusual sounds. From this, several researches

are done in order to automate the detection and clas-

siﬁcation of respiratory sounds for the diagnosis of

diseases according to Pramono et al. (2017).

Automated classiﬁcation of respiratory sounds has

been studied by several researchers in recent years,

automated respiratory analysis has the potential to de-

tect patient breathing anomalies and thereby signiﬁ-

cantly increases the effectiveness of decision making

Rocha et al. (2018b, 2019).

Convolutional neural networks (CNNs) show that

in addition to being very effective in image classiﬁca-

tion, it can also be used to classify soundtracks using

various CNN architectures as used by Hershey et al.

(2017).

3 DATASET DESCRIPTION

In this paper, we used the data set of Challenge

ICBHI 2017 Rocha et al. (2018a), This database of

breathing sounds was originally compiled to support

the scientiﬁc challenge organized by, in Informatics

in Biomedical Health ICBHI 2017. The database

was created by two research teams in Portugal and

Greece and includes 920 recordings acquired from

126 individuals with a total duration of 5.5 hours of

recordings. A total of 6898 respiratory cycles were

recorded, of which 3642 do not have an anomaly,

1864 contain crackles, 886 contain wheezing and 506

contain both crackles and wheezing.

Recordings were collected on heterogeneous

equipment and their duration ranged from 10 to 90

seconds. Also provided were the locations from

which the recordings were purchased. Data include

clean breathing sounds and noisy recordings that

simulate real-life conditions, collecting sounds from

seven chest locations. Patients with lower respiratory

tract infections, respiratory tract infections, COPD,

asthma and bronchiectasis were included. Sounds

were collected in clinical and non-clinical environ-

ments (patients’ residence). Patients cover all age

groups, children, adults and the elderly.

The respiratory sound characteristics of the

database were recorded by three experienced physi-

cians, two specialized pulmonologists and one cardi-

ologist Rocha et al. (2018b, 2019).

4 MATERIALS AND METHODS

In this section we present the methods used in this ar-

ticle, we also describe the metrics used to evaluate the

performance of the implemented neural network. The

diagram illustrated in Figure 1 shows the main con-

stituent parts of the method. It is composed of three

main modules: data pre-processing, training and test-

ing of CNN, and lastly it is analyzed its performance

with the metrics chosen by the present work.

Figure 1: Structure of the system.

4.1 Data Pre-processing

During processing, the 5 second size audio clips are

windowed, ie the audios are cut into segments Tzane-

takis and Cook (2002), if necessary, segments are

ﬁlled with zero so that all segments are the same

size. With this method you can increase the amount

of samples from each class to do CNN training. Thus

the quantity of each class was as follows: crackles

with 6415 samples, wheezing contains 7488 samples,

while both (crackles, wheezing) class has 732, and ﬁ-

nally the class that has no respiratory abnormalities

has 6850 samples.

Classiﬁcation of Respiratory Sounds with Convolutional Neural Network

139

4.1.1 Extract Features

The next step is to extract the resources to train our

model. To do this, a visual representation of each au-

dio sample has been made to identify classiﬁcation

features using the same techniques used to classify

images with high precision. Perna (2018). For this

we used the technique known as Mel Frequency Cep-

stral Coefﬁcients (MFCCs) SHIRALI-SHAHREZA

(2010). For each audio ﬁle in the dataset, we extracted

resources with the MFCC which means we have an

image representation for each audio sample. This way

you can train the classiﬁer with these images.

Spectrograms are a useful technique for visualiz-

ing the frequency spectrum of a sound and how they

vary over a very short period of time Jeffery et al.

(2018). The main difference is that a spectrogram

uses a spaced linear frequency scale (so that each fre-

quency compartment is spaced with an equal number

of Hertz), while an MFCC uses a quasi-logarithmic

spaced frequency scale, which is more similar to

the way how the human auditory system processes

sounds SHIRALI-SHAHREZA (2010).

Figure 2: Representation of a frequency domain sound sam-

ple (Class Wheezes).

The ﬁgure 2, It is possible to visualize one of the

dataset sound samples in its raw state, with its repre-

sentation in the time domain, comparing the ampli-

tude over time. Already the ﬁgures 3, 4 e 5, 6, these

are the audio samples with the technique used for fea-

ture extraction, which consists of the MFCC, similar

to a spectrogram, but with more distinct details, ac-

cording to the classiﬁcation of the images, the size

used was 224x140, after the MFCC.

4.2 Metrics of the Evaluation

The ﬁnal precision of the model is estimated by the

equation, where Ac

is the sum of the differences be-

tween the actual value y

and the expected value ˆy

Figure 3: MFCC None Class.

Figure 4: MFCC Crackles Class.

Figure 5: MFCC Wheezes Class.

Figure 6: MFCC Both Class.

with this it is possible to infer the generalization of

the network.

∑

i=1

− ˆy

) (1)

As a statistical tool, we have the confusion matrix

that provides the basis for describing classiﬁcation ac-

curacy and characterizing errors, helping to reﬁne ac-

curacy. The confusion matrix is formed by a matrix

of squares of numbers arranged in rows and columns

that express the number of sample units of a given

category, inferred by a decision rule, compared to the

current category Saraiva et al. (2018).

The measurements derived from the confusion

matrix are: total accuracy, which was chosen by

the present work, individual class precision, producer

precision, user precision, Kappa index, among oth-

BIOINFORMATICS 2020 - 11th International Conference on Bioinformatics Models, Methods and Algorithms

140

Figure 7: Neural Network architecture.

ers.s.

Total accuracy is calculated by dividing the sum

of the main diagonal of the error matrix x

, by the

total number of samples collected n, according to the

equation 3.

T =

∑

(2)

As a statistical tool to evaluate model performance

is also used precision and recall, which are repre-

sented by the following equations 3, 4.

Precision =

T P

T P + FP

(3)

Recall =

T P

T P + FN

(4)

F1 Score is a simple metric that takes Precision

and Recall into account. This is simply the har-

monic medium of precision and recall Suominen et al.

(2008).

F1Score = 2 ∗

Precision ∗ Recall

Precision + Recall

(5)

4.3 Neural Network Training and

Architecture

According to Shahin et al. (2004), to train a machine

learning model, it is necessary to divide the data into

two sets (training and testing). The training dataset is

the data sample used to ﬁt the model where the model

sees and learns from this data Krawczyk (2016). The

test data set, however, is the data sample used to pro-

vide an unbiased evaluation of the model in the train-

ing data set after adjusting the model hyperparameters

Krawczyk (2016).

To perform neural network training, the data set

was divided into training and testing, with 70% of

each class of data used for training and 30% used

for testing. Thus, the amount of training samples was

15.039 while for testing is 6.445 samples.

Figure 7 illustrates the proposed network architec-

ture for the sound classiﬁcation task. All convolution

layers are applying 2D convolution and each has 32

kernels of size 5. Max pooling with size 5 and strides

2 are also used on all pooling layers. The predictor

network consists of 13 residual blocks followed by

four fully connected layers with 1024, 512, 256 and

4 neurons respectively and a softmax layer to predict

the output class Saraiva et al. (2019a), Saraiva et al.

(2019b).

For comparison purposes the neural network was

implemented in two ways, ie two tests with different

hyperparameters were performed Table 1. As a loss

function, cross entropy loss on the softmax output is

used. To train the model, the Adam Kingma and Ba

(2014) optimization method is used, with a learning

rate of 0.0001 for test 1 and 0.001 for test 2.

For the implementation of the neural network, the

computer library TensorFlow Abadi et al. (2016) is

used. Processing was performed using a Geforce

GTX 1060 graphics card with 1280 CUDA cores

(processors), 6 GB of dedicated memory, 12 GB of

RAM and a fourth generation Core i5 processor.

5 RESULTS AND DISCUSSIONS

This section discusses and presents the results ob-

tained at each stage of the development of this article.

A comparison between neural network implementa-

tion tests is provided in Table 1. A comparison is

also made with the works of SHIRALI-SHAHREZA

(2010), Ntalampiras (May). It is noteworthy that the

performance of the approached method using the met-

rics of section 4.2 is demonstrated

In SHIRALI-SHAHREZA (2010), a sound clas-

siﬁcation method is presented that uses the same

database used in this article, where a CNN archi-

tecture is implemented, but only binary classiﬁca-

tion is made, which facilitates the performance of the

method, obtaining a average accuracy of 79%. Only

Classiﬁcation of Respiratory Sounds with Convolutional Neural Network

141

Table 1: Training hyperparameters Neural Network and Accuracy.

Test Learning Rate Optimiser Batch Size Epochs Training time Accuracy

1 0.0001 Adam 128 100 160 min 74.3%

2 0.001 Adam 200 200 330 min 72.0%

with binary sorting it is not possible to exploit all

database features. Already in Ntalampiras (May) is

developed a method of classiﬁcation of sounds, this

method based on Hidden Markov models, was used

the same database, where was made classiﬁcation of

the four types of sounds (Normal, crackles, wheez-

ing). , both crackles and wheezing) present in the

database, but the results were not satisfactory obtain-

ing on average only 64%.

Table 2: Results of the metrics used to evaluate the perfor-

mance of neural network test 1.

Class Recall Precision F1 Score Samples

None 90.0% 74.3% 81.1% 757

Crackles 61.2% 76.5% 67.6% 375

Wheezes 55.6% 71.7% 62.4% 184

Both 39.4% 72.2% 50.5% 109

Table 3: Results of the metrics used to evaluate the perfor-

mance of neural network test 2

Class Recall Precision F1 Score Samples

None 81.0% 78.1% 80.3% 757

Crackles 57.4% 64.3% 60.5% 375

Wheezes 60.3% 56.2% 58.7% 184

Both 53.7% 51.4% 52.8% 109

As mentioned in section 5.1, two implementation

tests were performed for neural network. The changes

of each test can be analyzed in Table 1. Already the

ﬁgures 8, 9, is demonstrate the confusion matrices of

each test. With this it is possible to identify which

implementation had the best performance for sound

classiﬁcation, as well as analyze the training history

for tests 1 in the ﬁgure 10.

In Tables 2, 3, one can analyze the metric results

for each class, as one can see that the model per-

formed well, even with the unbalanced dataset. In Ta-

ble 1, one can analyze the accuracy values for testing,

with test 1 performing better with respect to this met-

ric used. As can be analyzed the results were satisfac-

tory compared to SHIRALI-SHAHREZA (2010) and

(Ntalampiras, May). The method proposed in this ar-

ticle obtained results above 74%, in the classiﬁcation

of respiratory sounds used the four classes available

in the database used.

Figure 8: Confusion matrix test 1.

Figure 9: Confusion matrix test 2.

Figure 10: Training progression test 1.

6 CONCLUSION

Based on the methodology of this paper, a convo-

lutional neural network with a deep learning frame-

work was developed that originally integrates pre-

BIOINFORMATICS 2020 - 11th International Conference on Bioinformatics Models, Methods and Algorithms

142

processing based on 5-second windowed audio clips

for better classiﬁcation of breathing sounds: normal,

wheezing, crackling and both (wheezing and crack-

ling)

The article was divided into three parts, we de-

scribe the network architecture as well as the crucial

phase of pre-processing and classiﬁcation. The per-

formance results obtained suggest that CNNs are a

viable tool for detecting speciﬁc characteristics in res-

piratory data and are capable of accurately classifying

respiratory sounds inside and outside of laboratory

environments using CNN. This article is expected to

inspire and enable further research in the analysis of

respiratory sounds.

REFERENCES

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A.,

Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard,

M., et al. (2016). Tensorﬂow: A system for large-

scale machine learning. In 12th {USENIX} Sympo-

sium on Operating Systems Design and Implementa-

tion ({OSDI} 16), pages 265–283.

Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke,

J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D.,

Saurous, R. A., Seybold, B., Slaney, M., Weiss, R. J.,

and Wilson, K. (2017). Cnn architectures for large-

scale audio classiﬁcation. In 2017 IEEE International

Conference on Acoustics, Speech and Signal Process-

ing (ICASSP), pages 131–135.

Jeffery, T., Cunningham, S., and Whiteside, S. P. (2018).

Analyses of sustained vowels in down syndrome (ds):

a case study using spectrograms and perturbation data

to investigate voice quality in four adults with ds.

Journal of Voice, 32(5):644–e11.

Kandaswamy, A., Kumar, C. S., Ramanathan, R. P., Jayara-

man, S., and Malmurugan, N. (2004). Neural clas-

siﬁcation of lung sounds using wavelet coefﬁcients.

Computers in biology and medicine, 34(6):523–537.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Krawczyk, B. (2016). Learning from imbalanced data: open

challenges and future directions. Progress in Artiﬁcial

Intelligence, 5(4):221–232.

Morillo, D. S., Moreno, S. A., Granero, M.

A. F., and

Jim

enez, A. L. (2013). Computerized analysis of res-

piratory sounds during copd exacerbations. Comput-

ers in biology and medicine, 43(7):914–921.

Moussavi, Z. (2006). Fundamentals of respiratory sounds

and analysis. Synthesis lectures on biomedical engi-

neering, 1(1):1–68.

Naydenova, E. (2018). Machine learning for childhood

pneumonia diagnosis. PhD thesis, University of Ox-

ford.

Ntalampiras, S., . P. I. (2019, May). Classiﬁcation of sounds

indicative of respiratory diseases. pages 93–103.

Pasterkamp, H., Kraman, S. S., and Wodicka, G. R. (1997).

Respiratory sounds: advances beyond the stethoscope.

American journal of respiratory and critical care

medicine, 156(3):974–987.

Perna, D. (2018). Convolutional neural networks learn-

ing from respiratory data. In 2018 IEEE Interna-

tional Conference on Bioinformatics and Biomedicine

(BIBM), pages 2109–2113. IEEE.

Perna, D. and Tagarelli, A. (2019). Deep auscultation: Pre-

dicting respiratory anomalies and diseases via recur-

rent neural networks. In 2019 IEEE 32nd Interna-

tional Symposium on Computer-Based Medical Sys-

tems (CBMS), pages 50–55. IEEE.

Pramono, R. X. A., Bowyer, S., and Rodriguez-Villegas,

E. (2017). Automatic adventitious respiratory sound

analysis: A systematic review. PLOS ONE, 12(5):1–

43.

Rocha, B., Filos, D., Mendes, L., Vogiatzis, I., Peran-

toni, E., Kaimakamis, E., Natsiavas, P., Oliveira, A.,

acome, C., Marques, A., et al. (2018a). A res-

piratory sound database for the development of au-

tomated classiﬁcation. In Precision Medicine Pow-

ered by pHealth and Connected Health, pages 33–37.

Springer.

Rocha, B. M., Filos, D., Mendes, L., Serbes, G., Ulukaya,

S., Kahya, Y. P., Jakovljevic, N., Turukalo, T. L., Vo-

giatzis, I. M., Perantoni, E., Kaimakamis, E., Nat-

siavas, P., Oliveira, A., J

acome, C., Marques, A.,

Maglaveras, N., Paiva, R. P., Chouvarda, I., and

de Carvalho, P. (2019). An open access database for

the evaluation of respiratory sound classiﬁcation algo-

rithms. Physiological Measurement, 40(3):035001.

Rocha, B. M., Filos, D., Mendes, L., Vogiatzis, I., Peran-

toni, E., Kaimakamis, E., Natsiavas, P., Oliveira, A.,

acome, C., Marques, A., Paiva, R. P., Chouvarda, I.,

Carvalho, P., and Maglaveras, N. (2018b). α respira-

tory sound database for the development of automated

classiﬁcation. pages 33–37.

Saraiva, A., Ferreira, N., Sousa, L., Carvalho da Costa, N.,

Sousa, J., Santos, D., and Soares, S. (2019a). Classi-

ﬁcation of images of childhood pneumonia using con-

volutional neural networks. In 6th International Con-

ference on Bioimaging, pages 112–119.

Saraiva, A., Melo, R., Filipe, V., Sousa, J., Ferreira, N. F.,

and Valente, A. (2018). Mobile multirobot manipula-

tion by image recognition.

Saraiva, A. A., Santos, D. B. S., Costa, N. J. C., Sousa, J.

V. M., Ferreira, N. M. F., Valente, A., and Soares, S.

F. S. P. (2019b). Models of learning to classify x-ray

images for the detection of pneumonia using neural

networks. In BIOIMAGING.

Shahin, M. A., Maier, H. R., and Jaksa, M. B. (2004).

Data division for developing neural networks applied

to geotechnical engineering. Journal of Computing in

Civil Engineering, 18(2):105–114.

SHIRALI-SHAHREZA, M. Hassan; SHIRALI-

SHAHREZA, S. (2010). Effect of mfcc normalization

on vector quantization based speaker identiﬁcation. In

IEEE International Symposium on Signal Processing

and Information Technology, pages 250–253. IEEE.

Classiﬁcation of Respiratory Sounds with Convolutional Neural Network

143

Suominen, H., Ginter, F., Pyysalo, S., Airola, A., Pahikkala,

T., Salanter, S., and Salakoski, T. (2008). Ma-

chine learning to automate the assignment of diag-

nosis codes to free-text radiology reports: a method

description. In Proceedings of the ICML/UAI/COLT

Workshop on Machine Learning for Health-Care Ap-

plications.

Tzanetakis, G. and Cook, P. (2002). Musical genre classiﬁ-

cation of audio signals. IEEE Transactions on speech

and audio processing, 10(5):293–302.

BIOINFORMATICS 2020 - 11th International Conference on Bioinformatics Models, Methods and Algorithms

144