Automatic Acoustic Diagnosis of Heartbeats

Simone Mastrangelo and Stavros Ntalampiras

Department of Computer Science, University of Milan, via Celoria 18, Milan, Italy

Keywords:

Heartbeat Classiﬁcation, Heartbeat Features Extraction, Heart Sounds, Machine Learning.

Abstract:

Automatic identiﬁcation of heart irregularities based on the respective acoustic emissions is a relevant research

ﬁeld which receives ever-increasing attention over the last years. Devices such as digital stethoscope and

smartphones can record the heartbeat sounds and are easily accessible, making this method more appealing.

This paper presents different automatic procedures to classify heartbeat sounds coming from such devices into

ﬁve different labels: normal, murmur, extra heart sound, extrasystole and artifact so that even people without

medical knowledge can detect heart irregularities. The data used in this paper come from two different datasets.

The ﬁrst dataset is collected through an iPhone application whereas the second one is collected from a digital

stethoscope. To be able to classify heartbeat sounds, time and frequency domain features are extracted and

modeled by different machine learning algorithms, i.e. k-NN, random forest, SVM and ANNs. We report the

achieved performances and a thorough comparison.

1 INTRODUCTION

The leading cause of global death is represented by

cardiovascular diseases (CVDs), whose death rate is

estimated at 17.9 million people every year. One third

of these deaths occur prematurely in people under 70

years old

.It is of fundamental importance to be able

to promptly identify the symptoms of these diseases

in order to ensure the patient the most suitable med-

ical care and avoid possible premature death. How-

ever, medical experts and physicians may not always

be available to provide an accurate diagnosis (Schnei-

derman, 2001; Roy et al., 2002). At the same time,

tools such as smartphones and digital stethoscopes are

easily accessible and can provide a ﬁrst evaluation in

ﬁnding any CVDs very quickly even by people with-

out speciﬁc medical knowledge or, as in the case of

digital stethoscopes, they can be of support to med-

ical staff to facilitate the diagnosis. The aim of this

study is to create machine learning models that can

autonomously identify a CVD using heartbeat sounds

from applications and digital stethoscopes. The data

come from a challenge (Bentley et al., 2011) in which

there are two distinct datasets: the ﬁrst contains data

collected by iStethoscope Pro, an iPhone app that al-

lows the user to record the sound of his heartbeat,

while the second contains recordings of heartbeat

World Health Organization, Cardiovascular Diseases,

https://www.who.int/health-topics/cardiovascular-diseases

sounds coming from DigiScope, a digital stethoscope.

These data are divided into 5 categories: normal, mur-

mur extra-heart sound, artifact and extrasystole.

From and audio analysis point of view, a heartbeat

includes two sound events: the ﬁrst (S1 or lub) marks

the beginning of a systole (the contraction movement

of the myocardium), while the second (S2 or dub)

marks the end of systole and the beginning of dias-

tole (the relaxation phase after contraction). The beat

of a healthy heart is formed by the succession of S1

and S2 sounds i.e. the succession of systole and di-

astole. There may be two other sounds, S3 and S4

which are called extra heart sounds which are not part

of the normal heart sound. They can be found either

individually or together, while they are typically lo-

cated between S2 and S1. It is not certain that it com-

prises a sign of a speciﬁc disease; nonetheless, they

can reveal different clinical conditions

. Murmurs are

other types of heartbeat sounds that may appear dur-

ing auscultation; they arise from the ﬂow of blood

within the heart or large vessels and can be caused

by structural abnormalities of the heart or by an in-

crease in blood ﬂow. They are classiﬁed according

to their occurrence within the normal cardiac cycle,

so they can be systolic, diastolic or continuous. Sys-

tolic murmurs are not necessarily a sign of disease

University of Washington School of

Medicine, Technique: Heart Sounds & Murmurs,

https://depts.washington.edu/physdx/heart/tech.html

Mastrangelo, S. and Ntalampiras, S.

Automatic Acoustic Diagnosis of Heartbeats.

DOI: 10.5220/0010501800510056

In Proceedings of the 18th International Conference on Signal Processing and Multimedia Applications (SIGMAP 2021), pages 51-56

ISBN: 978-989-758-525-8

Figure 1: Representative Mel-spectrograms and Constant-q transforms extracted from normal, murmur, extra heart sound,

extrasystole and artifact classes existing in the datasets A and B.

and are often perceived in patients with normal heart

structure. The diastolic and continuous murmurs, on

the other hand, always indicate a structural disease

(Davey et al., 2018b). An extrasystole is often a pre-

mature heart impulse that is not part of the normal

heart cycle. More frequently, they originate from the

ventricles taking the name of ventricular extrasystole

or premature ventricular complexes; less frequently,

they originate from atria, the atrioventricular junction

or, rarely, from the sinus node. Extrasystoles can ap-

pear after every second or third beat (Davey et al.,

2018a). The following section describes the existing

works in the area of automatic classiﬁcation of heart-

beat sounds.

2 RELATED WORK

Several studies have been conducted designing meth-

ods able to correctly identify cardiovascular diseases,

while several of them have used data from phonocar-

diograms (PCGs), i.e. a plot of the heartbeat sound

generated by a phonocardiograph which is accurate

but, at the same time, also expensive (Ntalampiras,

2020). In their study, OH, Shu Lih, et al. (Oh et al.,

2020) use this type of data achieving satisfactory re-

sults. Such PCG signals have been used to train a deep

WaveNet model, which is an artiﬁcial neural network

composed of several layers of neurons, managing to

obtain a training accuracy of 97%.

There is a series of studies using data coming from

recordings made from smartphones or digital stetho-

scopes, where various audio feature extraction tech-

niques have been used towards training different ma-

chine learning methods. In (Chao et al., 2018) the

following classiﬁers were used: Naive Bayes, support

vector machine (SVM), Decision Trees, AdaBoost,

Random Forest and Gradient Boosting, managing to

obtain an f1-score of 71.37% from the SVM using

data from smartphone recordings and an f1-score of

71.26% obtained through a random forest, using data

from digital stethoscope recordings. Further improve-

ments were achieved for heartbeat sound recordings

from the digital stethoscope thanks to the use of recur-

rent neural networks reaching an accuracy of 80.8%

(Raza et al., 2019).

3 AUDIO FEATURES

From these two datasets several audio features are

extracted: energy, zero-crossing rate, spectral roll-

off, spectral centroid, mel-frequency cepstral coef-

ﬁcients (MFCCs), mel spectrogram and constant-Q

transform. These features are then used for training

different classiﬁers, and ﬁnally, the results are com-

pared in terms of precision, recall and f1-score eval-

uation metrics. It should be mentioned that for the

extraction we used Librosa, a Python package for pro-

cessing audio and music signals (McFee et al., 2015).

During the feature extraction process, the signal

is divided into equal-sized frames of 25ms with an

overlap of 50%. The extraction can take place in the

time domain or in the frequency domain; in the latter

case, the Discrete Fourier Transform (DFT) is used to

generate the spectrogram of the signal.

Signal energy is calculated for each frame using

the root mean square. Zero crossing rate indicates

the number of times the signal changes sign i.e. goes

from positive to negative or negative to positive, di-

vided by the length of the frame. Spectral rolloff is

the frequency below which a certain percentage of the

magnitude distribution is concentrated. Spectral cen-

troid is the center of gravity of the signal spectrum.

SIGMAP 2021 - 18th International Conference on Signal Processing and Multimedia Applications

Chroma vector is a twelve-element representation of

the spectral energy and is calculated by grouping the

DFT coefﬁcients of a short term window in twelve

bins. Each bin represents one of the twelve tones of

Western music. MFCCs comprise a short-term power

spectrum signal representation, where the frequency

bands are distributed according to a Mel-scale instead

of the linearly-spaced approach. This type of feature

is widely used in the ﬁeld of audio analysis due to its

discriminating power (Ntalampiras, 2016). The Mel-

scale ﬁlter bank maps the powers of the spectrum us-

ing triangular overlapping windows. Mel-scale has a

distortion effect of the frequencies in order to con-

form to the human auditory system which is able to

more easily distinguish the low-frequency region (Gi-

annakopoulos and Pikrakis, 2014). Representative

Mel-spectrograms and Constant-q transforms charac-

terizing the employed datasets are illustrated in Fig. 1.

Last but not least, the calculation of constant-Q trans-

form is similar that of the Fourier transform, with the

difference that it has a constant ratio between center

frequency and resolution. As such, a constant pattern

in the frequency domain is obtained for the sounds

whose harmonic frequency has been plot, unlike the

standard DFT where the spacing between the frequen-

cies is constant (Brown, 1991).

4 CLASSIFICATION

TECHNIQUES

Aiming at evaluating the performance of diverse clas-

siﬁers on the present problem, we used the follow-

ing ﬁve techniques: k-nearest neighbors (k-NN), ran-

dom forest (RF), support-vector machines (SVM),

and Artiﬁcial neural networks (ANN) including con-

volutional neural network (CNN).

k-NN. the speciﬁc classiﬁer has been thoroughly

applied to audio classiﬁcation problems, such as dis-

crimination of an audio stream in speech, music, am-

bient sound and silence (Lu et al., 2001). The k-NN

implementation used for this project is that of scikit-

learn (Pedregosa et al., 2011); the algorithm gener-

ates several k-NN instances in order to search for the

optimal number of neighbors and ultimately, the one

offering the highest f1-score is chosen.

Random Forest. This is a popular classiﬁer which

consists in an ensemble of decision trees where their

outcomes are averaged in order to improve the overall

accuracy. In this work, the number of used decision

trees is equal to 200.

Support Vector Machine. This is another popular

classiﬁer used in audio analysis (Lu et al., 2003);

SVM aims at discovering the optimal hyperplane sep-

arator that minimizes the classiﬁcation error on a val-

idation set of data. The radial basis function was

employed as a kernel during the SVM learning (Zhu

et al., 2007).

Artiﬁcial Neural Network. ANNs are structures

inspired by the animal and human brain and they have

been quite successful in diverse applications such as

commerce, industry and ﬁnance (Kruse et al., 2016).

ANNs have been shown to produce good results in

the classiﬁcation of abnormal heart sound using data

coming from phonocardiogram (Ari and Saha, 2009).

The model used in this project is a multilayer percep-

tron network with three hidden layers trained with the

standard version of the back-propagation algorithm

(Rojas, 1996).

Convolutional Neural Network. Following the re-

cent success of CNNs in audio pattern recognition ap-

plications (Purwins et al., 2019; Ntalampiras, 2020).

Here, two different methods have been used: for the

ﬁrst model uses the same features as the ANN. The

net is composed of three convolutional layers acti-

vated by a rectiﬁed linear unit (ReLU). Each convo-

lutional layer is followed by one max-pooling layer

and one dropout layer. The last two hidden layers are

standard fully-connected ones, while the output layer

employed softmax as the activation function. For the

second method, the already computed features log-

mel spectrogram and constant-Q transform have been

used as inputs. They were downsampled to 177x44px

and converted to RGB. It should be mentioned that

the two features are then evaluated separately. For

each model hyperparameter tuning is based on ran-

dom search, while maximizing the obtained accuracy

comprises the overall objective. Finally, CNN’s struc-

ture is similar to the ﬁrst method with the only dif-

ferences being the input shape, which ﬁts the input

image, and the hyperparameters.

It should be mentioned that log-mel spectrogram

and constant-Q transform were employed only dur-

ing CNN training as dictated by the related literature

(Purwins et al., 2019).

5 EXPERIMENTAL SET-UP AND

RESULTS

5.1 Data

The present set of data originates from a challenge

where the purpose is to classify heartbeat sounds in

Automatic Acoustic Diagnosis of Heartbeats

Table 1: The obtained results for all considered classiﬁers when applied on Dataset A. The highest rate per ﬁgure of merit is

emboldened.

Classiﬁer Precision Recall f1-score

k-NN 0.86 0.80 0.82

Random forest 0.81 0.75 0.73

SVM 0.77 0.75 0.73

ANN 0.79 0.78 0.76

CNN+MFCC 0.78 0.67 0.69

CNN+log-mel spectrogram 0.79 0.67 0.65

CNN+constant-Q transform 0.57 0.58 0.55

order to promptly help diagnose CVDs (Bentley et al.,

2011). Two different datasets are available: in the ﬁrst

(dataset A) the data were gathered using IStethoscope

Pro, an iPhone app, while in the second (dataset B)

the data were gathered in a hospital setting during a

clinical trial using a digital stethoscope called DigiS-

cope. Dataset A contains 124 audio ﬁles sampled at

44100 Hz while dataset B contains 312 audio ﬁles

sampled at 4000 Hz. The ﬁles are labeled with the

normal and murmur classes for both datasets, while

dataset A additionally contains the artifact (various

types of sounds unrelated to the heartbeat) and ex-

tra heart sounds classes. Interestingly, dataset B con-

tains the extrasystole class as well. It should be noted

that all classiﬁers operated on identical test, valida-

tion, and train sets of data so as to obtain a reliable

comparison. The division was 70% for training, 10%

for validation and 20% for testing.

5.2 Preprocessing

This phase precedes the feature extraction phase and

serves to prepare the dataset so that the data are pro-

cessed effectively by the various machine learning

models. Following the challenge’s guidelines, audio

ﬁles lasting less than two seconds have been elim-

inated as they could not represent a heartbeat full

cycle. Subsequently, ﬁles coming from dataset A

were downsampled to 4000 Hz, while ﬁles belonging

to both datasets the ﬁles were trimmed to the same

length. It should be mentioned that standard normal-

ization techniques including mean removal and vari-

ance scaling (z =

x−µ

, where µ is the mean of the

training samples and σ is the standard deviation of the

training samples) have been applied on the extracted

features before being used by the classiﬁers (Ntalam-

piras, 2021).

5.3 Results

To thoroughly measure the classiﬁers performances,

we employed standardized ﬁgures of merit, i.e. pre-

cision, recall and f1-score. Table 1 includes the re-

sults with respect to dataset A. In this case, the best-

performing classiﬁer is k-NN which offers the high-

est scores for precision, recall and f1-score (k = 120).

More in detail, classes artifact and mumrur are identi-

ﬁed with signiﬁcantly higher rates with respect to the

rest of classes, having an f1-score of 0.91 and 0.90

respectively. Interestingly, the present approach out-

performs the state of the art reported in (Chao et al.,

2018), where the obtained f1-score is equal to 0.71

The remaining classiﬁers offer similar performances,

managing to better classify data associated with ar-

tifact and murmur. Unfortunately, performances as-

sociated with deep CNNs are worse than shallow ap-

proaches, while performances are better for artifact

and murmur classes. This could be an indication that

more data is needed in order to allow the deep net-

work to learn the distributions associated with the spe-

ciﬁc classes of heartbeat sounds.

Table 2 illustrates the results for every classiﬁer

when applied on dataset B. In general, we see that the

classiﬁers performed poorer in this case, while there is

no winning classiﬁer for all ﬁgures of merit. The ran-

dom forest approach performed better in recall and

f1-score, while k-NN achieved the highest precision

rate (k = 264). Overall, every classiﬁer demonstrates

similar results in terms of correct and incorrect clas-

siﬁcations. It should be noted that for Dataset B,

the presented rates are lower with respect to the ones

shown in (Chao et al., 2018), where f1-score is 0.71.

Moreover, it is often that extrasystole are murmur are

misclassiﬁed as normal. At the same time, normal

samples are correctly classiﬁed, while random forest

reached f1-score of 0.82 in this case. However, log-

mel spectogram as input offered the highest f1-score

(0.85) for the normal class. We conclude that the

task represented by Dataset B is more challenging due

to noises/interferences associated with sounds emit-

ted from other human organs. Similarly to dataset A,

increasing data quantity could be particularly useful

especially for the deep learning based solutions.

https://github.com/lindawangg/Classifying-Heartbeats

SIGMAP 2021 - 18th International Conference on Signal Processing and Multimedia Applications

Table 2: The obtained results for all considered classiﬁers when applied on Dataset B. The highest rate per ﬁgure of merit is

emboldened.

Classiﬁer Precision Recall f1-score

k-NN 0.64 0.69 0.60

Random forest 0.62 0.73 0.66

SVM 0.61 0.69 0.60

ANN 0.59 0.61 0.60

CNN+MFCC 0.59 0.72 0.64

CNN+log-mel spectrogram 0.58 0.72 0.64

CNN+constant-Q transform 0.60 0.68 0.57

6 CONCLUSIONS

In this paper, a great variety of machine learning

methods has been extensively evaluated on heartbeat

sound classiﬁcation, with the aim being the detec-

tion of abnormalities such as extrasystole, extra heart

sound and murmurs. To this end, several tempo-

ral and spectral audio features have been exploited.

Such an automatic framework aims at supporting de-

cisions made by healthcare professionals, as well as

early diagnosis, e.g. using a smartphone, so as to

quickly check for any existing heartbeat abnormali-

ties and contact an expert physician. It was shown

that the recognition rates reached by such audio pat-

tern recognition methods differ signiﬁcantly between

dataset A (smartphone) and dataset B (stethoscope).

In the ﬁrst datasets, the methods achieved quite good

results in distinguishing artifacts and murmurs, while

in the second the results were worse, especially for the

extrasystole class where no model was able to classify

correctly.

The results of the present experiments could be

primarily improved by expanding the datasets. More

speciﬁcally, it would be especially useful to have

available more heartbeat samples representing the ex-

tra heart sound classes, i.e. extrasystole and mur-

mur. Further improvements could be obtained by cor-

rectly extracting and labeling the S1 and S2 sounds

of the heartbeat, and use them as an additional input

feature for the different classiﬁers. From a machine

learning perspective, it would be interesting to exper-

iment with a) data augmentation methods, including

transfer learning (Ntalampiras and Potamitis, 2018),

b) modeling temporal properties of heartbeats, us-

ing e.g. temporal convolutional networks (Yan et al.,

2020), and c) employed one-shot learning techniques

(Lake et al., 2015) accommodating scarce data avail-

ability.

REFERENCES

Ari, S. and Saha, G. (2009). In search of an optimiza-

tion technique for artiﬁcial neural network to clas-

sify abnormal heart sounds. Applied Soft Computing,

9(1):330–340.

Bentley, P., Nordehn, G., Coimbra, M., and Mannor, S.

(2011). The PASCAL Classifying Heart Sounds Chal-

lenge 2011 (CHSC2011) Results. http://www.peterj

bentley.com/heartchallenge/index.html.

Brown, J. C. (1991). Calculation of a constant q spectral

transform. The Journal of the Acoustical Society of

America, 89(1):425–434.

Chao, A., Ng, S., and Wang, L. (2018). Listen to your heart:

Feature extraction and classiﬁcation methods for heart

sounds.

Davey, P., Sprigings, D., El-Kadri, M., and Hart, G.

(2018a). Extrasystoles. Oxford University Press, Ox-

ford, UK.

Davey, P., Sprigings, D., Timperley, J., and Hothi, S.

(2018b). Murmur. Oxford University Press, Oxford,

UK.

Giannakopoulos, T. and Pikrakis, A. (2014). Chapter 4 -

audio features. In Giannakopoulos, T. and Pikrakis,

A., editors, Introduction to Audio Analysis, pages 59

– 103. Academic Press, Oxford.

Kruse, R., Borgelt, C., Braune, C., Mostaghim, S., Stein-

brecher, M., Klawonn, F., and Moewes, C. (2016).

Computational Intelligence: A Methodological Intro-

duction. Springer Publishing Company, Incorporated,

2nd edition.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.

(2015). Human-level concept learning through proba-

bilistic program induction. Science, 350(6266):1332–

1338.

Lu, L., Jiang, H., and Zhang, H. (2001). A robust audio clas-

siﬁcation and segmentation method. In Proceedings of

the ninth ACM international conference on Multime-

dia, pages 203–211.

Lu, L., Zhang, H.-J., and Li, S. Z. (2003). Content-based au-

dio classiﬁcation and segmentation by using support

vector machines. Multimedia systems, 8(6):482–492.

McFee, B., Raffel, C., Liang, D., Ellis, D., Mcvicar, M.,

Battenberg, E., and Nieto, O. (2015). librosa: Audio

and music signal analysis in python. pages 18–24.

Automatic Acoustic Diagnosis of Heartbeats

Ntalampiras, S. (2016). Automatic analysis of audiostreams

in the concept drift environment. In 2016 IEEE 26th

International Workshop on Machine Learning for Sig-

nal Processing (MLSP), pages 1–6.

Ntalampiras, S. (2020). Deep learning of attitude in chil-

dren’s emotional speech. In 2020 IEEE International

Conference on Computational Intelligence and Vir-

tual Environments for Measurement Systems and Ap-

plications (CIVEMSA), pages 1–5.

Ntalampiras, S. (2020). Identiﬁcation of anomalous phono-

cardiograms based on universal probabilistic model-

ing. IEEE Letters of the Computer Society, (01):1–1.

Ntalampiras, S. (2021). Speech emotion recognition

via learning analogies. Pattern Recognition Letters,

144:21–26.

Ntalampiras, S. and Potamitis, I. (2018). Transfer learning

for improved audio-based human activity recognition.

Biosensors, 8(3):60.

Oh, S. L., Jahmunah, V., Ooi, C. P., Tan, R.-S., Ciaccio,

E. J., Yamakawa, T., Tanabe, M., Kobayashi, M., and

Acharya, U. R. (2020). Classiﬁcation of heart sound

signals using a novel deep wavenet model. Computer

Methods and Programs in Biomedicine, page 105604.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., et al. (2011). Scikit-learn:

Machine learning in python. the Journal of machine

Learning research, 12:2825–2830.

Purwins, H., Li, B., Virtanen, T., Schl

uter, J., Chang, S.,

and Sainath, T. (2019). Deep learning for audio signal

processing. IEEE Journal of Selected Topics in Signal

Processing, 13(2):206–219.

Raza, A., Mehmood, A., Ullah, S., Ahmad, M., Choi, G. S.,

and On, B.-W. (2019). Heartbeat sound signal classi-

ﬁcation using deep learning. Sensors, 19(21):4819.

Rojas, R. (1996). Neural Networks. Springer Berlin Hei-

delberg.

Roy, D., Sargeant, J., Gray, J., Hoyt, B., Allen, M., and

Fleming, M. (2002). Helping family physicians im-

prove their cardiac auscultation skills with an interac-

tive CD-ROM. Journal of Continuing Education in

the Health Professions, 22(3):152–159.

Schneiderman, H. (2001). Cardiac auscultation and teach-

ing rounds: how can cardiac auscultation be re-

suscitated? The American Journal of Medicine,

110(3):233–235.

Yan, J., Mu, L., Wang, L., Ranjan, R., and Zomaya, A. Y.

(2020). Temporal convolutional networks for the ad-

vance prediction of ENSO. Scientiﬁc Reports, 10(1).

Zhu, Y., Ming, Z., and Huang, Q. (2007). SVM-based audio

classiﬁcation for content- based multimedia retrieval.

In Multimedia Content Analysis and Mining, pages

474–482. Springer Berlin Heidelberg.

SIGMAP 2021 - 18th International Conference on Signal Processing and Multimedia Applications