Lightweight Audio-Based Human Activity Classiﬁcation Using Transfer

Learning

Marco Nicolini

∗

, Federico Simonetta

† a

and Stavros Ntalampiras

† b

LIM – Music Informatics Laboratory, Computer Science Department, University of Milan, Milan, Italy

Keywords:

Audio Pattern Recognition, Machine Learning, Transfer Learning, Convolutional Neural Network, YAMNet,

Human Activity Recognition.

Abstract:

This paper employs the acoustic modality to address the human activity recognition (HAR) problem. The

cornerstone of the proposed solution is the YAMNet deep neural network, the embeddings of which comprise

the input to a fully-connected linear layer trained for HAR. Importantly, the dataset is publicly available and

includes the following human activities: preparing coffee, frying egg, no activity, showering, using microwave,

washing dishes, washing hands, and washing teeth. The speciﬁc set of activities is representative of a standard

home environment facilitating a wide range of applications. The performance offered by the proposed transfer

learning-based framework surpasses the state of the art, while being able to be executed on mobile devices,

such as smartphones, tablets, etc. In fact, the obtained model has been exported and thoroughly tested for

real-time HAR on a smartphone device with the input being the audio captured from its microphone.

1 INTRODUCTION

Human Activity Recognition (HAR) is the process

of automatic detection and identiﬁcation of physical

human activities (Ramanujam et al., 2021). Its ap-

plications range from health care systems with real-

time remote tracking of patients – e.g. medical diag-

nosis and tracking of elderly people –, to smart-home

and safe-traveling systems, including the recognition

of criminal human activity (Ntalampiras and Roveri,

2016) and activities in natural environments (Ntalam-

piras et al., 2012).

The main issue in HAR is to leverage motion sig-

nals to classify the type of action that is ongoing. The

literature mainly focuses on motion and wearable sen-

sors (Ramanujam et al., 2021) or vision sensors (Bed-

diar et al., 2020) and has recently embraced the deep-

learning world (Chen et al., 2021). For instance,

CNNs with inertial sensor data, such as accelerome-

ters and gyroscopes, have been used to sample the ac-

celeration and the angular velocity of a body (Bevilac-

qua et al., 2018). In such a context, real-time HAR

has been achieved using deep learning models using

information coming from sensors typically existing

in smartphones (Ronao and Cho, 2016; Wan et al.,

2020). However, real-time audio-based HAR is still

https://orcid.org/0000-0002-5928-9836

https://orcid.org/0000-0003-3482-9215

an open subject, and this work ﬁlls exactly that gap.

While multimodality is a key aspect in the HAR

ﬁeld (Chen et al., 2021) and every commercial smart-

phone device is equipped with one or more micro-

phones, the speciﬁc modality has not been thoroughly

explored in a stand-alone neither a multimodel set-

ting, where information from multiple modalities is

exploited. This work explores how existing audio-

recognition models tailored to real-time classiﬁcation

can be applied to HAR.

A few previous works presented HAR models

based on the corresponding acoustic emissions. One

work focused on the feature extraction stage, consist-

ing of a selection analysis via genetic search; the pro-

posed method is tailored to low-power consumption

devices, such as smartphones, and employs Random

Forest (RF) and Neural Network (NN) models (Ri-

boni et al., 2016). Another work focuses on trans-

fer learning for data augmentation to reduce the im-

balance bias and improve generalization (Ntalampi-

ras and Potamitis, 2018). Finally, the audio chan-

nel has been used in multimodal online HAR sys-

tems (Chahuara et al., 2016).

Regarding approaches based on traditional ma-

chine learning, i.e. not based on deep neural net-

works, non-Markovian ensemble voting has been

used for robotic applications(Stork et al., 2012).

Social network analysis based on graph statistics

has been applied by constructing networks between

Nicolini, M., Simonetta, F. and Ntalampiras, S.

Lightweight Audio-Based Human Activity Classiﬁcation Using Transfer Learning.

DOI: 10.5220/0011647900003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 783-789

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

783

windows of the audio fragments and comparing

the graphs related to different activities (Garc

ıa-

Hern

andez et al., 2017).

It should be mentioned that the spread of audio de-

vices listening to users’ speech, activities, etc. with-

out transparently making it public has created seri-

ous privacy concerns (Lau et al., 2018). An existing

work analyzes the impact of audio deterioration on

speech intelligibility with the hope of ﬁnding privacy-

friendly methods for HAR (Liang et al., 2020). To

the best of our knowledge, this latter path is yet to be

explored requiring more attention from the scientiﬁc

community.

With the rise and ever-increasing adoption of

Deep Neural Networks, the problem of data avail-

ability became of primary importance. While some

datasets for audio-based HAR are available – see

Sec. 2 –, various works have proven that using pre-

trained models improves model performances and re-

duces training costs. As such, we adopted a Transfer

Learning-based strategy, which is proven to be helpful

for the generalization abilities of the resulting mod-

els (Pan and Yang, 2010; Zhuang et al., 2021).

The contributions of this work are:

1. a proof-of-concept application for audio-based

HAR operating in real-time on commercially

available smartphones;

2. an exploration of the effectiveness of generic au-

dio classiﬁcation models for HAR-speciﬁc tasks;

3. a novel extension of an existing dataset for Audio-

based HAR.

The rest of the paper is organized as follows:

the following section describes the employed dataset

and section 3 presents the proposed method. Sub-

sequently, section 4 explains the experimental set-up

and analyses the obtained results. Finally, section 5

demonstrates the developed prototype application and

section 6 provides our conclusions and directions for

future work.

2 DATASET

There are few audio datasets facilitating HAR based

on audio data. In this work, we used an existing

dataset (Riboni et al., 2016) composed of eight classes

taken in various indoor environments, so that differ-

ent background noises and acoustic conditions are

well-represented. Speciﬁcally, the classes available

are: brewing coffee, cooking, using the microwave

oven, taking a shower, dish washing, hand washing,

teeth brushing and no activity. Since this dataset suf-

fers from imbalance issues, it was expanded to com-

375

750

1.125

1.500

Coﬀee

Showering

Teeth

Frying

Microwave

Dishes

Hands

No_activity

1.424

1.103

1.306

1.052

1.212

1.217

1.296

1.292

Figure 1: Total duration per each activity class in the used

dataset, after expansion and balancing. Time is in seconds.

pensate the less represented classes. Namely, we

manually selected smartphone and low quality mi-

crophones recordings from Freesound

. The down-

loaded ﬁles were annotated using the existing direc-

tory structure of the dataset.

Since there were various ﬁle types with various

sample rates (from 8 to 64 KHz) and channels (mono

or stereo), we also converted all audio ﬁles to wave

encoding format (.wav) with mono channel and 16

KHz sample rate. To improve the learning proce-

dure, silence was removed from ﬁles using Reaper,

a professional Digital Audio Workstation

. Speciﬁ-

cally, we used the dynamic split items tool with gate

threshold set at -24dB, hence, the audio below that

threshold the activity is considered silence – except

for the “No activity” class. Moreover, the standard

deviation (21 m 25 s) and average duration (24 m 45

s) of the initial dataset clearly indicated a highly im-

balanced situation.

Consequently, we reduced the cardinality of the

classes having total duration above the average by it-

eratively removing one random audio ﬁle at a time

from the dataset, until the total duration of the class

was less than 24 minutes. In the resulting dataset the

total average duration across classes is 20 m 44 s and

the standard deviation between classes total duration

is 1 m 57 s – see Figure 1.

3 THE PROPOSED METHOD

The proposed method is based on YAMNet

, a Neural

Network for audio classiﬁcation. YAMNet is fed with

log-Mel spectrograms and outputs one tag among 521

classes from the AudioSet-Youtube

corpus. YAM-

Net consists of 86 layers based on the MobileNet-v1

architecture (Howard et al., 2017), which is created

https://freesound.org

https://www.reaper.fm/

https://github.com/tensorﬂow/models/tree/master/

research/audioset/yamnet

https://research.google.com/audioset/

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

784

Making coffee Washing dishes Frying eggs Washing hands

Microowave No activity Showering Washing teeth

Log-mel bin

Time Time Time Time

Figure 2: Mel-scaled spectrograms of representative seg-

ments of the considered human activities.

07/02/22, 18)46Untitled Diagram-2-2-2.drawio

Pagina 1 di 1https://app.diagrams.net/

embeddings

YAMNet

(Google)

dense

layer

data

segmentation

per-class

scores

Figure 3: Block diagram illustrating the pipeline of the pro-

posed method.

using depth-wise convolutions reducing the computa-

tional complexity of the network. Among those, only

28 layers have learnable weights, with 27 convolu-

tional layers, and one fully connected layer.

With the purpose of transferring knowledge learnt

by YAMNet, we discard the last fully connected layer

and substitute it with a new fully connected layer de-

signed for our custom classiﬁcation task. Overall, the

portion of YAMNet we use has 3

195

456 trained pa-

rameters.

For creating the log-Mel spectrograms used by

YAMNet, it is fundamental the understanding of

frame and hop size. Indeed, long-time dependencies

may be captured more easily with long frame sizes.

At the same time, long frames may impact negatively

the training of the network for short-time dependen-

cies, because it could be saturated of information and

could have issues in the identiﬁcation of the most

discriminatory features. Moreover, long frames can

reduce the total number of frames. We empirically

found that an optimal frame size was 15600 × 6 sec-

onds with an overlap of the 50%. This value was cho-

sen because 15600 (0.975 seconds) is the exact seg-

ment size required by YAMNet. Figure 2 illustrates

log-mel spectrograms of representative segments of

each class existing in the dataset.

After having segmented windows, in order to im-

prove the generalization abilities of the model, we

added Gaussian noise with mean µ = 0 and standard

deviation σ = 0.2. This technique was proven to make

the model robust against realistic noise sources, e.g.,

previous work employ such a method in order to ig-

nore the noise over time (Kahl et al., 2017).

The frames are then fed into the pre-trained YAM-

Net and the produced embeddings, after a ReLU ac-

tivation function, are passed to a fully-connected lin-

ear layer trained from scratch on the new dataset –

see Figure 3. The output is then processed with

SoftMax activation during training, while a simple

argmax function can be used at inference time.

For training, we used “Adam” (Kingma and

Ba, 2014) update algorithm and Categorical Cross-

Entropy as multi-class loss function. Batch size was

set to 10 and training was performed for 100 epochs.

Both epochs and batch size values were chosen after

preliminary exploration using cross-validation.

3.1 Comparative Analysis with k-nn

k-Nearest Neighbor Classiﬁer (k-NN) despite its sim-

plicity it is a suitable approach for multi-class prob-

lems (Hota and Pathak, 2018). The standard version

of the k-NN classiﬁer has been used with the Eu-

clidean distance as similarity metric.

Feature Extraction. The short-term features feed-

ing the k-NN model are the following: a) zero cross-

ing rate, b) energy, c) energy’s entropy, d) spectral

centroid and spread, e) spectral entropy, f) spectral

ﬂux, g) spectral rolloff, h) MFCCs, i) harmonic ra-

tio, j) fundamental frequency, and k) chroma vec-

tors. We opted for the mid-term feature extraction

process meaning that mean and standard deviation

statistics on these short term features are calculated

over mid-term segments. More information on the

adopted feature extraction method can be found in

(Giannakopoulos and Pikrakis, 2014).

Parameterization. Short- and mid-term window

and hop sizes, have been discovered after a series of

early experimentations on the various datasets. The

conﬁguration offering the highest recognition accu-

racy is the following: 0.05, 0.025 seconds for short-

term window and hop size; and 1.0, 0.5 seconds for

mid-term window and hop size respectively. Overall,

Table 1: Standard deviation of the obtained recognition ac-

curacy per class and data division scheme.

Class 10-folds 3-folds

doing coffee 0.03188 0.01067

frying egg 0.03580 0.03124

no activity 0.01921 0.00790

showering 0.01381 0.00391

microwave 0.02805 0.00839

washing dishes 0.04032 0.01001

washing hands 0.01779 0.01384

washing teeth 0.02321 0.01221

Lightweight Audio-Based Human Activity Classiﬁcation Using Transfer Learning

785

Table 2: Average and Standard Deviation of Balanced Ac-

curacy; * refers to the baseline model described in (Riboni

et al., 2016); ** refers to the k-NN model

Folds Avg

10 0.8617

10** 0.8091

3 0.8820

3* 0.8560

3** 0.8098

the both feature extraction levels include a 50% over-

lap between subsequent windows.

Moreover, parameter k has been chosen using test

results based on the ten-fold cross validation scheme;

depending on the considered data population, the ob-

tained optimal values range in [3, 21]. The best k

parameter obtained is k=3.

4 EXPERIMENTAL SET-UP AND

RESULTS

For evaluating the proposed system, we designed

two experiments, namely a 3 and a 10 fold cross-

validation.

The metrics used to evaluate the trained model are

summarized in the following list:

• Balanced accuracy per fold: computes the mean

of the true positive rate obtained on each class of

a fold; it corresponds to the following formula:

∑

M−1

i=0

t p

+ f n

where M is the number of classes, t p

and f n

are

the number of true-positives and false-negatives

of the i-th class;

• Average balanced accuracy: computes the mean

of the balanced accuracy across folds;

• Standard deviation of balanced accuracy: com-

putes the standard deviation of the average bal-

anced accuracy;

• Per-class standard deviation: for each class, the

true-positive-rate is averaged across the folds;

then, the standard deviation is computed;

• Normalized confusion matrix: all the confusion

matrices of all folds are summed; then, they nor-

malized so that each row sums to one.

Comparing 10 and 3-fold cross-validation – see

Table 2 – the model seems to suffer from the in-

crease of data in the training set, leading to a decrease

of performance in the 10-fold cross-validation. Ta-

ble 1 shows the standard deviation across folds of

Figure 4: Confusion Matrix of 3-fold (top) and 10-fold (bot-

tom) evaluation test. Matrices are created by summing the

confusion matrix obtained in each fold and then normaliz-

ing so that each row sums to 1.

the true positive rates for each class. Since the ef-

fect is noticeable in all classes, we assume that it is

not connected with speciﬁc samples. Therefore, the

increased dataset size should be thoroughly assessed

for possible negative effects, which may impact the

learning ability of the model.

Overall, as shown in Figure 4, all classes are well-

predicted. Miss-classiﬁcations are mainly concen-

trated in the differentiation among “washing dishes”,

“washing hands”, and “washing teeth”, probably be-

cause of the water sound in the background. Other

miss-classiﬁcations are between “frying egg” and

“doing coffee”, probably because they include metal

sounds (of pots and pans) and some parts with very

low intensity sounds.

We compared the best-performing model to a

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

786

baseline work (Riboni et al., 2016) on the dataset

without our expansion nor our balancing strategy. For

the comparison, we used 3-fold cross-validation. The

speciﬁc work focuses on the usage of low-power con-

sumption devices. Even though the overall accuracy

does not differ signiﬁcantly from the baseline model,

it is interesting to note that there are relevant dif-

ferences in the way misclassiﬁcations are distributed

across classes (see Figure 5):

• “doing coffee”: the proposed model produces less

miss-classiﬁcations with “frying eggs” and “mi-

crowaves”;

• “no activity”: less miss-classiﬁcations with

“washing teeth’ using the proposed method;

• “using microwave”: the baseline model confused

this class with “doing coffee”, while the proposed

method does not;

• washing hands: similar, but our model performed

0.04 worse than the baseline in the 10-fold cross

validation, while 0.08 worse in the 3-fold cross

validation, miss-classiﬁcations in our classiﬁer

are in dishes and teeth (this could be because of

the similar water environments);

• washing teeth: similar, but our model scored

higher by about 0.03.

In addition, we contrasted the best performing

model with a k-NN classiﬁer described in subsection

3.1 trained on the expanded dataset, by looking at

table 2 k-NN is outperformed by both the baseline

model of Riboni and by the classiﬁer implemented

with YAMNet. Figure 6 shows the confusion matrix

of the 10 cross forld validation of the k-NN model.

The implementation of the proposed classiﬁer

along with the presented experiments is available

at https://github.com/LIMUNIMI/HAR-YAMNet en-

suring full reproducibility of the obtained results.

5 ANDROID APPLICATION

For experimental purposes, we have also built an An-

droid application which uses the proposed model to

classify real-life sounds. The application runs in real-

time and performs a new inference every 500 millisec-

onds on the previous 500 ms of recorded audio, visu-

alizing it on the screen. An example of the applica-

tion running on Android mobile operating system is

shown in Figure 7.

To improve the accuracy of the app, two ﬁlters

were implemented. The ﬁrst ﬁlter leverages the orig-

inal YAMNet predictor – not the one we trained – to

ﬁlter-out silence: if the original YAMNet predicts si-

lence with a score >0.89, no prediction is performed

Figure 5: Confusion Matrix of best model obtained with the

baseline model (Riboni et al., 2016).

Figure 6: Confusion matrix of best k-NN conﬁguration

where k=3.

by our model and a corresponding message is visual-

ized on the screen (“No activity from YAMNet”).

The second ﬁlter is a threshold of 0.3 for each

class: it ﬁlters classiﬁcation for low probability

scores, so that if a classiﬁcation is lower than the

threshold its classiﬁcation is disregarded and no tag

is visualized on the screen.

6 CONCLUSION AND FUTURE

DEVELOPMENTS

This article proposes a Deep Neural Network frame-

work with transfer learning from a CNN (YAMNet)

Lightweight Audio-Based Human Activity Classiﬁcation Using Transfer Learning

787

Figure 7: Screenshot of the developed prototype application

running on Android mobile operating system. The winning

class along and the associated probability is displayed to the

user.

to classify human activities using a reasonably-sized

dataset. The obtained results demonstrate the supe-

riority of the proposed system over the state-of-art

based on supervised feature learning. Weaknesses of

the model could emerge in case of scaling the num-

ber of classes with proportional number of instances:

indeed, the model would need more data to learn a

more complex problem for which the current neural

architecture may not be enough accurate.

Future works include the use of artiﬁcial data aug-

mentation to enlarge the dataset. Possibly YAMNet

hyper-parameters could be ﬁne-tuned if the dataset

is sufﬁciently large. Moreover, the effectiveness of

the smartphone application should be assessed thor-

oughly in terms of complexity along with the required

resources. Finally, the developed application could be

employed to enhance the capabilities of a wide range

of systems including smart-home assistants, such as

Amazon Alexa, Google Home, etc.

REFERENCES

Beddiar, D. R., Nini, B., Sabokrou, M., and Hadid,

A. (2020). Vision-based human activity recogni-

tion: A survey. Multimedia Tools and Applications,

79(41):30509–30555.

Bevilacqua, A., MacDonald, K., Rangarej, A., Widjaya, V.,

Caulﬁeld, B., and Kechadi, T. (2018). Human activ-

ity recognition with convolutional neural networks. In

Joint European Conference on Machine Learning and

Knowledge Discovery in Databases, pages 541–552.

Springer.

Chahuara, P., Fleury, A., Portet, F., and Vacher, M. (2016).

On-line human activity recognition from audio and

home automation sensors: Comparison of sequential

and non-sequential models in realistic smart homes 1.

Journal of Ambient Intelligence and Smart Environ-

ments, 8(4):399–422.

Chen, K., Zhang, D., Yao, L., Guo, B., Yu, Z., and Liu, Y.

(2021). Deep learning for sensor-based human activ-

ity recognition: Overview, challenges, and opportuni-

ties. ACM Computing Surveys, 54(4):77:1–77:40.

Garc

ıa-Hern

andez, A., Galv

an-Tejada, C., Galv

an-Tejada,

J., Celaya-Padilla, J., Gamboa-Rosales, H., Velasco-

Elizondo, P., and C

ardenas-Vargas, R. (2017). A sim-

ilarity analysis of audio signal to develop a human ac-

tivity recognition using similarity networks. Sensors,

17(11):2688.

Giannakopoulos, T. and Pikrakis, A. (2014). Introduction

to Audio Analysis: A MATLAB Approach. Academic

Press, Inc., USA, 1st edition.

Hota, S. and Pathak, S. (2018). KNN classiﬁer based ap-

proach for multi-class sentiment analysis of twitter

data. International Journal of Engineering and Tech-

nology, 7(3):1372.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,

Wang, W., Weyand, T., Andreetto, M., and Adam,

H. (2017). Mobilenets: Efﬁcient convolutional neu-

ral networks for mobile vision applications. CoRR,

abs/1704.04861.

Kahl, S., Wilhelm-Stein, T., Hussein, H., Klinck, H., Kow-

erko, D., Ritter, M., and Eibl, M. (2017). Large-scale

bird sound classiﬁcation using convolutional neural

networks. In CLEF (working notes), volume 1866.

Kingma, D. P. and Ba, J. (2014). Adam: A method for

stochastic optimization.

Lau, J., Zimmerman, B., and Schaub, F. (2018). Alexa, are

you listening? Proceedings of the ACM on Human-

Computer Interaction, 2(CSCW):1–31.

Liang, D., Song, W., and Thomaz, E. (2020). Characteriz-

ing the effect of audio degradation on privacy percep-

tion and inference performance in audio-based human

activity recognition. In 22nd International Conference

on Human-Computer Interaction with Mobile Devices

and Services, MobileHCI ’20, pages 1–10, New York,

NY, USA. Association for Computing Machinery.

Ntalampiras, S. and Potamitis, I. (2018). Transfer learning

for improved audio-based human activity recognition.

Biosensors, 8(3):60.

Ntalampiras, S., Potamitis, I., and Fakotakis, N. (2012).

Acoustic detection of human activities in natural en-

vironments. AES: Journal of the Audio Engineering

Society, 60(9):686–695.

Ntalampiras, S. and Roveri, M. (2016). An incremental

learning mechanism for human activity recognition.

In 2016 IEEE Symposium Series on Computational

Intelligence (SSCI), pages 1–6.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-

ing. IEEE Transactions on Knowledge and Data En-

gineering, 22(10):1345–1359.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

788

Ramanujam, E., Perumal, T., and Padmavathi, S. (2021).

Human activity recognition with smartphone and

wearable sensors using deep learning techniques: A

review. IEEE Sensors Journal, 21(12):13029–13040.

Riboni, D., Galv

an-Tejada, C. E., Galv

an-Tejada,

J. I., Celaya-Padilla, J., Delgado-Contreras, J. R.,

Magallanes-Quintanar, R., Martinez-Fierro, M. L.,

Garza-Veloz, I., L

opez-Hern

andez, Y., and Gamboa-

Rosales, H. (2016). An analysis of audio features to

develop a human activity recognition model using

genetic algorithms, random forests, and neural net-

works. Mobile Information Systems, 2016:1784101.

Ronao, C. A. and Cho, S.-B. (2016). Human activity recog-

nition with smartphone sensors using deep learning

neural networks. Expert systems with applications,

59:235–244.

Stork, J. A., Spinello, L., Silva, J., and Arras, K. O. (2012).

Audio-based human activity recognition using non-

markovian ensemble voting. In 2012 IEEE RO-MAN:

The 21st IEEE International Symposium on Robot and

Human Interactive Communication, pages 509–514.

IEEE.

Wan, S., Qi, L., Xu, X., Tong, C., and Gu, Z. (2020). Deep

learning models for real-time human activity recogni-

tion with smartphones. Mobile Networks and Appli-

cations, 25(2):743–755.

Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H.,

Xiong, H., and He, Q. (2021). A comprehensive sur-

vey on transfer learning. Proceedings of the IEEE,

109(1):43–76.

Lightweight Audio-Based Human Activity Classiﬁcation Using Transfer Learning

789