homeSound: A High Performance Platform for Massive Data Acquisition

and Processing in Ambient Assisted Living Environments

Marcos Herv

, Rosa Ma Alsina-Pag

and Joan Navarro

GTM - Grup de Recerca en Tecnologies M

edia, La Salle - Universitat Ramon Llull, Barcelona, Spain

GRITS - Grup de Recerca en Internet Technologies & Storage, La Salle - Universitat Ramon Llull, Barcelona, Spain

Keywords:

Ambient Assisted Living, Sensor Network, Machine Hearing, Audio Feature Extraction, Machine Learning,

Graphics Processor Unit.

Abstract:

Human life expectancy has steadily grown over the last century, which has driven governments and institutions

to increase the efforts on caring about the eldest segment of the population. The ﬁrst answer to that increasing

need was the building of hospitals and retirement homes, but these facilities have been rapidly overﬁlled

and their associated maintenance costs are becoming far prohibitive. Therefore, modern trends attempt to

take advantage of latest advances in technology and communications to remotely monitor those people with

special needs at their own home, increasing their life quality and with much less impact on their social lives.

Nonetheless, this approach still requires a considerable amount of qualiﬁed medical personnel to track every

patient at any time. The purpose of this paper is to present an acoustic event detection platform for assisted

living that tracks patients status by automatically identifying and analyzing the acoustic events happening in

a house. Speciﬁcally, we have taken beneﬁt of the amazing capabilities of a Jetson TK1, with its NVIDIA

Graphical Processing Unit, to collect the data in the house and process it to identify a closed number of

events, which could led doctors or care assistants in real-time by tracking the patient at home. This is a proof

of concept conducted with data of only one acoustic sensor, but in the future we have planned to extract

information of the sensor network placed in several places in the house.

1 INTRODUCTION

Human life expectancy is increasing in the modern

society (Suzman and Beard, 2015). This drives our

society to face new challenges in terms of health care

because the number of patients to attend is increas-

ing according to (National Institute on Aging, 2007;

Chatterji et al., 2008) the people ageing who need

support (Lafortune and Balestat, 2007). Nowadays,

public and private health services try to avoid long

term hospitalizations and, instead, foster the elderly

to remain at home for two reasons: on the one hand,

it is better for their health to keep them—while not

suffering from severe deterioration—in their own en-

vironment and, on the other hand, it is much cheaper

for health services. However, there is still a quality

gap between the service provided at medical facilities

and the service provided at patients’ home.

Technology is a powerful tool that can contribute

to address this problem by enabling medical staff to

monitor and attend patients while they are at home.

Ambient Assisted Living (AAL) (Vacher et al., 2010)

reduces the personnel costs in health assistance. AAL

consists of monitoring the preferred living environ-

ment of the patients with intelligent devices that can

track their status and improve their life quality. To

address this hot research topic, several engineering

projects have been proposed to discuss the feasibility

of deploying smart robots at the home of elderly not

only to cover routine tasks, but also to remind them to

have their medication or interact with them through

serious games (Morsi and Shukla, 2015). One of the

main challenges that this proposals open is the huge

amount of data that these robots have to collect in

order to provide a meaningful response for patients.

Typically, these robots have limited computing capa-

bilities and, thus, are able to process data from a re-

duced number of sensors.

This paper explains the proof of concept of a soft-

ware and hardware platform designed to recognize a

set of the predeﬁned events from the environmental

sound in a house. This information can be later used

to infer the in-home context and detect some situa-

tions of risk. To process data from several sources

182

HervÃ ˛as M., Alsina-PagÃ

ls R. and Navarro J.

homeSound: A High Performance Platform for Massive Data Acquisition and Processing in Ambient Assisted Living Environments.

DOI: 10.5220/0006209701820187

In Proceedings of the 6th International Conference on Sensor Networks (SENSORNETS 2017), pages 182-187

ISBN: 421065/17

(e.g., microphones) and conduct the computations

associated to audio event identiﬁcation in parallel,

the system implements a recognition scheme using a

NVIDIA Jetson TK1 (NVIDIA Corp., 2016) Graph-

ical Processing Unit (GPU). This platform can reach

to several decisions depending on the situation and

home, and the ﬁnal conclusion can be activating some

kind of alarm or just track the patients behaviour for

health purposes. Overall, the purpose of this work

is to present the ﬁrst approach to the implementation

of an acoustic event recognition platform and the ob-

tained results when classifying a limited corpus of

events.

The reminder of this paper is organized as follows.

Section 2 reviews the related work on environmen-

tal sound recognition; it is specially focused on ambi-

ent assisted living environments. Section 4 elaborates

on the technical details of the proposed algorithm to

solve the problem, which corresponds to a basic im-

plementation. Section 3 gives details about the se-

lected platform and its convenient features to process

audio data. Section 5 describes the algorithm used

to classify the events and shows the obtained results

when running on the chosen platform. Finally, Sec-

tion 6 detail the conclusions and future work of this

project.

2 RELATED WORK

There are several approaches in the literature that aim

to extract features from the sound. From these fea-

tures, it is possible to create a corpus of a close uni-

verse of different sounds and train a machine learning

system to classify the source of the sound. Therefore,

environmental sound recognition has emerged as a hot

research topic today, which has led to some interest-

ing applications (Chachada and Kuo, 2014); from ani-

mal recognition to surveillance, including ambient as-

sisted living use cases.

Interest in detecting in-home sounds started from

the beginning of this technology in 2005. Chen et al.

(Chen et al., 2005) was monitoring the bathroom ac-

tivity using only the sound information. Afterwards,

with research not detailed in this work, robust en-

vironment sound recognition motors were designed

in 2008 (Wang et al., 2008). One of the most chal-

lenging problems to be solved in this ﬁeld, is to take

into account the varying acoustic background, the

noise sources. In this regard, the project SonicSen-

tinel (Hollosi et al., 2011) uses noise-robust model-

based algorithms to evaluate the noise sources. Evolv-

ing this technology, Valero et al. (Valero and Al

ıas,

2012) succeeded on classifying audio scenes. Addi-

tionally, several works can be found about audio anal-

ysis in a smart home to help doctors on the early diag-

nose of dementia diseases for the elder (Guyot et al.,

2013). Also, it is worth mentioning that conditional

random ﬁelds have been used to build an event detec-

tion framework in a real-world environment of eight

households (Matern et al., 2013), which led the sys-

tem to be sometimes unreliable.

From the applications point of view, one of the

most popular use-cases nowadays of audio event

recognition is its use in the smart home (Chan et al.,

2008), especially when conceiving systems to meet

the needs of the elderly people. The constraints

around the design of a smart home for health care

(Vacher et al., 2010) based on audio event classiﬁ-

cation are as follows: i) the degree of dependency of

the disabled person, ii) the quality of life to be im-

proved by means of automatizing the processes, and

iii) the distress situations recognition and the activa-

tion of the preassigned protocols, including reducing

the false alarm situations (Goetze et al., 2012).

Despite the fact that there are several solutions

in the literature (Vacher et al., 2013) that consider

these three constraints, the primary goal of the plat-

form presented in this paper is to accurately address

the third one. Additionally, our proposal aims to meet

the needs of ambient assisted living, which are the

following (van Hengel and Anem

uller, 2009): i) in-

creasing the comfort of living at home, ii) increasing

the safety, through detecting dangerous events and iii)

supporting health care by professionals, through de-

tecting emergencies and monitoring vital signs.

3 SYSTEM DESCRIPTION

The proposed system diagram to monitor audio events

in ambient assisted living environments is shown in

Figure 1.

As far as the proof of concept herein presented is

concerned, the system relies on a network of micro-

phones consistently deployed around the house (see

Figure 2). The microphones are installed in such

a way that they provide the maximum entropy of a

given event (i.e., it is not necessary to analyze together

different audio sources).

The microphones used in this application to sense

the environmental sound should present a good trade-

off between the frequency response and cost, for this

reason tests are being conducted with the electret

condenser microphone CMA-4544PF-W (CUI inc.,

2003) of the manufacturer CUI inc. with a very low

price.

In this way, each microphone transmits sounds to

homeSound: A High Performance Platform for Massive Data Acquisition and Processing in Ambient Assisted Living Environments

183

Ambient Assisted

Living ServerHouse

Audio Sensor with

wireless connectivity

RAW data concentrator, GPU and remote access

Remote Server

Figure 1: Block diagram of the network elements of this

system.

this device that acts as a concentrator—the core ele-

ment of our proposal. As a matter of fact, this concen-

trator i) collects all the audio sounds of the house, ii)

processes them in order to extract their features, iii)

infers the source of the audio event, and iv) sends this

information to a remote server that monitors the needs

of the people living in the house.

Ambient Assisted

Living ServerHouse

Audio Sensor with

wireless connectivity

RAW data concentrator,

GPU and remote access

Remote Server

Figure 2: Example of the proposed audio sensors network

deployed in a house.

The concentrator platform used in this work is

the NVIDIA Jetson TK1 developer kit. This plat-

form is based on the Tegra K1 SOC, which is com-

posed of i) an NVIDIA Kepler GPU with 192 CUDA

cores and ii) a quad core ARM cortex-A15 CPU. The

Tegra family is the proposal of the NVIDIA manufac-

turer for mobile processors in which you need GPU-

accelerated performance with low power consump-

tion.

This GPU is able to process up to 192 threads in

parallel. Kepler architecture offers an improvement

of performance up to 3 times more than the previous

version, Fermi, (NVIDIA Corp., 2014). This level

of concurrency allows us to process audio events of

several sources in real-time.

Therefore, to exploit the parallel capabilities of the

concentrator, it opens a thread to process each audio

source and infer the event that generated every sound.

4 MACHINE HEARING

Endowing machines with the ability of hearing the

acoustic environment to detect and recognize an event

as humans do, is known as machine hearing. The

algorithm used in this work is based on i) feature

extraction using mel-frequency cepstral coefﬁcients

(MFCC) (Melmstein, 1976) and ii) pattern recogni-

tion using the k-Nearest Neighbors classiﬁer (KNN)

(Cover and Hart, 1967), see Figure 3.

Feature

Extraction

Audio

Analysis

Output

Audio

input

MFCC

Figure 3: Block diagram of a Hearing Machine algorithm.

4.1 Feature Extraction

Feature extraction aims to obtain a representation

of audio events in which the dimensionality of this

parametrization is much lower than the original sam-

ples (Al

ıas et al., 2016). This parametrization will be

the input data of the classiﬁer. The parametrization

used in this work, MFCC (Melmstein, 1976), uses an

approach based on perceptual-based frequency using

the Mel scale (Liang and Fan, 2014).

DFT

MFCC

FilterBank

log(|·|

) DCT

W[n-rP]

x[n]

xr[n]

X[k]

LE(i)

MFCC

Windowing Features Extraction

Figure 4: Block diagram of the feature extraction based on

the Mel coefﬁcients used in this work.

The incoming audio stream is divided into blocks

of 30 ms with an sliding window. These frames are

transformed into frequency domain using the DFT to

measure the power of different bands of the spectrum.

The power measures are conducted with a bank of 48

ﬁlters using the Mel scale (see Figure 5). The MFCC

SENSORNETS 2017 - 6th International Conference on Sensor Networks

184

coefﬁcients are obtained from the Discrete Cosine

Transform (DCT) of the logarithm of these 48 val-

ues. The higher order coefﬁcients of the DCT are dis-

carded to obtain a reduced dimensionality character-

ization of the sound event, this compression can be

done because the main information is in the low fre-

quency components of the signal’s spectral envelop.

The ﬁnal number of MFCC coefﬁcients is 13.

0 1000 2000 3000 4000 5000 6000 7000 8000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Frequency (Hz)

Amplitude

Figure 5: Example of a Mel scale with a ﬁlter-bank of 20.

Window lengths between 10 and 50 ms are usu-

ally used to detect transient audio events (Fu et al.,

2011). A Hamming windowing is also applied to this

frame of samples to improve the frequency resolu-

tion in the Discrete Fourier Transform (DFT)—as we

can see comparing the differences between square and

Hamming windows in Figure 6. This sliding block

has an overlap of 50% of samples to compensate the

power reduction of the data blocks due to the laterals

of the Hamming window, see Figure 6.

The Mel scale is a perceptual scale which aims

to emulate the behaviour of the human hearing. As

we can observe in Figure 5, Mel scale is a bank of

triangular ﬁlters.

4.2 Automatic Audio Classiﬁcation

Machine learning algorithms are widely used in the

literature of speech technologies to automatically

classify audio samples. In fact, most of the audio

recognition systems settle the use of the MFCC coefﬁ-

cients as baseline in terms of feature extraction (Al

ıas

et al., 2016). Then, when the signal is processed and

the features are already extracted, a k-Nearest Neigh-

bors (kNN) (Cover and Hart, 1967) system can be run

(Zhang and Zhou, 2005).

Hence, we have followed this approach and

trained a kNN classiﬁer as follows. We have built

a training data set composed by 2850 audio samples

belonging to 14 in-home events lasting a total number

of 20 hours. We have split every sample in several

0 500 1000 1500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Samples [n]

Amplitude

(a)

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2

−120

−100

−80

−60

−40

−20

Normalized Frequency (x π rad/sample)

Magnitude (dB)

(b)

0 200 400 600 800 1000 1200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Samples [n]

Amplitude

(c)

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2

−150

−100

−50

Normalized Frequency (x π rad/sample)

Magnitude (dB)

(d)

Figure 6: Comparison between a squared and Hamming

windows in time and frequency domain: a) is a squared

window, b) is the spectrum of the squared window, c) is a

Hamming window and d) is the spectrum of the Hamming

window.

sub samples as detailed above, and for every sub sam-

ple we have computed the MFCC coefﬁcients. This

results in a vector of 13 components (each one cor-

responding to its associated MFCC) for every sound

sub sample. As a result, a sound sample is charac-

terized with a set of vectors. The number of vectors

that characterize a given sound depends on the length

of the training sound, which would limit the classi-

ﬁer accuracy (i.e., shorter sounds of the same sound

type would probably be misclassiﬁed. Therefore, to

address this issue, we have built a bag of words with

all the vectors belonging to the same sample using the

aforementioned kNN. The resulting vector has a ﬁxed

length of K components. This gives an idea of how

many portions of the training sound set belong to each

centroid of the kNN, which at the same time removes

the temporal dimension of the sound event. Next, we

normalize all these resulting vectors to made the suit-

able for a fair comparison. With this ﬁxed size set of

normalized vectors, we ﬁnally train a Support Vector

Machine (running on the concentrator platform).

Finally, when our system is in exploitation mode,

the concentrator platform builds extracts the audio

sub samples and builds the ﬁxed size vector accord-

ingly. Then, this vector is delivered to the Support

Vector Machine (SVM) to predict the event.

5 RESULTS

With the data set described in the previous section we

have conducted our experimentation. Speciﬁcally, we

homeSound: A High Performance Platform for Massive Data Acquisition and Processing in Ambient Assisted Living Environments

185

have used 60% of instances to train the SVM and the

other 40% to test it. To obtain statistically signiﬁcant

results we have performed a 10-fold cross validation.

We have found that this system is able to recog-

nize the following events with an overall accuracy of

73%: someone falling down, slice, screaming, rain,

printer, people talking, frying food, ﬁlling water, door

knocking, dog bark, car horn, glass breaking, baby

crying, water boiling.

Figure 7 shows the confusion matrix.

38 CAPÍTOL 5. DESENVOLUPAMENT DEL PROJECTE

En la següent imatge es pot veure la jerarquia entre les crides de les eines utilitzades. També es

pot veure on s’executa cada un dels mòduls.

Figura 5.1: Jerarquia de crides segons el modul

En l’anterior iteració s’ha comentat com el sistema classiﬁca correctament alguns elements

mentre que d’altres els confon entre si. Per aquest motiu, a partir d’aquest punt del projecte el

que es vol és aproﬁtar la informació que dóna la matriu de confusió per tal de separar en capes

la classiﬁcació. En cada una de les capes es classiﬁquen uns elements i els altres són agrupats

en una super classe que els engloba. Els elements que siguin classiﬁcats com a super classe

hauran de tornar-se a classiﬁcar passant per un nou classiﬁcador conﬁgurat perquè la separació

sigui més bona que amb el genèric. El que s’espera aconseguir és que la separabilitat d’aquests

elements incrementi en el moment de ser classiﬁcats en el segon classiﬁcador.

En la següent imatge es pot apreciar la matriu de confusió utilitzada per a separar els elements

a classiﬁcar.

Figura 5.2: Matriu de confusió en la primera fase de la classiﬁcació

Figure 7: Confusion Matrix. Events are ordered from

left to right as follows: falling down, slice, screaming,

rain, printer, people talking, frying food, ﬁlling water, door

knocking, dog bark, car horn, glass breaking, baby crying,

water boiling.

In this confusion matrix we can see how often the

SVM misclassiﬁes a given class and, thus, assigns a

wrong event to an audio sample. It is shown that in

general, the best results for each sample are obtained

when testing the sound event against itself. Also, it

depicts the skill of the classiﬁer on distinguishing one

audio event from the others. The optimal value of this

confusion matrix should be an Identity Matrix with

the value 100 on its diagonal.

Although the classiﬁer performs reasonably well,

it gets confused on some sound events that have sev-

eral MFCC coefﬁcients pretty similar. For instance,

on row 6 in Fgure 7, door knocking, people talking

and frying food have similar MFCC vector patterns

and, thus, the SVM features a low accuracy in these

speciﬁc situations. To address this concern, we plan

to (1) complement the training vector set with other

sources in addition to MFCCs, and (2) use a more so-

phisticated classiﬁer such as a deep net.

6 CONCLUSIONS

Preliminary results of our paper encourages us to keep

on working on the analysis of the events happening in

the house. We will work with the feature extraction

improvement with other methods, as well as we will

test more machine learning algorithms to increase the

accuracy of the system with just one acoustic mea-

surement. Next steps after this proof of concept using

the Jetson TK1 are the expansion of the platform, by

means of using a wider sensor network, where sev-

eral autonomous acoustic sensors sending data to the

GPU to be processed. In this stage, an important part

of the work will be focused on the optimization of the

acoustic event detection algorithm to take advantage

of the parallelization of the GPU unit.

ACKNOWLEDGEMENTS

The authors would like to thank the Secre-

taria d’Universitats i Recerca del Departament

d’Economia i Coneixement (Generalitat de

Catalunya) under grant ref. 2014-SGR-0590.

REFERENCES

ıas, F., Socor

o, J. C., and Sevillano, X. (2016). A review

of physical and perceptual feature extraction tech-

niques for speech, music and environmental sounds.

Applied Sciences, 6(5):143.

Chachada, S. and Kuo, J. (2014). Environmental sound

recognition: A survey. APSIPA Transactions on Sig-

nal and Information Processing.

Chan, M., Est

eve, D., Escriba, C., and Campo, E. (2008).

A review of smart homespresent state and future

challenges. Computer methods and programs in

biomedicine, 91(1):55–81.

Chatterji, S., Kowal, P., Mathers, C., Naidoo, N., Verdes,

E., Smith, J. P., and Suzman, R. (2008). The health of

aging populations in china and india. Health Affairs,

27(4):1052–1063.

Chen, J., Kam, A. H., Zhang, J., Liu, N., and Shue, L.

(2005). Bathroom activity monitoring based on sound.

In International Conference on Pervasive Computing,

pages 47–61. Springer.

Cover, T. and Hart, P. (1967). Nearest neighbor pattern clas-

siﬁcation. IEEE Transactions on Information Theory,

13 (1): 2127.

CUI inc. (2003). CMA-4544PF-W. [Online; accessed 10-

Dec-2016].

Fu, Z., Lu, G., Ting, K. M., and Zhang, D. (2011). A sur-

vey of audio-based music classiﬁcation and annota-

tion. IEEE Transactions on Multimedia, 13(2):303–

319.

Goetze, S., Schroder, J., Gerlach, S., Hollosi, D., Appell,

J.-E., and Wallhoff, F. (2012). Acoustic monitoring

and localization for social care. Journal of Computing

Science and Engineering, 6(1):40–50.

Guyot, P., Pinquier, J., Valero, X., and Alias, F. (2013).

Two-step detection of water sound events for the

diagnostic and monitoring of dementia. In 2013

IEEE International Conference on Multimedia and

SENSORNETS 2017 - 6th International Conference on Sensor Networks

186

Expo (ICME), pages 1–6, San Jose, California (USA).

IEEE.

Hollosi, D., Goetze, S., Appell, J., and Wallhoff, F. (2011).

Acoustic applications and technologies for ambient

assisted living scenarios. In Proceedings of the AAL

Forum, Lecce (Italy).

Lafortune, G. and Balestat, G. (2007). Trends in severe dis-

ability among elderly people. OECD Health Working

Papers.

Liang, S. and Fan, X. (2014). Audio content classiﬁcation

method research based on two-step strategy. Int. J.

Adv. Comput. Sci. Appl.(IJACSA), 5:57–62.

Matern, D., Condurache, A., and Mertins, A. (2013). Adap-

tive and automated ambiance surveillance and event

detection for ambient assisted living. In 2013 35th An-

nual International Conference of the IEEE Engineer-

ing in Medicine and Biology Society (EMBC), pages

7318–7321. IEEE.

Melmstein, P. (1976). Distance measures for speech

recognition, psychological and instrumental. Pattern

Recognition and Artiﬁcial Intelligence.

Morsi, Y. S. and Shukla, A. (2015). Optimizing Assistive

Technologies for Aging Populations. IGI Global.

National Institute on Aging (2007). Growing Older in

America: The Health and Retirement Study. U.S. De-

partment of Health and Human Services, Washington,

DC.

NVIDIA Corp. (2014). NVIDIAs next generation cuda

compute architecture: Kepler TM GK110/210. Avail-

able online: (last accessed 20th Oct 2016).

NVIDIA Corp. (2016). JETSON TK1. Unlock the power of

the GPU for embedded systems applications.

Suzman, R. and Beard, J. (2015). Global health and aging–

Living longer. National Institute on Aging.

Vacher, M., Portet, F., Fleury, A., and Noury, N. (2010).

Challenges in the processing of audio channels for

ambient assisted living. In e-Health Networking Ap-

plications and Services (Healthcom), 2010 12th IEEE

International Conference on, pages 330–337. IEEE.

Vacher, M., Portet, F., Fleury, A., and Noury, N. (2013).

Development of audio sensing technology for ambient

assisted living: Applications and challenges. Digital

Advances in Medicine, E-Health, and Communication

Technologies, page 148.

Valero, X. and Al

ıas, F. (2012). Classiﬁcation of audio

scenes using narrow-band autocorrelation features. In

Signal Processing Conference (EUSIPCO), 2012 Pro-

ceedings of the 20th European, pages 2076–1465.

IEEE.

van Hengel, P. and Anem

uller, J. (2009). Audio event de-

tection for in-home care. In Int. Conf. on Acoustics

(NAG/DAGA).

Wang, J.-C., Lee, H.-P., Wang, J.-F., and Lin, C.-B. (2008).

Robust environmental sound recognition for home au-

tomation. IEEE transactions on automation science

and engineering, 5(1):25–31.

Zhang, M.-L. and Zhou, Z.-H. (2005). A k-nearest neigh-

bor based algorithm for multi-label classiﬁcation. In

2005 IEEE international conference on granular com-

puting, volume 2, pages 718–721. IEEE.

homeSound: A High Performance Platform for Massive Data Acquisition and Processing in Ambient Assisted Living Environments

187