Bone Conduction Eating Activity Detection based on YAMNet Transfer

Learning and LSTM Networks

Wei Chen

1 a

, Haruka Kamachi

1 b

, Anna Yokokubo

2 c

and Guillaume Lopez

2 d

Graduate School of Science and Engineering, Aoyama Gakuin University, Sagamihara, Japan

Department of Integrated Information Technology, Aoyama Gakuin University, Sagamihara, Japan

Keywords:

Bone Conduction, Transfer Learning, Long-short Term Memory (LSTM), Eating Behavioural Activity.

Abstract:

The trivial eating behaviors affect our health and sometimes lead to obesity and other health problems. We

propose an automatic human eating behavior estimation system , which performs real-time inferences using a

sound event detection (SED) deep learning model. In addition, We customized YAMNet, a pre-trained deep

neural network by 521 audio event classes based on Mobilenet v1 depthwise-separable convolution architec-

ture from Tensorﬂow. We used transfer learning shaped YAMNet as a feature extractor for acoustic signals

and applied an LSTM network as a classiﬁcation model that can effectively handle time-series environmental

acoustic signal. Dietary events including chewing, swallowing, talking, and other (silence and noises), were

collected on 14 subjects. The classiﬁcation results show that our proposed method can validly perform se-

mantic analysis of acoustic signals of eating behavior. The overall accuracy and overall F1 scores were both

93.3% in frame level, respectively. The classiﬁer established in this study provided a foundation for preventing

premature eating and a healthier eating behavior monitoring system.

1 INTRODUCTION

In modern life, the data of all human habits are being

digitized for a healthier lifestyle. Automatic detection

of dietary habits is one of the challenges of human

habits digitization. This paper explored a method for

automated eating activity using a commercially avail-

able bone conduction microphone. Compared to con-

ventional methods for automatic detection in eating

activity analysis-related works, this paper focuses on

improving the accuracy rate of each independent eat-

ing activity identiﬁcation, with the basic premise of

using acoustic signals from the natural environment.

According to the 2016 global obesity population

distribution by WHO, Approximately 39% of the

world’s population is overweight, of which 13% is

obese (NCD Risk Factor Collaboration, 2016). In

addition, surveys of obese people have shown that

many of them are ”fast eaters” who chew less and

eat for a shorter time (Yamaji et al., 2018). To pre-

vent obesity, automatic detection of eating behavior

using wearable devices has been progressing over the

https://orcid.org/0000-0001-7951-137X

https://orcid.org/0000-0002-9269-1026

https://orcid.org/0000-0003-2657-4961

https://orcid.org/0000-0002-9144-3688

past decade (Selamat and Ali, 2020). Wearable sen-

sors that have been proposed for automatic detection

of eating behavior include in-ear microphones (Amft,

2010; Shuzo et al., 2010), neck-worn sensors(Chun

et al., 2018), strain sensors (Yang et al., 2019), elec-

tromyography sensors (Huang et al., 2017), and wrist-

worn sensors (Shen et al., 2016).

As the most used method, acoustic sensing is

one of the earliest modalities studied, with advan-

tages such as ease of wear and precise identiﬁca-

tion of chewing. Being able to strike a balance be-

tween high-quality signal acquisition and user com-

fort is the main challenge of acoustic eating activ-

ity sensing. Kamachi et al. proposed a classiﬁca-

tion method of eating behavior by capturing both the

chewing sound based on bone conduction microphone

to capture both chewing and swallowing sound (Ka-

machi et al., 2021). P

aßler et al. used their proposed

design to perform analysis such as analyzing acous-

tical signal energies and chewing detection based on

magnitude squared coherence function (MSC) (P

aßler

and Fischer, 2011) . The fact that teeth produce vi-

brations when they tap, slide, or grind against each

other, these vibrations travel through the jaw and skull

bones as surface vibrations can easily reach the outer

ear (Prakash et al., 2020), leads to a high accuracy

Chen, W., Kamachi, H., Yokokubo, A. and Lopez, G.

Bone Conduction Eating Activity Detection based on YAMNet Transfer Learning and LSTM Networ ks.

DOI: 10.5220/0010903700003123

In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - Volume 4: BIOSIGNALS, pages 74-84

ISBN: 978-989-758-552-4; ISSN: 2184-4305

rate of chewing identiﬁcation. However, there are

still some issues to be left, such as difﬁculty in rec-

ognizing swallowing and susceptibility to background

noise (Kamachi et al., 2020).

Most of the eating activity detection methods in

the previous papers are based on manually extracted

features from speech signals and well-researched

classiﬁcation algorithms such as support vector ma-

chines (Zhang et al., 2011; Nkurikiyeyezu et al.,

2021). In recent years, with the advancement of

deep neural networks, classiﬁcation techniques us-

ing deep learning such as Convolutional neural net-

works (CNN) and Recurrent neural networks (RNN)

are also suitable for acoustic signals (Bae et al., 2016;

Xu et al., 2018). An emerging approach is to apply

transfer learning to the recognition of acoustic signals

(Ntalampiras et al., 2021).

This work uses acoustic signals recorded from

bone conduction microphones as input and trains a

model that combines transfer learning and deep learn-

ing to incorporate them into the automatic eating ac-

tivity detection task compatible with natural envi-

ronments. The experimental results show that our

model signiﬁcantly improves effectiveness compared

to existing state-of-the-art approaches, which is very

promising.

This paper is organized as follows. Section 2

presents the processing pipeline focusing on the archi-

tecture, detail of the datasets, and the primary method

used in this study. In Section 3, we describe the classi-

ﬁcation method, evaluation methods, and experimen-

tal results. In section 4, we discuss our proposed

method compared with the previous ones and present

future work. Finally, we conclude this paper in Sec-

tion 5.

This section describes the sample data utilized in

this study and the methods used to process the au-

dio signal data, the feature extractor, and the classiﬁer

model parameters.

1.1 Eating Behavior Signal Data

Collection

Figure 1: Data collection environment and devices.

2 MATERIAL AND METHOD

To evaluate the segmentation method for detection

eating activities using bone conduction sound, we col-

lected meal sound data in a natural meal environ-

ment. We used a bone conduction microphone con-

nected wirelessly to a Smartphone using Bluetooth

protocol for dietary activities sound collection. The

smartphone used was a Google Pixel 3, and the bone

conduction microphone was a Motorola Finiti HZ800

Bluetooth Headset. The sound signal sampling from

the microphone was 44100 Hz. After collection, we

transferred data to a computer for labeling and analy-

sis at 16000 Hz. Besides, it was necessary to perform

labeling afterward since data collection in a free envi-

ronment.

We collected data from 14 participants. All of the

young subjects were between the ages of 11 to 32

years. As shown in Figure 1, which reproduces the

data collection conditions, subjects put the bone con-

duction microphone on one ear. Also, we shot a video

focused on the mouth and throat of the subjects in or-

der to assist the afterward labeling task of audio sec-

tions corresponding to chewing, swallowing, talking,

and other sounds (like noise). The participants will be

asked to say a certain word at the beginning of the ex-

periment, which will be used to synchronize the video

and audio data.

To provide a natural environment, we collected

data in general daily life, such as big surrounding

sounds and eating with a conversation. Participants

were required to have a usual meal as every day’s

meal. For example, in a dining room, a standard

household table with other family members, and at

the university cafeteria with friends, we assume that

represents different noisy conditions. The meal con-

tent was also totally free, and participants ate what-

ever they wanted as usual in daily life, such various

food types were mixed unpredictably during the same

meal. Also, the collected data time varied by cases

collected from the meal’s start or the middle of the

meal. Besides, we collected additional swallowing

sound data because the few swallowing data com-

pared to other classes have been pointed out in our

previous works. The swallow audio data from 8 men

and women aged from 22 to 42, who were required

to have a couple of drinks, were collected the same as

above.

2.1 Architecture of the Proposed

Method

Our proposed architecture is shown in Figure 2. The

ﬁrst step is to label the acquired data so that it may

Bone Conduction Eating Activity Detection based on YAMNet Transfer Learning and LSTM Networks

Figure 2: Block Diagram overview in this study.

be used as a reference for training data to predict eat-

ing behavior. The acoustic eating behavior data taken

from a bone conduction microphone was manually se-

lected to include chewing, swallowing, speech, and

other sound events including noises applicable seg-

ments. In the next stage, the labeled sound episodes

are prepossessed by window segmentation, acoustic

enhancement and the features of the data are extracted

from transfer learning using YAMNet (Tensorﬂow,

2020). Finally The resulting embedding layer is then

classiﬁed using a deep learning network Long Short-

Term Memory (LSTM) (Hochreiter and Schmidhu-

ber, 1997) to evaluate the classiﬁer.

2.2 Sound Episode Segmentation

In order to segment the eating behavior to the speciﬁc

number of bites, the acoustic signal of a single bite

has to be accurately labeled. For the labeling process,

we used speech intensity to more easily segment the

speech signal (Clark et al., 2014). As the purpose of

this study is to estimate human eating behaviour au-

tomatically and possible to classify the audio signal

in real time, We deﬁne eating behavior as four steps:

chewing, swallow, talk and other. Chewing refers to

the vertical opening and closing of the top and bot-

tom teeth during a single chew. Swallowing refers to

the swallowing of food in one sitting. It is also con-

sidered a single swallow if the subject drinks. Talk

refers to what the subject is vocalizing. Other refers

to events that are not all of the above event, such as

when there is no sound or when there is noise. The

number of each sound event we labeled is shown in

Table 1. As you can see the data is not balanced well

due to the number of swallow is low. Because of this

is also in agreement with our philosophy of chewing

as much as possible in a single swallow during eating

behavior.

We deﬁne one acoustic eating episode label by

comparing the raw acoustic data with the sound in-

tensity and referring to the beginning and end points

of the waves that are visually obvious as shown in

Figure 3. The amplitude of swallow episode in the

acoustic data acquired from the bone conduction mi-

crophone tends to be smaller than that of chewing

episode. Therefore, the acoustic data was recorded

and synchronized with video to label speciﬁc swal-

lows.

BIOSIGNALS 2022 - 15th International Conference on Bio-inspired Systems and Signal Processing

Figure 3: Part of continuous chewing data labeling scene.

Table 1: The number of datasets collected.

Sound episode categories Data number

Chewing data 3395

Swallow data 334

Talk data 491

2.3 Feature Extraction and Signal

Processing

The most commonly used features in acoustic signal

processing are Mel-Frequency Cepstral Coefﬁcients

(MFCC) (Ittichaichareon et al., 2012). As a mech-

anism of MFCC, the magnitude spectra projected on

the reduced frequency bands are transformed into log-

arithmic magnitudes, which are then approximately

whitened and compressed by discrete cosine trans-

form (DCT). In contrast, DCT removes information

and destroys spatial relations in deep learning mod-

els, most audio signal processing with deep learning

methods use log-mel spectrograms to perform feature

extraction (Purwins et al., 2019) (Lee et al., 2017)

(Zheng and Yan, 2019).

Also, feature extraction methods based on log-mel

frequencies using transition learning have emerged.

In the evaluation of the Non-Semantic Speech bench-

mark (NOSS), which assess the general usefulness

of speech representations on “non-semantic” tasks,

shows that the correctness of the middle layer out-

puts of YAMNet and TRILL all reach a good accu-

racy rate (Shor et al., 2020). In the same acoustic

signal classiﬁcation domain, the YAMNet model is

treated as a feature extractor by outputting an inter-

mediate layer embedding using transfer learning for

COVID-19 cough classiﬁcation (Elizalde and Tomp-

kins, 2021).

2.3.1 Pre-emphasis

Pre-emphasis is a widely used method in audio signal

processing which has the effect of emphasizing the

wide-area components of the audio waveform (Dong

et al., 2020). In this study pre-emphasis is performed

on the raw signal before it is processed to compen-

sate for the frequencies in the high frequency portion

of the acoustic signal. The following ﬁlters are used

shown as equation (1):

(n) = Au(n) − αAu(n − 1) (1)

where, Au and Au

are the raw audio signal before

and after the pre-emphasis operation; n is the index

of each sample in raw audio signal; α is the parame-

ter which was normally assigned a value in the range

of [0.9, 1].

2.3.2 Framing and Windowing

After pre-emphasis, the raw audio signal is split into

slide windows to generate a log-mel spectrogram.

statistics of the collected data showed in Table 2. The

average time of one chewing is around 315ms. Also

There are very short sound episodes in the labeled

data. If the label time is larger than the window size,

the labeling will be inaccurate, so in order to effec-

tively classify more sound episode clips, We deﬁne

the window size as 250ms which is below the mean

time of chewing data event. and the hop size of each

window as 93ms, which is less than one-half of win-

dow size and greater than one-third of window size.

Table 2: Statistics of each label timing.

Categories Mean

(ms)

Min

(ms)

Max

(ms)

ST D

Chewing 314 28 1042 0.1572

Swallow 405 54 1541 0.2353

Talk 859 92 5067 0.6714

2.3.3 Log-mel Spectrogram

Not only YAMNet, but also many other acoustic deep

learning methods use mel spectrograms as input pre-

processing form for audio signals (Zeng et al., 2019).

According to the YAMNet summary, we generate the

mel-scaled spectrograms with a triangular ﬁlterbank

of 64 log-energies. The relationship between the Mel

spectrum and the frequency is shown in equation (2):

mel

= 2595 · lg(1 +

700Hz

) (2)

where, f

mel

is the mel frequency; f is the linear fre-

quency.

Then use the short-time Fourier transform (STFT)

to ﬁnd out the frequency of shorter intervals. We de-

ﬁned the window size of STFT to be 25ms and the hop

size of STFT to be 10ms. The progress feeding the

signal into the ﬁlterbanks to get the H

(k) is shown

Bone Conduction Eating Activity Detection based on YAMNet Transfer Learning and LSTM Networks

in equation (3):

(k) =











k − f (m − 1)

f (m) − f (m − 1)

, f (m − 1) ≤ k ≤ f (m)

f (m + 1) − k

f (m + 1) − f (m)

, f (m) ≤ k ≤ f (m + 1)

0, others

(3)

where, f (m) is the m

ﬁlter’s centre frequency;

(k) is the returned ﬁlterbank as matrix.

The result of multiplying ﬁlterbanks by STFT,

which is the raw audio signal processed to energy

spectrum, is shown in equation (4):

LogMelSpec(m) =

f (m+1)

∑

k= f (m−1)

log(H

(k) · |X(k)

(4)

where, |X(k)

| is the energy spectrum is the point of

energy; m is the ﬁlterbanks and k is the point of

FFTs.

2.3.4 YAMNet Embedding

YAMNet is a pre-trained deep network which pre-

dicts 521 audio event classes based on the AudioSet-

YouTube corpus (Gemmeke et al., 2017). Employ-

ing the Mobilenet v1 (Depthwise-separable convolu-

tion) architecture (Howard et al., 2017). The audio

set for training the YAMNet model contains more

than 632 audio events sampled from a 10-second clip

of a YouTube video that has been played more than

1000 times. Due to the properties of deep learning

and YAMNet, a feature extraction layer is built into

the model. Therefore, the log-mel spectrogram of

the speech signal directly becomes the input for Mo-

bilenet v1.

The network operators on input mel spectrogram

of size (48, 32, 32) as we get in previous section. The

structure of YAMNet is shown in Table 3. Inputs sig-

nal is processed by an 1-D convolution layer, which

the kernel size of 3 × 3. Then pass the value through

the number of ﬁlter 64-1024. The global average

pooling (AP) layer is in the next to prevent potential

over-ﬁtting by reducing the total number of parame-

ters of the model. At last, the network comes with

two fully connected (FC) layers of size 1024 and 64.

With the last list of value and a softmax layer to com-

pute one of 521 result determine which sound episode

this log-mel spectrogram belong. The most important

point of YAMNet is the last second fully-connected

layer which have 1024 values. We customize the

YAMNet network to contain with the network struc-

ture until last second fully-connected layer. We use

YAMNet to output a 1024 values of embedding layer

as feature extractor and treating YAMNet as a transfer

learning method. The advantage of this method is that

it has enough acoustic features even when the number

of data for this signal classiﬁcation problem is not so

large as to overﬂow, and provides great trade-off be-

tween performance and computational cost.

Table 3: YAMNet body architecture.

Type Filter shape Input size

Conv

3 × 3 × 3 48 × 32 × 32

Conv

dw 3 × 3 × 3 dw 48 × 32 × 32

Conv

pw 1 × 1 × 32 × 64 48 × 32 × 64

Conv

dw 3 × 3 × 64 dw 24 × 16 × 64

Conv

pw 3 × 3 × 128 24 × 16 × 128

Conv

dw 3 × 3 × 128 dw 24 × 16 × 128

Conv

pw 1 × 1 × 128 × 128 24 × 16 × 128

Conv

dw 3 × 3 × 128 dw 12 × 8 × 128

Conv

pw 1 × 1 × 128 × 256 12 × 8 × 256

Conv

dw 3 × 3 × 256 dw 12 × 8 × 256

Conv

pw 1 × 1 × 256 × 256 12 × 8 × 256

Conv

dw 3 × 3 × 256dw 6 × 4 × 512

Conv

pw 1 × 1 × 256 × 512 6 × 4 × 512

Conv

dw 3 × 3 × 512 6 × 4 × 512

−Conv

Conv

dw 3 × 3 × 512 3 × 2 × 1024

Conv

pw 1 × 1 × 512 × 1024 3 × 2 × 1024

Conv

dw 3 × 3 × 1024 dw 3 × 2 × 1024

Conv

pw 1 × 1 × 1024 × 1024 3 × 2 × 1024

AP&Pool 1 × 1 × 1024

3 × 2

FC 1024 × 512 1 × 1 × 512

So f tmax Classiﬁer 1 × 1 × 512

2.4 Classiﬁer Training and Evaluating

2.4.1 LSTM Model

Recurrent Neural Networks (RNNs) are very effective

for analyzing sequences of text, acoustic signals, and

video (Zhang and Man, 1998). The input signal can

be persistently held by looping in the network. Basi-

cally, the main feature of RNN is that it remembers

the previous state and uses that information to deter-

mine the next state. Therefore, models using RNN

networks are very suitable for analyzing time series.

However, the gradient of a traditional RNN depends

not only on the present error, but also on the past error,

so the retro-propagated gradients tend to grow enor-

mously or fade over time.

LSTM is a type of RNN network that has a

built-in function to determine which information to

store and which to delete, and is a model that can

store long-term dependencies without accumulating

errors(Navarro et al., 2020). The LSTM module has

BIOSIGNALS 2022 - 15th International Conference on Bio-inspired Systems and Signal Processing

Figure 4: General structure of an Long Short-Term Memory

neural networks (LSTM).

three internal gates, termed input, forgotten and out-

put shown in Figure 4. The input gate controls when

any new information will be put into memory. For-

gotten gates allow the cell state to identify important

and unwanted data when a piece of information is for-

gotten, leaving space for new data. The output gate is

used to control the result of the memory stored in the

cell state. The cell state has a weighting optimization

mechanism and controls each gate based on the out-

put error of the network. The cell state has a weight-

ing optimization mechanism that controls each gate

based on the output error of the network and sends

the prediction to the next LSTM module.

2.4.2 Evaluation of Classiﬁer

We use recall, speciﬁcity, precision, accuracy, and F1-

score to evaluate the performance of the classiﬁer. To

calculate those values there are four types of possible

results for the classiﬁcation task. If the sample true la-

bel is positive and it is classiﬁed as positive is counted

as a true positive (TP). If the sample is positive and it

is classiﬁed as negative is counted as a false negative

(FN). If a sample is negative and is classiﬁed as neg-

ative or positive, it is considered a true negative (TN)

or false positive (FP), respectively. Based on them the

result of recall, speciﬁcity, precision, accuracy, and

F1-score is deﬁned by the following equation:

Recall

class

T P

class

(5)

Speci f icity =

T N

FP + T N

(6)

Precision =

T P

T P + FP

(7)

Accuracy =

T P + T N

T P + T N + FP + FN

(8)

score

= 2 ×

Precision × Recall

Precision + Recall

(9)

3 RESULT

3.1 YAMNet Feature Extraction

The acoustic dietary data collected in this study was

segmented and feature extraction was performed as

described in Section 2.4, as shown in Figure 5. All

wav ﬁles of recorded audio signals are resampled to

1.6khz, Pre-emphasised then cut into sliding windows

of window size 250ms and hop size 93ms, which win-

dow size has 4000 sample points and hop size has

1500 sample points. The slide window is further ap-

plied to splitting the signal into short frames, applied

STFT to generate a spectrogram. Finally applied to a

64 log-energies mel ﬁlterbank to output log-mel spec-

trogram prepared for feeding the YAMNet transfer

learning.

Figure 5: Sound clip splitting and generating the embedding

layer as feature extractor using YAMNet.

Bone Conduction Eating Activity Detection based on YAMNet Transfer Learning and LSTM Networks

3.2 LSTM Model Implementation

The feature extracted sound clips are divided into

train data, validation data and test data. The num-

ber of data used for classiﬁcation is shown in the fol-

lowing table 4. We set the ratio of train data, vali-

dation data and test data to be 75%-10%-15% (Va-

sudevan et al., 2020). As an input to the model, three

sound clips are stored as a single continuous clip in

order to increase the storage area in the time domain

of the LSTM gate, and the overlapping of the three

clips connected back and forth is performed. From

the shape of our input data, we set the batch size, time

steps, and feature to 16, 3, and 1024, respectively.

Table 4: Number of the sound clip for each label.

Category Train Validation Test Total

Chewing 8283 1155 1672 11304

Swallow 874 179 158 1129

Talk 3357 470 724 4504

Other 20913 2676 4134 27658

Total 33427 4480 6688 44595

The parameters of each layer of the LSTM model

were set as follows and shown as table 5. The num-

ber of the hidden layers and the iterations for each is

set to 128 and 256 (Altch

e and de La Fortelle, 2017).

To prevent overtraining of the LSTM model, we set

up a dropout layer with a rate of 0.25%, the dropout

rate of 0.25% can effectively prevent memory loss

in the model (Semeniuta et al., 2016). Finally, the

Softmax layer outputs a vector with 4 elements to

produce the results of the LSTM model, where the

probability of each of the vectors is in the order of

other data, chewing data, swallow data, and talk data.

When the LSTM model complies, we ﬁne-tune the

model using the Adam optimizer. Adam is an opti-

mization algorithm that can use instead of the clas-

sical stochastic gradient descent procedure to update

network weights iteratively in training data (Kingma

and Ba, 2014). The learning rate is set to 0.001,

and the sparse categorical cross entropy is selected

as the loss function. Use sparse categorical cross en-

tropy when your classes are mutually exclusive such

as when each sample belongs exactly to one class (To-

takura et al., 2020) and expressed in Equation (10).

Where, N presents the number of categories which is

4; y

and log

∧

respectively represents the label value

and its log probability. The Model Loss on training

and validation datasets is shown in Figure 6. It can

be observed that there was no gradient explosion ap-

peared from the plotted data. Train loss drops below

0.05 from 80 epochs and stays at about 0.02 from 200

epochs. Validation loss ﬂoats and stabilizes at about

0.5 from 120 epochs. We implemented 200 epochs,

where the loss is stable, as a parameter during model

training.

Loss = −

∑

i=1

· log

∧

(10)

Table 5: LSTM body architecture for classiﬁcation.

Type Filter shape Input size

LST M1 kernel size 256 1 × 3 × 256

LST M2 kernel size 128 1 × 128

Dropout drop rate 0.25 1 × 128

FC 128 × 4 1 × 128

So f tmax Classiﬁer 1 × 4

Figure 6: The model loss on training and validation

datasets.

3.3 Classiﬁcation Performances

As section 2.5.2 described the evaluation of classi-

ﬁer, a confusion matrix is to evaluate the performance

of the classiﬁer. The 6688 sound clip from the test

datasets described in section 3.1 is used for evaluat-

ing, which shown as Figure 7.

Figure 7: Confusion matrix of YAMNet as feature extractor.

According to the results of the confusion matrix,

the sound clips of Chewing, Swallow, Talk, and Other,

whose features were extracted by YAMNet, retain a

signiﬁcant percentage of accuracy. The accuracy rate

was 91.54%, 73.64%, 90.73%, 95.13% respectively.

The weighted average reaches 93.30%. And the over-

all F1-score reaches 93.28%. To understand the per-

BIOSIGNALS 2022 - 15th International Conference on Bio-inspired Systems and Signal Processing

Table 6: Evaluation result of the trained LSTM classiﬁer each categories.

Evaluation metric Chewing Swallow Talk Other Average Weighted Average

Recall 91.55% 73.65% 90.74% 95.13% 87.77% 93.3%

Speciﬁcity 96.86% 99.83% 98.91% 91.46% 96.77% 93.78%

Precision 90.74% 90.83% 90.61% 94.86% 91.76% 93.29%

Accuracy 91.54% 73.64% 90.73% 95.13% 87.76% 93.30%

F1-Score 91.14% 81.34% 90.67% 94.99% 89.54% 93.28%

formance of the Classiﬁer, equations (5) (9) in Section

2.5.2 were calculated the recall, speciﬁcity, precision,

accuracy, and F1-score for 4 categories as Chewing,

Swallow, Talk, and Other, shown as Table 6.

The accuracy rate for Swallow is 73.64% with

the lowest value and the accuracy of Ture label for

Chewing and predicted as Other was 21.62%. Which

showed that this model is still in a difﬁcult state to

classify swallowing data perfectly. Several reason

leads to the low accuracy of swallow data compared to

other sound events. First reason is that the number of

swallowing data in this datasets is lowest. In recorded

data, approximately 15 or more chewing event cor-

responding to one swallow event. Therefore, the total

number of swallow sound clip data is only 1129, 2.5%

of the total sound clip data number (44595), 10% of

the total chewing sound clip data number (11304).

This may result in an imbalance in the data. To ac-

count for this, we incorporated the weighting of the

LSTM model. However, we need to further improve

the balance of the data.

4 DISCUSSION

4.1 Performance Comparison with

Previous Research

Several classiﬁcation method for detection of eating

behavior have been developed in the last decade. Al-

though the technology of dietary monitoring is evolv-

ing rapidly with the advancement of sensors, auto-

matic monitoring of comprehensive dietary intake in

real time is still one of the big challenge worth study-

ing. There are two ways to classify eating behavior.

One is to classify whether it is during the meal period

or not, or to classify chewing, swallowing, and speech

in real time as in this study.

There are already a large number of dietary pe-

riod classiﬁcations in existence and with a sufﬁciently

high level of accuracy. Gao et al. develop a practi-

cal solution for automatic detection of eating episode

(Gao et al., 2016). They achieve the detection accu-

racy rate over 94% using LOSO (Leave One Sample

Out) method based on deep learning. Bi et al. propose

a wearable system for eating detection in free-living

scenarios and achieve the accuracy rate over 90% but

also belongs to the classiﬁcation by dietary period

(Bi et al., 2017). However, this kind of classiﬁca-

tion method cannot classify how many times one eat-

ing event has chewed during the meal period though

the high accuracy. Zhang et al. used the same pol-

icy as in this study to achieve real-time classiﬁcation

of eating episodes (Zhang et al., 2020). They present

the design, implementation, and evaluation of a neck-

lace suit for detection and validation of chewing se-

quences and eating episodes in free living condition.

They achieve the F1-score at 76.2% on per-second

level and 81.6% at the per-episode level. Overall F1-

score of 73.7% in detection the chewing sequences in

a free-living condition. Kamachi et al. suggested an

automatic segmentation method to detect eating ac-

tivity using bone conduction sound the same as this

study and porpoised a segmentation method based on

the chewing model (Kamachi et al., 2021). The result

comparison with all studies above and related study

on chewing sound detection shown in Table 8. From

the result comparison, our study performed a better

F1-score in overall sound episode. Also in this study,

the data was collected in a free-living condition too

with talking and noises.

Table 7: Comparison between the developed YAMNet fea-

tured LSTM classiﬁer and previous classiﬁer.

Author

Recorder

Events F1Score

comp

(Bedriet al.,2017)

Ch,Dr,Ta,Sw,Wa

80%

(Gaoet al.,2016)

AR Ea Ac 94%

(Zhanget al.,2020)

IMU Ea

81.6%

(Diouet al.,2017)

88.3%

(Kamachiet al.,2021)

Ch,Sw,Ta,O,Ea

Pre 88.1%

Our method

Ch,Sw,Ta,O

93.28%

Note: Where, F1Score

comp

means comprehensive

F1-score; Mu means Multimodal sensor; AR means

Audio record, Ch means Chewing event; Dr means

Drinking event, Ta means Talking event; Wa means

Walking event; Ea means Eating episode; O means

Other(silent) data; Sw means Swallow event; Ac and

Pre means overall accuracy and precision when F1-

score is not provide, respectively.

Bone Conduction Eating Activity Detection based on YAMNet Transfer Learning and LSTM Networks

4.2 Future Work for Classiﬁcation

As indicated by the results in this study, the swallow

data holds a low percentage of accuracy rate. The

addition of sound episodes is becoming essential in

the future to improve the classiﬁer. It is also possi-

ble that the volume of swallowing sounds taken with

bone conduction microphones varies greatly and that

sound episodes of swallowing that are too small were

not adequately captured in the labeling phase of this

study. The future plan designed as following:

• Eliminate artiﬁcial error during labeling by em-

ploying throat microphones in addition to bone

conduction microphones and video in datasets

collection.

• Classiﬁcation of eating behavior by simultaneous

input of throat microphone and bone conduction

microphone when classiﬁcation need better swal-

lowing sound estimation result.

5 CONCLUSIONS

In this study, we developed a classiﬁcation system for

human eating behavior in a natural environment us-

ing YAMNet transition learning and LSTM networks.

The data for the classiﬁcation of eating behavior con-

sists of chewing, swallowing, speech, and other sig-

nals including noises. In particular, we found that the

use of transition learning in YAMNet can enhance the

features of speech signals and improve the accuracy

rate of classiﬁcation models for machine learning and

Deep Learning. Using LSTM, we built a classiﬁer

using the embedding layer of YAMNet as the input

feature. The F1-score and accuracy rate of the over-

all classiﬁed data reached 93.28% and 93.3%, respec-

tively. By using the classiﬁcation prediction of this

research, we can canonically estimate the number of

chewing and swallowing in real time, and expect to

build a smarter eating environment by digitizing eat-

ing behavior and preventing fast eating.

ACKNOWLEDGEMENTS

This research was supported by Lotte Research Pro-

motion Grant. Entire experimental protocols were ap-

proved by the ethics committee of Aoyama Gakuin

University (H18-003).

REFERENCES

Altch

e, F. and de La Fortelle, A. (2017). An lstm network

for highway trajectory prediction. In 2017 IEEE 20th

International Conference on Intelligent Transporta-

tion Systems (ITSC), pages 353–359. IEEE.

Amft, O. (2010). A wearable earpad sensor for chewing

monitoring. In SENSORS, 2010 IEEE, pages 222–

227. IEEE.

Bae, S. H., Choi, I., and Kim, N. S. (2016). Acoustic scene

classiﬁcation using parallel combination of lstm and

cnn. In Proceedings of the Detection and Classiﬁ-

cation of Acoustic Scenes and Events 2016 Workshop

(DCASE2016), pages 11–15.

Bedri, A., Li, R., Haynes, M., Kosaraju, R. P., Grover, I.,

Prioleau, T., Beh, M. Y., Goel, M., Starner, T., and

Abowd, G. (2017). Earbit: using wearable sensors to

detect eating episodes in unconstrained environments.

Proceedings of the ACM on interactive, mobile, wear-

able and ubiquitous technologies, 1(3):1–20.

Bi, S., Wang, T., Davenport, E., Peterson, R., Halter, R.,

Sorber, J., and Kotz, D. (2017). Toward a wearable

sensor for eating detection. In Proceedings of the

2017 Workshop on Wearable Systems and Applica-

tions, pages 17–22.

Chun, K. S., Bhattacharya, S., and Thomaz, E. (2018). De-

tecting eating episodes by tracking jawbone move-

ments with a non-contact wearable sensor. Proceed-

ings of the ACM on Interactive, Mobile, Wearable and

Ubiquitous Technologies, 2(1):1–21.

Clark, J. P., Adams, S. G., Dykstra, A. D., Moodie, S., and

Jog, M. (2014). Loudness perception and speech in-

tensity control in parkinson’s disease. Journal of com-

munication disorders, 51:1–12.

Diou, C., Papapanagiotou, V., and Delopoulos, A. (2017).

Chewing detection from an in-ear microphone using

convolutional neural networks. In 2017 39th Annual

International Conference of the IEEE Engineering in

Medicine and Biology Society (EMBC), pages 1258–

1261. IEEE.

Dong, X., Yin, B., Cong, Y., Du, Z., and Huang, X. (2020).

Environment sound event classiﬁcation with a two-

stream convolutional neural network. IEEE Access,

8:125714–125721.

Elizalde, B. and Tompkins, D. (2021). Covid-19 detection

using recorded coughs in the 2021 dicova challenge.

arXiv preprint arXiv:2105.10619.

Gao, Y., Zhang, N., Wang, H., Ding, X., Ye, X., Chen,

G., and Cao, Y. (2016). ihear food: eating detection

using commodity bluetooth headsets. In 2016 IEEE

First International Conference on Connected Health:

Applications, Systems and Engineering Technologies

(CHASE), pages 163–172. IEEE.

Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A.,

Lawrence, W., Moore, R. C., Plakal, M., and Rit-

ter, M. (2017). Audio set: An ontology and human-

labeled dataset for audio events. In 2017 IEEE Inter-

national Conference on Acoustics, Speech and Signal

Processing (ICASSP), pages 776–780. IEEE.

BIOSIGNALS 2022 - 15th International Conference on Bio-inspired Systems and Signal Processing

Hochreiter, S. and Schmidhuber, J. (1997). Lstm can solve

hard long time lag problems. Advances in neural in-

formation processing systems, pages 473–479.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,

Wang, W., Weyand, T., Andreetto, M., and Adam,

H. (2017). Mobilenets: Efﬁcient convolutional neu-

ral networks for mobile vision applications. arXiv

preprint arXiv:1704.04861.

Huang, Q., Wang, W., and Zhang, Q. (2017). Your glasses

know your diet: Dietary monitoring using electromyo-

graphy sensors. IEEE Internet of Things Journal,

4(3):705–712.

Ittichaichareon, C., Suksri, S., and Yingthawornsuk, T.

(2012). Speech recognition using mfcc. In Interna-

tional conference on computer graphics, simulation

and modeling, pages 135–138.

Kamachi, H., Kondo, T., Hossain, T., Yokokubo, A., and

Lopez, G. (2021). Automatic segmentation method of

bone conduction sound for eating activity detailed de-

tection. In Adjunct Proceedings of the 2021 ACM In-

ternational Joint Conference on Pervasive and Ubiq-

uitous Computing and Proceedings of the 2021 ACM

International Symposium on Wearable Computers,

pages 310–315.

Kamachi, H., Kondo, T., Yokokubo, A., and Lopez, G.

(2020). Classiﬁcation method of eating behavior

by dietary sound collected in natural meal environ-

ment. In Activity and Behavior Computing, ABC

2020. Smart Innovation, Systems and Technologies,

volume 204, pages 135–152. Springer, Singapore.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Lee, J., Park, J., Kim, K. L., and Nam, J. (2017). Sample-

level deep convolutional neural networks for music

auto-tagging using raw waveforms. arXiv preprint

arXiv:1703.01789.

Navarro, J. M., Mart

ınez-Espa

na, R., Bueno-Crespo, A.,

Mart

ınez, R., and Cecilia, J. M. (2020). Sound lev-

els forecasting in an acoustic sensor network using a

deep neural network. Sensors, 20(3):903.

NCD Risk Factor Collaboration (2016). Trends in adult

body-mass index in 200 countries from 1975 to 2014:

a pooled analysis of 1698 population-based measure-

ment studies with 19· 2 million participants. The

lancet, 387(10026):1377–1396.

Nkurikiyeyezu, K., Kamachi, H., Kondo, T., Jain, A.,

Yokokubo, A., and Lopez, G. (2021). Classiﬁcation

of eating behaviors in unconstrained environments.

Biomedical Engineering Systems and Technologies,

BIOSTEC 2020. Communications in Computer and

Information Science, 1400:592–609.

Ntalampiras, S., Kosmin, D., and Sanchez, J. (2021).

Acoustic classiﬁcation of individual cat vocalizations

in evolving environments. In 2021 44th International

Conference on Telecommunications and Signal Pro-

cessing (TSP), pages 254–258. IEEE.

aßler, S. and Fischer, W.-J. (2011). Acoustical method

for objective food intake monitoring using a wearable

sensor system. In 2011 5th International Conference

on Pervasive Computing Technologies for Healthcare

(PervasiveHealth) and Workshops, pages 266–269.

IEEE.

Prakash, J., Yang, Z., Wei, Y.-L., Hassanieh, H., and Choud-

hury, R. R. (2020). Earsense: earphones as a teeth

activity sensor. In Proceedings of the 26th Annual

International Conference on Mobile Computing and

Networking, pages 1–13.

Purwins, H., Li, B., Virtanen, T., Schl

uter, J., Chang, S.-Y.,

and Sainath, T. (2019). Deep learning for audio signal

processing. IEEE Journal of Selected Topics in Signal

Processing, 13(2):206–219.

Selamat, N. A. and Ali, S. H. M. (2020). Automatic food in-

take monitoring based on chewing activity: A survey.

IEEE Access, 8:48846–48869.

Semeniuta, S., Severyn, A., and Barth, E. (2016). Recur-

rent dropout without memory loss. arXiv preprint

arXiv:1603.05118.

Shen, Y., Salley, J., Muth, E., and Hoover, A. (2016).

Assessing the accuracy of a wrist motion tracking

method for counting bites across demographic and

food variables. IEEE journal of biomedical and health

informatics, 21(3):599–606.

Shor, J., Jansen, A., Maor, R., Lang, O., Tuval, O., Quitry,

F. d. C., Tagliasacchi, M., Shavitt, I., Emanuel, D.,

and Haviv, Y. (2020). Towards learning a universal

non-semantic representation of speech. arXiv preprint

arXiv:2002.12764.

Shuzo, M., Komori, S., Takashima, T., Lopez, G., Tatsuta,

S., Yanagimoto, S., Warisawa, S., Delaunay, J.-J., and

Yamada, I. (2010). Wearable eating habit sensing

system using internal body sound. Journal of Ad-

vance Mechanical Design, Systems, and Manufactur-

ing, 4(1):158–166.

Tensorﬂow (2020). Sound classiﬁcation with yamnet.

https://github.com/tensorﬂow/models/tree/master/

research/audioset/yamnet/, (Accessed:20 December

2021).

Totakura, V., Janmanchi, M. K., Rajesh, D., and Hussan,

M. T. (2020). Prediction of animal vocal emotions us-

ing convolutional neural network. International Jour-

nal of Scientiﬁc & Technology Research, 9(2):6007–

6011.

Vasudevan, H., Michalas, A., Shekokar, N., and Narvekar,

M. (2020). Advanced Computing Technologies and

Applications: Proceedings of 2nd International Con-

ference on Advanced Computing Technologies and

Applications—ICACTA 2020. Springer Nature.

Xu, Y., Kong, Q., Wang, W., and Plumbley, M. D. (2018).

Large-scale weakly supervised audio classiﬁcation us-

ing gated convolutional neural network. In 2018 IEEE

international conference on acoustics, speech and sig-

nal processing (ICASSP), pages 121–125. IEEE.

Yamaji, T., Mikami, S., Kobatake, H., Kobayashi, K.,

Tanaka, H., and Tanaka, K. (2018). Does eating fast

cause obesity and metabolic syndrome? Journal of

the American College of Cardiology, 71(11S):A1846–

A1846.

Yang, X., Doulah, A., Farooq, M., Parton, J., McCrory,

M. A., Higgins, J. A., and Sazonov, E. (2019). Sta-

Bone Conduction Eating Activity Detection based on YAMNet Transfer Learning and LSTM Networks

tistical models for meal-level estimation of mass and

energy intake using features derived from video ob-

servation and a chewing sensor. Scientiﬁc reports,

9(1):1–10.

Zeng, Y., Mao, H., Peng, D., and Yi, Z. (2019). Spectro-

gram based multi-task audio classiﬁcation. Multime-

dia Tools and Applications, 78(3):3705–3722.

Zhang, H., Lopez, G., Shuzo, M., Delaunay, J.-J., and Ya-

mada, I. (2011). Analysis of eating habits using sound

information from a bone-conduction sensor. In Proc.

of the IADIS e-Health Conference (EH 2011).

Zhang, J. and Man, K.-F. (1998). Time series prediction

using rnn in multi-dimension embedding phase space.

In SMC’98 Conference Proceedings. 1998 IEEE In-

ternational Conference on Systems, Man, and Cyber-

netics (Cat. No. 98CH36218), volume 2, pages 1868–

1873. IEEE.

Zhang, S., Zhao, Y., Nguyen, D. T., Xu, R., Sen, S., Hester,

J., and Alshurafa, N. (2020). Necksense: A multi-

sensor necklace for detecting eating activities in free-

living conditions. Proceedings of the ACM on Interac-

tive, Mobile, Wearable and Ubiquitous Technologies,

4(2):1–26.

Zheng, X. and Yan, J. (2019). Acoustic scene classiﬁcation

combining log-mel cnn model and end-to-end model.

DCASE2019 Challenge.

BIOSIGNALS 2022 - 15th International Conference on Bio-inspired Systems and Signal Processing