Prediction of Sleep Stages Based on Wearable Signals Using Machine

Learning Techniques

Rodrigo Duarte Braga

, Daniel Osório

1,2

and Hugo Gamboa

1,2

Department of Physics, Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa, Monte da Caparica,

2892-516, Caparica, Portugal

Plux-Wireless Biosignals S.A, Avenida 5 de Outubro 70, 1050-59, Lisboa, Portugal

Keywords: Deep Learning, Machine Learning, Wearable, Photoplethysmography, Sleep Stages, Heart Rate Variation.

Abstract: Sleep’s impact on mood and health is widely recognized by medical researchers with such understanding

disseminating among average people in recent years. The main objective of this work was the development

of machine learning algorithms for automatic sleep cycles detection. The features were selected based on

the AASM manual, which is considered the gold standard for human technicians. For training the models

we used MESA, a database containing 2056 full overnight unattended polysomnographies. With the goal of

developing an algorithm that would only require a photoplethysmography (PPG) device to be able to

accurately predict sleep stages and quality, the main channels used from this dataset were peripheral oxygen

saturation and PPG. Testing the performance of Random forest, Gradient Boosting, Gaussian Naive-bayes,

K-Nearest Neighbours, Support Vector Machine and Multilayer Perceptron classifiers, and using features

extracted from the dataset, we achieved 80.50 % accuracy, 0.7586 Cohen’s kappa, and 77.38% F1-score, for

five sleep stages, using a Multilayer Perceptron. To assess its performance in a real-world scenario we

acquired sleep data and compared the classifications attributed by a popular sleep stage classification

android app and our algorithm, resulting in a strong level of agreement (90.96% agreement, 0.8663 Cohen’s

kappa), for four sleep stages.

1 INTRODUCTION

Sleep’s impact on mood and health is widely

recognized by medical researchers with such

understanding disseminating among average people

in recent years. While newer studies strengthen the

suspected link between inadequate sleep and a wide

range of infirmities (Minkel et al., 2012), the general

population is not very conscious of their sleep

quality. As a result, there is much interest in having

proper means of studying sleep, given its importance

and how difficult it is to accurately diagnose sleep

disorders, considering how individuals are affected

by sleep loss, and their ability to recover from said

sleep loss, varies significantly (Tkachenko and

Dinges, 2017). The discovery of the brain’s

electrical activity was the main contributor

responsible for the development of the field of sleep

medicine in the second half of the 20th century. The

examination of the electroencephalogram (EEG)

patterns that occur during sleep lead to the current

division of the sleep period into different stages, thus

creating the basis of sleep medicine and the study of

human sleep (Worley, 2018). One of the major

discoveries was that sleep is much more restorative

to both waking cognition and health when it occurs

and goes through the appropriate physiological

sequences. This is to say that, due to the way that

sleep is structured into distinct stages, where each

one has a certain set of characteristics and its own

physiological role, the exclusive measurement of the

amount of time slept is not enough for the quality of

sleep to be determined. As such, sleep quality

depends not only on total time slept but on many

other factors such as fragmentation, amount of time

spent in each sleep stage, and how the sleep cycles

are structured.

Currently, polysomnography (PSG) is the most

common technique used to study sleep disorders,

being able to record multiple biosignals

simultaneously (Rundo and Downey, 2019; Karlen

et al., 2009). It has, however, the issue of being

expensive and inconvenient (Kelly et al., 2012), with

the fact that this type of exam is normally performed

Braga, R., Osório, D. and Gamboa, H.

Prediction of Sleep Stages Based on Wearable Signals Using Machine Learning Techniques.

DOI: 10.5220/0011655200003414

In Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2023) - Volume 4: BIOSIGNALS, pages 195-202

ISBN: 978-989-758-631-6; ISSN: 2184-4305

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

195

in a clinic additionally raising the issue of negative

bias, as people may behave differently than normal

when they know they are being monitored.

Furthermore, there is the matter of longitudinal data.

Laboratory PSGs are usually a single-night snapshot,

whereas sleep is a dynamic process that is affected

by the existence and intensity of many other factors

that vary from day to day.

Wearable sleep-trackers, on the other hand, are

low-cost devices capable of measuring biosignals

and, from this data, inferring information about

certain behaviours, like sleep. They present many

advantages over PSG, such as their convenience,

ease of use, affordability and data accessibility, with

the possibility of using some sort of cloud-based

platform for storage of data, thus allowing the

acquisition of an unprecedented amount of

information about sleep and other behaviours or

health parameters.

Because of the costs and time expenditures

associated with PSG, there is much interest in the

development of algorithms that can be deployed in a

wearable device, being able to automatically and

accurately classify sleep stages with a similar degree

of accuracy as the current gold-standard. Ideally,

such an algorithm should also strive to be as simple

as possible (both in terms of signals used and model

complexity).

1.1 State-of-the-Art

As sleep disorders are common in modern society,

with the main difficulty of their treatment being

detection and diagnosis (Pavlova and Latreille,

2019), and with the recent increase in popularity of

using wearable devices in medicine (Akkaş et al.,

2020), some studies have already been developed on

the performance of models for sleep stage prediction

using different biosignals, features or classifiers.

In the literature about the performance of these

kinds of wearables, it was possible to find

information about Fitbit Charge 2, which records

wrist activity through accelerometers and pulses

through photoplethysmography (PPG). In Stucky et

al. (2021), the authors compared this device against

portable home PSG, displaying reasonably accurate

mean values of sleep and heart rate (HR) estimates,

should it follow careful data processing. One other

device is the Heally Recording System which,

through the combination of embedded sensors and

electrodes in a shirt that measures respiratory and

cardiac physiology, monitors sleep based on

autonomic signals. It exhibited accuracy at

approximately 80% agreement with manual scoring,

which is similar to accuracies obtained through

actigraphy, considered an appropriate method for the

assessment of sleep in patients with certain sleep

disorders (McCall and McCall, 2012).

Other studies, relying on ML, have successfully

developed algorithms for sleep stage prediction. For

example, Tsinalis et al. (2020) managed to obtain

sleep stage-specific characteristics with an average

accuracy of 86% based on EEG data, while Yildirim

et al. (2019), developed and applied a 19-layer 1D

convolutional neural network model to EEG and

EOG signals, achieved the highest classification

accuracies for 5 of its 6 sleep classes as over 91%.

More specifically for studies using the same

dataset (that will be described in the next section)

that was chosen for this work, we have Kudo et al.

(2022) that, using PPGs and accelerometers’

information extracted from the public datasets Apple

watch Sleep (Walch et al., 2019), and Multi-Ethnic

Study of Atherosclerosis (MESA) (Zhang et al.,

2018; Chen et al., 2015), achieved a macro F1 score

of 0.655 and Cohen’s kappa score of 0.527, using a

recurrent neural network. Another similar study,

published by Sridhar et al. (2020), using the ECG

signal of both the Sleep Heart Health Study (Quan et

al., 1997) and the MESA dataset for training,

validation and testing of the developed algorithm,

obtained an 47 overall performance of 77% accuracy

and 0.66 Cohen’s kappa against the reference stages

on a held-out portion of the datasets used for

training.

All these studies suggest that the development of

similar fully automatic recognition systems could

serve as a suitable replacement for manual

inspection of PSG signals, particularly for large-

scale studies.

2 DATASET DESCRIPTION

Initially, the algorithm was trained through the use

of a publicly available online database, selected

from others such as the NCH Sleep DataBank (Lee

et al., 2021), or the Sleep Heart Health Study. After

a comparison between several of these databases,

one was selected based on its size, sensor quality

and quantity, detail of the scoring, and how recently

collected was the data.

The set of PSG recordings used for this work

was obtained from MESA. This dataset included a

sleep exam with 2237 participants, consisting of full

overnight unattended polysomnographies that were

conducted between 2010-2012, and had the

following demographics described in Table 1.

BIOSIGNALS 2023 - 16th International Conference on Bio-inspired Systems and Signal Processing

196

Table 1: Dataset demographics of the MESA database

(adapted from (NSRR team, 2022)).

Characteristics Value

Number of PSGs 2056

Number of Patients 2237

Age (Years)

Mean 69.6

Median 69.0

Standard deviation ± 9.2

Minimum 54.0

Maximum 95.0

Gender

Female 1198

Male 1039

Race/ethnicity

White, Caucasian 830

Chinese American 265

Black, African-American 616

Hispanic 526

The information pertaining to the PSG studies in

the MESA dataset are contained in two separate file

formats for each study. The EDF files store 27

biosignal channels, from which we used only three,

HR information, the PPG recording, and oxygen

saturation. With the exception of the PPG signal that

was sampled at 256 Hz, the channels were sampled

at 1 Hz. On the other hand, the XML files contain

annotations corresponding to the PSG recordings,

such as sleep stages and their duration.

For the real-world validation of the developed

models, 14 nights of sleep were acquired using a

biosignalsPlux device on the posterior side of the

left wrist (Plux Wireless Biosignals, 2022) that

recorded both PPG and accelerometer data, and a

smartwatch (Ticwatch E2) on the right wrist.

3 METHODOLOGY

In order to build the code developed in this work to

analyse and process the dataset, as well as build the

ML models, Python was used through the code

editor Spyder. Several different libraries were used,

including BeautifulSoup4, Pandas, NumPy, scikit-

learn, Tensorflow, and hrvanalysis. After the

development of the algorithms, to test them in a real-

world scenario, they were used to classify the sleep

stages of the acquired 14 nights of sleep. These

classifications were then compared to the results

obtained from “Sleep as Android“ (Chaudhry,

2017), which is one of the most reviewed android

sleep analysis smartphone applications, using the

measurements taken using the smartwatch.

In the next Sub-Sections, the algorithms used to

perform the classification of the data, the metrics

upon which they are evaluated, and both which

features and how they were extracted are described.

3.1 Data Pre-Processing and Feature

Extraction

The first step of the extraction of the data from the

PPG records was the standardization of the signal,

achieved through the subtraction of the signal’s

mean followed by its division by its standard

deviation. After that, the signal was segmented in

short windows (half second interval) and the mean

of each of these intervals was subtracted to minimize

baseline drift. Subsequently a 4th order Chebyshev

II bandpass filter (sampling frequency of 256 Hz and

cut-off frequencies of 0.05 and 30 Hz) was used.

At this stage we segmented the signal according

to the sleep stage annotations and began extracting

features. These features, a total of 30, range from the

maximum, mean and minimum values of oxygen

saturation and HR, in this case also including its

standard deviation, to features related to heart rate

variation (HRV). The resulting analysis of HRV is

grouped under time-domain and frequency-domain.

In the time-domain, 12 features were used, such

as root mean square of successive differences

between N-N intervals (RMSSD), standard deviation

of these differences (SDSD), number of pairs of

successive N-N intervals that differ by more than 50

ms and 20 ms (NN50 and NN20), total proportion of

NN50 and NN20 in relation to the total number of N-

N intervals, standard deviation of all N-N intervals

(calculated over each 30 second interval), mean and

median of the N-N intervals (Mean_nni and

Median_nni), coefficient of variation (SDNN divided

by Mean_nni), coefficient of variation of successive

differences (RMSSD divided by Mean_nni) and,

finally, the difference between the longest and

shortest N-N interval.

As for the frequency-domain, seven features

were used, including total power spectral density

(Golgouneh and Tarvirdizadeh, 2020), power in the

very low (vlf), low (lf), and high (hf) frequency

bands (Salahuddin et al., 2007), normalised lf and hf

power, and the ratio between these two powers.

Two additional features related to the PPG

signal’s entropy (more specifically fuzzy (Chen et

al., 2007) and dispersion entropy (Rostaghi and

Prediction of Sleep Stages Based on Wearable Signals Using Machine Learning Techniques

197

Azami, 2016)) were extracted after averaging its

value in windows of 32 samples, to minimize time

spent for this step and the information loss resulting

from the averaging.

Finally, despite only classifying sleep in 30

second intervals, the two preceding stage

classifications were also used as features, so as to

take into account the continuity of sleep.

3.2 Classification Models

To classify the sleep stages, we used both machine

learning models (such as Random Forest, Gradient

Boosting, Gaussian Naïve-Bayes, K-Nearest

Neighbours, and Support Vector Machine) and

artificial neural networks (Multilayer Perceptrons).

The choice of these algorithms was based on

both literature reviews done for other sleep stage

classification studies, and trial and error.

3.2.1 Random Forest

Random Forest is an ensemble learning method that

constructs and uses numerous decision trees. Due to

random variable selection and bootstrap aggregation

leading to lower correlation across trees, the

ensemble prediction is generally more accurate than

any of its decision trees individual predictions.

3.2.2 Gradient Boosting

Gradient Boosting is an ensemble learning method

of weak prediction models, usually decision trees.

With careful tuning of its parameters, it may result

in better performance than Random Forest models.

3.2.3 Gaussian Naive-Bayes

Gaussian Naive Bayes classifiers are based on

applying Bayes’ theorem with a strong

independence assumption to classify the data.

3.2.4 K-Nearest Neighbours

K-Nearest Neighbours (KNN) classifiers utilise

proximity to make predictions. For classification

problems, a class label is assigned to a data element

based on the vote of the K number of its nearest

neighbours. It is possible to construct a weighted

version using the distance between data points.

3.2.5 Support Vector Machine

Support-Vector Machine (SVM) algorithms attribute

classifications by finding a hyperplane in an N-

dimensional space that is able to separately classify

the data points.

3.2.6 Multilayer Perceptron

Multilayer Perceptrons are a fully connected class of

feedforward artificial NNs, consisting of at least

three layers of nodes. With the exception of the

nodes in the input layer, each node is a neuron that

uses a nonlinear activation function.

3.3 Model’s Hyperparameters

The characteristics of machine learning algorithms

are strongly tied to their hyperparameters, with their

optimization and tuning being pivotal to a model’s

performance (Feurer and Hutter, 2019). For the non-

neural network models, the chosen method to tune

these hyperparameters was grid search, which is a

tuning technique that computes their optimum

values through an exhaustive search in a manually

introduced subset of values. Scikit-learn library’s

implementation of this function was used, with the

hyperparameters’ values being presented in table 2.

Table 2: Values for the different parameters to be

optimized when utilizing scikit-learn’s grid search.

Parameters Values

Random

Forest

n_estimators

10-100, 100-1000 (increasing

y 10 and 100, respectively)

Criterion gini, entropy

max_depth

None, 10-100 (increasing by

)

Gradient

Boosting

n_estimators

10-100, 100-1000 (increasing

y 10 and 100, respectively)

Criterion

friedman_mse, squared_error,

mse

max_depth

1,3,5, 10-100(increasing by

)

KNN

n_neighbours 1, 3, 5, 7, 9, 11

Weights uniform, distance

Metric manhattan, Euclidean

SVM

0.0001, 0.01, 0.05, 0.1, 0.5,

1.0, 5, 10

Kernel linear, poly, rbf

Gamma

scale, 0.0001, 0.01, 0.05, 0.1,

0.5, 1.0, 5, 10

For the neural network models, their

hyperparameters were chosen to be tuned through

trial and error due to their increased complexity.

BIOSIGNALS 2023 - 16th International Conference on Bio-inspired Systems and Signal Processing

198

3.4 Model Evaluation

After training the algorithms, it is necessary to

evaluate their performance. To do this, the MESA

dataset was first split into testing and training sets,

so as to permit an assessment and minimization of

the impact of the model’s overfitting to the data,

which would otherwise lead to an imprecise

estimation of the model’s capabilities. Additionally,

as sleep stages’ distribution is naturally unbalanced

(Worley, 2018), to promote a more even learning

process, these sets were balanced.

Finally, some metrics were calculated to evaluate

their performance. These metrics were accuracy,

Cohen’s kappa and macro average F1-score.

4 RESULTS

For the non-neural network models, after optimizing

their hyperparameters through grid search, the

evaluated metrics for the best performing models of

each type obtained are presented in Table 3.

Table 3: Values of the chosen metrics for the highest

performance non-neural network models of each type.

Accurac

(

)

Cohen's ka

Random Forest 79.30 0.7412

Gradient Boosting 82.34 0.7792

Gaussian Naive-

Bayes

68.41 0.6052

KNN 21.99 0.0249

SVM 25.03 0.0628

As can be observed in Table 3, the best

performing models are Random Forest and Gradient

Boosting, with this last model presenting an overall

more balanced performance for all of its

classifications when compared to the other models

and presenting the highest accuracy and Cohen’s

kappa for the balanced test dataset.

For the neural network models, the first step of

tuning its architecture was selecting the number of

layers and neurons per layer. Accordingly, starting

by the hidden layer number, it was discovered that

models with three layers are optimal (Figure 1).

Figure 1: Model accuracy per number of layers.

Following this, the optimal neuron count per

layer for this 3-layered model was found (Figure 2).

Figure 2: Model accuracy per number of neurons in each

layer.

Finally, to reduce overfitting the influence of

several regularization methods such as L1, L2 and

dropout were tested, with only L2 regularization

having a positive influence in the performance of the

developed models (Figure 3).

Figure 3: Correlation between maximum model accuracy

and L2 regularisation rate.

Prediction of Sleep Stages Based on Wearable Signals Using Machine Learning Techniques

199

In this figure, it is possible to observe that the

best result is achieved when a L2 regularisation rate

of 0.4 is used, with this model presenting an

accuracy of 80.50%, Cohen’s kappa of 0.7563 and

F1 score of 77.38 % (in the unbalanced test set). An

example of the resulting classification can be seen in

figure 4, where the orange lines are the true sleep

stages and the blue lines are the model’s predictions,

with the blue lines disappearing when they match.

Figure 4: Sleep stages for each interval of a randomly

chosen MESA file, classified by the developed algorithm.

Using this final model to classify the sleep night

data that was recorded during this work, an

agreement of 90.96%, Cohen’s kappa of 0.8663, and

macro average F1-score of 90.52% was achieved

when compared to the classifications attributed by

the Android app, with these results being displayed

in figure 5.

Figure 5: Normalized confusion matrix of the results

obtained from the classification of real-world data.

5 DISCUSSION

First it is important to notice that, usually, models

are stochastically trained, meaning that two models

with the matching architecture being trained with

identical data in the same manner, might perform

differently after training, which may further

complicate the study and understanding of the

training process. To solve this issue several identical

models with the same characteristics were

developed, at which point their average performance

was evaluated, and then compared with the average

performance of other models with different

architectures.

As mentioned previously, some commonly used

regularisation methods, such as L1, L2, or dropout,

were tested. In the case of the latter, despite usually

being described as improving model performance

(Baldi and Sadowski, 2014; Srivastava et al., 2014),

it failed to do so in this case, instead leading to a

decrease in performance (even only 5% dropout

lowers average accuracy to 41.84%). This decrease

seems tied to dropout probability, where the higher

the probability, the worse the performance is, until a

plateau is reached at approximately 37.35%

accuracy. The addition of L1 regularisation also

seems to be detrimental to model development, with

the higher the rate, the worse its impact on the

model’s accuracy. On the other hand, L2

regularisation seems to improve the effectiveness of

the models, and, while we found a regularisation rate

of 0.4 to be optimal, there seems to be a wide range

of values (from 0.1 to 2) where the model still

benefits from its addition.

With this said, for the NNs, the best performing

model presented 80.50% accuracy, 0.7563 Cohen’s

kappa on the balanced test set, and a macro average

F1-Score of 77.38% on the complete, unbalanced

test set. After an extensive search for the optimal

configuration of hyperparameters, we found that the

model consistently performed better in a 3 hidden

layer, 32 neurons per layer, structure, with all hidden

layers having a L2 regularisation rate of 0.4. Overall,

we found that performance tends to be highest for

models with 3 or 7 layers, with it dropping sharply

outside these limits. Similarly for neuron count,

accuracy dropped to around 20% for any number of

neurons per layer outside of the interval between 16

and 64, whereas it seems mostly stable at around

80% accuracy and optimal at 32 neurons per layer.

The results obtained are promising as, while

some models are able to achieve higher accuracy

(Tsinalis et al., 2016; Yildirim et al., 2019), they do

so while using more signals (usually EEG, EOG, or

ECG), which significantly restricts their usability for

everyday applications. Conversely, we reached

better performance than many other models,

including recently published studies that make use of

BIOSIGNALS 2023 - 16th International Conference on Bio-inspired Systems and Signal Processing

200

more signals or features (Sun et al., 2020), or

employ the same dataset (Kudo et al., 2022; Sridhar

et al., 2020).

For the classification of the real-world data that

was acquired, a neural network was chosen over the

other Gradient Boosting model, as even though its

performance on the balanced test dataset was

slightly inferior to the best performing non-neural

network model, its performance on stage 2

classification (one of the most common stages for

naturally-occurring unbalanced sleep) is

substantially improved, which leads to this model

being superior for real-world stage classifications

(0.7586 Cohen’s kappa in the complete, unbalanced

test dataset, in contrast to 0.6967) without being as

deleterious to lowest class accuracy (52.95%

compared to 56.32% accuracy). Additionally, the

increase in misclassifications by this model tends to

be between physiologically similar stages (such as

between stage 1 and stage 2, which are both usually

considered light sleep, for example), which lowers

the importance of such errors. This neural network

being the most accurate algorithm developed is in

line with the current state-of-the-art, as the model’s

increased complexity theoretically allows it to more

accurately classify the different sleep stages.

After this selection, the device’s data was scored

by our algorithm, and then compared with the

classifications by the Android application, at which

point a strong level of agreement (McHugh, 2012)

(90.96% accuracy, 0.8663 Cohen’s kappa and a

macro average F1-Score of 90.52%) was observed.

6 CONCLUSION

This work’s main objective was the development of

a ML algorithm that detects and classifies sleep

cycles. For this end, both NN and non-NN models

were developed.

The performance achieved for the final NN

model was higher than many other studies, despite

generally using a lesser amount of features or signals

and the same or similar datasets.

Another goal of this work was to test the

developed model’s performance in a real-world

scenario. To achieve this, we simultaneously

recorded 14 nights of sleep using a biosignalsPlux

device with PPG and accelerometer sensors and a

widely used Android sleep scoring application

paired with a commercially available wearable

device. After comparing the resulting classifications

we obtained a strong level of agreement. This leads

us to believe in the potential of the developed

algorithm to be used in real-world scenarios.

While the main goals of this work were fulfilled,

it still presents some limitations that could be

improved, namely in terms of feature acquisition and

extraction.

Future studies should attempt to integrate these

algorithms into devices. This way, not only is it

possible to increase the similarity between the

devices and algorithms being compared, but it

should also be easier to acquire a larger amount of

data, ideally, from a larger set of individuals as well.

The recording of more data itself would also

likely lead to improvements in the determination of

the real-world performance of the models, besides

the potential use of this data for model training. In

this regard, the recording and comparison of results

with a PSG study would be optimal.

Additionally, during feature extraction, we chose

to reduce the number and quality of the entropies

used as features, due to time and computation

constraints. As, even after this, these were some of

the most relevant features, the extraction and use of

them without averaging the signal beforehand could

lead to some performance improvements.

Finally, throughout this work several models

were created, some of them having similar levels of

accuracy and other selected metrics to the final

model developed. Due to this, a complementary

study that could be done is the creation of another

ensemble model that utilises the output of these

models as inputs, as these types of models tend to

have a better performance than the sum of their parts

(Zhang and Ma, 2012).

REFERENCES

Akkaş, M. A., Sokullu, R., & Ertürk Çetin, H. (2020).

Healthcare and patient monitoring using IoT. Internet

of Things, 11, 100173.

Baldi, P., & Sadowski, P. (2014). The Dropout Learning

Algorithm. Artificial Intelligence, 210(1), 78–122.

Chaudhry, B. M. (2017). Sleeping with an Android.

MHealth, 3, 7–7.

Chen, W., Wang, Z., Xie, H., & Yu, W. (2007).

Characterization of surface EMG signal based on

fuzzy entropy. IEEE Transactions on Neural Systems

and Rehabilitation Engineering, 15(2), 266–272.

Chen, X., Wang, R., Zee, P., Lutsey, P. L., Javaheri, S.,

Alcántara, C., Jackson, C. L., Williams, M. A., &

Redline, S. (2015). Racial/ethnic differences in sleep

disturbances: The Multi-Ethnic Study of

Atherosclerosis (MESA). Sleep, 38(6), 877–888.

Feurer, M., & Hutter, F. (2019). Hyperparameter

Optimization. In: Hutter, F., Kotthoff, L., Vanschoren,

Prediction of Sleep Stages Based on Wearable Signals Using Machine Learning Techniques

201

J. (eds) Automated Machine Learning. The Springer

Series on Challenges in Machine Learning. Springer,

Cham. 3–33.

Fonseca, P., Weysen, T., Goelema, M. S., Møst, E. I. S.,

Radha, M., Lunsingh Scheurleer, C., van den Heuvel,

L., & Aarts, R. M. (2017). Validation of

Photoplethysmography-Based Sleep Staging

Compared With Polysomnography in Healthy Middle-

Aged Adults. Sleep, 40(7).

Golgouneh, A., & Tarvirdizadeh, B. (2020). Fabrication of

a portable device for stress monitoring using wearable

sensors and soft computing algorithms. Neural

Computing and Applications, 32(11), 7515–7537.

K. Pavlova, M., & Latreille, V. (2019). Sleep Disorders.

American Journal of Medicine, 132(3), 292–299.

Kelly, J. M., Strecker, R. E., & Bianchi, M. T. (2012).

Recent Developments in Home Sleep-Monitoring

Devices. ISRN Neurology, 2012, 1–10.

Kudo, S., Chen, Z., Ono, N., Altaf-Ul-Amin, M. D.,

Kanaya, S., & Huang, M. (2022). Deep Learning-

Based Sleep Staging with Acceleration and Heart Rate

Data of a Consumer Wearable Device. LifeTech 2022 -

2022 IEEE 4th Glob. Conf. Life Sci. Tech., 305–307.

Lee, H., Li, B., DeForte, S., Splaingard, M., Huang, Y.,

Chi, Y., & Lin, S. (2021). NCH Sleep DataBank: A

Large Collection of Real-world Pediatric Sleep

Studies.

McCall, C., & McCall, W. V. (2012). Comparison of

actigraphy with polysomnography and sleep logs in

depressed insomniacs. Journal of Sleep Research.,

21(1), 122–127.

McHugh, M. L. (2012). Interrater reliability: the kappa

statistic. Biochemia Medica, 22(3), 276.

Minkel, J. D., Banks, S., Htaik, O., Moreta, M. C., Jones,

C. W., McGlinchey, E. L., Simpson, N. S., & Dinges,

D. F. (2012). Sleep deprivation and stressors:

Evidence for elevated negative affect in response to

mild stressors when sleep deprived. Emotion, 12(5),

1015–1020.

NSRR team, (2022). Administrative - MESA Variables -

Sleep Data - National Sleep Research Resource -

NSRR. (n.d.). Retrieved June 24, 2022, from

https://sleepdata.org/datasets/mesa/variables?folder=A

dministrative

PLUX Biosignals | Professional Kit. (n.d.). Retrieved

August 5, 2022, from https://www.pluxbiosignals.com

/collections/biosignalsplux/products/professional-kit

Quan, S. F., Howard, B. V., Iber, C., Kiley, J. P., Nieto, F.

J., O’Connor, G. T., Rapoport, D. M., Redline, S.,

Robbins, J., Samet, J. M., & Wahl, P. W. (1997). The

Sleep Heart Health Study: Design, rationale, and

methods. Sleep, 20(12), 1077–1085.

Rostaghi, M., & Azami, H. (2016). Dispersion Entropy: A

Measure for Time-Series Analysis. IEEE Signal

Processing Letters, 23(5), 610–614.

Rundo, J. V., & Downey, R. (2019). Polysomnography.

Handbook of Clinical Neurology, 160, 381–392.

Salahuddin, L., Cho, J., Jeong, M. G., & Kim, D. (2007).

Ultra short term analysis of heart rate variability for

monitoring mental stress in mobile settings. Annual

International Conference of the IEEE Engineering in

Medicine and Biology Society. IEEE Engineering in

Medicine and Biology Society. Annual International

Conference, 2007, 4656–4659.

Sridhar, N., Shoeb, A., Stephens, P., Kharbouch, A.,

Shimol, D. Ben, Burkart, J., Ghoreyshi, A., & Myers,

L. (2020). Deep learning for automated sleep staging

using instantaneous heart rate. Npj Digit. Med., 3(1).

Srivastava, N., Hinton, G., Krizhevsky, A., &

Salakhutdinov, R. (2014). Dropout: A Simple Way to

Prevent Neural Networks from Overfitting. Journal of

Machine Learning Research, 15, 1929–1958.

Stucky, B., Clark, I., Azza, Y., Karlen, W., Achermann,

P., Kleim, B., & Landolt, H. P. (2021). Validation of

fitbit charge 2 sleep and heart rate estimates against

polysomnographic measures in shift workers:

naturalistic study. Journal of Medical Internet

Research, 23(10), 1–20.

Sun, H., Ganglberger, W., Panneerselvam, E., Leone, M.

J., Quadri, S. A., Goparaju, B., Tesh, R. A., Akeju, O.,

Thomas, R. J., & Westover, M. B. (2020). Sleep

staging from electrocardiography and respiration with

deep learning. Sleep, 43(7).

Tkachenko, O., & Dinges, D. F. (2018). Interindividual

variability in neurobehavioral response to sleep loss: A

comprehensive review. Neuro. Biobe. Rev., 89, 29–48.

Tsinalis, O., Matthews, P. M., & Guo, Y. (2016).

Automatic Sleep Stage Scoring Using Time-

Frequency Analysis and Stacked Sparse Autoencoders.

Annals of Biomedical Engineering, 44(5), 1587–1597.

Walch, O., Huang, Y., Forger, D., & Goldstein, C. (2019).

Sleep stage prediction with raw acceleration and

photoplethysmography heart rate data derived from a

consumer wearable device. Sleep, 42(12).

Worley, S. L. (2018). The extraordinary importance of

sleep: The detrimental effects of inadequate sleep on

health and public safety drive an explosion of sleep

research. P and T, 43(12), 758–763.

Yildirim, O., Baloglu, U. B., & Acharya, U. R. (2019). A

deep learning model for automated sleep stages

classification using PSG signals. International Journal

of Environmental Research and Public Health, 16(4).

Zhang, C., & Ma, Y. (2012).

Ensemble Machine Learning:

Methods and Applications. Springer Publishing

Company, Incorporated.

Zhang, G. Q., Cui, L., Mueller, R., Tao, S., Kim, M.,

Rueschman, M., Mariani, S., Mobley, D., & Redline,

S. (2018). The National Sleep Research Resource:

Towards a sleep data commons. Journal of the

American Medical Informatics Association, 25(10),

Zhao, X., & Sun, G. (2021). A Multi-Class Automatic

Sleep Staging Method Based on

Photoplethysmography Signals. Entropy, 23(1), 1–12.

BIOSIGNALS 2023 - 16th International Conference on Bio-inspired Systems and Signal Processing

202