Facial Point Graphs for Amyotrophic Lateral Sclerosis Identiﬁcation

Nicolas Barbosa Gomes

1 a

, Arissa Yoshida

1 b

, Mateus Roder

1 c

Guilherme Camargo de Oliveira

1,2 d

and Jo

ao Paulo Papa

1 e

Department of Computing, S

ao Paulo State University (UNESP), Brazil

School of Engineering, Royal Melbourne Institute of Technology (RMIT), Australia

Keywords:

Neurodegenerative Disease, ALS, Graph Neural Networks, Facial Point Graph.

Abstract:

Identifying Amyotrophic Lateral Sclerosis (ALS) in its early stages is essential for establishing the beginning

of treatment, enriching the outlook, and enhancing the overall well-being of those affected individuals. How-

ever, early diagnosis and detecting the disease’s signs is not straightforward. A simpler and cheaper way arises

by analyzing the patient’s facial expressions through computational methods. When a patient with ALS en-

gages in speciﬁc actions, e.g., opening their mouth, the movement of speciﬁc facial muscles differs from that

observed in a healthy individual. This paper proposes Facial Point Graphs to learn information from the ge-

ometry of facial images to identify ALS automatically. The experimental outcomes in the Toronto Neuroface

dataset show the proposed approach outperformed state-of-the-art results, fostering promising developments

in the area.

1 INTRODUCTION

A gradual decline in the structure and functioning

of the central nervous system marks Neurodegenera-

tive Diseases (NDDs). The incidence and prevalence

of these diseases exhibit a sharp increase with age,

which means that life expectancy continues to rise in

many parts of the world. Consequently, the number of

cases is projected to grow in the future (Checkoway

et al., 2011). Despite the availability of certain treat-

ments that can relieve the physical or mental symp-

toms linked to NDDs, there is currently no known

method to slow down their progression or achieve a

complete cure.

Amyotrophic Lateral Sclerosis (ALS) is an NDD

that causes the gradual deterioration of motor func-

tions of the nervous system. Worldwide, the annual

incidence of ALS is about 1.9 per 100,000 inhabi-

tants (Arthur et al., 2016). Although there are doc-

uments, such as El Escorial, published by the World

Federation of Neurology (Brooks et al., 2000), ad-

dressing essential criteria regarding the diagnosis of

ALS, current concepts and deﬁnitions of ALS have

not yet been uniﬁed or standardized in clinical prac-

tice and are occasionally imprecise, causing difﬁcul-

https://orcid.org/0000-0002-8571-8198

https://orcid.org/0000-0002-6715-4050

https://orcid.org/0000-0002-3112-5290

https://orcid.org/0000-0002-9698-2445

https://orcid.org/0000-0003-3529-3109

ties for neurologists in the clinical treatment of ALS.

In addition, patients face a delay in disease diagnosis

by approximately 18 months (Bandini et al., 2018a)

and an average survival of 3 to 5 years after diagno-

sis (Xu and Yuan, 2021). Since effective treatments

are currently unavailable, early and precise diagnosis

is crucial in maintaining patients’ quality of life as it

leads to earlier intervention and recruitment for clini-

cal trials.

Evaluating the facial expression of people is one

effective way to diagnose neurological diseases, for

the subject may lose a signiﬁcant amount of verbal

communication ability (Yolcu et al., 2019). Sev-

eral studies have explored the use of computer vision

techniques in analyzing human faces for clinical pur-

poses, further emphasizing its signiﬁcance (Bevilac-

qua et al., 2011; Bandini et al., 2018a; Jin et al.,

2020; Gomes et al., 2023; Oliveira et al., 2023).

An important point to mention is that many types of

NDDs affect the oro-facial musculature

with signif-

icant impairments in speech, swallowing, and oro-

motor skills, as well as emotion expression (Bandini

et al., 2020). Therefore, analyzing a patient’s facial

expression in an image or video can be crucial for di-

agnosing ALS.

The geometry-based characteristics derived from

an individual’s face describe the shape of its compo-

nents, such as the eyes or mouth, which are essential

Musculature related to communication and critical to

functions such as chewing, swallowing, and breathing.

Gomes, N., Yoshida, A., Roder, M., Camargo de Oliveira, G. and Papa, J.

Facial Point Graphs for Amyotrophic Lateral Sclerosis Identiﬁcation.

DOI: 10.5220/0012428400003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

207-214

ISBN: 978-989-758-679-8; ISSN: 2184-4321

207

for facial analysis (Wu and Ji, 2019). Based on these

landmarks, Bandini et al. (Bandini et al., 2018a) pro-

posed an approach that predicts the patient’s healthy

state based on features representing motion, asymme-

try, and face shape through video analysis. Such an

inference was accomplished using well-known ma-

chine learning techniques, i.e., Support Vector Ma-

chines (SVM) (Cortes and Vapnik, 1995) and Logis-

tic Regression (Cox, 1958). Although reasonable re-

sults have been reported, there is still the need to deal

with the limited representation power and biases as-

sociated with handcrafted features. Our work circum-

vents such a shortcoming by introducing Facial Point

Graphs (FPGs) to automatically learn motion infor-

mation from facial expressions. Our model is based

on Graph Neural Networks (GNNs) and ﬁrst con-

structs a graph with the most important facial points

for ALS diagnosis to fulﬁll that purpose for further

training. Later, each frame is classiﬁed as positive or

negative to the disease. The majority voting then as-

signs the ﬁnal label to the individual.

To the best of our knowledge, no method employs

Facial Point Graphs for ALS identiﬁcation. We ﬁrmly

believe that the landmarks extracted from frames can

be better encoded in a non-Euclidean space, enabling

the precise deﬁnition and representation of their dis-

tinct features. Therefore, the main contributions of

this paper are twofold:

• To introduce Facial Point Graphs to identify ALS.

• To employ a deep learning approach to the same

context, thus not requiring handcrafted features.

The remainder of this paper is structured as fol-

lows: Sections 2 and 3 present the literature re-

view and theoretical background, respectively. Sec-

tion 4 presents an explanation regarding the employed

dataset, the used models to crop images and extract fa-

cial features, the proposed approach, and the classiﬁ-

cation method. Finally, Section 5 presents the exper-

imental results, and Section 6 states the discussions

about the results, conclusions, and future works.

2 RELATED WORKS

Facial expression is a signiﬁcant part of human

nonverbal contact, is more effective than words in

face-to-face communication (Mehrabian, 1968), and

serves as a distinctive universal means of transmis-

sion. Very often, impaired facial expressions manifest

as symptomatic indications across countless medical

conditions (Yolcu et al., 2019).

Bandini et al. (Bandini et al., 2018a) introduced

a novel approach for automatically detecting bulbar

ALS. Their method involves analyzing facial move-

ment features extracted from video recordings. The

dataset comprises ten ALS patients (six male and four

female) and eight age-matched healthy control sub-

jects (six male and two female), which were asked to

perform speciﬁc actions during recordings. Initially,

each individual was recorded at rest (REST) with a

neutral facial expression for 20 seconds. An impor-

tant point to mention is that this task was not used

for analysis but only as a reference for extracting the

geometric characteristics during the tasks.

Next, each participant was asked to perform the

following actions: open their jaw to the maximum ex-

tent, repeated ﬁve times (OPEN); lip puckering (as if

kissing a baby) a total of four times (KISS); pretend

to blow out a candle, ﬁve times (BLOW); smile with

closed lips, ﬁve times (SPREAD); repeat the syllable

/pa/ in a single breath as fast as possible (PA); repeat

the word /pataka/ as quickly as possible (PATAKA);

repeat the sentence “Buy Bobby a puppy” (BBP) ten

times in their usual tone and speaking speed.

Furthermore, the image pre-processing step

was performed using the supervised descent

method (Xiong and De la Torre, 2013), which ex-

tracts corresponding facial landmarks for eyebrows,

eyelids, and nose, as likewise outer and inner lip

contours for each frame. Also, a third coordinate

was estimated for these landmarks based on intrinsic

camera parameters. In this regard, feature extraction

was carried out considering the points in the mouth

region, as they demonstrated greater sensitivity to

ALS. Considering aspects of lip movement such as

range and speed of motion, symmetry, and shape,

two different algorithms were used for classiﬁcation:

SVM and Logistic Regression. Last but not least,

the best classiﬁcation result was achieved in the BBP

task, with an accuracy of 88.9%.

Xu et al. (Xu et al., 2020) conducted a study

on classifying expressions using facial landmarks.

Their approach used a Graph Convolutional Network

(GCN) (Kipf and Welling, 2016) to classify facial

expressions in images. They employed the Dlib-ml

machine learning algorithm (King, 2009) to estimate

the positions of 64 facial landmarks, which are em-

ployed to construct a graph along with their two-

dimensional coordinates. The training process in-

corporated three different databases: JAFFE (Lyons

et al., 1998), FER2013 (Goodfellow et al., 2013), and

CK+ (Lucey et al., 2010). The classes considered

in this study included Anger, Disgust, Fear, Happi-

ness, Sadness, and Surprise, achieving an accuracy of

95.85%.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

208

3 THEORETICAL BACKGROUND

Graph Neural Networks bring the problem of learning

patterns in a dataset to the graph domain. Formally,

a graph G = (V , E) is deﬁned as a set of nodes V

and a set of edges E between them, aka the adjacency

relation. During the iteration process, each node (re-

ceiver) receives a set of aggregated messages from its

neighbors, applying an aggregation function and an

update function. Each node forwards information to

its neighbors before its features are updated. In the

next iteration, it forwards the new information (mes-

sage) to its neighbors once more, as illustrated in Fig-

ure 1.

For each iteration k, a hidden vector h

(k)

∈ R

in-

corporates the features of node u ∈ V , where n stands

for the number of input features. It is important to

highlight that the hidden vector h

(0)

encodes the fea-

tures before training, i.e., at the initial stage. Firstly,

a node-order invariant function aggregates features

from the neighborhood N(u) of node u. Secondly,

the aggregated features are used to update the node

information, described as follows:

k+1

= U



, A



, ∀v ∈ N(u)}



, (1)

where U

(k)

and A

(k)

stand for the updating and aggre-

gating functions, respectively. One can use distinct

models for these functions, but this paper employs

a formulation based on an attention mechanism, de-

scribed further,

(a) b

Figure 1: Aggregation of messages in a bidirectional graph:

(a) input graph and (b) an example of GNN working mech-

anism (for the sake of simpliﬁcation, the second iteration

considers node ‘A’ only).

3.1 Graph Attention Networks

Graph Attention Networks (GATs) are a strategy for

improving the aggregation function. In this network,

the message gives different priorities to the informa-

tion from the neighborhood. The ﬁrst application of

this concept in a model was described by Veli

ckovi

et al. (Veli

ckovi

c et al., 2017) and crafted as follows:

k+1

= σ

∑

v∈N(u)

v→u

, (2)

where W ∈ R

′

×n

is a trainable parameter known as

the weight matrix, n

′

and σ stand for the number of

output features and the sigmoid function, respectively.

In addition, α

v→u

∈ R indicates the attention given

from v to the node u, i.e., the degree of inﬂuence v

has on updating the features of node u. A higher value

of α

v→u

implies a stronger impact of v on the feature

update process of u. Formally, its deﬁnition is repre-

sented as follows:

v→u

exp



[

]

[

∥W

]



∑

′

∈N(u)

exp



[

]

∥W

′

i

, (3)

where a

∈ R

2∗n

′

deﬁnes a trainable parameter known

as the attention vector. The symbol ∥ denotes the con-

catenation operator, and λ represents the LeakyReLU

non-linearity function (with negative input slope β =

0.2).

In addition, this particular GNN has proven to be

more effective in accurately identifying the healthy

state of patients by analyzing the facial landmarks

extracted from their expressions during task perfor-

mance.

4 EXPERIMENTAL

METHODOLOGY

4.1 Dataset

Established by Bandini et al. (Bandini et al., 2020),

Toronto NeuroFace is the ﬁrst public dataset with

videos of oro-facial gestures performed by individu-

als with oro-facial impairments, including post-stroke

(PS), ALS, and healthy control (HC). The dataset con-

sists of 261 colored (RGB) videos featuring thirty-six

participants: 11 patients with ALS, 14 patients with

PS, and 11 HC. This work emphasizes the distinction

between ALS and healthy individuals, as the primary

interest lies in the former. Consequently, emphasis

was placed on a subset that exclusively included ALS

and HC groups. Each video captures a participant per-

forming one of the subtasks from a set of speech and

non-speech tasks commonly used during the clinical

oro-facial examination. Following the manual seg-

mentation of the videos, the dataset was partitioned

into 921 videos of repetitions. Table 1 presents the

distribution of the number of repetitions for each sub-

task used in the experiments.

Facial Point Graphs for Amyotrophic Lateral Sclerosis Identiﬁcation

209

Table 1: Number of repetitions for each subtask.

Subtask Description ALS HC

SPREAD Pretending

to smile with tight lips

55 59

KISS Pretend

to kiss a baby

59 57

OPEN Maximum

opening of the jaw

54 55

BLO

Pretend

to blow a candle

31 39

BBP

Repetitions

of the sentence

“Buy Bobby a Puppy”

95 111

Repetitions

of the syllables /pa/

as fast as possible in a single breath

100 110

ATAKA

Repetitions of the syllables /pataka/

as fast as possible in a single breath

88 108

4.2 Pre-Processing

To eliminate visual elements outside the subject’s face

and ensure consistency in the dataset, OpenFace 2.0

tool (Baltrusaitis et al., 2018) is employed during the

preprocessing stage. This tool ﬁrst detects the main

face, then performs a transformation based on head

pose estimation and a crop operation on all frames.

The resulting output ends up in 200 × 200 grayscale

images centered on the facial region, as illustrated in

Figure 2.

Figure 2: Illustration of OpenFace 2.0 for: (i) main face

detection, (ii) transformation based on head pose estima-

tion, and (iii) face cropping on an image from the Flickr30k

dataset (Young et al., 2014).

4.3 Feature Extraction

In this work, the Facial Alignment Network

(FAN) (Bulat and Tzimiropoulos, 2017), a deep learn-

ing model, was utilized to identify the frame-by-

frame face geometric structure of each video in the

dataset. As a state-of-the-art approach, FAN employs

heatmap regression to accurately detect facial land-

mark points following the MULTI-PIE 2D 68-point

format (Gross et al., 2010), enabling alignment in

two and three dimensions. Since the dataset contains

videos recorded with the frontal face position, align-

ment was considered in two dimensions only.

Previous studies show that patients with ALS ex-

hibit signiﬁcant sensitivity in lip and jaw movements

(Langmore and Lehman, 1994; Bandini et al., 2018b).

Therefore, 26 points were selected from the land-

marks extracted by FAN, representing the relevant

regions (Figure 3a). To establish connections be-

tween these landmark nodes, the Delaunay triangula-

tion (Delaunay et al., 1934) was employed, involving

the creation of a triangular mesh by connecting the

speciﬁc landmarks (Figure 3b).

To enhance information communication among

graph nodes during the learning process, a strategic

choice is made to use point 31 (according to the 68-

point format) (Gross et al., 2010), corresponding to

the nose tip, as a central node. This key node serves

as a hub, connecting all other nodes independently

of the Delaunay triangulation calculation (Figure 3c).

Lastly, as the ﬁnal step of the feature extraction pro-

cess, the edge’s weight is set as the Euclidean distance

between its corresponding nodes.

Figure 3: Representation of the feature extraction process.

4.4 Classiﬁcation and Evaluation

The classiﬁcation performance was evaluated using

a leave-one-subject-out cross-validation (LOSO-CV)

approach, following the method proposed by Bandini

et al. (Bandini et al., 2018a). Furthermore, to en-

hance the reliability of predictions in real-world sce-

narios and mitigate issues like overﬁtting or memoriz-

ing training data, separate sets for training, validation,

and testing are employed in each interaction. Regard-

ing the validation sets, two subjects are randomly se-

lected, one categorized as HC and the other as ALS,

ensuring a balanced representation of both classes in

this stage.

The evaluation was conducted in two modes, i.e.,

repetition- and subject-based classiﬁcation:

4.4.1 Repetition Classiﬁcation

For each iteration of the LOSO-CV, the repetitions

produced by one participant were treated as individual

samples in the test set. At the same time, the remain-

ing data was split into validation and training sets.

This approach ensures that every participant, both HC

and those with ALS, and their respective repetitions

were considered in separate test sets. During this

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

210

trial, individuals’ speech and non-speech repetitions

were classiﬁed as belonging to the HC or ALS group.

Figure 4 illustrates the process mentioned above for a

given individual

Figure 4: Overview of the repetition-based evaluation step.

4.4.2 Subject Classiﬁcation

At each iteration of the LOSO-CV, each subject was

treated as a test case and classiﬁed as either HC or

ALS. The classiﬁcation was determined through a

majority vote among its predicted repetitions; in tie

cases, the subject was considered HC to generate a

more conservative prediction according to Bandini

et al. (Bandini et al., 2018a). Figure 5 depicts an

overview of the subject classiﬁcation process.

Figure 5: Overview of the subject-based evaluation step.

In both repetition- and subject-based classiﬁca-

tion, the validation set was used to prevent bias in the

model’s hyperparameters and to facilitate the imple-

mentation of the early stopping technique. The num-

ber of epochs for training was determined by moni-

toring the learning progress on the validation set

Considering Bandini et al. (Bandini et al., 2018a)

as the benchmark to our work, the experiments were

also performed considering two other classiﬁcation

models for comparison purposes: SVM with linear

This experiment involves tallying the instances of hits

and misses for each repetition (per individual) to construct

the ultimate classiﬁcation accuracy. Essentially, the label-

ing is applied to repetitions rather than individuals.

The maximum number of epochs is set to 100, the

batch-size comprises 64 samples, the learning rates are set

to 10

−4

and 10

−5

considering the GAT and linear layers,

respectively. The number of hidden layers was set to 17.

These values were empirically chosen based on the results

over the validation set.

and radial basis function (RBF) and Logistic Regres-

sion. Both models use 11 geometric and kinematic

features extracted from speech and non-speech tasks.

A grid search was used to ﬁnd proper values for SVM

parameters, i.e., the conﬁdence value C and the RBF

kernel scale parameter γ.

4.5 Proposed Model

Initially, the proposed model uses ﬁfteen frames for

each repetition performed by the patient. Although

the entire video of each repetition contains approxi-

mately thirty frames, the model showed better results

using just half of them at equally spaced intervals.

FPG receives a graph of twenty-six nodes represent-

ing the face landmarks, where each node encodes a

feature vector with the x and y coordinates of its re-

lated landmark. In addition, each graph edge stores its

length determined by the Euclidean distance between

its two corresponding nodes.

Further, each frame proceeds through six GAT and

two linear layers. Before the information is forwarded

to the linear layers, an average pooling is performed

using the nodes, i.e., all information encoded in the

graph is mapped into a single vector. Figure 6 illus-

trates such a process.

The result obtained after pooling goes through

two linear layers, which generate the model’s out-

put. Nonetheless, the error is calculated based on the

frame label, and the mode among frames represents

the outcomes concerning the repetition experiment.

In other words, classifying an individual’s repetition

is based on the majority consensus among the frames.

Likewise, when classifying the subject, the majority

mode derived from the classiﬁcations of each repeti-

tion determines whether the patient has ALS or not.

5 RESULTS

The experimental results were obtained for each sub-

task separately. For a thorough assessment of the pro-

posed approach, three evaluation measures were con-

sidered: accuracy, sensitivity, and speciﬁcity. Table 2

presents the results for each subtask accordingly.

According to previous studies, the SPREAD sub-

task also appears to be the most discriminative one,

with an accuracy of 80.7% during repetition-based

classiﬁcation and 81.8% concerning the subject-based

classiﬁcation in our model approach. As described

in Section 4, the experiments were conducted by ﬁrst

splitting the dataset into training and test folds. The

former was partitioned into a smaller training set to

generate a validation fold, whose size was limited to

Facial Point Graphs for Amyotrophic Lateral Sclerosis Identiﬁcation

211

Figure 6: FPG model.

Table 2: FPG results for each subtask in speech and non-

speech data.

TASK Classiﬁcation Accuracy Speciﬁcity Sensitivity

Repetition 80,7% 79,6% 81,8%

SPREAD

Subject 81,8% 81,8% 81,8%

Repetition 68,1% 80,7% 55,9%

KISS

Subject 68,1% 81,8% 54,5%

Repetition 77,0% 78,1% 75,9%

OPEN

Subject 81,8% 81,8% 81,8%

Repetition 37,1% 51,2% 19,3%

Non-

speech

BLOW

Subject 38,4% 57,1% 16,6%

Repetition 49,0% 63,0% 32,6%

BBP

Subject 50,0% 63,6% 33,3%

Repetition 64,2% 64,5% 64,0%

Subject 57,1% 54,5% 60,0%

Repetition 67,3% 65,7% 69,3%

Speech

PATAKA

Subject 66,6% 63,6% 70,0%

the data available for training.

An important point to anticipate is that the re-

sults obtained using SVM and Logistic Regression

may differ signiﬁcantly from the ﬁndings presented

by Bandini et al. (Bandini et al., 2018a), for they

employed a slightly different approach. Although

Toronto Neuroface contains the same speech and non-

speech tasks as those in the study conducted by Ban-

dini et al. (Bandini et al., 2018a), our approach has

several differences. Initially, the participants in the

accessible dataset varied both in identity and quan-

tity. Videos with repetitions were manually cropped

to streamline the content since only the complete

video, including all repetitions, was provided. Unfor-

tunately, access to videos featuring samples from the

REST subtask, essential for normalization in SVM

and Regression models, was not granted. Further-

more, our videos only included color information and

did not incorporate three-dimensional depth features.

Figure 7 compares FPG against the baselines in-

spired in Bandini et al. (Bandini et al., 2018a) work.

One can observe that our model consistently outper-

forms others in the majority of tasks, e.g., SPREAD,

KISS, PA, and PATAKA. However, SVM-RBF stands

out as the top-performing model in the BLOW sub-

task. However, SVM-RBF stands out as the top-

performing model in the BLOW subtask, which was

the most challenging, as also observed by Bandini et

al. (Bandini et al., 2018a).

6 DISCUSSIONS AND

CONCLUSIONS

To the best of our knowledge, the current study is

the ﬁrst to evaluate Graph Neural Networks for ALS

identiﬁcation based on facial expression. As the main

ﬁnding, state-of-the-art results were demonstrated in

all subtasks of the Toronto Neuroface dataset except

for one.

The two highest accuracies are observed in

SPREAD and OPEN subtasks, achieving results

above 80%. Similar values for speciﬁcity and sensi-

tivity are observable in both subtasks, demonstrating

the model’s robustness in distinguishing ALS patients

from healthy ones.

The high accuracy in the SPREAD task is at-

tributed to the pure lip movement, not involving the

jaw muscles (Bandini et al., 2018a), allowing the de-

tection of the loss of lip muscle extension exhibited by

bulbar ALS patients. Additionally, as shown in pre-

vious studies, the jaw muscles decline in bulbar ALS

patients (Bandini et al., 2018a). Consequently, the ex-

tension of this movement was distinguished with high

accuracy by the model during the OPEN task. OPEN

considers the greatest extent of jaw muscle movement

among all other tasks, justifying the model’s accuracy.

The exchange of information among the graph

nodes during the learning iterations allowed for bet-

ter differentiation of facial points between individuals

with ALS or HC. It is also noteworthy that, except for

PA and PATAKA tasks, the model showed inferior or

equal results in repetition classiﬁcation compared to

subject-base classiﬁcation, indicating that most rep-

etitions were correctly classiﬁed, as the mode of la-

beled repetitions ended in the correct classiﬁcation of

the subject.

One of the major limitations and challenges in

training deep learning models is the limited number of

videos available in the dataset. Deep models typically

require a substantial amount of data to learn effec-

tively. However, FPG showcased exceptional perfor-

mance despite being a deep approach. Remarkably,

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

212

Figure 7: Comparison between FPG and baselines.

it achieved high accuracy without data augmentation

during training.

Such outcomes highlight the effectiveness of

GNN models, showcasing their inherent structural

characteristics and information propagation capabil-

ities. GNNs demonstrate their ability to capture com-

plex patterns and relationships within the data, even

when dealing with a limited dataset, underscoring

GNNs as a powerful approach in this particular do-

main.

This study did not consider the order of repeti-

tions. Therefore, exploring temporal information in

the Facial Point Graph as a future work would be in-

teresting, particularly the changes observed in facial

movements in the presence of neurodegenerative dis-

eases. Further investigation into the impact of fatigue

found in ALS patients during speech tasks such as

’PA’ and ’PATAKA’ will be conducted, with the po-

tential aim of improving the model’s performance in

capturing these variations.

The problem addressed by this work has high

complexity due to the heterogeneity and the dataset’s

limited size. Besides that, the proposed approach

achieved signiﬁcant results when compared to sim-

ilar works, introducing the Facial Point Graph for

ALS diagnosis. In addition, the results were achieved

without handcrafted features and with a lightweight

model, enabling the development of affordable sys-

tems capable of supporting clinicians in automatic

ALS diagnosis.

ACKNOWLEDGMENT

The research in this paper uses the Toronto Neuro-

Face Dataset collected by Dr. Yana Yunusova and

the Vocal Tract Visualization and Bulbar Function

Lab teams at UHN-Toronto Rehabilitation Institute

and Sunnybrook Research Institute respectively, ﬁ-

nancially supported by the Michael J. Fox Founda-

tion, NIH-NIDCD, Natural Sciences and Engineer-

ing Research Council, Heart and Stroke Foundation

Canadian Partnership for Stroke Recovery and AGE-

WELL NCE.

The authors would like to thank FAPESP

(Fundac¸

ao de Amparo

a Pesquisa do Estado de

ao Paulo), process number #2022/13156-8 and

#2022/16090-8.

REFERENCES

Arthur, K. C., Calvo, A., Price, T. R., Geiger, J. T., Chio, A.,

and Traynor, B. J. (2016). Projected increase in amy-

otrophic lateral sclerosis from 2015 to 2040. Nature

communications, 7(1):12408.

Baltrusaitis, T., Zadeh, A., Lim, Y. C., and Morency, L.-

P. (2018). Openface 2.0: Facial behavior analysis

toolkit. In 2018 13th IEEE international conference

on automatic face & gesture recognition (FG 2018),

pages 59–66. IEEE.

Bandini, A., Green, J. R., Taati, B., Orlandi, S., Zinman,

L., and Yunusova, Y. (2018a). Automatic detection of

amyotrophic lateral sclerosis (als) from video-based

analysis of facial movements: speech and non-speech

tasks. In 2018 13th IEEE International Conference

on Automatic Face & Gesture Recognition (FG 2018),

pages 150–157. IEEE.

Bandini, A., Green, J. R., Wang, J., Campbell, T. F., Zin-

man, L., and Yunusova, Y. (2018b). Kinematic fea-

tures of jaw and lips distinguish symptomatic from

presymptomatic stages of bulbar decline in amy-

otrophic lateral sclerosis. Journal of Speech, Lan-

guage, and Hearing Research, 61(5):1118–1129.

Bandini, A., Rezaei, S., Guarin, D. L., Kulkarni, M., Lim,

D., Boulos, M. I., Zinman, L., Yunusova, Y., and Taati,

B. (2020). A new dataset for facial motion analysis

Facial Point Graphs for Amyotrophic Lateral Sclerosis Identiﬁcation

213

in individuals with neurological disorders. IEEE jour-

nal of biomedical and health informatics, 25(4):1111–

1119.

Bevilacqua, V., D’Ambruoso, D., Mandolino, G., and

Suma, M. (2011). A new tool to support diagnosis of

neurological disorders by means of facial expressions.

In 2011 IEEE International Symposium on Medical

Measurements and Applications, pages 544–549.

Brooks, B. R., Miller, R. G., Swash, M., Munsat, T. L.,

of Neurology Research Group on Motor Neuron Dis-

eases, W. F., et al. (2000). El escorial revisited: revised

criteria for the diagnosis of amyotrophic lateral scle-

rosis. Amyotrophic lateral sclerosis and other motor

neuron disorders, 1(5):293–299.

Bulat, A. and Tzimiropoulos, G. (2017). How far are we

from solving the 2d & 3d face alignment problem?

(and a dataset of 230,000 3d facial landmarks). In

International Conference on Computer Vision.

Checkoway, H., Lundin, J. I., and Kelada, S. N. (2011).

Neurodegenerative diseases. IARC scientiﬁc publica-

tions, (163):407–419.

Cortes, C. and Vapnik, V. (1995). Support-vector networks.

Machine Learning, 20:273–297.

Cox, D. R. (1958). The regression analysis of binary se-

quences. Journal of the Royal Statistical Society: Se-

ries B (Methodological), 20(2):215–232.

Delaunay, B. et al. (1934). Sur la sphere vide. Izv. Akad.

Nauk SSSR, Otdelenie Matematicheskii i Estestven-

nyka Nauk, 7(793-800):1–2.

Gomes, N. B., Yoshida, A., de Oliveira, G. C., Roder, M.,

and Papa, J. P. (2023). Facial point graphs for stroke

identiﬁcation. In Iberoamerican Congress on Pattern

Recognition, pages 685–699. Springer.

Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A.,

Mirza, M., Hamner, B., Cukierski, W., Tang, Y.,

Thaler, D., Lee, D.-H., et al. (2013). Challenges in

representation learning: A report on three machine

learning contests. In Neural Information Processing:

20th International Conference, ICONIP 2013, Daegu,

Korea, November 3-7, 2013. Proceedings, Part III 20,

pages 117–124. Springer.

Gross, R., Matthews, I., Cohn, J., Kanade, T., and Baker,

S. (2010). Multi-pie. Image and vision computing,

28(5):807–813.

Jin, B., Qu, Y., Zhang, L., and Gao, Z. (2020). Diagnosing

parkinson disease through facial expression recogni-

tion: video analysis. Journal of medical Internet re-

search, 22(7):e18697.

King, D. E. (2009). Dlib-ml: A machine learning toolkit.

The Journal of Machine Learning Research, 10:1755–

1758.

Kipf, T. N. and Welling, M. (2016). Semi-supervised clas-

siﬁcation with graph convolutional networks. arXiv

preprint arXiv:1609.02907.

Langmore, S. E. and Lehman, M. E. (1994). Physiologic

deﬁcits in the orofacial system underlying dysarthria

in amyotrophic lateral sclerosis. Journal of Speech,

Language, and Hearing Research, 37(1):28–37.

Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar,

Z., and Matthews, I. (2010). The extended cohn-

kanade dataset (ck+): A complete dataset for action

unit and emotion-speciﬁed expression. In 2010 ieee

computer society conference on computer vision and

pattern recognition-workshops, pages 94–101. IEEE.

Lyons, M., Akamatsu, S., Kamachi, M., and Gyoba,

J. (1998). Coding facial expressions with gabor

wavelets. In Proceedings Third IEEE international

conference on automatic face and gesture recognition,

pages 200–205. IEEE.

Mehrabian, A. (1968). Some referents and measures of non-

verbal behavior. Behavior Research Methods & In-

strumentation, 1(6):203–207.

Oliveira, G. C., Ngo, Q. C., Passos, L. A., Papa, J. P.,

Jodas, D. S., and Kumar, D. (2023). Tabular data

augmentation for video-based detection of hypomimia

in parkinson’s disease. Computer Methods and Pro-

grams in Biomedicine, 240:107713.

Veli

ckovi

c, P., Cucurull, G., Casanova, A., Romero, A., Lio,

P., and Bengio, Y. (2017). Graph attention networks.

arXiv preprint arXiv:1710.10903.

Wu, Y. and Ji, Q. (2019). Facial landmark detection: A

literature survey. International Journal of Computer

Vision, 127:115–142.

Xiong, X. and De la Torre, F. (2013). Supervised descent

method and its applications to face alignment. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 532–539.

Xu, R.-S. and Yuan, M. (2021). Considerations on the con-

cept, deﬁnition, and diagnosis of amyotrophic lateral

sclerosis. Neural Regeneration Research, 16(9):1723.

Xu, X., Ruan, Z., and Yang, L. (2020). Facial expression

recognition based on graph neural network. In 2020

IEEE 5th International Conference on Image, Vision

and Computing (ICIVC), pages 211–214. IEEE.

Yolcu, G., Oztel, I., Kazan, S., Oz, C., Palaniappan, K.,

Lever, T. E., and Bunyak, F. (2019). Facial expres-

sion recognition for monitoring neurological disorders

based on convolutional neural network. Multimedia

Tools and Applications, 78:31581–31603.

Young, P., Lai, A., Hodosh, M., and Hockenmaier, J.

(2014). From image descriptions to visual denota-

tions: New similarity metrics for semantic inference

over event descriptions. Transactions of the Associa-

tion for Computational Linguistics, 2:67–78.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

214