An Attention-based Architecture for EEG Classiﬁcation

Italo Zoppis

1 a

, Alessio Zanga

1 b

, Sara Manzoni

1 c

, Giulia Cisotto

2,3 d

Angela Morreale

4 e

, Fabio Stella

1 f

and Giancarlo Mauri

1 g

Department of Computer Science, University of Milano-Bicocca, Milano, Italy

Department of Information Engineering, University of Padova, Italy

Integrative Brain Imaging Center, National Center of Neurology and Psychiatry, Tokyo, Japan

Behavioral Neurology, Montecatone Rehabilitation Institute, Imola, Italy

Keywords:

Attentional Mechanism, Graph Attention Network, Brain Network, EEG.

Abstract:

Emerging studies in the deep learning community focus on techniques aimed to identify which part of a graph

can be suitable for making better decisions and best contributes to an accurate inference. These researches

(i.e., “attentional mechanisms” for graphs) can be applied effectively in all those situations in which it is not

trivial to capture dependency between the involved entities while discharging useless information. This is the

case, e.g., of functional connectivity in human brain, where rapid physiological changes, artifacts and high

inter-subject variability usually require highly trained clinical expertise. In order to evaluate the effectiveness

of the attentional mechanism in such critical situation, we consider the task of normal vs abnormal EEG

classiﬁcation using brain network representation of the corresponding EEG recorded signals.

1 INTRODUCTION

Current networks not only involve social and techno-

logical aspects of our life but are considered as fun-

damental tools for studying many natural phenomena

and conceptual problems. In particular, we have re-

cently witnessed a signiﬁcant growth of neuroscience

studies that use networks as a new paradigm to bet-

ter understand cognition (Varela et al., 2001), brain

cell organization (Rubinov and Sporns, 2010), and

functional connectivity (Towlson et al., 2013; Shih

et al., 2015; van den Heuvel et al., 2012). More-

over, recent advances in deep learning approaches

have provided the opportunity to dig into the under-

standing of brain diseases and to develop effective

neuro-markers for diagnosis and prognosis (Durste-

witz et al., 2019; Corchs et al., 2019). Similarly, there

have been several attempts in literature to extend deep

learning techniques to deal with network data. Some

initial work in this context used recursive networks to

process structured data such as direct acyclic graphs

(Frasconi et al., 1998; Sperduti and Starita, 1997).

https://orcid.org/0000-0001-7312-7123

https://orcid.org/0000-0003-4423-2121

https://orcid.org/0000-0002-6406-536X

https://orcid.org/0000-0002-9554-9367

https://orcid.org/0000-0001-9864-2295

https://orcid.org/0000-0002-1394-0507

https://orcid.org/0000-0003-3520-4022

More recently, Graph Neural Networks (GNNs) have

been introduced (Gori et al., 2005; Scarselli et al.,

2008) as a generalization of recursive networks ca-

pable of handling more general classes of graphs.

Despite the excellent performances and robustness

of deep learning for network data, current induction

has to deal with large multivariate and noisy data sets,

thus posing critical issues for an effective mining and

inference. This is the case of EEG signals, in which

rapid physiological changes, artifacts and high inter-

subject variability require a highly trained (human)

clinical expertise. In this regard, emerging researches

on deep architectures focus on how to bring out rel-

evant parts of a network to provide better decisions

(Veli

ckovi

c et al., 2017), and knowledge representa-

tion. Technically, this approach is known as “atten-

tional mechanism”. Introduced for the ﬁrst time in

the deep learning community in order to access im-

portant parts of the data (Bahdanau et al., 2014), the

attention mechanism has recently been successful for

the resolution of a series of tasks (Lee et al., 2018).

The key ideas of our study are that: 1) interactions

between brain regions can be used to extract useful

features in order to classify anomalies and 2) features

of pairs of brain region are related with each others.

Using a correlation matrix, we are able to express

the strength of the interaction between pairs of elec-

trodes, which can be directly mapped to a graph rep-

resentation: each node is an electrode and each edge

214

Zoppis, I., Zanga, A., Manzoni, S., Cisotto, G., Morreale, A., Stella, F. and Mauri, G.

An Attention-based Architecture for EEG Classiﬁcation.

DOI: 10.5220/0008953502140219

In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) - Volume 4: BIOSIGNALS, pages 214-219

ISBN: 978-989-758-398-8; ISSN: 2184-4305

is added if the correlation is strong enough. This is

motivated by the spatial positioning of the electrodes

and the biological mechanisms, that actually include

more than one brain region together during everyday

tasks. Furthermore, by construction, an edge is added

to the graph if and only if it is a valid representa-

tion of the interaction between a pair of nodes (alias

a pair of brain regions), so for each node the attention

is performed on a well-structured and physiological-

motivated neighborhood.

In this paper, by focusing on these researches,

we investigate the performance of the graph atten-

tional mechanism for providing case/control, i.e., ab-

normal/normal EEG, classiﬁcation of functional brain

networks obtained from EEG recorded signals.

In Sec. 2, we highlight some critical issues that

could affect the inference when blindly applying brain

networks as a tool for the analysis of functional con-

nectivity. In Sec. 3, we give the main deﬁnitions and

concepts. In Sec. 4, we describe the Graph Attention

Network (GAT) mechanism on which we apply our

inference problem. In Sec. 4.1, we conveniently adapt

and extend such mechanism for EEG signal classiﬁca-

tion. In Sec. 5, we describe the experimental setting.

We conclude the paper reporting and discussing the

results in Sec. 6 and Sec. 7.

2 EEG SIGNALS: CRITICAL

ASPECTS FOR NETWORK

BASED INFERENCE

Although the network representation of brain signals

has had an evident impact on the scientiﬁc commu-

nity, it cannot be uncritically applied to inference and

data mining. In fact, to perform a pertinent analysis

and properly extract brain functional network proper-

ties it is important to know the neural phenomenon

under study.

Different pathologies, such as stroke, are usually

associated to lesions in different brain regions. This

can cause problems in obtaining accurate inference,

as the location and the shapes of these lesions can

largely differ from individual to individual. Clearly,

this has an impact on the deﬁnition of the network, on

its nodes and even on the correspondences that these

elements ﬁnd in different subjects.

Moreover, because of rapid physiological

changes, artifacts, and high inter-subject variabil-

ity, EEG data are non-stationary multivariate time

series that are difﬁcult to summarize with broadly

network statistics, and the corresponding inductive

tasks could generalize poorly, or even be unable to

capture speciﬁc extreme situations. This is the case,

for example, of epileptic seizures where abnormal

neuronal activities lead to convulsions and / or mild

loss of awareness. In such case, most seizures (e.g.,

temporal lobe epilepsy) begin as focal and rapidly

generalize for several seconds. If we were interested

in identifying epileptic foci during a generalized

attack, the use of “graph-based” inference should

be carefully applied, in order to provide a proper

identiﬁcation.

In order to ﬁll, at least in part, some of the is-

sues described for brain network based inference in

the following paragraphs we evaluate an attentional

(graph-based) architecture for selecting relevant net-

work topology, discharging useless information, and

at the same time acquiring the temporal functional de-

pendence of EEG recorded traces.

3 MAIN CONCEPTS AND

DEFINITIONS

From a theoretical perspective, networks can be mod-

eled through graphs, i.e., abstract objects representing

collection of “entities”, V (vertices or nodes), and re-

lationships between them, i.e., edges, E. In this pa-

per, we use attributed graphs, G = (V,E), where each

vertex v ∈ V is labeled with a vector of attribute val-

ues. Moreover, given a vertex v ∈ V , we indicate with

N (v) = {u : {v,u} ∈ E} the neighborhood of the ver-

tex v.

In order to summarize relationships between ver-

tices and capture relevant information in a graph, em-

bedding (i.e., objects transformation to lower dimen-

sional spaces) is typically applied (Goyal and Ferrara,

2018). This approach allows to use a rich set of an-

alytical methods, offering to deep models the capa-

bility of providing different levels of representation.

Embedding can be performed at the node level, at the

graph level, or through different mathematical strate-

gies, and it is typically realized by ﬁtting (deep) net-

work’s parameters using standard gradient-based op-

timization. In particular, the following deﬁnitions can

be useful (Lee et al., 2018).

Deﬁnition 3.1. Given a graph G = (V,E) with V

as the set of vertices and E the set of edges, the

objective of node embedding is to learn a function

f : V → R

such that each vertex i ∈ V is mapped

to a k-dimensional vector,

Deﬁnition 3.2. Given a set of graphs, G, the objective

of graph embedding is to learn a function f : G → R

that maps an input graph G ∈ G to a low dimensional

embedding vector,

An Attention-based Architecture for EEG Classiﬁcation

215

(a)

(b)

(c)

Figure 1: System Architecture. (c) The adjacency matrix

is computed for each window; (b) From adjacency to GAT

(for each window); (a) LSTM processes the sequence of

GAT embedded vectors.

4 GAT MODELS

In this paper, we apply the attentional-based node em-

bedding as recently proposed in (Veli

ckovi

c et al.,

2017) by introducing a stacked architecture for

case/control classiﬁcation of recorded EEG traces.

For a general, yet formal, deﬁnition of the notion of

“attention” here we conveniently adapt the one re-

ported in (Lee et al., 2018).

Deﬁnition 4.1. An attentional mechanism is a func-

tion a : R

× R

→ R which computes coefﬁcients

i, j

= a



(l)



across pairs of vertices, i, j, based

on their feature representation

(l)

at level l.

The coefﬁcients e

i, j

can be interpreted as the rele-

vance of vertex j’s features to i. Accordingly to

(Veli

ckovi

c et al., 2017), let a be a single-layer feed-

forward neural network parametrized by a weight

vector~a with nonlinear LeakyReLU activation. In this

case we have,

(l)

i, j

= LeakyReLU



(l)

||W

(l)

i

where W is a learnable parameter matrix and

(l)

||W

(l)

is the concatenation of the embed-

ded representation for the vertices i, j. The coefﬁ-

cients e

i, j

are generally normalized using, e.g., a soft-

max function,

(l)

i, j

exp(e

(l)

i, j

)

∑

k∈N (i)

exp(e

(l)

i,k

)

Notice that the mechanism’s parameters, ~a, are

trained jointly with the others network’s parameters

with standard optimization. Finally, the normalized

(attention) coefﬁcients α

i, j

are then applied to com-

pute a linear combination of the features “around” i

(i.e., features of the vertices in N (i)). In this way, the

next level feature vector for i is obtained, i.e.,

(l+1)

= σ



∑

j∈N (i)

(l)

i, j

(l)



where σ is non linear vector-valued function (in our

case, sigmoid). In this way, embedding from neigh-

bors is aggregated together and scaled by the attention

scores.

4.1 A Stacked GAT-LSTM for EEG

Traces

Long-Short Term Memory networks have success-

fully contributed to model temporal sequences with

long lag time dependency. Furthermore because of

their forget gates, LSTM are able to ﬁlter out irrel-

evant data from “memory” (Gers et al., 1999). On

the basis of these arguments, here we apply a stacked

LSTM layer built on top of the level reported above.

In this way, we try to capture both the relevant topol-

ogy of the corresponding network and the temporal

BIOSIGNALS 2020 - 13th International Conference on Bio-inspired Systems and Signal Processing

216

dependency responses, while discharging ineffective

data from LSTM’s “memory”. The LSTM layer is

composed by 32 units. The resulting architecture is

reported in Fig. 1a.

5 EXPERIMENTAL SETTING

In order to capture temporal information from the

recorded EEG traces, we apply a sliding window ap-

proach. Speciﬁcally:

• The whole multivariate data are framed into dif-

ferent overlapping windows on the temporal do-

main, and each corresponding sub-sampled cross-

section series is used to obtain a cross-correlation

matrix built with Spearman correlation values be-

tween every pairs of recorded channels (Fig. 1c).

In this way each window is associated with a

graph adjacency matrix using a threshold-based

approach.

• For each graph, a GAT network is obtained as re-

ported in Sec. 1b (Fig. 1b).

• The GAT embedding from the j-th GAT network

(which characterizes the j-th graph embedded

representation) is aligned with the other (GATs)

output to obtain a sequence of GAT embedded

vectors. This sequence is processed as input by

the stacked LSTM layer (Fig. 1a).

• The input set of node features

h = {

,. . . ,

}

is composed by features vectors

∈ R

, with F

the number of features for each node. In this pa-

per, a single feature vector is made by ﬁve fea-

tures and is calculated by extracting the average

power for ﬁve well-established frequency bands,

such as delta (0.5-4Hz), theta (4-8Hz), alpha (8-

12Hz), beta (12-30Hz) and gamma (30-100Hz),

in the corresponding window.

5.1 Dataset

The dataset used in this paper is the “TUH Abnormal

EEG Corpus”, a large corpus of data derived from the

EEG Data Corpus of Temple University Hospital of

Philadelphia, Pennsylvania (Obeid and Picone, 2016).

This dataset was previously used in other publications

(Lopez et al., 2015; Schirrmeister et al., 2017;

Ozal

Yıldırım et al., 2017). It contains up to 2993 EDF

ﬁles, divided in 1472 abnormal EEGs and 1521 nor-

mal EEGs, a total of approximately 1142 hours of

recording. For each record there is a plain text report

of the session describing the patient: clinical history,

medications, ﬁrst impression of the EEG record and

clinical correlations. Each EEG record contains 22

channels with a 10/20 conﬁguration.

6 RESULTS

The objective of our experiments were to evaluate the

accuracy of the attentional-based architecture to clas-

sify normal and abnormal signals of the data reported

in Sec.5.1.

As a reference for our comparisons, we used Con-

volutional Neural Networks (CNNs). In order to de-

sign homogeneous comparisons, CNNs are equipped

with dense (feed-forward) layers, that (similarly to

the architecture based on “attention”) allows to ob-

tain, for each window, an embedded vector, which in

turn represents the corresponding graph. The embed-

ding sequence can then be processed as input from the

stacked LSTM.

It is worth to note that in our experiments we have also

evaluated a “CCN + Dense” architecture. In this case,

a CNN supplies the graph embedding for every win-

dow. The sequence of all embedding is then passed as

input to the dense layer.

For each neural architecture, the number of epochs

is ﬁxed to 100 and the loss function is a cross entropy

function. The selected optimizer is an Adam Opti-

mizer with a learning rate of 10

−5

. To obtain more

robust error estimation, we applied for each classiﬁer,

a standard 10-fold cross-validation. The resulting per-

formances are averaged on the number of folds.

The results are reported in Tab. 1 and Tab. 2. Re-

sults reported in Tab. 2 shares the same experimental

settings as those in Tab. 1, with the only difference

that, here, we previously band-pass ﬁltered signals

(from 0.1 to 47 Hz). In both cases, GAT-based ar-

chitectures outperform CNN-based architectures.

The architecture described in this paper was im-

plemented in Python using Keras library (Chol-

let et al., 2015) and Spektral library (Grattarola,

2019). The dataset preprocessing library is Py-

EEGLab (Zanga, 2019). Numerical evaluations were

executed on Ubuntu 18.04.2 LTS; Processor: AMD



Threadripper

1900X CPU @ 3.89GHz, 4.20 Ghz,

8 Core(s), 16 Logical Processors; GPU: NVIDIA



GeForce RTX

2070 8GB GDDR6; Installed Physi-

cal Memory (RAM) 32.00 GB ECC.

7 CONCLUSIONS

EEG-based brain networks are rather complex, yet

promising, tools, which typically need for a highly-

trained knowledge of the underlying neurophysiolog-

An Attention-based Architecture for EEG Classiﬁcation

217

Table 1: Classiﬁcation Performances [%].

Architecture Accuracy Sensitivity Speciﬁcity Precision F1 Score

CNNs + Dense 67.89% 68.67% 67.11% 67.76% 68.21%

CNNs + LSTM 68.56% 67.26% 70.23% 74.34% 70.63%

GATs + LSTM 81.27% 77.27% 86.99% 89.47% 82.93%

Table 2: Classiﬁcation Performances [%] with band-pass ﬁltered signals.

Architecture Accuracy Sensitivity Speciﬁcity Precision F1 Score

CNNs + Dense 69.90% 71.23% 68.63% 68.42% 69.80%

CNNs + LSTM 69.97% 72.59% 67.86% 64.47% 68.29%

GATs + LSTM 76.92% 77.85% 76.00% 76.32% 77.08%

ical processes, to provide accurate inference and mod-

eling. It turns out that the development of methods to

properly measure the brain functional connectivity at

different time steps is fundamental for classiﬁcation

and, more generally, for induction.

The work presented here has focused on the for-

mulation of the recent “attentional mechanism” for

graphs (Veli

ckovi

c et al., 2017). In particular, we in-

troduced a stacked GAT-LSTM architecture aimed to

classify abnormal vs normal EEG signals. The pro-

posed architecture intends to beneﬁt on the one side,

from the potential LSTM capability to model long lag

time dependency while discharging information, and

one the other, from being able to exploit the “atten-

tional mechanism” for capturing most task-relevant

information from brain network’s complex dynamic.

Although the reported results are encouraging for

this purpose – outperforming a typical CNN applica-

tion, a larger dataset has to be investigated to further

support the impact of the newly proposed GAT-based

approach for physiological signals. This in turn re-

ﬂects the needs to focus on speciﬁc pathologies, as

highlighted in this paper. Our research will follow

this target by specializing the analysis to clinical ori-

ented studies for a more complete modeling and in-

terpretation. Others experiments will be performed to

describe more extensively the effects of the applica-

tion of a band-pass ﬁltering.

REFERENCES

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural ma-

chine translation by jointly learning to align and trans-

late. arXiv preprint arXiv:1409.0473.

Chollet, F. et al. (2015). Keras. https://keras.io.

Corchs, S., Chioma, G., Dondi, R., Gasparini, F., Man-

zoni, S., Markowska-Kacznar, U., Mauri, G., Zoppis,

I., and Morreale, A. (2019). Computational methods

for resting-state eeg of patients with disorders of con-

sciousness. Frontiers in neuroscience, 13.

Durstewitz, D., Koppe, G., and Meyer-Lindenberg, A.

(2019). Deep neural networks in psychiatry. Molecu-

lar psychiatry, page 1.

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general

framework for adaptive processing of data structures.

IEEE transactions on Neural Networks, 9(5):768–

786.

Gers, F. A., Schmidhuber, J., and Cummins, F. (1999).

Learning to forget: Continual prediction with lstm.

Gori, M., Monfardini, G., and Scarselli, F. (2005). A new

model for learning in graph domains. In Proceedings.

2005 IEEE International Joint Conference on Neural

Networks, 2005., volume 2, pages 729–734. IEEE.

Goyal, P. and Ferrara, E. (2018). Graph embedding tech-

niques, applications, and performance: A survey.

Knowledge-Based Systems, 151:78–94.

Grattarola, D. (2019). danielegrattarola/spektral.

Lee, J. B., Rossi, R. A., Kim, S., Ahmed, N. K., and Koh, E.

(2018). Attention models in graphs: A survey. arXiv

preprint arXiv:1807.07984.

Lopez, S., Suarez, G., Jungreis, D., Obeid, I., and Picone,

J. (2015). Automated identiﬁcation of abnormal adult

eegs.

Obeid, I. and Picone, J. (2016). The temple university

hospital eeg data corpus. Frontiers in neuroscience,

10:196.

Rubinov, M. and Sporns, O. (2010). Complex network mea-

sures of brain connectivity: uses and interpretations.

Neuroimage, 52(3):1059–1069.

Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M.,

and Monfardini, G. (2008). The graph neural net-

work model. IEEE Transactions on Neural Networks,

20(1):61–80.

Schirrmeister, R. T., Gemein, L., Eggensperger, K., Hutter,

F., and Ball, T. (2017). Deep learning with convolu-

tional neural networks for decoding and visualization

of eeg pathology.

Shih, C.-T., Sporns, O., Yuan, S.-L., Su, T.-S., Lin, Y.-J.,

Chuang, C.-C., Wang, T.-Y., Lo, C.-C., Greenspan,

R. J., and Chiang, A.-S. (2015). Connectomics-based

analysis of information ﬂow in the drosophila brain.

Current Biology, 25(10):1249–1258.

Sperduti, A. and Starita, A. (1997). Supervised neural net-

works for the classiﬁcation of structures. IEEE Trans-

actions on Neural Networks, 8(3):714–735.

BIOSIGNALS 2020 - 13th International Conference on Bio-inspired Systems and Signal Processing

218

Towlson, E. K., V

ertes, P. E., Ahnert, S. E., Schafer, W. R.,

and Bullmore, E. T. (2013). The rich club of the c. el-

egans neuronal connectome. Journal of Neuroscience,

33(15):6380–6387.

van den Heuvel, M. P., Kahn, R. S., Go

ni, J., and Sporns, O.

(2012). High-cost, high-capacity backbone for global

brain communication. Proceedings of the National

Academy of Sciences, 109(28):11372–11377.

Varela, F., Lachaux, J.-P., Rodriguez, E., and Martinerie,

J. (2001). The brainweb: phase synchronization and

large-scale integration. Nature reviews neuroscience,

2(4):229.

Veli

ckovi

c, P., Cucurull, G., Casanova, A., Romero, A., Lio,

P., and Bengio, Y. (2017). Graph attention networks.

arXiv preprint arXiv:1710.10903.

Ozal Yıldırım, Baloglu, U. B., and Acharya, U. R. (2017).

A deep convolutional neural network model for auto-

mated identiﬁcation of abnormal eeg signals.

Zanga, A. (2019). Pyeeglab: a simple tool for eeg manipu-

lation. https://github.com/AlessioZanga/PyEEGLab.

An Attention-based Architecture for EEG Classiﬁcation

219