MAGNET: Multi-Label Text Classiﬁcation using Attention-based Graph

Neural Network

Ankit Pal, Muru Selvakumar and Malaikannan Sankarasubbu

Saama AI Research, Chennai, India

Keywords:

Multi-label Text Classiﬁcation, Graph Neural Networks, Attention Networks, Deep Learning, Natural

Language Processing, Supervised Learning.

Abstract:

In Multi-Label Text Classiﬁcation (MLTC), one sample can belong to more than one class. It is observed

that most MLTC tasks, there are dependencies or correlations among labels. Existing methods tend to ignore

the relationship among labels. In this paper, a graph attention network-based model is proposed to capture

the attentive dependency structure among the labels. The graph attention network uses a feature matrix and a

correlation matrix to capture and explore the crucial dependencies between the labels and generate classiﬁers

for the task. The generated classiﬁers are applied to sentence feature vectors obtained from the text feature

extraction network(BiLSTM) to enable end-to-end training. Attention allows the system to assign different

weights to neighbor nodes per label, thus allowing it to learn the dependencies among labels implicitly. The

results of the proposed model are validated on ﬁve real-world MLTC datasets. The proposed model achieves

similar or better performance compared to the previous state-of-the-art models.

1 INTRODUCTION

Multi-Label Text Classiﬁcation (MLTC) is the task

of assigning one or more labels to each input sample

in the corpus. This makes it both a challenging and

essential task in Natural Language Processing(NLP).

We have a set of labelled training data {(x

)}

i=1

where x

∈ R

are the input features with D dimen-

sion for each data instances and y

∈ {0, 1} are the

targets. The vector y

has one in the jth coordinate

if the ith data point belongs to jth class. We need to

learn a mapping (prediction rule) between the features

and the labels, such that we can predict the class label

vector y of a new data point x correctly.

MLTC has many real-world applications, such as

text categorization (Schapire and Singer, 2000), tag

recommendation (Katakis et al., 2008), information

retrieval (Gopal and Yang, 2010), and so on. Before

deep learning, the solution to the MLTC task used

to focus on traditional machine learning algorithms.

Different techniques have been proposed in the liter-

ature for treating multi-label classiﬁcation problems.

In some of them, multiple single-label classiﬁers are

combined to emulate MLTC problems. Other tech-

niques involve modifying single-label classiﬁers by

changing their algorithms to allow their use in multi-

label problems.

The most popular traditional method for solv-

ing MLTC is Binary Relevance (BR) (Zhang et al.,

2018). BR emulates the MLTC task into multiple

independent binary classiﬁcation problems. How-

ever, it ignores the correlation or the dependencies

among labels (Luaces et al., 2012). Binary Rele-

vance has stimulated research for ﬁnding approaches

to capture and explore the label correlations in various

ways. Some methods, including Deep Neural Net-

work (DNN) based and probabilistic based models,

have been introduced to model dependencies among

labels, such as Hierarchical Text Classiﬁcation. (Sun

and Lim, 2001), (Xue et al., 2008), (Gopal et al.,

2012) and (Peng et al., 2019). Recently Graph-based

Neural Networks (Wu et al., 2019) e.g. Graph Convo-

lution Network (Kipf and Welling, 2016), Graph At-

tention Networks (Velickovic et al., 2018) and Graph

Embeddings (Cai et al., 2017) have received consid-

erable research attention. This is due to the fact that

many real-world problems in complex systems, such

as recommendation systems (Ying et al., 2018), so-

cial networks and biological networks (Fout et al.,

2017) etc, can be modelled as machine learning tasks

over large networks. Graph Convolutional Network

(GCN) was proposed to deal with graph structures.

The GCN beneﬁts from the advantage of the Convo-

lutional Neural Network(CNN) architecture: it per-

494

Pal, A., Selvakumar, M. and Sankarasubbu, M.

MAGNET: Multi-Label Text Classiﬁcation using Attention-based Graph Neural Network.

DOI: 10.5220/0008940304940505

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 2, pages 494-505

ISBN: 978-989-758-395-7; ISSN: 2184-433X

forms predictions with high accuracy, but a relatively

low computational cost by utilizing fewer parame-

ters compared to a fully connected multi-layer per-

ceptron (MLP) model. It can also capture essential

sentence features that determine node properties by

analyzing relations between neighboring nodes. De-

spite the advantages as mentioned above, we suspect

that the GCN is still missing an essential structural

feature to capture better correlation or dependencies

between nodes.

One possible approach to improve the GCN per-

formance is to add adaptive attention weights depend-

ing on the feature matrix to graph convolutions.

To capture the correlation between the labels bet-

ter, we propose a novel deep learning architecture

based on graph attention networks. The proposed

model with graph attention allows us to capture the

dependency structure among labels for MLTC tasks.

As a result, the correlation between labels can be au-

tomatically learned based on the feature matrix. We

propose to learn inter-dependent sentence classiﬁers

from prior label representations (e.g. word embed-

dings) via an attention-based function. We name the

proposed method Multi-label Text classiﬁcation us-

ing Attention based Graph Neural NETwork (MAG-

NET). It uses a multi-head attention mechanism to

extract the correlation between labels for the MLTC

task. Speciﬁcally, these are the following contribu-

tions:

• The drawbacks of current models for the MLTC

task are analyzed.

• A novel end-to-end trainable deep network is pro-

posed for MLTC. The model employs Graph At-

tention Network (GAT) to ﬁnd the correlation be-

tween labels.

• It shows that the proposed method achieves sim-

ilar or better performance compared to previous

State-of-the-art(SoTA) models across two MLTC

metrics and ﬁve MLTC datasets.

2 RELATED WORK

The MLTC task can be modeled as ﬁnding an opti-

mal label sequence y

∗

that maximizes the conditional

probability p(y |x), which is calculated as follows:

p(y | x) =

∏

i=1

p(y

| y

,..,y

i−1

,x) (1)

There are mainly three types of methods to solve

the MLTC task:

• Problem transformation methods

• Algorithm adaptation methods

• Neural network models

2.1 Problem Transformation Methods

Problem transformation methods transform the multi-

label classiﬁcation problem either into one or more

single-label classiﬁcation or regression problems.

Most popular problem transformation method is Bi-

nary relevance (BR) (Boutell et al., 2004), BR learns

a separate classiﬁer for each label and combines the

result of all classiﬁers into a multi-label prediction

by ignoring the correlations between labels. Label

Powers(LP) treats a multi-label problem as a multi-

class problem by training a multi-class classiﬁer on

all unique combinations of labels in the dataset. Clas-

siﬁer Chains (CC) transform the multi-label text clas-

siﬁcation problem into a Bayesian conditioned chain

of the binary text classiﬁcation problem. However,

the problem transformation method takes a lot of time

and space if the dataset and labels are too large.

2.2 Algorithm Adaptation Methods

Algorithm adaptation, on the other hand, adapts the

algorithms to handle multi-label data directly, instead

of transforming the data. Clare and King (2001) con-

struct a decision tree by modifying the c4.5 algo-

rithm (Quinlan, 1993) and develop resampling strate-

gies. (Elisseeff and Weston 2002) propose the Rank-

SVM by amending a Support Vector Machine (SVM).

(Zhang and Zhou 2007) propose a multi-label lazy

learning approach (ML-KNN), ML-KNN uses corre-

lations of different labels by adopting the traditional

K-nearest neighbor (KNN) algorithm. However, the

algorithm adaptation method is limited to utilizing

only the ﬁrst or second order of label correlation.

2.3 Neural Network Models

In recent years, various Neural network-based mod-

els are used for MLTC task. For example, (Yang

et al., 2016) propose hierarchical attention networks

(HAN), uses the GRU gating mechanism with hier-

archical attention for document classiﬁcation. Zhang

and Zhou (2006) propose a framework called Back-

propagation for multilabel learning (BP-MLL) that

learns ranking errors in neural networks via back-

propagation. However, these types of neural networks

don’t perform well on high dimensional and large-

scale data.

Many CNN based model, RCNN (Lai et al.,

2015), Ensemble method of CNN and RNN by Chen

et al. (2017), XML-CNN (Liu et al., 2017), CNN

MAGNET: Multi-Label Text Classiﬁcation using Attention-based Graph Neural Network

495

(Kim, 2014a) and TEXTCNN (Kim, 2014a) have

been proposed to solve the MLTC task. However,

they neglect the correlations between labels.

To utilise the relation between the labels some Hi-

erarchical text classiﬁcation models have been pro-

posed, Transfer learning idea proposed by (Xiao

et al., 2011) uses hierarchical Support Vector Ma-

chine (SVM), (Gopal et al., 2012) and (Gopal and

Yang, 2015) uses hierarchical and graphical depen-

dencies between class-labels, (Peng et al., 2018) uti-

lize the graph operation on the graph of words. How-

ever, these methods are limited as they consider only

pair-wise relation due to computational constraints.

Recently, the BERT language model achieves

state-of-the-art performance in many NLP tasks. (De-

vlin et al., 2019b)

3 MAGNET ARCHITECTURE

3.1 Graph Representation of Labels

A graph G consists of a feature description M ∈R

n×d

and the corresponding adjacency matrix A ∈ R

n×n

where n denotes the number of labels and d denotes

the number of dimensions.

GAT network takes the node features and adja-

cency matrix that represents the graph data as in-

puts. The adjacency matrix is constructed based on

the samples. In our case, we do not have a graph

dataset. Instead, we learn the adjacency matrix, hop-

ing that the model will determine the graph, thereby

learning the correlation of the labels.

Our intuition is that by modeling the correlation

among labels as a weighted graph, we force the GAT

network to learn such that the adjacency matrix and

the attention weights together represent the correla-

tion. We use three methods to initialize the weights of

the adjacency matrix. Section 3.5 explains the initial-

ization methods in detail.

In the context of our model, the embedding vec-

tors of the labels act as the node features, and the ad-

jacency matrix is a learn-able parameter.

3.2 Node Updating Mechanism in

Graph Convolution

In Graph Convolution Network Nodes can be updated

by different types of node updating mechanisms. The

basic version of GCN updates each node i of the `-th

layer, H

(`+1)

, as follows.

(`+1)

= σ





(2)

Where σ(·) denote an activation function, A is an

adjacency matrix and W

(`)

is the convolution weights

of the `-th layer. We represent each node of the graph

structure as a label; at each layer, the label’s features

are aggregated by neighbors to form the label features

of the next layer. In this way, features become increas-

ingly more abstract at each consecutive layer. e.g., la-

bel 2 has three adjacent labels 1, 3 and 4. In this case,

another way to write equation (2) is

(`+1)

= σ



(`)

+ H

(`)

+ H

(`)



(3)

So, in this case, the graph convolution network

sums up all labels features with the same convolution

weights, and then the result is passed through one ac-

tivation function to produce the updated node feature

output.

3.3 Graph Attention Networks for

Multi-Label Classiﬁcation

In GCNs, the neighborhoods of nodes combine with

equal or pre-deﬁned weights. However, the inﬂu-

ence of neighbors can vary greatly, and the attention

mechanism can identify label importance in corre-

lation graph by considering the importance of their

neighbor labels.

The node updating mechanism, equation (3), can

be written as a linear combination of neighboring la-

bels with attention coefﬁcients.

(`+1)

= ReLU



(`)

+ α

(`)

+α

(`)

+ α

(`)



(4)

where α

i j

is an attention coefﬁcient which mea-

sures the importance of the jth node in updating the

ith node of the `-th hidden layer. The basic expression

of the attention coefﬁcient can be written as

(`)

= f



(`)



(5)

The attention coefﬁcient can be obtained typically

by i) a similarity base, ii) concatenating features, and

iii) coupling all features. We evaluate the attention

coefﬁcient by concatenating features.

= ReLU



W) k







(6)

In our experiment, we are using multi-head attention

(Vaswani et al., 2017) that utilizes K different heads

to describe labels relationships. The operations of

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

496

Figure 1: Illustration of overall structure of MAGNET model with a single Graph Attention layer for multi label text clas-

siﬁcation.



(n)



,n = 1,2,.. . ,N is input for BiLSTM to generate the feature vectors. x

(n)

are encoded using BERT

embeddings. Input for Graph attention network is the Adjacency matrix A ∈ R

n×n

and label vectors M ∈ R

n×d

. GAT output

is attended label features which is applied on the feature vectors obtained from the BiLSTM.

the layer are independently replicated K times (each

replication is done with different parameters), and the

outputs are aggregated feature wise (typically by con-

catenating or adding).

(`+1)

= Tanh

∑

k=1

∑

j∈N(i)

i j,k

(7)

Where α

i j

is the attention coefﬁcient of label j to

i. N(i) represents the neighborhood of label i in the

graph. We use a cascade of GAT layers. For ﬁrst

layer the input is label embedding matrix M ∈ R

n×d

= Tanh

∑

k=1

∑

j∈N(i)

(0)

i j,k

(0)

(8)

The output from the previous GAT layer is fed into

the successive GAT layer similar to RNN but the GAT

layer weights are not shared

(`+1)

= Tanh

∑

k=1

∑

j∈N(i)

i j,k

| {z }

attended label features

(9)

The output from the last layer is the attended label

features H

gat

∈ R

c×d

where c denotes the number of

labels and d denotes the dimension of the attended

label features. which is applied to the textual features

from the BiLSTM.

3.4 Feature Vector Generation

We are using bidirectional LSTM (Hochreiter and

Schmidhuber, 1997) to obtain the feature vectors. We

use BERT for embedding the words and then feed it to

BiLSTM for ﬁne-tuning for the domain-speciﬁc task,

BiLSTM reads the text sequence x from both direc-

tions and computes the hidden states for each word,

−→

−−−−→

LSTM



−→

i−1



←−

←−−−−

LSTM



←−

i+1



(10)

We obtain the ﬁnal hidden representation of the i-

th word by concatenating the hidden states from both

directions,

−→

;

←−

(11)

F = f

rnn

BERT

(s;θ

BERT

);θ

rnn

) ∈ R

(12)

MAGNET: Multi-Label Text Classiﬁcation using Attention-based Graph Neural Network

497

Where s is the sentence, θ

rnn

is Rnn parameters,

BERT

is BERTs parameter, D is hidden size of BiL-

STM. Later we multiply feature vectors with attended

label features to get the ﬁnal prediction score as,

ˆy = FH

gat

(13)

Where H

gat

∈ R

c×d

and F is feature vectors obtained

from BiLSTM model. Figure 1 shows the overall

structure of the proposed model.

3.5 Adjacency Matrix Generation

In this section, we explain how to initialize the ad-

jacency matrix for the GAT network. We use three

different methods to initialize the weights..

• Identity Matrix

We use the Identity matrix as the adjacency ma-

trix. Ones on the main diagonal and zeros else-

where, i.e., starting with zero correlation as a

starting point.

• Xavier Initialization

We use Xavier initialization (Glorot and Bengio,

2010) to initialize the weight of adjacency matrix.

√

+ n

i+1

(14)

where n

is the number of incoming network con-

nections.

• Correlation Matrix

The correlation matrix is constructed by count-

ing the pairwise co-occurrence of labels. The fre-

quency vector is vector F ∈ R

where n is the

number of labels and F

is the frequency of label i

in the overall training set. The co-occurrence ma-

trix is then normalized by the frequency vector.

A = M / F (15)

where M ∈ R

n×n

is the co-occurrence matrix and

F ∈ R

is the frequency vector of individual la-

bels. This is similar to how the correlation matrix

built-in (Chen et al., 2019), except we do not em-

ploy binarization.

3.6 Loss Function

We use Cross-entropy as the loss function. If the

ground truth label of a data point is y ∈ R

, where

= {0, 1}

L =

∑

c=1

log(σ (

)) + (1 −y

)log (1 −σ(

)) (16)

Where σ is sigmoid activation function

4 EXPERIMENT

In this section, we introduce the datasets, experiment

details, and baseline results. Subsequently, the au-

thors make a comparison of the proposed methods

with baselines

4.1 Datasets

In this section, we provide detail and use the source

of the datasets in the experiment. Table 2 shows the

Statistics of all datasets.

Reuters-21578 is a collection of documents collected

from Reuters News Wire in 1987. The Reuters-21578

test collection, together with its earlier variants, has

been such a standard benchmark for the text catego-

rization (TC) (Debole and Sebastiani, 2005). It con-

tains 10,788 documents, which has 8,630 documents

for training and 2,158 for testing with a total of 90

categories.

RCV1-V2 provided by Lewis et al. (2004) (Lewis

et al., 2004), consists of categorized newswire stories

made available by Reuters Ltd. Each newswire story

can have multiple topics assigned to it, with 103 top-

ics in total. RCV1-V2 contains 8,04,414 documents

which are divided into 6,43,531 documents for train-

ing and 1,60,883 for testing.

Arxiv Academic Paper Dataset (AAPD) is provided

by Yang et al. (2018). The dataset consists of the ab-

stract and its corresponding subjects of 55,840 aca-

demic papers, and each paper can have multiple sub-

jects. The target is to predict subjects of an academic

paper according to the content of the abstract. The

AAPD dataset then divides into 44,672 documents

for training and 11,168 for testing with a total of 54

classes.

Slashdot dataset was collected from the Slashdot

website and consists of article blurbs labeled with the

subject categories. Slashdot contains 19,258 samples

for training and 4,814 samples for testing with a total

of 291 classes.

Toxic Comment Dataset, We are using toxic com-

ment dataset from Kaggle. This dataset has large

number of comments from Wikipedia talk page edits.

Human raters have labeled them for toxic behavior.

4.2 Experiment Details

We implement our experiments in Tensorﬂow on an

NVIDIA 1080Ti GPU. Our model consists of two

GAT layers with multi-head attention. Table 1 shows

the hyper-parameters of the model on ﬁve datasets.

For label representations, we adopt 768 dim BERT

trained on Wikipedia and BookCorpus. For the cate-

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

498

gories whose names contain multiple words, we ob-

tain the label representation as to the average of em-

beddings for all words. For all datasets, the batch

size is set to 250, and out of vocabulary(OOV) words

are replaced with unk. We use BERT embedding to

encode the sentences. We use Adam optimizer to

minimize the ﬁnal objective function. The learning

rate is initialized to 0.001 and we make use of the

dropout 0.5 (Srivastava et al. 2014) to avoid over-

ﬁtting and clip the gradients (Pascanu, Mikolov, and

Bengio 2013) to the maximum norm of 10.

4.3 Performance Evaluation

miF1 In the micro-average method, the individual

true positives, false positives, and false negatives of

the system are summed up for different sets and ap-

plied to get Micro-average F-Score.

F1 −Score

micro

∑

j=1

2t p

∑

j=1

(2t p

+ f p

+ f n

)

Precision

micro

∑

j=1

t p

∑

j=1

t p

+ f p

Recall

micro

∑

j=1

t p

∑

j=1

t p

+ f n

(17)

Hamming Loss (HL): Hamming-Loss is the frac-

tion of labels that are incorrectly predicted. (Dester-

cke, 2014). Therefore, hamming loss takes into ac-

count the prediction of both an incorrect label and

a missing label normalized over the total number of

classes and the total number of examples.

HL =

|N|·|L|

|N|

∑

i=1

|L|

∑

j=1

xor(y

i, j

) (18)

where y

i, j

is the target and z

i, j

is the prediction. Ide-

ally, we would expect Hamming loss, HL = 0, which

would imply no error; practically the smaller the value

of hamming loss, the better the performance of the

learning algorithm.

4.4 Comparison of Methods

We compare the performance of 27 algorithms, in-

cluding state-of-the-art models. Furthermore, we

compare the latest state-of-the-art models on the rcv1-

v2 dataset. Compared algorithms can be categorized

into three groups, as described below:

• Flat Baselines

Flat Baseline models transform the documents

and extract the features using TF-IDF (Ramos,

), later use those features as input to Logistic

regression (LR) (Allison, 1999) , Support Vec-

tor Machine (SVM)(Hearst, 1998) , Hierarchi-

cal Support Vector Machine (HSVM) (Vural and

Dy, 2004) , Binary Relevance (BR)(Boutell et al.,

2004), Classiﬁer Chains(CC)(Read et al., 2011).

Flat Baseline methods ignore the relation between

words and dependency between labels.

• Sequence, Graph and N-gram based Models

These types of models ﬁrst transform the text

dataset into sequences of words, the graph of

words or N-grams features, later apply differ-

ent types of deep learning models on those fea-

tures including CNN (Kim, 2014b), CNN-RNN

(Chen et al., 2017), RCNN (Lai et al., 2015),

DCNN (Schwenk et al., 2017), XML-CNN (Liu

et al., 2017), HR-DGCNN (Peng et al., 2018),

Hierarchical LSTM (HLSTM) (Chen et al.,

2016), multi-label classiﬁcation approach based

on a conditional cyclic directed graphical model

(CDN-SVM) (Guo and Gu, 2011), Hierarchical

Attention Network (HAN) (Yang et al., 2016) and

Bi-directional Block Self-Attention Network (Bi-

BloSAN) (Shen et al., 2018) etc. for the multi-

label classiﬁcation task For example, Hierarchi-

cal Attention Networks for Document Classiﬁca-

tion (HAN) uses a GRU grating mechanism to

encode the sequences and apply word and sen-

tence level attention on those sequences for doc-

ument classiﬁcation. Bi-directional Block Self-

Attention Network (BI-BloSAN) uses intra-block

and inter-block self-attentions to capture both lo-

cal and long-range context dependencies by split-

ting the sequences into several blocks.

• Recent State-of-the-Art Models

We compare our model with different state-of-the-

art models for multi-label classiﬁcation task in-

cluding BP-MLL

RAD

(Nam et al., 2014), Input

Encoding with Feature Message Passing (FMP)

(Lanchantin et al., 2019), TEXT-CNN(Kim,

2014a), Hierarchical taxonomy-aware and atten-

tional graph capsule recurrent CNNs framework

(HE-AGCRCNN)(Peng et al., 2019), BOW-

CNN(Johnson and Zhang, 2014), Capsule-B

networks(Zhao et al., 2018), Hierarchical Text

Classiﬁcation with Reinforced Label Assignment

(HiLAP)(Mao et al., 2019), Hierarchical Text

Classiﬁcation with Recursively Regularized Deep

Graph-CNN (HR-DGCNN)(Peng et al., 2018),

Hierarchical Transfer Learning-based Strategy

(HTrans)(Banerjee et al., 2019), BERT (Bidirec-

tional Encoder Representations from Transform-

ers)(Devlin et al., 2019b), BERT-SGM(Yarullin

and Serdyukov, 2019), For example FMP +

LaMP is a variant of LaMP model which

MAGNET: Multi-Label Text Classiﬁcation using Attention-based Graph Neural Network

499

Table 1: Main experimental hyper-parameters.

Dataset Vocab Size Embed size Hidden size Attention

heads

Reuters-21578 20,000 768 250 4

RCV1-V2 50,000 768 250 8

AAPD 30,000 768 250 8

Slashdot 30,000 768 300 4

Toxic Comment 50,000 768 200 8

Table 2: Statistics of the datasets.

Dataset Domain #Train #Test Labels

Reuters-21578 Text 8,630 2,158 90

RCV1-V2 Text 6,43,531 1,60,883 103

AAPD Text 44,672 11,168 54

Slashdot Text 19,258 4,814 291

Toxic Comment Text 126,856 31,714 7

uses Input Encoding with Feature Message Pass-

ing (FMP). It achieves state-of-the-art accuracy

across ﬁve metrics and seven datasets. HE-

AGCRCNN uses a hierarchical taxonomy em-

bedding method to learn the hierarchical relations

among the labels.is another recent state-of-the-art

model, which has shown outstanding performance

in large-scale multi-label text classiﬁcation. It

uses a hierarchical taxonomy embedding method

to learn the hierarchical relations among the la-

bels. BERT (Bidirectional Encoder Representa-

tions from Transformers) is a recent pre-trained

language model that has shown groundbreaking

results in many NLP tasks. BERT uses atten-

tion mechanism (Transformer) to learns contex-

tual relations between words in a text.

5 PERFORMANCE ANALYSIS

In this section, we will compare our proposed method

with baselines on the test sets. Table 4 shows the

detailed Comparisons of Micro F1-score for various

state-of-the-art models.

5.1 Comparisons with State-of-the-Art

First, we compare the result of Traditional Ma-

chine learning algorithms. Among LR, SVM, and

HSVM, HSVM performs better than the other two.

HSVM uses SVM at each node of the Decision tree.

Later we compare the result of Hierarchical, CNN

based models and graph-based deep learning mod-

els. Among Hierarchical Models HLSTM, HAN,

HE-AGCRCNN, HR-DGCNN, HiLAP, and HTrans,

HE-AGCRCNN performs better compared to other

Hierarchical models. HAN and HLSTM methods

are based on recurrent neural networks. While ana-

lyzing the performance of the recurrent model with

baseline Flat models, recurrent neural networks per-

form worse than HSVM even though there was

an ignorance of label dependency in baseline mod-

els. RNN faces the problem of vanishing gradi-

ents and exploding gradients when the sequences are

too long. Graph model, HR-DGCNN, performs bet-

ter than recurrent and baseline models. Comparing

the CNN-based model RCNN, XML-CNN, DCNN,

TEXTCNN, CNN, and CNN-RNN, TEXTCNN per-

forms better among all of them while RCNN performs

worse among them.

The sequence generator model treats the multi-

label classiﬁcation task as a sequence generation.

When comparing the sequence generator models

SGM-GE and seq2seq, SGM performs better than the

seq2seq network. SGM utilizes the correlation be-

tween labels by using sequence generator model with

a novel decoder structure.

Comparing the proposed MAGNET against the

state-of-the-art models, MAGNET signiﬁcantly im-

proved previous state-of-the-art results, we see ~20%

improvement in miF1 comparison to HSVM model.

While comparing with the best Hierarchical text clas-

siﬁcation models, we observe ~11%, ~19%, ~5%

and ~8% accuracy improvement compared to HE-

AGCRCNN, HAN, HiLAP, HTrans respectively. The

proposed model produced a ~16% improvement in

miF1 over the popular bi-directional block self-

attention network (Bi-BloSAN).

Comparing with CNN group models, proposed

model improves the performance by ~12% and ~6%

accuracy compared with TEXTCNN and BOW-CNN

method respectively. MAGNET achieves ~2% im-

provement over state-of-the-art BERT model.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

500

5.2 Evaluation on Other Datasets

We also evaluate our proposed model on four differ-

ent datasets rather than RCV1 to observe the perfor-

mance of the model on those datasets, which vary in

the number of samples and the number of labels. Ta-

ble 3 shows the miF1 scores for different datasets, and

we also report the Hamming loss in Table 5. Evalu-

ation results show that proposed methods achieve the

best performance in the primary evaluation metrics.

We observe 3% and 4% miF1 improvement in AAPD

and Slashdot dataset, respectively, as compared to the

CNN-RNN method.

5.3 Analysis and Discussion

Here we discuss a further analysis of the model and

experimental results. We report the evaluation results

in terms of hamming loss and macro-F1 score. We

are using a moving average with a window size of 3

to draw the plots to make the scenarios more comfort-

able to read.

5.3.1 Impact of Initialization of the Adjacency

Matrix

We initialized the adjacency matrix in three differ-

ent ways random, identity, and co-occurrence ma-

trix. We hypothesized that the co-occurrence ma-

trix would perform the best since the model is fed

with richer prior information than the identity matrix,

where the correlation is zero and random matrix. To

our surprise, random initialization performed the best

at 0.887, and identity matrix performed the worst at

0.865, whereas the co-occurrence matrix achieved the

micro-F1 score of 0.878. Even though Xavier initial-

izer performed the best, all the other random initial-

izers performed better than co-occurrence and iden-

tity matrices. This shows that the textual information

from samples contain richer information than that in

the label co-occurrence matrix that we initialize the

adjacency with, and both co-occurrence and identity

matrix, traps the model in a local minima.

5.3.2 Results on Different Types of Word

Embeddings

In this section, we investigate the impact of the four

different word embeddings on our proposed archi-

tecture, namely the Word2Vec embeddings(Mikolov

et al., 2013), Glove embeddings (Pennington et al.,

2014), Random embeddings, BERT embeddings (De-

vlin et al., 2019a). Figure (2) and Figure (3) shows

the f1 score of all four different word embeddings on

the (unseen) test dataset of Reuters-21578.

Figure 2: Different types of word embeddings performance

on MAGNET x axis refer to the different types of word em-

beddings and y axis refer to the Accuracy ( F1-score).

Accordingly, we make the following observations:

• Glove and word2vec embeddings produce simi-

laer results.

• Random embeddings perform worse than other

embeddings. Pre-trained word embeddings have

proven to be highly useful in our proposed archi-

tecture compared to the random embeddings.

• BERT embeddings outperform other embeddings

in this experiment. Therefore, using BERT fea-

ture embeddings increase the accuracy and per-

formance of our architecture.

Our proposed model uses BERT embeddings for en-

coding the sentences.

Figure 3: Performance of proposed model on different types

of word embeddings. x-axis is the number of epoch and the

y-axis refers to the micro-F1 score.

MAGNET: Multi-Label Text Classiﬁcation using Attention-based Graph Neural Network

501

Table 3: Comparisons of Micro F1-score for various models on four benchmark datasets.

F1-accuracy

Methods Reuters-

21578

AAPD Slashdot Toxic

BR 0.878 0.648 0.486 0.853

BR-support 0.872 0.682 0.516 0.874

CC 0.879 0.654 0.480 0.893

CNN 0.863 0.664 0.512 0.775

CNN-RNN 0.855 0.669 0.530 0.904

MAGNET 0.899 0.696 0.568 0.930

Table 4: Comparisons of Micro F1-score for various state-

of-the-art models on Rcv1-v2 dataset.

Rcv1-v2

Method Accuracy

LR 0.692

SVM 0.691

HSVM 0.693

HLSTM 0.673

RCNN 0.686

XML-CNN 0.695

HAN 0.696

Bi-BloSAN 0.72

DCNN 0.732

SGM+GE 0.719

CAPSULE-B 0.739

CDN-SVM 0.738

HR-DGCNN 0.761

TEXTCNN 0.766

HE-AGCRCNN 0.778

BP-MLL

RAD

0.780

HTrans 0.805

BOW-CNN 0.827

HilAP 0.833

BERT 0.864

BERT + SGM 0.846

FMP + LaMP

0.877

MAGNET 0.885

5.3.3 Comparison between Two Different Graph

Neural Networks

In this section, we compare the performance of GAT

and GCN networks. The critical difference between

GAT and GCN is how the information aggregates

from the neighborhood. GAT computes the hidden

states of each node by attending over its neighbors,

following a self-attention strategy where GCN pro-

duces the normalized sum of the node features of

neighbors.

GAT improved the average miF1 score by 4% over

the GCN model. It shows that the GAT model cap-

tures better label correlation compare to GCN. The

attention mechanism can identify label importance in

correlation graph by considering the signiﬁcance of

their neighbor labels.

Figure 4: Performance of GAT vs GCN. x-axis is number

of epochs and y-axis is micro-F1 score.

Figure (4) shows the accuracy of both neural net-

work on Reuters-21578 dataset.

6 CONCLUSION

The proposed approach can improve the accuracy and

efﬁciency of models and can work across a wide range

of data types and applications. To model and capture

the correlation between labels, we proposed a GAT

based model for multi-label text classiﬁcation.

We evaluated the proposed model on various

datasets and presented the results. The combination

of GAT with bi-directional LSTM shows that it has

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

502

Table 5: Comparisons of hamming loss for various models on four benchmark datasets. The smaller the value, the better.

Hamming-loss

Methods Rcv1-v2 AAPD Reuters-

21578

Slashdot Toxic

BR 0.0093 0.0316 0.0032 0.052 0.034

CC 0.0089 0.0306 0.0031 0.057 0.030

CNN 0.0084 0.0287 0.0033 0.049 0.039

CNN-RNN 0.0086 0.0282 0.0037 0.046 0.025

MAGNET 0.0079 0.0252 0.0029 0.039 0.022

achieved consistently higher accuracy than those ob-

tained by conventional approaches.

Even though our proposed model performs very

well, there are still some limitations. When the

dataset contains a large number of labels correlation

matrix will be very large, and training the model can

be difﬁcult. Our work alleviates this problem to some

extent, but we still think the exploration of more ef-

fective solutions is vital in the future.

REFERENCES

Allison, P. (1999). Logistic Regression Using Sas®: Theory

and Application. SAS Publishing, ﬁrst edition.

Banerjee, S., Akkaya, C., Perez-Sorrosal, F., and Tsiout-

siouliklis, K. (2019). Hierarchical transfer learning

for multi-label text classiﬁcation. In Proceedings of

the 57th Conference of the Association for Computa-

tional Linguistics, ACL 2019, Florence, Italy, July 28-

August 2, 2019, Volume 1: Long Papers, pages 6295–

6300.

Boutell, M. R., Luo, J., Shen, X., and Brown, C. M.

(2004). Learning multi-label scene classiﬁcation. Pat-

tern Recognition, 37(9):1757 – 1771.

Cai, H., Zheng, V. W., and Chang, K. C. (2017). A compre-

hensive survey of graph embedding: Problems, tech-

niques and applications. CoRR, abs/1709.07604.

Chen, G., Ye, D., Xing, Z., Chen, J., and Cambria, E.

(2017). Ensemble application of convolutional and

recurrent neural networks for multi-label text catego-

rization. In 2017 International Joint Conference on

Neural Networks (IJCNN), pages 2377–2383.

Chen, H., Sun, M., Tu, C., Lin, Y., and Liu, Z. (2016).

Neural sentiment classiﬁcation with user and prod-

uct attention. In Proceedings of the 2016 Conference

on Empirical Methods in Natural Language Process-

ing, EMNLP 2016, Austin, Texas, USA, November 1-4,

2016, pages 1650–1659.

Chen, Z.-M., Wei, X.-S., Wang, P., and Guo, Y. (2019).

Multi-Label Image Recognition with Graph Convolu-

tional Networks. In The IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR).

Debole, F. and Sebastiani, F. (2005). An analysis of the

relative hardness of reuters-21578 subsets: Research

articles. J. Am. Soc. Inf. Sci. Technol., 56(6):584–596.

Destercke, S. (2014). Multilabel prediction with probability

sets: The hamming loss case. In Information Process-

ing and Management of Uncertainty in Knowledge-

Based Systems - 15th International Conference, IPMU

2014, Montpellier, France, July 15-19, 2014, Proceed-

ings, Part II, pages 496–505.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019a).

BERT: pre-training of deep bidirectional transform-

ers for language understanding. In Proceedings of

the 2019 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, NAACL-HLT 2019,

Minneapolis, MN, USA, June 2-7, 2019, Volume 1

(Long and Short Papers), pages 4171–4186.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019b). Bert: Pre-training of deep bidirectional

transformers for language understanding. In NAACL-

HLT.

Fout, A., Byrd, J., Shariat, B., and Ben-Hur, A. (2017).

Protein interface prediction using graph convolutional

networks. In Proceedings of the 31st International

Conference on Neural Information Processing Sys-

tems, NIPS’17, pages 6533–6542, USA. Curran As-

sociates Inc.

Glorot, X. and Bengio, Y. (2010). Understanding the dif-

ﬁculty of training deep feedforward neural networks.

In Proceedings of the Thirteenth International Confer-

ence on Artiﬁcial Intelligence and Statistics, AISTATS

2010, Chia Laguna Resort, Sardinia, Italy, May 13-

15, 2010, pages 249–256.

Gopal, S. and Yang, Y. (2010). Multilabel classiﬁcation

with meta-level features. pages 315–322.

Gopal, S. and Yang, Y. (2015). Hierarchical bayesian infer-

ence and recursive regularization for large-scale clas-

siﬁcation. TKDD, 9(3):18:1–18:23.

Gopal, S., Yang, Y., Bai, B., and Niculescu-Mizil, A.

(2012). Bayesian models for large-scale hierarchi-

cal classiﬁcation. In Advances in Neural Informa-

tion Processing Systems 25: 26th Annual Conference

on Neural Information Processing Systems 2012. Pro-

ceedings of a meeting held December 3-6, 2012, Lake

Tahoe, Nevada, United States, pages 2420–2428.

Guo, Y. and Gu, S. (2011). Multi-label classiﬁcation us-

ing conditional dependency networks. In IJCAI 2011,

Proceedings of the 22nd International Joint Confer-

ence on Artiﬁcial Intelligence, Barcelona, Catalonia,

Spain, July 16-22, 2011, pages 1300–1305.

MAGNET: Multi-Label Text Classiﬁcation using Attention-based Graph Neural Network

503

Hearst, M. A. (1998). Support vector machines. IEEE In-

telligent Systems, 13(4):18–28.

Johnson, R. and Zhang, T. (2014). Effective use of word

order for text categorization with convolutional neural

networks. CoRR, abs/1412.1058.

Katakis, I., Tsoumakas, G., and Vlahavas, I. (2008). Multi-

label text classiﬁcation for automated tag suggestion.

In Proceedings of the ECML/PKDD 2008 Discovery

Challenge.

Kim, Y. (2014a). Convolutional neural networks for sen-

tence classiﬁcation. In Proceedings of the 2014 Con-

ference on Empirical Methods in Natural Language

Processing, EMNLP 2014, October 25-29, 2014,

Doha, Qatar, A meeting of SIGDAT, a Special Inter-

est Group of the ACL, pages 1746–1751.

Kim, Y. (2014b). Convolutional neural networks for sen-

tence classiﬁcation. CoRR, abs/1408.5882.

Kipf, T. N. and Welling, M. (2016). Semi-supervised clas-

siﬁcation with graph convolutional networks. CoRR,

abs/1609.02907.

Lai, S., Xu, L., Liu, K., and Zhao, J. (2015). Recurrent con-

volutional neural networks for text classiﬁcation. In

Proceedings of the Twenty-Ninth AAAI Conference on

Artiﬁcial Intelligence, January 25-30, 2015, Austin,

Texas, USA, pages 2267–2273.

Lanchantin, J., Sekhon, A., and Qi, Y. (2019). Neural

message passing for multi-label classiﬁcation. CoRR,

abs/1904.08049.

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). Rcv1:

A new benchmark collection for text categorization

research. J. Mach. Learn. Res., 5:361–397.

Liu, J., Chang, W., Wu, Y., and Yang, Y. (2017). Deep learn-

ing for extreme multi-label text classiﬁcation. In Pro-

ceedings of the 40th International ACM SIGIR Con-

ference on Research and Development in Information

Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017,

pages 115–124.

Luaces, O., D

ıez, J., Barranquero, J., del Coz, J. J., and

Bahamonde, A. (2012). Binary relevance efﬁcacy for

multilabel classiﬁcation. Progress in Artiﬁcial Intelli-

gence, 1(4):303–313.

Mao, Y., Tian, J., Han, J., and Ren, X. (2019). Hierar-

chical text classiﬁcation with reinforced label assign-

ment. CoRR, abs/1908.10419.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. In Advances

in Neural Information Processing Systems 26: 27th

Annual Conference on Neural Information Processing

Systems 2013. Proceedings of a meeting held Decem-

ber 5-8, 2013, Lake Tahoe, Nevada, United States.,

pages 3111–3119.

Nam, J., Kim, J., Loza Menc’ia, E., Gurevych, I., and

urnkranz, J. (2014). Large-scale multi-label text

classiﬁcation - revisiting neural networks. In Machine

Learning and Knowledge Discovery in Databases -

European Conference, ECML PKDD 2014, Nancy,

France, September 15-19, 2014. Proceedings, Part II,

pages 437–452.

Peng, H., Li, J., Gong, Q., Wang, S., He, L., Li, B., Wang,

L., and Yu, P. S. (2019). Hierarchical taxonomy-aware

and attentional graph capsule rcnns for large-scale

multi-label text classiﬁcation. CoRR, abs/1906.04898.

Peng, H., Li, J., He, Y., Liu, Y., Bao, M., Wang, L., Song,

Y., and Yang, Q. (2018). Large-scale hierarchical text

classiﬁcation with recursively regularized deep graph-

cnn. In Proceedings of the 2018 World Wide Web

Conference on World Wide Web, WWW 2018, Lyon,

France, April 23-27, 2018, pages 1063–1072.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

Proceedings of the 2014 Conference on Empirical

Methods in Natural Language Processing, EMNLP

2014, October 25-29, 2014, Doha, Qatar, A meeting

of SIGDAT, a Special Interest Group of the ACL, pages

1532–1543.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learn-

ing. Morgan Kaufmann Publishers Inc., San Fran-

cisco, CA, USA.

Ramos, J. Using tf-idf to determine word relevance in doc-

ument queries.

Read, J., Pfahringer, B., Holmes, G., and Frank, E. (2011).

Classiﬁer chains for multi-label classiﬁcation. Ma-

chine Learning, 85(3):333–359.

Schapire, R. E. and Singer, Y. (2000). Boostexter: A

boosting-based system for text categorization. Ma-

chine Learning, 39(2/3):135–168.

Schwenk, H., Barrault, L., Conneau, A., and LeCun, Y.

(2017). Very deep convolutional networks for text

classiﬁcation. In Proceedings of the 15th Conference

of the European Chapter of the Association for Com-

putational Linguistics, EACL 2017, Valencia, Spain,

April 3-7, 2017, Volume 1: Long Papers, pages 1107–

1116.

Shen, T., Zhou, T., Long, G., Jiang, J., and Zhang, C.

(2018). Bi-directional block self-attention for fast

and memory-efﬁcient sequence modeling. CoRR,

abs/1804.00857.

Sun, A. and Lim, E. (2001). Hierarchical text classiﬁcation

and evaluation. In Proceedings of the 2001 IEEE In-

ternational Conference on Data Mining, 29 November

- 2 December 2001, San Jose, California, USA, pages

521–528.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,

Jones, L., Gomez, A. N., Kaiser, L., and Polo-

sukhin, I. (2017). Attention is all you need. CoRR,

abs/1706.03762.

Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Li

P., and Bengio, Y. (2018). Graph attention networks.

In 6th International Conference on Learning Repre-

sentations, ICLR 2018, Vancouver, BC, Canada, April

30 - May 3, 2018, Conference Track Proceedings.

Vural, V. and Dy, J. G. (2004). A hierarchical method for

multi-class support vector machines. In Proceedings

of the Twenty-ﬁrst International Conference on Ma-

chine Learning, ICML ’04, pages 105–, New York,

NY, USA. ACM.

Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu,

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

504

P. S. (2019). A comprehensive survey on graph neural

networks. CoRR, abs/1901.00596.

Xiao, L., Zhou, D., and Wu, M. (2011). Hierarchical clas-

siﬁcation via orthogonal transfer. In Proceedings of

the 28th International Conference on Machine Learn-

ing, ICML 2011, Bellevue, Washington, USA, June 28

- July 2, 2011, pages 801–808.

Xue, G., Xing, D., Yang, Q., and Yu, Y. (2008). Deep

classiﬁcation in large-scale text hierarchies. In Pro-

ceedings of the 31st Annual International ACM SIGIR

Conference on Research and Development in Infor-

mation Retrieval, SIGIR 2008, Singapore, July 20-24,

2008, pages 619–626.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., and

Hovy, E. H. (2016). Hierarchical attention networks

for document classiﬁcation. In NAACL HLT 2016,

The 2016 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, San Diego Califor-

nia, USA, June 12-17, 2016, pages 1480–1489.

Yarullin, R. and Serdyukov, P. (2019). Bert for sequence-

to-sequence multi-label text classiﬁcation.

Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,

W. L., and Leskovec, J. (2018). Graph convolutional

neural networks for web-scale recommender systems.

CoRR, abs/1806.01973.

Zhang, M.-L., Li, Y.-K., Liu, X.-Y., and Geng, X.

(2018). Binary relevance for multi-label learning: an

overview. Frontiers of Computer Science, 12(2):191–

202.

Zhao, W., Ye, J., Yang, M., Lei, Z., Zhang, S., and Zhao, Z.

(2018). Investigating capsule networks with dynamic

routing for text classiﬁcation. CoRR, abs/1804.00538.

MAGNET: Multi-Label Text Classiﬁcation using Attention-based Graph Neural Network

505