A Comparative Study of Deep Learning Methods for the Detection and

Classiﬁcation of Natural Disasters from Social Media

Spyros Fontalis, Alexandros Zamichos, Maria Tsourma, Anastasis Drosou and Dimitrios Tzovaras

Information Technologies Institute, Centre for Research and Technology Hellas (CERTH),

Keywords:

Disaster Management, Twitter, Preprocessing, Bias Mitigation, Deep Learning.

Abstract:

Disaster Management, deﬁned as a coordinated social effort to successfully prepare for and respond to dis-

asters, can beneﬁt greatly as an industrial process from modern Deep Learning methods. Disaster prevention

organizations can beneﬁt greatly from the processing of disaster response data. In an attempt to detect and

subsequently categorise disaster-related information from tweets via tweet text analysis, a Feedforward Neural

Network (FNN), a Convolutional Neural Network, a Bi-directional Long Short-Term Memory (BLSTM), as

well as several Transformer-based network architectures, namely BERT, DistilBERT, Albert, RoBERTa and

DeBERTa, are employed. The two deﬁned main tasks of the work presented in this paper are: (1) distinguish-

ing tweets into disaster related and non relevant ones, and (2) categorising already labeled disaster tweets into

eight predeﬁned natural disaster categories. These supported types of natural disasters are earthquakes, ﬂoods,

hurricanes, wildﬁres, tornadoes, explosions, volcano eruptions and general disasters. To achieve this goal,

several accessible related datasets are collected and combined to suit the two tasks. In addition, the combina-

tion of preprocessing tasks that is most beneﬁcial for inference is investigated. Finally, experiments have been

conducted using bias mitigation techniques.

1 INTRODUCTION

Over the last decade, social media networks have en-

tered in people’s everyday lives, allowing them to post

and share any information that one considers impor-

tant. Daily, the number of people using social media,

such as Twitter, is growing, leading to an increased

ﬂow of information (Chaffey, 2016). The importance

of this information lies in the fact that users can post

it from anywhere and also that they can post any in-

formation instantly without any barrier. This allows

for third parties, bearing the property of developers

and research scientists, to collect and analyse this in-

formation aiming to extract general features such as

public opinion upon one topic (Neri et al., 2012), or

create an evacuation plan in case of an occurring nat-

ural disaster.

On this basis, two types of text classiﬁers based

on Deep Learning models have been developed. The

ﬁrst classiﬁer plays the role of a real natural disas-

ter detector, whose goal is the binary classiﬁcation of

tweets into those that relate to a real natural disas-

ter, and those that are irrelevant. The second classi-

ﬁer classiﬁes tweets that are already known to refer to

natural disasters based on predeﬁned natural disaster

types.

Several previous attempts provide an evaluation

of methods for classifying tweet disaster, such as

this evaluation of machine learning techniques (Ku-

mar et al., 2019) and this BERT-like model evaluation

(Zhou et al., 2022) to this task. However, none of

the above approaches combine a clear evaluation of a

wide range of machine and deep learning models on

a large and diverse dataset.

Our contributions to this paper are:

• Evaluating and comparing several Deep Learning

classiﬁers for two separate disaster tweet classiﬁ-

cation tasks.

• Experimenting with the combination of prepro-

cessing steps that (1) maximises the efﬁciency of

the two classiﬁers on the above downstream tasks

and (2) mitigates the pre-existing bias of our col-

lected training datasets.

2 RELATED WORK

There are a number of notable previous attempts to

classify disaster tweets, using a variety of methods.

Previous attempts include a number of deep learn-

ing techniques, such as convolutional neural networks

320

Fontalis, S., Zamichos, A., Tsourma, M., Drosou, A. and Tzovaras, D.

A Comparative Study of Deep Learning Methods for the Detection and Classiﬁcation of Natural Disasters from Social Media.

DOI: 10.5220/0011666500003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 320-327

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

(Nguyen et al., 2017) and recurrent neural networks

(Nikolov and Radivchev, 2019), as well as conven-

tional machine learning algorithms (Huang and Xiao,

2015). Of particular note is the introduction of a ro-

bust transformer for crisis classiﬁcation and contex-

tual crisis embedding (Liu et al., 2021). An interest-

ing domain adaptation technique is also used by (Li

et al., 2018), which learns classiﬁers from unlabelled

target data, in addition to labelled source data.

Concerning the latest advances in the ﬁeld of Nat-

ural Language Processing, the current state-of-the art

architecture is the Transformer architecture (Vaswani

et al., 2017). The ﬁrst model to apply this archi-

tecture to language modelling in an encoder-decoder

context is BERT (Devlin et al., 2018). BERT has

achieved state-of-the-art results in a large number of

NLP benchmark downstream tasks. The effectiveness

of the architecture proposed by BERT has paved the

way for a lot of suchlike attempts, providing the in-

spiration for various kinds of modiﬁcations and im-

provements.

An important category of BERT variations is deal-

ing with size reduction. During pre-training, BERT

sets to adjust millions of parameters. This compels

the process to sometimes be exclusionary for many re-

searchers or small companies to implement (Schwartz

et al., 2020). An important aspect of this characteris-

tic is the consideration of the environmental impact

that the training process entails (Strubell et al., 2019).

In this context, DistilBERT (Sanh et al., 2019) is a

successful transfer-learning operation, which demon-

strates a reduction on the size of the original BERT

model by 40%, while retaining 97% of its language

understanding capabilities. Similarly, ALBERT (Lan

et al., 2019) is a smart approach to performing novel

distillation techniques on the base BERT model.

One of the most successful BERT variants is the

RoBERTa model (Liu et al., 2019), which has ex-

posed a lot of BERT’s main weaknesses (Cortiz,

2021) and proved that it is severely under trained from

reaching its full potential. XLM-RoBERTa (Conneau

et al., 2019) is trained on one hundred languages and

demonstrates that multilingual language modelling is

not necessarily associated with performance degrada-

tion. Finally, the DeBERTa (He et al., 2020) architec-

ture improves the BERT and RoBERTa models using

two novel techniques, that of disentangled attention

and the use of an enhanced mask decoder.

3 DATASETS - PREPROCESSING

This section presents the datasets used for training,

along with their analytical synthesis composition.

In total, four ﬁnal datasets are used for the exper-

iments in this paper. These are the Kaggle dataset

which is ﬁrst analysed in Section 3.1, the Synthetic

binary dataset which is analyzed in Section 3.2,

the Multi-class binary classiﬁcation dataset which is

analysed in Section 5.4 and ﬁnally the Synthetic multi-

class dataset which is analysed in Section 3.3. The

ﬁrst three are part of the ﬁrst text classiﬁcation task

and the last one is part of the second text classiﬁca-

tion task. All the above datasets are divided into 80%

train, 10% validation and 10% test sets.

3.1 Data Sources

Listed here are all the various independent sources we

have combined to create our datasets. These are the

following:

1. CrisisLex: Crisis-Related Social Media Data and

Tools (Olteanu et al., 2014).

2. HumAID: Human-Annotated Disaster Incidents

Data from Twitter by CRISISNLP (Alam et al.,

2021).

3. Disaster Eyewitness Tweets (Zahra et al., 2020).

4. Kaggle

. This dataset is provided by the relevant

Kaggle competition “Natural Language Process-

ing with Disaster Tweets”.

5. Volcano Eruptions Tweets. This dataset contains

2516 tweets collected using the Twitter API and

are referring to two volcano eruptions (i.e. Honga

Tonga & La Palma volcanoes)

3.2 Binary Classiﬁcation of Tweets into

“disaster” and “non relevant”

Three distinct datasets are used for this task:

1. For the binary classiﬁcation task of decoupling

the disaster tweets from the non-disaster ones, the

Kaggle dataset is primarily used.

2. To achieve more diverse representation of the non-

relevant class, 5000 random disaster unrelated

tweets were extracted from the CrisisLex dataset

and combined with the Kaggle dataset. The ﬁnal

result is a merged dataset with a total of 12373

tweets, with 4535 tweets referring to disasters and

7838 being non-relevant. This dataset henceforth

referred to as Synthetic binary dataset.

3. The Multi-class binary classiﬁcation dataset is

comprised of 46672 tweets referring to disasters

or not. Its structure is analysed at Section 5.4.

www.kaggle.com

A Comparative Study of Deep Learning Methods for the Detection and Classiﬁcation of Natural Disasters from Social Media

321

Table 1: Detailed synthesis of the Synthetic multi-class dataset.

Disaster categories

Source datasets

Disaster eyewitness HumAID CrisisLex Volcano er. Kaggle Total

earthquake

3980

2015 - - - 5980

ﬂood

3980 1302

- - - 5282

hurricane

3940

1654 - - - 5594

wildﬁre 1964

3757

- - - 5721

tornado - - 4172 - - 4172

explosion - - 4239 - - 4239

volcano eruption - - -

2516

- 2516

general disasters - - - - 3271 3271

Total 13864 8728 8411 2516 3271 36775

3.3 Multi-label Classiﬁcation of Tweets

into Predeﬁned Disaster Categories

All the data sources that were combined to create the

new ensemble dataset for this task are listed in Section

3.1. The ﬁnal dataset contains 36775 tweets and is

henceforth referred to as Synthetic multi-class dataset.

The full synthesis of this dataset can be explored in

Table 1 and observed visually in Figure 1:

Figure 1: Visual synthesis of the Synthetic multi-class

dataset.

Some tweet categories refer to speciﬁc natural dis-

aster cases. An interesting class addition is the gen-

eral disaster class, which was ﬁlled with the tweets

from the Kaggle dataset that are labelled as disasters.

This choice was made because most of them do not

correspond to the other predeﬁned categories. Those

that do refer to predeﬁned disaster categories, or those

that refer to non-natural disasters, such as shootings,

are regarded as noise that can help with model gener-

alisation.

3.4 Preprocessing

Most textual data extracted from social media is un-

structured, as it typically contains colloquialisms,

html tags, emojis, scripts, hashtags, links and ad-

vertisements, which makes it difﬁcult to decontex-

tualize the main text from all this peripheral noise

(Baldwin et al., 2013). Generally, allowing the pres-

ence of these non-word entities in the text increases

the dimensionality of the unseen vocabulary. This

leads to an excessive complication in text classiﬁca-

tion, as each non-word entity is considered an indi-

vidual dimension by the machine (Kumar and Dhi-

nesh Babu, 2019). Therefore, a typical list of prepro-

cessing tasks includes the following: link removal,

html tag removal, URL removal, emojis removal,

mention removal, named entities removal, removing

of stop words, lemmatization, stemming, lowercasing

and punctuation removal (Anandarajan et al., 2019).

Also, due to the idiosyncratic textual nature of tweets,

the pipeline also includes the removal of links, html

tags, URLs, emojis and hashtags.

However, it is advocated (Uysal and Gunal, 2014)

that carefully choosing appropriate combinations of

preprocessing tasks, rather than enabling or disabling

them all, can potentially provide a boost to the ef-

fectiveness of classiﬁcation depending on the domain

and language. For this reason, some preliminary ex-

periments were conducted (see Section 5), so as to

capture the combination of preprocessing tasks that

improves the results of the classiﬁcation tasks the

most.

3.5 Bias Mitigation

Due to the nature of the information collected from

social media platforms and the scope of the task, the

collected data includes position biases as it contains

information on the location of the disastrous event.

With this in mind, and knowing that the models that

are trained on this dataset internalise biases with re-

spect to certain case speciﬁc words or expressions

(Garrido-Mu

noz et al., 2021), an additional experi-

ment was conducted, concerning the application of

bias mitigation methods on the synthetic dataset in or-

der to evaluate the outcomes of the multi-class classi-

ﬁcation task.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

322

Positional bias concerns the inclusion of location’s

information within a text. For example, all the tweets

referring to the tornado predeﬁned disaster category

correspond to a single tornado case, that of the 2013

Oklahoma tornado. To avoid the potential bias prob-

lem, there needs to be a disassociation of special case

speciﬁc terms with the target disaster classes. Inspired

by previous attempts at bias mitigation (Dixon et al.,

2018) (Murayama et al., 2021), a relevant technique

is applied to the input data before feeding it into the

models.

The most important named entities are recognised

and replaced by special code tokens within the text

with the use of spaCy. The entity types that are re-

placed are the following: people, nationalities, build-

ings and facilities, companies, agencies, institutions

and locations.

4 METHODS

This section presents the methodology behind all the

experiments performed in detail.

The method of Logistic Regression is used to es-

tablish a reasonable machine learning baseline for the

results of the rest of the deeper methods. The TF-IDF

method is employed for token vectorization.

Next, the three custom shallow networks evalu-

ated are presented. These are a feedforward neural

network, a convolutional neural network and a Bi-

directional Long Short-term Memory network. First,

as input to the networks, the tweet word sequences

get tokenized by a Keras text vectorization function

which creates a vocabulary of 12000 words from the

dataset For each network, a broad hyperparameter

grid search is performed by trying a variety of hyper-

parameter conﬁgurations and recording the best ﬁnal

result. The shifting variables of the hyperparame-

ter search grid are the input embedding dimensions

(50-100-200), the total number of layers (2-3-4), the

layer dimensions (16-32-64 for LSTM, 10-50-100 for

dense) and the dropout rate (0-0.1-0.2). The invari-

ant hyper-parameters for all the training instances are:

100 maximum sequence length, Adam optimiser, cat-

egorical cross entropy / binary cross entropy as loss

function.

First, for the Feedforward Neural Network, the

whole embedding input is passed through either a ﬂat-

tening layer or a global Max Pooling operation, and

then to a series of one or two of standard dense lay-

ers with dropout rate. Following the same workﬂow

as before, a custom network whose structural core is

based on Convolutional layer(s) is tried. The struc-

ture of the network after the embedding input is com-

pleted with a sequence of one or two convolutional

and global max pooling layers of dimensions fol-

lowed by a dense layer between the output of the last

max pooling layer and the ﬁnal output layer. The con-

volution dimensions are 128 5× 5 ﬁlters with stride 2

in width and height. Additionally, a custom network

whose structural core is based on a Recurrent Neu-

ral Network and more speciﬁcally, long short-term

memory layer, is tried. The structure of the network

after the embedding input is completed with one or

two iterations of BILSTM layers with a dropout, fol-

lowed by a dense layer between the output of the last

max pooling layer and the last output layer.

For the Transformer models, a variety of trans-

former models and variations are tried, all of which

are presented at 2. All the Transformer models were

trained for 5 epochs with a 1e-5 learning rate and

Adam as optimiser, with an early stopping strategy.

5 EXPERIMENTAL RESULTS

In this section, all our experimental results are dis-

played, along with some useful short comments and

analysis.

5.1 Preprocessing

The aim of these experiments is to ﬁnd out which pre-

processing tasks perform the best for each model cat-

egory. To this end, three models were trained on data

preprocessed in different ways and the average accu-

racy from 3 different experiments was measured. The

results can be seen on Table 2. The standard pipeline

refers to all the steps mentioned in Section 3.4.

It seems that the set of preprocessing tasks that

performs the best differs for each model. Therefore,

the following experiments all adhere to the appro-

priate preprocessing procedure for the trained model.

More speciﬁcally, the input data of the custom CNN

network is preprocessed by the same pipeline indi-

cated by the custom RNN. In similar manner, the in-

put data of all the Transformer models are prepro-

cessed by the pipeline indicated by BERT base.

5.2 Bias Mitigation

Some examples of the output of a reference Trans-

former model - in this case BERT base - with and

without the use of bias mitigation are shown in Table

3. The sentences include a speciﬁc case-speciﬁc bias

present, while semantically each sentence is referring

to a disaster type not corresponding to this bias. For

A Comparative Study of Deep Learning Methods for the Detection and Classiﬁcation of Natural Disasters from Social Media

323

Table 2: Effect of applying different combination of preprocessing steps on the results (mean accuracy from 3 identical

experiments) of the binary classiﬁcation task.

Preprocessing tasks

Models

Log. Regression Custom RNN BERT base

standard pipeline 0.7946 0.7812 0.8293

no stemming 0.7855 0.7931 0.8328

no lemmatization/stemming 0.7839 0.8084 0.8263

no stopwords removal 0.7841 0.7788 0.8216

no lemmatization/stemming/stopwords removal 0.7783 0.7876 0.8208

Table 3: Effect of Bias Mitigation on disaster type classiﬁcation prediction using BERT model.

Test sentences

predicted class (normalized logit)

Active bi-

ases

Ground

truth

No bias mitigation Bias mitiga-

tion

Pity such beautiful nature was

destroyed by the ﬁre #LaPalma

volcano

eruption

wildﬁre volcano eruption (0.379) wildﬁre

(0.345)

Oklahoma will stand strong af-

ter the explosion

tornado explosion tornado (0.648) explosion

(0.255)

Huge earthquake in Alberta!

our homes are destroyed

wildﬁre earthquake earthquake (0.268) earthquake

(0.292)

Greeces’ tourism at an all time

high despite tragic earthquake

wildﬁre earthquake earthquake (0.213) earthquake

(0.241)

example, the second sentence has an active bias to-

wards the tornado class, because in the input dataset,

the word “Oklahoma” is encountered only in tweets

referring to tornado cases. Also, this particular sen-

tence matches a different class, namely the tornado

class, and thus poses a challenge to the model.

The model trained without bias mitigation is heav-

ily affected by the biased words. Performing bias mit-

igation leads to correct predictions in the ﬁrst two

cases, while it raises the probability of the correct

class in the last two. Performing this bias mitigation

technique seems to steer the model in the right direc-

tion by undermining the effect of problematic biases

in the dataset. For this reason, all the following ex-

periments are performed in datasets which have been

processed with this technique.

5.3 Classiﬁcation Tasks

The classiﬁcation results from all the methods tried

for the binary classiﬁcation of Tweets into disaster

and non relevant can be seen in Table 4 for all related

datasets. The rows where the Multi-class binary clas-

siﬁcation dataset is inscribed, represents a series of

experiments that is explained in subsection 5.4 below.

The classiﬁcation results from all the methods tried

on the classiﬁcation of tweets into predeﬁned disaster

categories can be seen at Table 5. In order to mitigate

statistical randomness, all the experiments were run 3

separate times, and their average results are shown.

5.4 Merging the Classiﬁcation Tasks

Given the superior results of the transformers models

on the downstream task of classiﬁcation into multiple

classes, it is interesting to ﬁnd out whether these mod-

els can perform binary classiﬁcation into disaster and

non relevant tweets accurately, even though they have

been trained for another task. This way, a conclusion

can be drawn about whether it is worth embedding the

two tasks together and solving both without having to

train the model for both tasks separately.

More speciﬁcally, a new class of 9897 tweets non

relevant to disasters is added to the synthetic multi-

class dataset under the newly found class non rele-

vant. The pre-existing eight disaster type classes are

going to correspond to the disaster class. Thus, along

with the disaster type classiﬁcation results, we exam-

ine if this class correspondence can lead accurately to

simultaneous binary classiﬁcation of tweets into the

disaster and non relevant classes. Their binary classi-

ﬁcation scores are in the last rows of Table 4.

6 DISCUSSION

In this section, observations and conclusions from the

two tables of results are commented on and analysed.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

324

Table 4: Results from the binary classiﬁcation of tweets into ’disaster’ and ’non relevant’.

Model/Method Data-set

Results

Acc. Precision Recall F1 micro F1 macro F1

weighted

Log. regression Kaggle 0.7946 0.8220 0.6662 0.7946 0.7839 0.7909

Custom FFN Kaggle 0.7832 0.7734 0.7893 0.7832 0.7757 0.7793

Custom CNN Kaggle 0.7975 0.7785 0.7791 0.7975 0.7784 0.7841

Custom RNN Kaggle 0.8084 0.7955 0.7936 0.8084 0.7848 0.7953

BERT base Kaggle 0.8328 0.8036 0.8066 0.8328 0.8294 0.8329

Albert Kaggle 0.8334 0.8131 0.7931 0.8334 0.8293 0.8331

DistilBERT Kaggle 0.8143 0.8567 0.6822 0.8143 0.8042 0.8104

RoBERTa base Kaggle 0.8355 0.8711 0.7688 0.8355 0.8299 0.8351

RoBERTa large Kaggle 0.8374 0.8591 0.7764 0.8374 0.8283 0.8363

XLM-RoBERTa Kaggle 0.8292 0.8236 0.7647 0.8292 0.8238 0.8282

DeBERTa base Kaggle 0.8341 0.8156 0.7921 0.8341 0.8231 0.8277

BERT base Synthetic binary 0.8293 0.8099 0.8036 0.8293 0.8270 0.8272

Albert Synthetic binary 0.8353 0.8183 0.7977 0.8353 0.8250 0.8295

DistilBERT Synthetic binary 0.8341 0.9165 0.7133 0.8341 0.8256 0.8287

RoBERTa base Synthetic binary 0.8373 0.8731 0.7964 0.8373 0.8367 0.8397

RoBERTa large Synthetic binary 0.8399 0.8678 0.8049 0.8399 0.8359 0.8381

DeBERTa base Synthetic binary 0.8404 0.8532 0.8093 0.8404 0.8326 0.8308

XLM-RoBERTa Synthetic binary 0.8232 0.8256 0.7122 0.8232 0.8194 0.8235

BERT base Multi-class

binary

0.7033 0.6999 0.7012 0.7033 0.7021 0.7029

Albert Multi-class

binary

0.6911 0.6910 0.6921 0.6911 0.6896 0.6908

DistilBERT Multi-class

binary

0.6989 0.6943 0.6948 0.6989 0.6965 0.6971

RoBERTa base Multi-class

binary

0.7067 0.7048 0.7055 0.7067 0.7056 0.7060

RoBERTa large Multi-class

binary

0.7061 0.7056 0.7051 0.7061 0.7046 0.7062

XLM-RoBERTa Multi-class

binary

0.7053 0.7031 0.7022 0.7053 0.7038 0.7041

Table 5: Results from classiﬁcation of tweets into predeﬁned disaster categories on the Synthetic multi-class dataset.

Model/Method

Results

Accuracy Precision Recall F1 micro F1 macro F1 weighted

Custom FFN 0.9061 0.8883 0.8912 0.9061 0.8903 0.8908

Custom CNN 0.9039 0.9045 0.9039 0.9039 0.9013 0.9033

Custom RNN 0.9074 0.9054 0.9088 0.9074 0.9066 0.9057

BERT base 0.9222 0.9234 0.9215 0.9222 0.9205 0.9210

Albert 0.9191 0.9195 0.9188 0.9191 0.9175 0.9178

DistilBERT 0.9176 0.9199 0.9167 0.9176 0.9184 0.9180

RoBERTa base 0.9271 0.9274 0.9267 0.9271 0.9273 0.9269

XLM-RoBERTa 0.9252 0.9237 0.9283 0.9252 0.9250 0.9251

DeBERTa base 0.9243 0.9174 0.9212 0.9243 0.9251 0.9254

A Comparative Study of Deep Learning Methods for the Detection and Classiﬁcation of Natural Disasters from Social Media

325

6.1 Binary Classiﬁcation of Tweets Into

’disaster’ and ’non relevant’

A lot of interesting observations can be made from

Table 4. For the machine learning method - Logis-

tic regression - the recall value of 0.66 is poor com-

pared to the others, meaning that this method often

classiﬁes disaster tweets as non-relevant. The rest of

the scores are generally lower than the other methods,

although very comparable. The overall decent per-

formance of such a shallow method compared to the

others could mean that the semantic and grammati-

cal language features that distinguish disaster tweets

from non-relevant ones are apparent enough to be able

to be analyzed by simpler methods well enough. The

custom networks all perform slightly worse than the

Transformer models. As for the Transformer models,

the qualitative difference in their results compared to

the other methods is signiﬁcant. The performance of

all the Transformer models is comparable. It seems

that the DeBERTa model generally performs better

than the rest.

Finally, the binary classiﬁcation results when in-

cluding the non-disaster class in the multiclass set-

ting show a signiﬁcant drop in performance in com-

parison with the previous methods. According to

post-experiment analysis, it seems that all the mod-

els generally have a problem distinguishing the non-

relevant and the general disaster tweets. For ref-

erence, from approximately 36% of the non-related

tweets that are misclassiﬁed by BERT base, 54% of

them are assigned as general disaster tweets. Follow-

ing the same principle, approximately 35% of general

disaster tweets that are misclassiﬁed by BERT base,

43% of them are classiﬁed as non-relevant.

6.2 Classiﬁcation of Tweets Into

Predeﬁned Disaster Categories

The results from the custom networks differ only

slightly from those of the deeper Transformer based

methods. The results of the Transformers outperform

the previously tested custom neural networks as ex-

pected. The best transformers model in terms of re-

sults is the Roberta base model.

7 CONCLUSION

In this paper we perform an evaluation of Deep Learn-

ing methods on two text classiﬁcation tasks. The ﬁrst

task classiﬁes tweets into disaster related and non rel-

evant classes and the second classiﬁes disaster tweets

into predeﬁned disaster categories. The combina-

tion of preprocessing steps that enables each model to

learn better is identiﬁed, showing that sometimes the

omission of certain typical preprocessing steps can

lead to better downstream classiﬁcation. Also, it is

shown that mitigating the bias through named entity

substitution in the input datasets is an effective strat-

egy when data sources are limited. The three shallow

custom neural networks - feedforward, convolutional

and recurrent - perform well on both tasks. As ex-

pected, the Transformer models outperform the pre-

vious methods by a considerable margin. The best

overall results are achieved by the DeBERTa model

for the ﬁrst task and the RoBERTa base for the sec-

ond. Finally, embedding the two tasks together is not

a fruitful idea, as it gives very poor results.

The results obtained from the experiments have

the potential to be used in practice, showing capac-

ity to effectively perform automatic disaster detection

from social media in service of disaster relief organi-

zations. Future work could include the acquisition of

more diverse datasets through manual annotation, and

the application of more sophisticated bias mitigation

and bias measurement techniques, as demonstrated in

(Dixon et al., 2018).

ACKNOWLEDGEMENTS

This research was supported by grants from Horizon

2020, the European Union’s Programme for Research

and Innovation under grant agreement No. 870373 -

SnapEarth, grant agreement No. 101004594 - ETA-

PAS and grant agreement No. 101037648 - SOCIO-

BEE. This paper reﬂects only the authors’ view and

the Commission is not responsible for any use that

may be made of the information it contains.

REFERENCES

Alam, F., Qazi, U., Imran, M., and Oﬂi, F. (2021). Hu-

maid: Human-annotated disaster incidents data from

twitter with deep learning benchmarks. In ICWSM,

pages 933–942.

Anandarajan, M., Hill, C., and Nolan, T. (2019). Text pre-

processing. In Practical Text Analytics, pages 45–59.

Springer.

Baldwin, T., Cook, P., Lui, M., MacKinlay, A., and Wang,

L. (2013). How noisy social media text, how diffrnt

social media sources? In Proceedings of the Sixth

International Joint Conference on Natural Language

Processing, pages 356–364.

Chaffey, D. (2016). Global social media research summary

2016. Smart Insights: Social Media Marketing.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

326

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V.,

Wenzek, G., Guzm

an, F., Grave, E., Ott, M., Zettle-

moyer, L., and Stoyanov, V. (2019). Unsupervised

cross-lingual representation learning at scale. arXiv

preprint arXiv:1911.02116.

Cortiz, D. (2021). Exploring transformers in emotion recog-

nition: a comparison of bert, distillbert, roberta, xlnet

and electra. arXiv preprint arXiv:2104.02041.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman,

L. (2018). Measuring and mitigating unintended bias

in text classiﬁcation. In Proceedings of the 2018

AAAI/ACM Conference on AI, Ethics, and Society,

pages 67–73.

Garrido-Mu

noz, I., Montejo-R

aez, A., Mart

ınez-Santiago,

F., and Ure

na-L

opez, L. A. (2021). A survey on bias

in deep nlp. Applied Sciences, 11(7):3184.

He, P., Liu, X., Gao, J., and Chen, W. (2020). Deberta:

Decoding-enhanced bert with disentangled attention.

arXiv preprint arXiv:2006.03654.

Huang, Q. and Xiao, Y. (2015). Geographic situational

awareness: mining tweets for disaster preparedness,

emergency response, impact, and recovery. ISPRS

International Journal of Geo-Information, 4(3):1549–

1568.

Kumar, A., Singh, J. P., and Saumya, S. (2019). A com-

parative analysis of machine learning techniques for

disaster-related tweet classiﬁcation. In 2019 IEEE

R10 Humanitarian Technology Conference (R10-

HTC)(47129), pages 222–227. IEEE.

Kumar, P. and Dhinesh Babu, L. (2019). Novel text prepro-

cessing framework for sentiment analysis. In Smart

intelligent computing and applications, pages 309–

317. Springer.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma,

P., and Soricut, R. (2019). Albert: A lite bert for

self-supervised learning of language representations.

arXiv preprint arXiv:1909.11942.

Li, H., Caragea, D., Caragea, C., and Herndon, N. (2018).

Disaster response aided by tweet classiﬁcation with a

domain adaptation approach. Journal of Contingen-

cies and Crisis Management, 26(1):16–27.

Liu, J., Singhal, T., Blessing, L. T., Wood, K. L., and Lim,

K. H. (2021). Crisisbert: a robust transformer for cri-

sis classiﬁcation and contextual crisis embedding. In

Proceedings of the 32nd ACM Conference on Hyper-

text and Social Media, pages 133–141.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Murayama, T., Wakamiya, S., and Aramaki, E. (2021).

Mitigation of diachronic bias in fake news detection

dataset. arXiv preprint arXiv:2108.12601.

Neri, F., Aliprandi, C., Capeci, F., Cuadros, M., and By, T.

(2012). Sentiment analysis on social media. In 2012

IEEE/ACM international conference on advances in

social networks analysis and mining, pages 919–926.

IEEE.

Nguyen, D. T., Al Mannai, K. A., Joty, S., Sajjad, H., Im-

ran, M., and Mitra, P. (2017). Robust classiﬁcation

of crisis-related data on social networks using con-

volutional neural networks. In Eleventh international

AAAI conference on web and social media.

Nikolov, A. and Radivchev, V. (2019). Nikolov-radivchev

at semeval-2019 task 6: Offensive tweet classiﬁcation

with bert and ensembles. In Proceedings of the 13th

international workshop on semantic evaluation, pages

691–695.

Olteanu, A., Castillo, C., Diaz, F., and Vieweg, S. (2014).

Crisislex: A lexicon for collecting and ﬁltering mi-

croblogged communications in crises. In Eighth in-

ternational AAAI conference on weblogs and social

media.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019).

Distilbert, a distilled version of bert: smaller, faster,

cheaper and lighter. arXiv preprint arXiv:1910.01108.

Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O.

(2020). Green ai. Communications of the ACM,

63(12):54–63.

Strubell, E., Ganesh, A., and McCallum, A. (2019). En-

ergy and policy considerations for deep learning in

nlp. arXiv preprint arXiv:1906.02243.

Uysal, A. K. and Gunal, S. (2014). The impact of prepro-

cessing on text classiﬁcation. Information processing

& management, 50(1):104–112.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Zahra, K., Imran, M., and Ostermann, F. O. (2020). Auto-

matic identiﬁcation of eyewitness messages on twitter

during disasters. Information processing & manage-

ment, 57(1):102107.

Zhou, B., Zou, L., Mostafavi, A., Lin, B., Yang, M.,

Gharaibeh, N., Cai, H., Abedin, J., and Mandal, D.

(2022). Victimﬁnder: Harvesting rescue requests in

disaster response from social media with bert. Com-

puters, Environment and Urban Systems, 95:101824.

A Comparative Study of Deep Learning Methods for the Detection and Classiﬁcation of Natural Disasters from Social Media

327