Language Identiﬁcation for Short Medical Texts

Erick Velazquez Godinez, Zolt

an Szl

avik, Selene Baez Santamar

ıa and Robert-Jan Sips

Artiﬁcial Intelligence Department, myTomorrows, Anthony Fokkerweg 61 1059CP, Amsterdam, The Netherlands

Keywords:

Language Detection, Sentence Embedding, Graphotactics, Linguistic Knowledge.

Abstract:

Language identiﬁcation remains a challenge for short texts originating from social media. Moreover, domain-

speciﬁc terminology, which is frequent in the medical domain, may not change cross-linguistically, making

language identiﬁcation even more difﬁcult. We conducted language identiﬁcation on four datasets, two of

them with general language, and two of them containing medical language. We evaluated the impact of two

embedding representations and a set of linguistic features based on graphotactics. The proposed linguistic

features reﬂect the graphotactics of the languages included in the test dataset. For classiﬁcation, we imple-

mented two algorithms: random forest and SVM. Our ﬁndings show that, when classifying general language,

linguistic-based features perform close to the embedding representations of fastText and BERT. However,

when classifying text with technical terms, the linguistic features outperform embedding representations. The

combination of embeddings with linguistic features had a positive impact on the classiﬁcation task under both

settings. Therefore, our results suggest that these linguistic features could be applied for big and small datasets

keeping the good performances in both general and medical languages. As future work, we want to test the

linguistic features for a more signiﬁcant set of languages.

1 INTRODUCTION

Language identiﬁcation (LI) is the task of determining

the language that a piece of text is written (Wehrmann

et al., 2018). Often, Natural Language Processing

(NLP) tools and techniques assume that the language

of texts is already known since their performance is

language-dependent (Jauhiainen et al., 2019). There-

fore, Automatic Language Identiﬁcation

is the ﬁrst

step in any NLP pipeline to ensure that the appropri-

ate language model is used.

Language identiﬁcation is considered a solved

problem as stated in (Jauhiainen et al., 2019), how-

ever, the authors also highlight existing challenges.

One of them is related to short texts, as these may

not contain enough information to determine the lan-

guage. Moreover, LI is complicated when dealing

with closely related languages (Molina et al., 2016),

e.g. Portuguese-Spanish-Italian, Danish-Norwegian-

Swedish. Finally, LI can be complicated in the case of

domain-speciﬁc language, e.g. words in medicine are

The terms Language Identiﬁcation (LI), Automatic

Language identiﬁcation (ALI), and Language Detection are

used to describe the same task. In this paper, we use the

term LI, implying that it is a task is conducted by an artiﬁ-

cially intelligent agent.

composed of roots from Latin and Greek, which have

the advantage to be precise and unchanging. These

characteristics make them hold the same meaning in

different languages (Holt et al., 1998).

Nowadays, social media platforms are widely

used to share, discuss, and seek health information

(Pershad et al., 2018). For instance, data from Twitter

may be used to detect adverse drug reactions, which

a medical practitioner may want to know, or be obli-

gated to report, with the purpose of preventing unnec-

essary risks (Manousogiannis et al., 2019). Further-

more, multilingualism, code-switching, dialects, dif-

ferences between patient and health care professional

(HCP) language, and privacy considerations accentu-

ate the need for effective methods that detect the lan-

guage in short medical texts.

In this paper, we address LI of tweets for nine Eu-

ropean languages, using four datasets covering both

the general and medical domains. We compare exist-

ing established LI methods with novel ones based on

multilingual embeddings, as well as on linguistic fea-

tures that we based on graphotactics. We apply two

different classiﬁers on top of these features.

Godinez, E., SzlÃ ˛avik, Z., SantamarÃ a, S. and Sips, R.

Language Identiﬁcation for Short Medical Texts.

DOI: 10.5220/0008950903990406

In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) - Volume 5: HEALTHINF, pages 399-406

ISBN: 978-989-758-398-8; ISSN: 2184-4305

399

2 RELATED WORK

Since 2012, langid. py (Lui and Baldwin, 2012) has

been established as a standard baseline for language

identiﬁcation. It is a multinomial Naive Bayes clas-

siﬁer based on an n-gram character model. More

recently, language identiﬁcation has been conducted

with Recurrent Neural Networks (Wehrmann et al.,

2018). Language identiﬁcation is now one of the

modules available in fastText, a commonly used tool

for text classiﬁcation with word embeddings, with a

speciﬁc LI module (Joulin et al., 2016).

Recently, word embeddings have been success-

fully applied in different NLP tasks (text classiﬁca-

tion (Joulin et al., 2016), next sentence prediction

(Devlin et al., 2018), sentiment analysis (Wehrmann

et al., 2018; Yin and Jin, 2015). Several multi-

lingual word embeddings have been proposed via

cross-lingual transfer, see (Ruder et al., 2019) for a

complete survey. Two of widely used multilingual

word embedding are fastText (Joulin et al., 2016) and

BERT (Devlin et al., 2018). Adaptations for BERT

in the medical domain have been proposed recently

in (Lee et al., 2019) and (Alsentzer et al., 2019), but

none of these are multilingual.

Previous to the word embedding models, tradi-

tional n-gram character models have been used for

language identiﬁcation at the word level (Barman

et al., 2014), and n-gram character models are also

at the core of word embedding creation (Joulin et al.,

2016; Devlin et al., 2018). Concerning n-grams, it

has been stated in (Bender, 2009) that models based

on n-grams are, in fact, language-dependent, because

they perform well for languages that share similar ty-

pological properties. N-gram strategies process text

as a sequence of symbols, and it is possible to reveal

structures, because of the word distributions in var-

ious contexts. In English-like languages, the effec-

tiveness of n-gram driven models is based mostly on

two properties: relatively low inﬂectional morphol-

ogy and relatively ﬁxed word order. Inversely, lan-

guages with more complex morphology present more

sparsity (higher inﬂected forms for lemmata), which

limits the ability of n-gram models to capture depen-

dencies between open word classes and closed word

classes. Thus, n-gram models would not work for

language identiﬁcation when we confront similar lan-

guages and dialects.

In (Jauhiainen et al., 2019) a survey of language

identiﬁcation, the authors reported SVM and Con-

ditional Random Fields as the most used algorithms

for LI between 2016 and 2017. In (Mandal et al.,

2018) the authors compared the performance of an

SVM model and an LSTM model for language iden-

tiﬁcation in tweets that present code-switching. The

authors presented two neural models based on char-

acter and phonetic representations that combined in

stacking and thresholding techniques. However, in

order to create the phonetic representation of a text, it

is paramount to know previously the language since,

a character can be used to encode different sounds

cross-linguistically (De Saussure, 1989). For exam-

ple, the letter v in standard Spanish represents the

sound [b], in English [v], and [ f ] in and German. The

phonetic characteristics of each language seems to be

a good way to tackle language identiﬁcation, because

every language has its unique set of sounds and rules

to combine them. The way that a language combines

sounds is called phonotactics. However, a sequence

of sounds may be licit for a language, while for other

languages, the same sequence may be illicit. The

phonotactics restrictions vary from language to lan-

guage (Zsiga, 2012). Since it is not possible to access

the phonetic representation of a text without knowing

the language previously, we could use the concept of

graphotactics. Like its counterpart, graphotactics im-

plies that every language has its way to restrict the

combination of the characters (Coulmas, 2003).

In this paper, we propose a set of patterns that

aim to capture the graphotactic restrictions of lan-

guages. The principal advantage of the graphotac-

tic patterns is that they are an aggregated version

of the n-gram models. Thus one single graphotac-

tic pattern can generate string patterns while n-gram

model needs to change the value of n to generate the

same patterns. Some related works attempted to use

grapheme-phoneme information for language detec-

tion at the word level. For example, in (Giwa and

Davel, 2014) the author replaced each character of

a word by a language identiﬁer, then this informa-

tion was used to predict the language of origin of a

new word. For example, the word “#queen#” would

take this representation: E E E E E, where # denotes

the start and the end of a word and E is the language

identiﬁer for English. Also, in (Nguyen and Cornips,

2016) the author used subword information (morphs

or morphemes-like) to detect code-switching within

words of Limburguish, a variant of Dutch. Mor-

phemes are units that convey meaning; they can be

a word, a root, a preﬁx, sufﬁx, etc. (Zsiga, 2012).

To the best of our knowledge, LI research has not

yet focused speciﬁcally on medical-domain texts, and

thus we need to rely on other domains for inspiration

to ﬁnd out what works for medical LI.

HEALTHINF 2020 - 13th International Conference on Health Informatics

400

3 METHODOLOGY

Our methodology consists of the following compo-

nents: i) obtaining or creating datasets for both the

general and the medical domains, ii) extracting fea-

tures for modelling, iii) training classiﬁers using these

features, iv) comparison of models built using combi-

nations of dataset-, feature-, and classiﬁer choices.

3.1 Datasets

For our analysis, we used four datasets: the ﬁrst two

datasets contain tweets in general language; the third

one contains tweets with medical terms, and it was au-

tomatically annotated. We created the fourth dataset

from the tweets where the automatic annotation was

uncertain. This last dataset was manually annotated

by two experts. We focused on the identiﬁcation of

nine European languages: Danish, Swedish, English,

Dutch, German, Portuguese, Spanish, French, and

Italian. Table 3.1 summarizes the distribution of lan-

guages in the four datasets. As shown in the table 3.1,

the prevalence of the languages varies in each dataset.

The General Topic Dataset (GTD): The ﬁrst

dataset is on general topics, and it has been built by

the Twitter team to test language identiﬁcation algo-

rithms. Twitter provides three datasets for language

identiﬁcation evaluation

: the recall dataset (RD), the

precision dataset (PD), and the uniformly sampled

dataset (UD). The datasets consist of tweet IDs with

tags to encode their corresponding language. When

collecting the complete tweets, we found that a con-

siderable number of tweets were not available any-

more. We merged the RD and PD datasets. We ﬁl-

tered the tweets to get only those in the languages that

we targeted. We were able to retrieve 10,420 tweets

for the nine languages.

Table 1: Language distribution for the four datasets.

Language GTD TPD MTD SCD

Danish 391 0 106 0

Swedish 420 0 595 0

English 2560 1505 5058 153

Dutch 939 1430 680 0

German 1141 1479 870 0

Portuguese 1101 0 1556 13

Spanish 1727 1562 2298 32

French 1038 1551 1638 22

Italian 1103 1539 557 6

Total 10420 9066 13358 226

https://blog.twitter.com/engineering/en us/a/2015/

evaluating-language-identiﬁcation-performance.html

The (Tromp and Pechenizkiy, 2011) Dataset (TPD):

This balanced dataset is composed of six languages:

German, Dutch, English, Spanish, French, and Ital-

ian. While creating the dataset, (Tromp and Pech-

enizkiy, 2011) removed the messages containing mul-

tiple languages or bilingual terminology. User names

and hashtags were also removed. The authors explain,

in the ﬁle of distribution, that several tweets belong

to the same user, but the tweets were anonymized.

While the decision of removing multilingual tweets

may have helped to improve language detection in

short monolingual text, it does not address problem

of language detection in multilingual tweets. This is

important, since around 20% of the tweets are multi-

lingual. Our interest in using this corpus is to test our

proposition in ideal scenario and see how they per-

form in noisier datasets. We expect to have the best

performance on this dataset compared to the rest.

The Medical Topic Dataset (MTD): To collect more

specialized language on Twitter, we implemented the

following strategy. One of our in-house medical ex-

perts gave us a list of the ten most common diseases

in colloquial and medical language. The list was pro-

vided originally in English. For this list, we extracted

their UMLS

Concept Unique Identiﬁers (CUI) and

used them to ﬁnd corresponding terms in the other

nine languages using Wikidata

. We were able to

expand the original list of English terms to include

terms in all languages however, the distribution of

these terms is unbalanced. Thus when using this list

to retrieve tweets, the resulting dataset was unbalance.

The columns MTD and SCD in table show the distri-

bution of tweets in each language. We collected the

tweets in two periods. The ﬁrst one includes tweets

from the last two weeks of August 2019, and the last

group includes tweets from the ﬁrst two weeks of

September 2019. We excluded retweets to avoid the

double classiﬁcation of the same tweet. Originally,

we collected 18483 tweets. After eliminating dupli-

cates and retweets, the ﬁnal dataset includes 137984

tweets.

Special Cases Dataset (SCD): To label the language

of each tweet, we ﬁrst used the language tag pro-

vided by Twitter as a candidate language for every

tweet. We then ran the Amazon Comprehend Lan-

guage Identiﬁer to detect the language. If the two tags

are the same for a given tweet, we keep this as the ﬁnal

tag. When there is disagreement, we selected these

tweets for manual annotation for the language. The

kappa coefﬁcient of the two automatic annotators was

0.9648. From the original number of tweets, we got

440 tweets to be manually annotated. The annotation

The Uniﬁed Medical Language System

https://www.wikidata.org

Language Identiﬁcation for Short Medical Texts

401

task consisted 1) to identify the language tweets, and

2) to report if a tweet is written with more than one

language. The ﬁnal dataset contains 231 tweets, con-

taining the tweets where the annotators agreed. We

discuss more details about the annotation process in

section 4.

3.2 Feature Creation

We used two different kinds of features: ﬁst, sen-

tence embeddings, and second, linguistic-based fea-

tures from the text. We used the embeddings and the

linguistic features as inputs for two different classi-

ﬁers in different conﬁgurations.

Embeddings: Embeddings have obtained good re-

sults for language identiﬁcation and code-switching

(Wehrmann et al., 2018; Xia, 2016). We used two

pre-trained models: fastText (Joulin et al., 2016), and

BERT (Devlin et al., 2018). We based our selection

of tools on the ability to generate embedding for mul-

tilingual texts.

FastText generates embeddings at the word and

sentence level, on a 16 dimensional space. We de-

cided to take only the sentence embeddings, since

we are interested in detecting the language at tweet

level. Then, we used two models of BERT: BERT-

Base multilingual cased and BERT-Base multilingual

uncased. At this moment, there is only a base version

of the multilingual model and no large version, as it

occurs with monolingual versions of BERT. In order

to create the sentence embedding, we used the con-

catenation of the last four layers in BERT because its

performance has been reported to be better with this

conﬁguration (Devlin et al., 2018).

Linguistic Features: As stated in (Bender, 2009),

n-gram models seem to be language-dependent. We

wanted to create a language-independent strategy, but

being able to capture the graphotactics of each lan-

guage. We created regular expressions that would

capture the combination of characters based on some

restrictions. These restrictions aimed to capture

groups of consonants, groups of vowels, groups of

consonants followed by a vowel, groups of vowels

followed by a consonant, etc. These restrictions can

easily reﬂect the graphotactics of language. While

the patterns are language-independent, they can be

applied to any text an retrieve the combination of

graphemes for each language. The complete list of

these features and their description is in table 2. We

use a TF-IDF strategy to count the incidence of the

patterns. We kept all diacritics and upper cases of the

tweets. These patterns can occur at any place of a to-

ken.

Since the regular expressions generated more

strings patterns, the number of features was high. We

used four different methods for feature selection:

• Variance-based: We ﬁxed a threshold to remove

the features that have a low variance. The thresh-

old was ﬁxed at 0.001.

• Univariate-based: The selection is based on uni-

variate statistical tests. Since we are dealing with

a very sparse matrix, we selected the X

test for

this selection, as suggested in the user manual of

the sci-kit learn library.

• Linear model-based (L1): The selection is based

on a regression analysis method: Least Absolute

Shrinkage and Selection Operator (lasso).

• Tree estimators: These methods are commonly

used to compute the importance of features.

3.3 Classiﬁers for Language

Identiﬁcation

As a baseline, we selected langID. py (Lui and Bald-

win, 2012), and the language classiﬁer module of

fastText (Joulin et al., 2016). Both tools have been

recently used in the context of LI for short texts

(Wehrmann et al., 2018).

We also used two different classiﬁers to test the

impact of the features for language identiﬁcation. As

they are widely used in LI (Jauhiainen et al., 2019),

we used the following two classiﬁers:

• A random forest with 500 trees, and the maximum

depth was set up at 16,

• A SVM multi-class classiﬁer with a linear kernel.

4 RESULTS

In this section, we provide an analysis of the per-

formance of the three sets of features i.e. fastText

embeddings, BERT embeddings, and linguistic fea-

tures, that we used for language identiﬁcation on four

datasets i.e the GTD, the TPD, the MTD and the SCD.

In each case, we compare the performance of two

classiﬁers, random forest, and SVM. We decided to

use precision, recall, and F-measure instead of using

accuracy, as accuracy cannot express where the clas-

siﬁcation fails. Table 3 shows the results for the two

baselines. We use this table as a reference to compare

the performance of the each dataset with the features

that we tested.

Next, we performed language identiﬁcation on the

general Twitter dataset (GTD), the Tromp and Pech-

enizkiy dataset (TPD), see (Tromp and Pechenizkiy,

HEALTHINF 2020 - 13th International Conference on Health Informatics

402

Table 2: Patterns created to capture the graphotactics of languages.

Pattern Description Example

C{2,} Consonant cluster stru- ggl -es

C{2,}V Consonant cluster followed by a vowel dive- rso -s

VC{2,} Consonant cluster preceded by a vowel g- olp -e

V{2,} Vowel cluster eeu -wen

V{2,}C Vowel cluster followed by a consonant famil- ial -es

CV{2,} Vowel cluster preceded by a consonant re-voi-r

V{2,}C{2,} Vowel cluster followed by a consonant cluster or- ient -e

C{2,}V{2,} Consonant cluster followed by a vowel cluster re- spei -to

Table 3: Performance of the baseline tools on the four datasets.

Dataset Langid.py fastText classiﬁer

P R F1 P R F1

GTD 82.24 71.67 76.67 88.06 85.55 86.75

TPD 99.0 97.57 98.26 99.35 99.19 99.27

MTD 97.43 93.34 95.32 96.38 80.54 87.71

SCD 83.34 55.45 66.46 80.89 72.73 76.26

2011), the medical topic dataset (MTD), the special

case dataset (SCD). We evaluated the task using a

weighted average of precision, recall, and F-measure.

The weighting strategy considers the number of true

instances for each class. We also decided to only show

the results with the tree strategy selection for the lin-

guistic features since the performance with the classi-

ﬁer was the best in all cases. We organised the results

in two tables, table 4 for the random forest model and

table 5 for the SVM model.

4.1 The General Twitter Dataset: GTD

In table 4, column GTD, we observe that the ran-

dom forest classiﬁer achieved the highest perfor-

mance with the fastText embeddings (fT) with 84.4%

of F1. For the two other independent sets of features,

BERT and tree linguistic features (TLF), the perfor-

mance exceeded only 53% of F1. We observed that

the combination of the BERT embeddings with the

other two independent sets of features (fastText and

TLF) had a positive impact on the performance of the

classiﬁcation. However, the combination of fastText

and the TLF is slightly lower. We also noticed that

none of the independent sets of features outperformed

the results of the baseline. These results indicate that

the embedding representation of fastText encodes bet-

ter than the BERT embeddings.

In table 5, column GTD, we oberserve the results

for the SVM classiﬁer. As it occurred with the ran-

dom forest classiﬁer as well, the TLF outperformed

the BERT embeddings; on this occasion, the incre-

ment was 7.42% in precision. We also conducted a

statistical signiﬁcance test based on approximate ran-

domization (Noreen, 1989). We tested the hypothesis

to know if the performance of the classiﬁcation with

the TLF was, and we found that the classiﬁcation with

TLF is signiﬁcantly better than with BERT embed-

dings (p=0.004). The combination of fastText em-

beddings (fT) and the TLF outperformed both base-

lines. The combination of the embeddings with the

TLF has been found to be a more signiﬁcant incre-

ment of the performance that the combination of both

embeddings (fT and BERT). We also tested this hy-

pothesis and the resulting p-value is 0.001. Our re-

sults are consistent with previous works where the

SVM classiﬁer outperformed other classiﬁers for lan-

guage identiﬁcation (Eldesouki et al., 2016).

If we compare the performance of both algorithms

on the GTD and the TPD datasets (see columns GTD

and TPD in tables 4, and 5), we will see that the F1

measure is lower. Our explanation for this is that

the GTD contains multilingual tweets, while this is

not the case for the TPD. While identifying the lan-

guage for the baselines, both langID. py and the fast-

Text module computed the conﬁdence of determining

the language for each tweet. This value goes from 0.0

to express no conﬁdence and 1.0 to express full con-

ﬁdence. We ﬁltered the tweets in each dataset to see

where the baselines tools had a conﬁdence lower than

0.5. For the GTD, 20.6% of the tweets have a con-

ﬁdence below that threshold. For the TPD, the base-

lines found only 0.96% of the tweets with low conﬁ-

dence. This reﬂects the fact that the TPD is composed

only by monolingual tweets. Thus, we can infer that

the GTD has tweets with code-switching making the

performance of the classiﬁers to go down.

Language Identiﬁcation for Short Medical Texts

403

Table 4: Performance with the random forest algorithm.

GTD TPD MTD SCD

Features P R F1 P R F1 P R F1 P R F1

fT 86.2 84.3 84.4 99.6 99.66 99.6 97.9 97.8 98.7 81.9 78.9 74.7

TLF 75.3 55.4 53.3 91.6 91.0 91.1 81.9 76.6 73.7 65.5 76.9 69.4

fT+TLF 86.2 81.9 82.1 99.7 99.7 99.7 95.5 96.1 95.7 76.1 82.4 78.2

BERT 73.3 55.5 53.2 96.9 96.8 96.8 74.3 75.0 69.7 48.7 69.76 57.3

fT+BERT 86.4 81.2 81.6 97.2 99.7 99.7 95.3 95.9 95.5 61.5 74.7 65.9

BERT+TLF 85.1 76.0 76.3 99.7 99.7 99.7 93.9 94.1 93.4 60.1 74.2 64.9

Table 5: Performance with the SVM algorithm.

GTD TPD MTD SCD

Features P R F1 P R F1 P R F1 P R F1

fT 87.6 86.3 86.4 99.7 99.7 99.7 98.7 98.7 98.7 74.7 80.9 76.3

TLF 83.0 81.1 80.7 98.2 98.1 98.1 97.1 97.0 97.0 76.2 81.5 77.0

fT+TLF 90.2 89.4 89.4 99.8 99.8 99.8 99.02 99.0 99.0 78.6 83.81 79.9

BERT 75.6 75.5 75.3 98.7 98.7 98.7 97.4 97.4 97.4 69.3 75.5 71.5

fT+BERT 79.4 79.2 79.1 99.0 99.0 99.0 97.9 97.8 97.8 72.2 76.8 73.3

BERT+TLF 81.7 81.5 81.4 99.1 99.1 99.1 97.9 97.9 97.9 72.7 77.3 73.8

4.2 The Tromp and Pechenizkiy

Dataset: TPD

The good balance between the languages is reﬂected

in the performance of the algorithms. The two algo-

rithms can learn and classify tweets with a better per-

formance than in any of the other datasets. Our re-

sults show that no matter the features being used, the

performance of both classiﬁers reaches 91% in terms

of F1. Lower performances were still achieved with

the random forest classiﬁer, table 4, while the SVM

classiﬁer, table 5, reaches higher performances in all

cases. From our results, we can infer that having a

balanced distribution of the classes in a dataset will

help with the learning process of the algorithm.

Regarding the performance with the SVM clas-

siﬁer (see table 5, column TPD) the TLF features,

the fastText and BERT embeddings reached values

above 98% in terms of F1. The TLF reached a per-

formance very close to the fastText and BERT em-

beddings. However, when comparing with the base-

lines, the TFL did not pass the statistical test to show

that the classiﬁcation was better with them. For the

fastText and BERT embeddings, the performance is

still statistically signiﬁcant (p=0.001). We conclude

that the good performance is due to the actual com-

position of the Tromp corpus: it does not contain

tweets with multilingual or bilingual expressions, and

all links and emoticons have been removed. Texts are

clean, without noise that could have played a role in

classiﬁcation. However, the corpus provides us with

an opportunity to check the robustness of sets of fea-

tures, particularly the linguistic ones.

4.3 The Medical Topic Dataset: MTD

Regarding the random forest classiﬁer, see table 4,

column MTD, fastText embeddings (fT) achieved the

best performance. In this case, the combination of

the TLF and fastText embeddings had no positive

impact. However, the combination of the TLF and

BERT embeddings resulted in an increment of 23.7%

of the F1 value of the BERT embeddings. The perfor-

mance of TLF was still better than the BERT embed-

dings. The statistical test for the previous assumption

was conﬁrmed by p = 0.01. In the case of the SVM

classiﬁer, see table 5, column MTD the results were

above the baselines performance. As with the pre-

vious dataset, the performance of the TLF was very

close to those that we obtained with the word em-

beddings of fastText and BERT. We still observed

that the combination of the TLF with the embed-

dings increased the performance of the classiﬁers in

terms of F1 values. This observation supports our

hypothesis that the combination of linguistics-based

features with embeddings improve language identiﬁ-

cation performance on short medical texts.

4.4 Special Cases Dataset: SCD

In this case, both classiﬁers outperformed the base-

lines with all features. Regarding the random for-

est classiﬁer, see table 4, colum SCD, the best per-

formance was reached by fastText embeddings with

74.7% of F1. Concerning the combination of the

sets of features, fastText embeddings(fT) and the TLF

reached 78.2%. For this speciﬁc combination, we ob-

served a slight reduction of the F1 measure, 0.73%.

We found that the precision dropped 2.68%, and the

HEALTHINF 2020 - 13th International Conference on Health Informatics

404

recall increase by 0.51%. The performance of the

BERT embeddings is the lowest in the set of features

that we tested for both classiﬁers. For the SVM clas-

siﬁer, see table 5, column SCD, we observed that the

TLF are the ones that achieved the best performance

with 77% of the F1 measure. When combining fea-

tures, we observed again that the classiﬁcation pro-

duced the best performance with fastText embeddings

(fT) and the TLF. Their performance is 79.9% of F1

measure. As we can see in table 5, column SCD the

performance of the classiﬁers are similar to the per-

formance on the GTD dataset, see its correspondent

column. In fact, when identifying the language with

the baseline tools, the F1-values are low for the GTD

and SCD. The GTD has 20.96% of tweets with low

conﬁdence for language identiﬁcation. In the SCD,

29.99% of tweets also have low conﬁdence. Another

factor that can explain the low performance of the al-

gorithms and even the low conﬁdence of the baseline

tools is that 41.55% of the tweets contain at least two

languages. Since both datasets share similar charac-

teristics in terms of the conﬁdence of language identi-

ﬁcation, we can infer that the GTD includes a similar

number of multilingual tweets.

4.5 The Manual Annotation of the SCD

Regarding our manual annotation, for language iden-

tiﬁcation, we observed a kappa coefﬁcient of 0.6727.

In (Pustejovsky and Stubbs, 2012), the authors point

out that this value is interpreted as substantial and

highlight the fact that having more than two classes

to annotate could decrease the inter-annotator agree-

ment. For our annotation process, we asked the an-

notators to identify nine different languages. For the

composition of the ﬁnal dataset, we ﬁrst kept the

tweets where the two annotators agreed. We also in-

cluded the tweets where one of the annotator’s label

coincided with one of the two automatic labels. This

strategy made drop the number of tweets from 440 to

231.

The annotation was challenging because tweets

can have code-switching, i.e., a tweet is written in

more than two languages. In fact, 44.55% of the ﬁnal

tweets are written in more than two languages. Thus

when it comes to identifying the principal language

by the annotator, their linguistic knowledge may have

played a role. Our annotators had a different linguistic

background. For annotator one, the linguistic prefer-

ence was Spanish, French, English, and German. For

annotator two, Dutch, English, German, French, and

Spanish. Thus, for a bilingual tweet, let us say half-

Spanish, half-English, the annotator who had Spanish

as the ﬁrst language would assign Spanish as the prin-

cipal language as opposed to English, as is the case

for a different annotator. The decision to identify the

principal language of a code-switched tweet may have

been subjective.

Regarding the annotation of layman language,

medical topic, and medical language, we observed a

kappa coefﬁcient of 0.4759. 71.86% of the tweets

were tagged as layman language, 9.52% of the tweets

were tagged as a medical topic, and 18.18% of the

tweets were tagged as medical language. 11.68% of

the tweets contain only one word, which can have an

impact on language identiﬁcation.

Our results on general and medical language

datasets show that they are indeed inherently differ-

ent. Our results indicate that the same approach that

works best for the general language is not the opti-

mal one when dealing with short medical texts. To

process medical tweets, we recommend that embed-

dings, i.e., “big data” based, learned features should

be combined with expert knowledge-based features,

in our case based on linguistics. This is in line with

recent examples in which various sub-domains within

AI are being brought together to result in better, more

complete solutions for the medical domain (Van den

Bercken et al., 2019; Manousogiannis et al., 2019).

Finally, our general recommendation is to bet on hy-

brid AI methods that combine the versatility of sta-

tistical approaches with the strength of symbolic ap-

proaches.

5 CONCLUSIONS AND FUTURE

WORK

In this paper, we analyzed the impact of two types of

features and their combination for language identiﬁ-

cation, with two classiﬁers and four different datasets.

We found that the combination of embeddings and

linguistic features offered a substantial gain in terms

of F1-score, compared to only one of these types. Yet,

language identiﬁcation is still a challenge for short

medical texts. Since medical language relies on Latin

and Greek roots, it has different graphotactics from

layman language. This difference needs to be con-

sidered while designing NLP systems for the medical

domain.

As for future work, we plan to continue building

a dataset of tweets with medical terminology, relying

on expert knowledge for labelling. We plan to release

the ﬁrst version of our dataset with the publication of

this paper. Regarding the language identiﬁcation task,

we plan to move from sentence level to word-level

language identiﬁcation to address the code-switching

that is present in social media data.

Language Identiﬁcation for Short Medical Texts

405

REFERENCES

Alsentzer, E., Murphy, J. R., Boag, W., Weng, W.-H.,

Jin, D., Naumann, T., and McDermott, M. (2019).

Publicly available clinical bert embeddings. arXiv

preprint arXiv:1904.03323.

Barman, U., Das, A., Wagner, J., and Foster, J. (2014).

Code mixing: A challenge for language identiﬁcation

in the language of social media. In Proceedings of the

ﬁrst workshop on computational approaches to code

switching, pages 13–23.

Bender, E. M. (2009). Linguistically na

ıve != language in-

dependent: Why NLP needs linguistic typology. In

Proceedings of the EACL 2009 Workshop on the In-

teraction between Linguistics and Computational Lin-

guistics: Virtuous, Vicious or Vacuous?, pages 26–32,

Athens, Greece. ACL.

Coulmas, F. (2003). Writing systems: An introduction to

their linguistic analysis. Cambridge University Press.

De Saussure, F. (1989). Cours de linguistique g

erale:

Edition critique, volume 1. Otto Harrassowitz Verlag.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Eldesouki, M., Dalvi, F., Sajjad, H., and Darwish, K.

(2016). QCRI @ DSL 2016: Spoken Arabic dialect

identiﬁcation using textual features. In Proceedings

of the Third Workshop on NLP for Similar Languages,

Varieties and Dialects (VarDial3), pages 221–226,

Osaka, Japan. The COLING 2016 Organizing Com-

mittee.

Giwa, O. and Davel, M. H. (2014). Language identiﬁcation

of individual words with joint sequence models. In

Interspeech 2014.

Holt, R. J., Stanaszek, M. J., and Stanaszek, W. F. (1998).

Understanding Medical Terms: A Guide for Phar-

macy Practice. CRC Press.

Jauhiainen, T. S., Lui, M., Zampieri, M., Baldwin, T., and

Lind

en, K. (2019). Automatic language identiﬁcation

in texts: A survey. Journal of Artiﬁcial Intelligence

Research, 65:675–782.

Joulin, A., Grave, E., Bojanowski, P., Douze, M., J

egou,

H., and Mikolov, T. (2016). Fasttext. zip: Com-

pressing text classiﬁcation models. arXiv preprint

arXiv:1612.03651.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H.,

and Kang, J. (2019). Biobert: pre-trained biomedi-

cal language representation model for biomedical text

mining. arXiv preprint arXiv:1901.08746.

Lui, M. and Baldwin, T. (2012). langid. py: An off-the-

shelf language identiﬁcation tool. In Proceedings of

the ACL 2012 system demonstrations, pages 25–30.

ACL.

Mandal, S., Das, S. D., and Das, D. (2018). Language

identiﬁcation of bengali-english code-mixed data us-

ing character & phonetic based lstm models. arXiv

preprint arXiv:1803.03859.

Manousogiannis, E., Mesbah, S., Bozzon, A., Baez, S., and

Sips, R. J. (2019). Give it a shot: Few-shot learning

to normalize adr mentions in social media posts. In

Proceedings of the Fourth Social Media Mining for

Health Applications (# SMM4H), pages 114–116.

Molina, G., AlGhamdi, F., Ghoneim, M., Hawwari, A.,

Rey-Villamizar, N., Diab, M., and Solorio, T. (2016).

Overview for the second shared task on language iden-

tiﬁcation in code-switched data. In Proceedings of

the Second Workshop on Computational Approaches

to Code Switching, pages 40–49. ACL.

Nguyen, D. and Cornips, L. (2016). Automatic detec-

tion of intra-word code-switching. In Proceedings of

the 14th SIGMORPHON Workshop on Computational

Research in Phonetics, Phonology, and Morphology,

pages 82–86.

Noreen, E. W. (1989). Computer-intensive methods for test-

ing hypotheses. Wiley New York.

Pershad, Y., Hangge, P. T., Albadawi, H., and Oklu, R.

(2018). Social medicine: Twitter in healthcare. Jour-

nal of clinical medicine, 7(6):121.

Pustejovsky, J. and Stubbs, A. (2012). Natural Language

Annotation for Machine Learning: A guide to corpus-

building for applications. ” O

Reilly Media, Inc.”.

Ruder, S., Vuli

c, I., and Søgaard, A. (2019). A survey of

cross-lingual word embedding models. Journal of Ar-

tiﬁcial Intelligence Research, 65:569–631.

Tromp, E. and Pechenizkiy, M. (2011). Graph-based n-

gram language identiﬁcation on short texts. In Proc.

20th Machine Learning conference of Belgium and

The Netherlands, pages 27–34.

Van den Bercken, L., Sips, R.-J., and Loﬁ, C. (2019). Eval-

uating neural text simpliﬁcation in the medical do-

main. In The World Wide Web Conference, pages

3286–3292. ACM.

Wehrmann, J., Becker, W. E., and Barros, R. C. (2018).

A multi-task neural network for multilingual senti-

ment classiﬁcation and language detection on twitter.

In Proceedings of the 33rd Annual ACM Symposium

on Applied Computing - SAC 18, pages 1805–1812.

ACM Press.

Xia, M. X. (2016). Codeswitching language identiﬁcation

using subword information enriched word vectors. In

Proceedings of The Second Workshop on Computa-

tional Approaches to Code Switching, pages 132–136.

Yin, Y. and Jin, Z. (2015). Document sentiment classiﬁ-

cation based on the word embedding. In 2015 4th

International Conference on Mechatronics, Materials,

Chemistry and Computer Engineering. Atlantis Press.

Zsiga, E. C. (2012). The sounds of language: An introduc-

tion to phonetics and phonology. John Wiley & Sons.

HEALTHINF 2020 - 13th International Conference on Health Informatics

406