Extracting Contextonyms from Twitter for Stance Detection

Guillaume Gadek

1,2

, Joseﬁn Betsholtz

, Alexandre Pauchet

, St

ephan Brunessaux

Nicolas Malandain

and Laurent Vercouter

Airbus DS, 78990 Elancourt, France

Normandie Univ, INSA Rouen, LITIS, 76000 Rouen, France

Keywords:

Opinion Mining, Context, Contextonyms, Sentiment Analysis, Social Media Data, User Generated Text.

Abstract:

Opinion mining on tweets is a challenge: short texts, implicit topics, inventive spellings and new vocabulary

are the rule. We aim at efﬁciently determining the stance of tweets towards a given target. We propose a

method using the concept of contextonyms and contextosets in order to disambiguate implicit content and

improve a given stance classiﬁer. Contextonymy is extracted from a word co-occurrence graph, and allows to

grasp the sense of a word according to its surrounding words. We evaluate our method on a freely available

annotated tweet corpus, used to benchmark stance detection on tweets during SemEval2016.

1 INTRODUCTION

The large volume, easy access and rapid propaga-

tion of online information make the Internet a perfect

medium for opinion mining. In particular, Twitter has

emerged as a micro-blogging community with over

500 million public messages per day.

Similar to SMS, tweets are short, contain inventive

spelling and their meanings are often implicit. How-

ever, they also differ (Gotti et al., 2013): tweets can

be public whereas SMS are strictly private. Twitter

also contains a greater extent of invented words and

typing errors (Maynard et al., 2012). Furthermore,

tweets contain entities such as hashtags

or user men-

tions

. Both hashtags and user mentions often appear

as labels, without any syntactic role.

Nevertheless, most algorithms adapted to tweets

consider words as atoms, i.e. without considering

any relation to the surrounding words, which gen-

erates ambiguity as most words can have more than

one meaning. The problem of discovering the “real”

sense of words in a text is commonly referred to as

text disambiguation. We propose to use contextonyms

to solve this issue, following the simple idea that the

surrounding words can be exploited to determine the

context of usage of a word, and therefore to deter-

mine its sense. Two words are contextonyms if they

Hashtags are single words or phrases preceded by a # and

whose meaning may be maintained in the sentence. For

instance, “#voteforyou” can replace “vote for you”.

User mentions use @name in place of a named entity.

are commonly used in a same context and a group of

words frequently co-occurring is called a contextoset.

Twitter is commonly used to express views on var-

ious topics, including product reviews and political

opinions. In this domain, an opinion is represented

by ﬁve elements: its author, time of utterance, target

(e.g. a phone), aspect of the target (e.g. the screen

of the phone), and polarity/sentiment (Pang and Lee,

2008).

Stance detection is a similar idea: it focuses on

determining the polarity (in favor, against or none) of

an opinion. Texts are gathered by topic and have to

be analyzed with regard to a given target. To clar-

ify the difference, if the target is Hillary Clinton, the

topic could be U.S. Election Candidates or U.S. Fe-

male Politicians.

This task

requires an excellent knowledge of the

topic as implicit statements have to be identiﬁed: the

corpus includes utterances towards different entities,

related to the topic but sometimes different from the

target. As an example, among the SemEval task was a

“Hillary Clinton” corpus, which included tweets tar-

geting Donald Trump, her rival in the 2016 US elec-

tion.

In this article, we propose to use contextonymy,

and more precisely contextosets, to disambiguate

tweets. We outline a new method to extract the con-

textosets and show that this process improves the

SemEval proposed a stance detection task on tweets for the

ﬁrst time in 2015 (linked to the 2016 NAACL conference):

http://alt.qcri.org/semeval2016/task6/.

132

Gadek G., Betsholtz J., Pauchet A., Brunessaux S., Malandain N. and Vercouter L.

Extracting Contextonyms from Twitter for Stance Detection.

DOI: 10.5220/0006190901320141

In Proceedings of the 9th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2017), pages 132-141

ISBN: 978-989-758-220-2

stance detection on tweets. To benchmark our ap-

proach, we compare our results on the SemEval task

of stance detection on tweets.

Section 2 introduces others’ work on stance detec-

tion, word sense disambiguation, and contextonyms.

Then, Section 3 explains how we extract contextosets

and determine stance. Section 4 presents the imple-

mentation and experiments. Finally, Section 5 shows

the results obtained by our method and highlights

some elements to discuss.

2 RELATED WORK

Previous studies on opinion mining and sentiment

analysis have already proposed excellent methods, but

only a few of them are applicable in stance detection.

2.1 From Opinion Mining to Stance

Detection

Stance detection techniques can be divided into three

main approaches: ﬁrst, the use of sentiment dictio-

naries and speciﬁc linguistic rules. Second, machine

learning for text categorization, using a training cor-

pus. Third, hybrid methods that combine the two ﬁrst

approaches. The three techniques are described in the

following sections.

Resource-based Approaches for Sentiment Analy-

sis (SA). This family of methods associates polari-

ties with words, i.e. each word gets a score reﬂecting

its degree of “positivity” and a score reﬂecting its de-

gree of “negativity”. A weighted sum, usually called

valence, produces the overall sentiment of a text.

Vader

(Hutto and Gilbert, 2014) is a rule-based

sentiment analyzer. Inspired by LIWC

, ANEW

(Bradley and Lang, 1999) and SentiWordNet

(Bac-

cianella et al., 2010), the authors compiled a short list

of 7,500 word-valence pairs, and benchmarked it suc-

cessfully.

Some systems quite accurately predict sentiment,

but this indicator is not satisfactory for stance detec-

tion in politics. (Tsytsarau and Palpanas, 2012) pro-

poses to detect contradiction: they focus on the signs

Valence Aware Dictionary for sEntiment Reasoning.

Linguistic Inquiry and Word Count: in a text, it counts the

percentage of words that reﬂect different emotions, think-

ing styles, social concerns, and parts of speech.

Affective Norms for English Words: a set of normative

emotional ratings for a large number of words in the En-

glish language.

A lexical resource for opinion mining. It assigns sentiment

scores to synsets from WordNet.

of opposition or agreement between two successive

posts. A study on UK politics (Maynard and Funk,

2011) uses GATE (Cunningham et al., 2011), a rule-

based tool, and applies it to political opinion detec-

tion. A set of rules states whether a sentence is for or

against one of the three big parties in the UK. Similar

work focuses on online debates (Anand et al., 2011),

to distinguish messages in terms of opposition to en-

tities: many posts are exclusively negative, and there-

fore a “positive/negative” vocabulary approach would

not be useful.

Learning-based Approaches. A second way, Text

Classiﬁcation, relies on a human annotated training

corpus. This approach stems from topic categoriza-

tion (Pang and Lee, 2008), which objective is to la-

bel a document with predeﬁned categories. The im-

pact of the chosen features is important, as (Tan et al.,

2002) showed by comparing unigrams and/or bigrams

of words.

Social, user-generated contents are very speciﬁc to

a given platform. The increase of inventive spellings

on the Internet led (Pak and Paroubek, 2010) to gather

a corpus for sentiment analysis, classifying tweets as

objective (thus, neutral) or subjective (and then, rely-

ing on laughing and crying emoticons, between pos-

itive and negative). In 2013, on a contextual polarity

task, best results used SentiWordNet or MPQA

(Wil-

son et al., 2005) and achieved a F

-measure of 0.68

(Nakov et al., 2013): the task is difﬁcult and no good

system has emerged yet.

Hybrid Approaches. (Andreevskaia and Bergler,

2008) combined the two precedent approaches:

lexicon-based and learning-based. In their experi-

ments, they proposed a corpus consisting of debate

forum posts, such as “Firefox VS Internet Explorer”.

A post in a forum thread is usually part of a discus-

sion, which helps to put the elements in relation.

(Hasan and Ng, 2013) reached a 0.75 accuracy

on political forum posts stance classiﬁcation. They

used multiple approaches: the presence of word un-

igrams and bigrams, sentiment features using the

LIWC resource (Pennebaker et al., 2001), as well

as task dependent features: they assume that an au-

thor keeps the same stance throughout a debate (ideo-

logical constraint), and that two successive messages

have a great probability to oppose each other (user-

interaction constraint). These types of features are not

applicable in the SemEval corpus, as it only contains

the text of tweets.

Multi-Perspective Question Answering: this corpus con-

tains news articles from a wide variety of news sources,

manually annotated for opinions and other private states.

Extracting Contextonyms from Twitter for Stance Detection

133

To beneﬁt from the strengths of sentiment re-

sources, (Khan et al., 2014) combined three methods:

an emoticon classiﬁer, an enhanced emotion classiﬁer

and a SentiWordNet classiﬁer. Results show improve-

ments in comparison to the methods taken one by one.

2.2 From Words to Relations

The previous section brieﬂy reviewed the tools to de-

tect the stance of a text. We believe that the scores

can be improved if we tackle one of the biggest chal-

lenges: the ambiguity of the texts. In the following

part, we review some techniques for word sense dis-

ambiguation.

Word sense disambiguation consists in choosing

between senses when the meaning of a word is not

obvious. Most words can bear more than one mean-

ing and their senses can be detected from the context

in which the words appear. (Wiebe and Mihalcea,

2006) show an improvement of their sentence classi-

ﬁer when considering the feature “subjectivity infor-

mation” (syntactic rules to determine if the sentence

is subjective or objective).

The exploration of various links between words is

an active ﬁeld: (Rei and Briscoe, 2014) look for hy-

ponyms in a word embedding space. Hyponyms gen-

eration allows more relations per word, as opposed to

synonyms or antonyms. (Perez-Tellez et al., 2010) at-

tack the homonymy problem

. They claim to be able

to distinguish between “orange”, the color, and “or-

ange”, the fruit. To achieve this, they convert each

tweet to a tf.idf

vector and then apply clustering (K-

means) on the whole corpus: tweets in a same cluster

tend to use the words in the same sense.

(Fernando and Stevenson, 2012) aimed to asso-

ciate a Wikipedia article to each of the WordNet

synsets: ﬁrst by matching the title to generate a can-

didate list, then reﬁning this selection by considering

the whole article as well as the glossary and descrip-

tion ﬁelds of the synsets.

Previous work (Zesch et al., 2008) proposed a

semantic relatedness score using concept vectors to

map documents, comparing various resources such

as Wiktionary, Wikipedia articles, WordNet and Ger-

maNet (a german avatar of WordNet).

More recently, (Feng et al., 2015) claim that the

usual methods for semantic relatedness of words, us-

When two words that have different meanings are either

spelled in the same way (e.g. “match” (that you light a

ﬁre with) and “match” (a sports game) or pronounced in a

similar way (e.g. “to” and “too”).

Term frequency . inverse document frequency: a numeri-

cal statistic reﬂecting how important a word is to a docu-

ment in a corpus.

ing WordNet (Miller, 1995) or Wikipedia, give poor

results on Twitter content, because of the different

sentence structure as well as the presence of new vo-

cabulary.

These techniques have been used on many differ-

ent types of corpus, but our work focuses on social

media data. User-generated texts are very different

from other corpora, however, contextonyms may help

in our stance detection task.

2.3 Contextonyms and Contextosets

The concept of Contextonyms was ﬁrst introduced by

(Hyungsuk et al., 2003), noting that “contextually

related words are meaningful indicators of a target

word’s semantic value in a given context”. In this

study, contextonyms are deﬁned as “relevant contex-

tually related words for a target word”. In turn, “con-

text” is deﬁned as a certain number of surrounding

words. Contextonymy is a relation between words,

as is synonymy: two words are contextonyms if they

frequently occur together (and thus, describe the same

context).

(Hyungsuk et al., 2003) also obtained cliques -

complete subgraphs - from the contextonyms, which,

according to them, represent the minimal senses of

words. We estimate that the target word used by

(Hyungsuk et al., 2003) is not enough to character-

ize a context. Still, a contextoset regroups a number

of words corresponding to a given context.

Works using Contextonyms. (Ploux and Ji, 2003)

and (Wang et al., 2016) propose a Statistical Ma-

chine Translation method, where the unit is not a

word but a “minimal semantic unit”, represented by

a clique (following (Hyungsuk et al., 2003) con-

textonyms extraction). (S¸erban, 2013) extracted the

contextonyms from movie subtitles, to correct Senti-

WordNet by continuing the propagation of sentiment

valences along these relations.

Comparison of Contextosets, Word Embeddings

and Synsets. To assess the semantic relatedness

proposed by contextosets, we compared the results

with the outcome of two other methods: Word2Vec

and WordNet. We used the same 70 million word cor-

pus, completely composed by tweets, in all methods.

We obtained word embeddings by training a

Word2Vec model (Mikolov et al., 2013) using “bag-

of-words” and a vector size of 100. We then ex-

tracted contextonyms, using the procedure outlined

in this paper. Finally, focusing on the word “sup-

port”, we acquired the synsets from WordNet. Table 1

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

134

shows an excerpt of neighboring words, contexosets

and synsets for the word “support”.

From the Word2Vec word embeddings, only some

words seem related to our target, and the results

largely depend on the training corpora. The grammat-

ical categories are not considered important, and we

have no insight into the relations between the other

words (“respect” is not included in the set of the clos-

est words to “organize”, whereas they both appear as

close words to “support”).

WordNet synsets are sometimes numerous, but do

not exist for Twitter-speciﬁc vocabulary. Even the

very popular “LOL” is excluded from this dictionary:

WordNet is legitimate, but it is not adapted for social

media.

Contextonyms, like Word2Vec, make no distinc-

tion between grammatical categories and may include

any word that has been used in the corpus. Further-

more, they quite efﬁciently convey the meaning of a

word, and the contextosets evidently match the origi-

nal corpus topics.

2.4 Discussion

During the SemEval2013 opinion mining task, the

best system achieved a F

-measure of 0.68 (Nakov

et al., 2013). (Andreevskaia and Bergler, 2008) are

a reference in stance detection: on the topic “Firefox

VS Internet Explorer”, their hybrid classiﬁer achieved

= 0.66. On another dataset, (Hasan and Ng, 2013)

reached a 0.75 accuracy. These results are good,

but not excellent: the stance detection task is in-

deed very difﬁcult. Moreover, during annotation, the

inter-annotator agreement

is often quite low, which

means that even if the stance detection procedure per-

forms well on the testing data, it is still debatable

whether the results truly represent the actual stances.

We believe that stance detection on tweets process

could beneﬁt from disambiguation techniques. How-

ever, due to the difﬁculties of social media contents,

we cannot rely only on well-established resources.

Disambiguation itself also presents great challenges:

(Perez-Tellez et al., 2010) reached a F

-measure of

0.74 on some ambiguous company names. Neverthe-

less, we believe that word sense disambiguation has

great potential and including such a step in stance de-

tection can have a positive impact. Contextonymy

appears to have many advantages that makes it suit-

able to use for disambiguation. Hence, the aim of

this study is to develop a method to extract a database

of contextosets from tweets, and then exploit this

On an annotation task, each sample is labeled by different

persons to check if they agree; measures such as Kohen’s

kappa enables evaluation of their agreement.

database to disambiguate the senses of tweets in or-

der to improve stance detection.

3 EXTRACTION AND USE OF

CONTEXTOSETS

3.1 Contextoset Extraction

Required Resources. Contextosets are extracted

from a corpus of documents. The contents of these

documents should be representative of the topics of

interest. Here we consider a corpus constituted of

tweets only. In order to obtain as meaningful contex-

tosets as possible, a thematic corpus should be used.

Preprocessing. In this step, all tweets are lower-

cased, and user mentions, special symbols, and stop

words are removed. Common abbreviations are con-

verted to their full words (e.g. “I’m” becomes “I

am”). The tweets are then tokenized on their white

spaces. We considered using a lemmatizer and POS-

tagger, however, our tests with TweetNLP (Owoputi

et al., 2013) were not convincing, as the model lacks

information about the speciﬁc topics in our corpus.

For instance, the word ISIS was “lemmatized” to

IS, which completely eliminates the meaning of the

word.

Constructing a Co-occurence Graph.

Deﬁnition 3.1. Tweet. A tweet t is a set of words

,...} obtained from preprocessing a real tweet.

Deﬁnition 3.2. Co-occurrence. Words n

are said

to occur together if they are in the same tweet t and

they are separated by less than WindowSize/2 − 1

words. Stated differently, a word co-occurs with the

WindowSize/2 words before it and the WindowSize/2

words after it, in the same tweet t. WindowSize is a

parameter that can be set to any even, positive integer.

Using a corpus of preprocessed tweets as in Def-

inition 3.1, we constructed a co-occurrence graph

G = (V,E) . The set of nodes {V } consists of the

complete vocabulary of the preprocessed corpus, and

the set of edges {E} represents the undirected, val-

ued links between all pairs of co-occurring words (see

Deﬁnition 3.2). The weight w

of any individual edge

e is the number of co-occurrences of the words linked

by e in the entire preprocessed corpus.

Extracting Contextonyms from Twitter for Stance Detection

135

Table 1: Word Embeddings, contextosets and WordNet synsets for the nearest words of support.

Method: Word Embeddings

supporting, supported, supports, respect, vote, encourage, voting, voted, organize, helping

Method: Contextosets

(support, continued, foolery),

(climate, support, advocacy, preventing, change),

(support, bae, naten, kanta),

(support, tennessee, thank, trump2016)

Method: Synsets

(documentation, support)

(support, keep, livelihood, living, bread and butter, sustenance)

(support, supporting)

(accompaniment, musical accompaniment, backup, support)

(support, ﬁnancial support, funding, backing, ﬁnancial backing)

(support, back up)

(back, endorse, indorse, plump for, plunk for, support)

(hold, support, sustain, hold up)

(conﬁrm, corroborate, sustain,substantiate, support, afﬁrm)

(subscribe, support)

(corroborate, underpin, bear out, support)

(defend, support, fend for)

(patronize, patronise, patronage, support, keep going)

(digest, endure, stick out, stomach, bear, stand, tolerate, support, brook, abide, suffer, put up)

Filtering Words. As previously mentioned, one of

the great challenges of interpreting tweets is that they

do not necessarily comply with established rules of

grammar and spelling. Moreover, many new words

have emerged speciﬁcally on social media, and they

often convey important clues about the content and/or

stance of the tweet. For instance, the hashtag “#de-

mexit” implies US democrats leaving the democratic

party, and has been frequently used in discussions

about the 2016 U.S. election on Twitter. However,

this word, like many others, is absent in the Oxford

Dictionary. Other words, such as “laaazzzyymoon-

nddayyy” do not belong to “established” Twitter-

vocabulary, but are purposely misspelled words used

by one person alone. It is important to be able to sepa-

rate the important social vocabulary and the nonsense

words when creating contextonyms. Furthermore, if a

tweet containing a nonsense word is retweeted many

times this word will appear important by conven-

tional ﬁltering methods, which remove low-frequency

words, relatively to an actual word which perhaps

only occurs once or a few times in the corpus. There-

fore, we have developed an innovative method to ﬁlter

out the non-usable words.

Deﬁnition 3.3. Degree. The degree of a word n in a

co-occurrence graph G is the number of other words

directly connected to n. We denote this by d(n)

We determine a word to be legit if it is used in

many different kinds of contexts, i.e. surrounded by

different words. This simply indicates that the word

is present in more than one tweet and/or is used by

more than one person. One way to assess this is to

look at the degree of a node. However, since we rely

on WindowSize parameter to assign neighbors to a

word, its degree is also dependent on its position in

any given tweet. For instance, “dogs like to swim in

the summers” and “dogs usually run very fast” would

give “dogs”, “swim”, “to”, “in”, “run” degrees of 4,

even if “dogs” is the only word that appear in both

sentences. Therefore, we normalize the degree of a

node by its average degree due to its position in a

tweet to get a ratio, α, that represents the actual va-

riety of contexts that a word appears in.

Let g

= (V

) be the co-occurrence graph for a

single tweet t. For a given word n, let the tokenized

tweets containing n be denoted by 1,...,K. Then, the

average degree φ of a word n, due to its position, is

given by

φ(n) =

∑

j=1

d(n)

(1)

We can then ﬁnd α(n), the ratio of degree in G to

average degree position for word n, to be

α(n) =

d(n)

φ(n)

(2)

A large score implies that word n occurs in a great

variety of contexts. Hence, the words in a tweet such

as “dizz movi ezz hoorrble”, if only appearing in this

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

136

tweet, would all only get a score of 1, even if the tweet

is retweeted 50 times. A word n would then be re-

moved if α(n) < α

threshold

The second part of the ﬁltering process concerns

the edges. Again, the conventional ﬁltering method,

that ﬁlters by edge weight, would remove important

contexts that are less represented in the corpus, per-

haps because of topic bias, and favor co-occurrences

that appear frequently, even if it is by retweet. In an

attempt to address this issue, we introduce metric β,

that consists of two weight-node count ratios.

β(e) =

1,e

2,e

(3)

Where w

is the weight of edge e = (n

), c

1,e

and

2,e

are the word counts for the two words n

and n

connected by e. Since β

=]0,2], a value approaching

2 implies that this association is very important for

both words, whereas a value approaching 0 implies

that the association is relatively unimportant for both

words. By ﬁltering away the edges that have small

values, i.e. whenever β

< β

threshold

, we get rid of

the unimportant associations and only retain what is

important for our contextosets.

Contextoset Extraction. We chose to extract the

contextosets using a method proposed by (Palla et al.,

2005). They outline a way to obtain k-cliques, that

is, communities derived from overlapping cliques.

This seems to suit our problem particularly well as

maximal cliques provide inadequate contextosets be-

cause of the tendency to form many, almost iden-

tical sets of words that specify the same context.

k-cliques improve the contextosets as they merge

cliques that share many of the same words. We used

the k

clique communities implementation of (Palla

et al., 2005) method in the NetworkX (Hagberg et al.,

2008) python package.

3.2 Determining Stance

As described in Section 2, there are two main ap-

proaches for stance detection. The ﬁrst is based

on sentiment analysis, guided by the intuition that

positive-sentiment tweets have a supportive stance

towards their target. The second one is based on

text categorization, where conditional probabilities of

word co-occurrences help to statistically determine

the class to which a tweet belongs. For each of these

approaches, we propose a baseline and a method us-

ing contextosets. We do not claim to have the best

classiﬁer, but we aim to show that contextosets can

substantially improve stance detection.

3.2.1 Sentiment: Resource-based Approaches

Baseline, SENT-BASE. We propose a baseline,

SENT-BASE, using the well-known resource Senti-

WordNet 3.0 (Baccianella et al., 2010) to predict the

stance. We assume that positively (negatively) valued

tweets have the stance FAVOR (AGAINST).

In SentiWordNet, each word n may be present in

different synsets. A synset is a set of one or more

synonyms that are interchangeable in a given context.

Let S(n) be the set of i synsets s

containing the word

n. Each synset has a positive and a negative valence

−

Let S

be the set of all the N synsets taken into

account for the whole tweet. We therefore deﬁne the

valence v(t):

v(t) =

∑

∈S

+ s

−

(4)

If v(t) is positive (negative), we assume the tweet is

supportive (opposed), thus having a stance FAVOR

(AGAINST ).

Enhancing Sentiment Analysis with Contextosets:

SENT-CTXT. Sentiment prediction can be im-

proved by considering contextonyms when selecting

synsets. We obtain a list of the best contextoset(s)

matching a tweet, by criterion of the greatest number

of words shared by the contextoset(s) and the tweet. It

is possible to have more than one contextoset sharing

the same number of words with the tweet.

Let C be the set of contextosets c generated from

the corpus. Then, for a tweet t, the set of best contex-

tosets B

is given by

= {c | max(|{n}

∩ {n}

|),∀ c ⊂ C} (5)

Then, we propose a function that takes the tokenized

tweet and its contextoset(s) as inputs. Using Sen-

tiWordNet, it selects only one synset based on the

shared number of words between the contextonyms

and the synset. If two synsets are competing, the func-

tion relies on the “gloss” ﬁeld (glossary, a short un-

structured description) to count the number of shared

words and, ﬁnally, select the best synset.

Finally, the valence is computed as in SWN-BASE

and allows us to predict a stance label.

3.2.2 Statistical Approaches

Baseline: SVM-UNIG. We propose a baseline,

SVM-UNIG, using a SVM on word unigrams. More

speciﬁcally, we compared different algorithms and

parameter settings and ﬁnally selected and trained a

Extracting Contextonyms from Twitter for Stance Detection

137

SVM with RBF kernel (C = 100.0, γ= 0.01 after a 3-

stratiﬁed folded cross validation). The feature vector

is composed of the boolean indicators of the unigrams

presence. Vocabulary size is ﬁxed at 10,000, which

limits the feature vector length.

Using Contextosets as Features: SVM-CTXTS.

SVM-CTXTS is following a simple intuition: perhaps

the contextosets are good indicators of the stance of a

tweet. Thus, SVM-CTXTS is the same classiﬁer as

SVM-UNIG, but the feature vector is here composed

of the boolean indicators of the presence of a contex-

toset. However, we believe this method is sensitive

to the size of the training set: due to the limited size

of our training and test sets (1250 tweets in the test

set), and because we obtained 6278 contextosets, it is

likely to see contextosets occur only once in a while.

Using Contextosets to Expand Tweets: SVM-EXP.

This method address to the shortness of the tweets by

completing them with the best contextoset(s). It ﬁrst

transforms the tokenized tweet by adding all words of

the best matching contextoset(s). Then, it uses SVM-

UNIG to determine its stance. Of course, this process

has to be applied both on the training set and on the

testing set.

We use Equation 5 to ﬁnd the best contextosets

associated to the tweet t, and then obtain the “ex-

panded tweet” E

as follows.

= {n}

∪ {n}

(6)

3.3 Evaluation

For each possible label s, the classiﬁer can send a

positive or a negative signal in response to a query

sample. The response signal can be either true if it

matches the ground truth, or false in case of error.

To compute metrics to evaluate our results, a set

of samples called the test set is needed.

Thus, we have four possible outcomes for each

label s: TP stands for the number of true positives,

meaning that the classiﬁer correctly determined TP

samples to belong to s; FP stands for the number of

false positives, meaning that the classiﬁer incorrectly

determined the label of FP samples to be s; TN repre-

sents the number of true negatives, meaning that the

classiﬁer correctly determined TN samples not to be-

long to s; and FN stands for the number of false neg-

atives, meaning that the classiﬁer incorrectly deter-

mined FN samples not to belong to s.

The ﬁrst metric, Precision, is deﬁned in Equa-

tion 7 and represents the fraction of true positives,

given samples (whose true labels are of mixed cate-

gories) that were all classiﬁed to belong to s. Another

metric, Recall (Equation 8), determines the fraction of

true positives, given samples (whose true labels are

all s) that were classiﬁed to belong to any category.

Together, we combine them to obtain the F

-measure

(Equation 9) which is commonly used to assess the

quality of the prediction for each label s.

T P

+ FP

(7)

T P

+ FN

(8)

(s) = 2

+ R

(9)

To benchmark our results, we use the metric Ofﬁcial

Score proposed in the SemEval task (Equation 10). It

consists of the average between the positive (F for fa-

vor) and negative (A for against) stances F

-measures

(and does not include the neutral-stance prediction).

It is not directly comparable to F

-measures given be-

forehand.

Score =

(F) +F

(A)) (10)

4 IMPLEMENTATION AND

EXPERIMENTS

4.1 SemEval Task Description

Our stance classiﬁers are evaluated on the

SemEval2016-task6 corpus

. We focus only on

the subtask-A, which includes a training set and a

test set on ﬁve topics (“Atheism”, “Climate Change

is a Real Concern”, “Feminist Movement”, “Hillary

Clinton”, “Legalization of Abortion”). Subtask-B

focused on “Donald Trump” in an unsupervised way,

with only a test set available. Overall, the subtask-A

corpus is divided in a training part (2,914 tweets)

which is used to ﬁnd best parameters and train our

supervised model, and a test part (1,250 tweets)

which is used for evaluation. To compare the results,

they use the Ofﬁcial Score (see Equation 10).

4.2 Input Corpora for Contextosets

Extraction

We collected a corpus of English-written tweets, Gen-

Tweets, using the Twitter Stream API. This API al-

lows anyone to gather public tweets that contain one

SemEval2016-task6 is a stance detection task ap-

plied on an annotated tweet corpus freely available at

http://alt.qcri.org/semeval2016/task6/

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

138

or more keywords (words or hashtags), as soon as

they are published. However, due to constraints im-

posed by Twitter, we only receive a random sample

of the set of tweets containing these keywords. Gen-

Tweets consists of 7,773,089 tweets gathered between

November 20th and December 1st, 2015, on a broad

range of topics, including Clinton, the abortion de-

bate, religion, and miscellaneous.

4.3 Parameters

Using the GenTweets corpus, we obtained a vocabu-

lary size larger than 250,000 words: we restrained it at

50,000 setting α

threshold

= 10. This size is more rea-

sonable to handle and allows for easier processing of

the co-occurrence graph, yet includes enough variety

to grasp a wide range of expressions.

We chose β

threshold

= 0.06 as this this number im-

plies that the link between two words is of relatively

small importance; it also limits the number of edges

at 300,000, again allowing for reasonable processing

times.

5 RESULTS AND DISCUSSION

Table 2 contains the results of our various experi-

ments. P stands for the average precision over the

three stances, for each target with a speciﬁed algo-

rithm. R stands for recall, and F

is the average F

measure.

Table 2: Comparison between the proposed algorithms on

SemEval TaskA.

Algorithm P R F

Sent

SENT-BASE 0.41 0.30 0.31

SENT-CTXT 0.43 0.35 0.37

Stat

SVM-UNIG 0.63 0.62 0.62

SVM-CTXT 0.58 0.61 0.58

SVM-EXP 0.69 0.64 0.66

SENT-BASE turns out to be a rather unsatisfac-

tory baseline, as it achieves F

= 0.31. However,

SENT-CTXT improves the stance detection to F

0.37. The low scores are due to the weak assump-

tion that sentiment predicts stance. Moreover, the test

corpus includes tweets targeting other entities than the

target: the set whose target is Hillary Clinton includes

positive mentions to other candidates as well, though

it means an AGAINST stance towards Hillary Clin-

ton. A sentiment-based approach does not handle this

well. Hence, knowing only the topic and the senti-

ment is insufﬁcient to determine the stance: the target

of the sentiment needs be considered as well. In other

words, the target of a sentiment in a tweet can differ

from the target given by the stance detection task.

SVM-UNIG is a better baseline because it draws

upon the training sets (as opposed to SENT-BASE),

and reaches F

= 0.62. One can note that the best

ofﬁcial score (0.68) on this task was also reached by

a SVM (which also included character ngrams in its

feature vector) proposed as a baseline by the SemEval

organizers (Mohammad et al., 2016).

SVM-CTXT performed rather unsatisfactory. It is

mainly due to the small size of the training corpora:

the training set has too few elements to cover enough

vocabulary of contextosets, thus it is likely that the

training set does not cover all of the contextosets pos-

sibilities, resulting in SVM-CTXT making prediction

on contextosets it has never seen before.

Finally, SVM-EXP shows an improvement, reach-

ing F

= 0.66 (Ofﬁcial Score = 0.65). While it is not

better than the results of top competitors, it is compa-

rable to them.

Table 3: Comparison with SemEval competitors, using Se-

mEval ofﬁcial score.

Algorithm: SVM-EXP A#1 A#2 A#3

Score: 0.650 0.678 0.673 0.668

In Table 3, we compare our best results to the

three best scores obtained by the competitors during

the evaluation. Our algorithm SVM-EXP would have

been ranked 6

among the 19 competitors: we ob-

tained good results in the Ofﬁcial Score benchmark.

However, like the other teams, scores are not very

high and the accuracy of a prediction is too low to be

useful. (Mohammad et al., 2016) proposes an anal-

ysis of the results. The top-ranked algorithm (A#1,

MITRE) used two recurrent neural networks (RNNs).

The ﬁrst RNN chose the best hashtags on an unlabeled

tweet set, and the one second estimated the stance ac-

cordingly. The runner-up (A#2, pkudlab) used both a

deep convolutionnal neural network and a set of rules,

and only used the training data. We are unaware of the

technique used by (A#3, TakeLab).

6 CONCLUSION

Stance detection on tweets is a challenging task, be-

cause of their shortness, innovative spelling and usage

of words. Themes are often implicit and the targets of

opinions are not always explicitly mentioned.

In the ﬁeld of semantic and lexical relatedness,

contextonyms and contextosets help to address some

of these issues by attempting to disambiguate the

words in the tweets. Furthermore, it is possible to

Extracting Contextonyms from Twitter for Stance Detection

139

produce contextosets from any kind of dialects or

languages, requiring only basic adaptations (e.g. an

adapted tokenizer) as well as large amounts of texts.

To show the usefulness of contextosets, we pro-

posed to measure their effect on SemEval stance de-

tection task. We introduced two baselines: a sen-

timent analyzer, based on SentiWordNet, and a text

classiﬁer, based on a SVM whose feature vector is

constituted of boolean indicators of unigram pres-

ence. In both cases, contextosets increase the global

measure, even though “sentiment” does not seem

the best approach on this task.

We believe contextosets have a great potential, and

we will continue to explore the possibilities along

both sentimental and statistical approaches. Even

if our sentiment analyzer failed to predict positive

tweets of stance against, we believe it has the po-

tential to tackle this task. For instance, results may

be improved if we enable it to consider the subject

of the tweet to grasp not only the sentiment polar-

ity, but also its target. The learning approach may

be improved if we use contextosets to disambiguate

ambiguous tweets only, and not all of them.

REFERENCES

Anand, P., Walker, M., Abbott, R., Tree, J. E. F., Bowmani,

R., and Minor, M. (2011). Cats rule and dogs drool!:

Classifying stance in online debate. In Proceedings

of the 2nd workshop on computational approaches to

subjectivity and sentiment analysis, pages 1–9. Asso-

ciation for Computational Linguistics.

Andreevskaia, A. and Bergler, S. (2008). When specialists

and generalists work together: Overcoming domain

dependence in sentiment tagging. In ACL, pages 290–

298.

Baccianella, S., Esuli, A., and Sebastiani, F. (2010). Sen-

tiwordnet 3.0: An enhanced lexical resource for sen-

timent analysis and opinion mining. In LREC, vol-

ume 10, pages 2200–2204.

Bradley, M. M. and Lang, P. J. (1999). Affective norms for

english words (anew): Instruction manual and affec-

tive ratings. Technical report.

Cunningham, H., Maynard, D., Bontcheva, K., Tablan,

V., Aswani, N., Roberts, I., Gorrell, G., Funk, A.,

Roberts, A., Damljanovic, D., Heitz, T., Greenwood,

M. A., Saggion, H., Petrak, J., Li, Y., and Peters, W.

(2011). Text Processing with GATE (Version 6).

Feng, Y., Fani, H., Bagheri, E., and Jovanovic, J. (2015).

Lexical semantic relatedness for twitter analytics. In

Tools with Artiﬁcial Intelligence (ICTAI), 2015 IEEE

27th International Conference on, pages 202–209.

IEEE.

Fernando, S. and Stevenson, M. (2012). Mapping wordnet

synsets to wikipedia articles. In LREC, pages 590–

596.

Gotti, F., Langlais, P., and Farzindar, A. (2013). Translating

government agencies tweet feeds: Speciﬁcities, prob-

lems and (a few) solutions. NAACL 2013, page 80.

Hagberg, A. A., Schult, D. A., and Swart, P. J. (2008).

Exploring network structure, dynamics, and function

using NetworkX. In Proceedings of the 7th Python

in Science Conference (SciPy2008), pages 11–15,

Pasadena, CA USA.

Hasan, K. S. and Ng, V. (2013). Extra-linguistic constraints

on stance recognition in ideological debates. In ACL

(2), pages 816–821.

Hutto, C. J. and Gilbert, E. (2014). Vader: A parsimonious

rule-based model for sentiment analysis of social me-

dia text. In Eighth International AAAI Conference on

Weblogs and Social Media.

Hyungsuk, J., Ploux, S., and Wehrli, E. (2003). Lexical

knowledge representation with contexonyms. In 9th

MT summit Machine Translation, pages 194–201.

Khan, F. H., Bashir, S., and Qamar, U. (2014). Tom: Twitter

opinion mining framework using hybrid classiﬁcation

scheme. Decision Support Systems, 57:245–257.

Maynard, D., Bontcheva, K., and Rout, D. (2012). Chal-

lenges in developing opinion mining tools for social

media. Proceedings of the@ NLP can u tag# usergen-

eratedcontent, pages 15–22.

Maynard, D. and Funk, A. (2011). Automatic detection

of political opinions in tweets. In The semantic web:

ESWC 2011 workshops, pages 88–99. Springer.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–

3119.

Miller, G. A. (1995). Wordnet: a lexical database for en-

glish. Communications of the ACM, 38(11):39–41.

Mohammad, S. M., Kiritchenko, S., Sobhani, P., Zhu, X.,

and Cherry, C. (2016). Semeval-2016 task 6: De-

tecting stance in tweets. In Proceedings of the Inter-

national Workshop on Semantic Evaluation, SemEval,

volume 16.

Nakov, P., Kozareva, Z., Ritter, A., Rosenthal, S., Stoyanov,

V., and Wilson, T. (2013). Semeval-2013 task 2: Sen-

timent analysis in twitter.

Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schnei-

der, N., and Smith, N. A. (2013). Improved part-

of-speech tagging for online conversational text with

word clusters. Association for Computational Lin-

guistics.

Pak, A. and Paroubek, P. (2010). Twitter as a corpus for

sentiment analysis and opinion mining. In LREC, vol-

ume 10, pages 1320–1326.

Palla, G., Dernyi, I., Farkas, I., and Vicsek, T. (2005).

Uncovering the overlapping community structure of

complex networks in nature and society. Nature,

435:814–818.

Pang, B. and Lee, L. (2008). Opinion mining and senti-

ment analysis. Foundations and trends in information

retrieval, 2(1-2):1–135.

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

140

Pennebaker, J. W., Francis, M. E., and Booth, R. J. (2001).

Linguistic inquiry and word count: Liwc 2001. Mah-

way: Lawrence Erlbaum Associates, 71:2001.

Perez-Tellez, F., Pinto, D., Cardiff, J., and Rosso, P. (2010).

On the difﬁculty of clustering company tweets. In

Proceedings of the 2nd international workshop on

Search and mining user-generated contents, pages

95–102. ACM.

Ploux, S. and Ji, H. (2003). A model for matching se-

mantic maps between languages (french/english, en-

glish/french). Computational linguistics, 29(2):155–

178.

Rei, M. and Briscoe, T. (2014). Looking for hyponyms in

vector space. In Proceedings of the 18th Conference

on Computational Natural Language Learning, pages

68–77.

S¸erban, O. (2013). Detection and integration of affective

feedback into distributed interactive systems. PhD the-

sis, Citeseer.

Tan, C.-M., Wang, Y.-F., and Lee, C.-D. (2002). The use of

bigrams to enhance text categorization. Information

processing & management, 38(4):529–546.

Tsytsarau, M. and Palpanas, T. (2012). Survey on mining

subjective data on the web. Data Mining and Knowl-

edge Discovery, 24(3):478–514.

Wang, R., Zhao, H., Ploux, S., Lu, B.-L., and Utiyama, M.

(2016). A bilingual graph-based semantic model for

statistical machine translation. In International Joint

Conference on Artiﬁcial Intelligence.

Wiebe, J. and Mihalcea, R. (2006). Word sense and subjec-

tivity. In Proceedings of the 21st International Con-

ference on Computational Linguistics and the 44th

annual meeting of the Association for Computational

Linguistics, pages 1065–1072. Association for Com-

putational Linguistics.

Wilson, T., Wiebe, J., and Hoffmann, P. (2005). Recogniz-

ing contextual polarity in phrase-level sentiment anal-

ysis. In Proceedings of the conference on human lan-

guage technology and empirical methods in natural

language processing, pages 347–354. Association for

Computational Linguistics.

Zesch, T., M

uller, C., and Gurevych, I. (2008). Using wik-

tionary for computing semantic relatedness. In AAAI,

volume 8, pages 861–866.

Extracting Contextonyms from Twitter for Stance Detection

141