A Challenging Data Set for Evaluating Part-of-Speech Taggers

Mattias Wahde

, Minerva Suvanto

and Marco L. Della Vedova

Chalmers University of Technology, SE-412 96 Gothenburg, Sweden

Keywords:

Part-of-Speech Tagging, Natural Language Processing, Sequence Labeling.

Abstract:

We introduce a novel, challenging test set for part-of-speech (POS) tagging, consisting of sentences in which

only one word is POS-tagged. First derived from Wiktionary, and then manually curated, it is intended as

an out-of-sample test set for POS taggers trained over larger data sets. Sentences were selected such that

at least one of four standard benchmark taggers would incorrectly tag the word under consideration for a

given sentence, thus identifying challenging instances of POS tagging. Somewhat surprisingly, we ﬁnd that

the benchmark taggers often fail on rather straightforward instances of POS tagging, and we analyze these

failures in some detail. We also compute the performance of a state-of-the-art DNN-based POS tagger over

our set, obtaining an accuracy of around 0.87 for this out-of-sample test, far below its reported performance in

the literature. Also for this tagger, we ﬁnd instances of failure even in rather simple cases.

1 INTRODUCTION

Part-of-speech (POS) tagging is an important pre-

processing step in many natural language processing

(NLP) tasks (Chiche and Yitagesu, 2022). In POS

tagging, the aim is to assign class labels (POS tags) to

the words in a sentence, determining, for each word,

whether it is a noun, a verb, an adjective, and so on.

For many words, this is simple, as they are always

associated with a single POS tag. For example, the

word organization is always a (common) noun. On

the other hand, there are many words for which sev-

eral different POS tags are possible, so that the correct

tag in a given situation must be inferred from con-

text, i.e., using information about surrounding words.

For example, the word present can be either a noun

(meaning a gift), a verb (as in present a paper at a

conference), or an adjective (as in being present at a

meeting). The aim of a POS tagger is to resolve such

ambiguities and thus to assign the correct POS tag to

each word in a sentence.

POS tagging has been extensively studied in NLP,

resulting in a set of high-performance taggers, such as

Brill (Brill, 1992), Hunpos (Hal

acsy et al., 2007), Per-

ceptron (Bird et al., 2009), Stanford (Toutanova et al.,

2003), and, more recently, a variety of taggers using

deep neural networks (DNNs), see, e.g., (Akbik et al.,

https://orcid.org/0000-0001-6679-637X

https://orcid.org/0009-0003-1751-151X

https://orcid.org/0000-0002-4703-7500

2018). Achieving decent POS tagging performance is

not a very difﬁcult task. In fact, when applied to a

diverse set of sentences, a tagger that simply selects

the most common tag for any given word, typically

obtains an accuracy of at least 0.85 − 0.90.

More sophisticated taggers, as exempliﬁed above,

generally obtain even better results, with reported ac-

curacy in the range 0.92 − 0.97. Recent DNN-based

taggers typically exhibit high accuracy, but other tag-

gers, some of which are listed above, are not far be-

hind. Indeed, the reported accuracy of the Stanford

tagger (over its test set) is 0.972 (Toutanova et al.,

2003), whereas the performance of a state-of-the-art

DNN-based tagger (Akbik et al., 2018) is 0.978 over

the same set. Here, one should also bear in mind that

the ground truth labels involve some error (or, at least

ambiguity), where different human evaluators may as-

sign different POS tags, typically affecting around 3%

of the words

. Thus, one can hardly demand an accu-

racy of better than around 0.97 of any automated tag-

ger. For this reason, given the accuracies mentioned

above, one may perhaps view POS tagging as a solved

problem and thus simply apply any of the standard

POS taggers included in commonly used software li-

braries.

However, we argue that such a conclusion may be

premature. First of all, most (English) POS taggers

have been trained on one of two speciﬁc data sets,

The topic of POS ambiguity is further discussed in

Section 2 below.

Wahde, M., Suvanto, M. and Vedova, M.

A Challenging Data Set for Evaluating Part-of-Speech Taggers.

DOI: 10.5220/0012307200003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 2, pages 79-86

ISBN: 978-989-758-680-4; ISSN: 2184-433X

namely the Brown corpus (Francis and Kucera, 1979)

and the Penn Treebank (Marcus et al., 1999). Even

though those are excellent data sets, there is still a

risk of overstating the out-of-sample performance: If

a sufﬁcient number of different POS taggers are gen-

erated, some will naturally perform better than oth-

ers on the parts of the data set used for testing, but

those results are not necessarily replicated on com-

pletely new data.

Second, the sentences in the two data sets are now

around 30 years old (Penn Treebank) and more than

50 years old (Brown). Over such a long time span,

any human language will undergo changes that, in

turn, may lead to reduced POS accuracy. Third, even

though the average performance of POS taggers is

very good, there are many cases (as will be discussed

in Section 6 below) where existing POS taggers fail,

not only in complicated cases involving tags that are

hard to assign even for a human, but also, in fact, in

simple cases.

For these reasons, we here introduce a new test set

for POS tagging, which has been semi-automatically

generated using Wiktionary, followed by thorough

manual inspection, as discussed below. This data set

speciﬁcally targets word usages for which a set of

representative POS taggers struggle. It should only

be seen as a test set since, as a result of its method

of construction, we only tag one word per sentence.

Nevertheless, the new data set offers an opportunity

to test a POS tagger (trained on any other data set) in

an out-of-sample fashion.

The paper is structured as follows: First, in Sec-

tion 2 we begin with some general observations re-

lated to POS tagging. Next, in Section 3 we introduce

and describe the new POS tagging test set. Then, in

Section 4, we list and describe the taggers used here.

In Section 5 we present the results obtained for the

various taggers, both over standard benchmark data

sets and our data set. Then, in Section 6, we analyze

different instances of misclassiﬁcation by the various

taggers. We also brieﬂy present a rule-based approach

(in development) that aims to improve tagger perfor-

mance by focusing on those words or phrases where

existing taggers struggle. Conclusions are given in

Section 7.

2 POS TAGGING:

OBSERVATIONS

As mentioned above, POS tagging is the problem of

assigning a class label (POS tag) to each word in a

sentence. In other words, the process maps an ordered

sequence of words {w

, w

, . . . , w

} to a sequence of

tuples (w

, p

), where p

are the POS tags. Note that

there are related procedures, such as chunking (Wu

et al., 2023; Ramshaw and Marcus, 1999) that seeks

to divide a text into units that may contain more than

one word, for example noun phrases, verb phrases,

compound nouns, and so on. In this paper, however,

we will consider only POS tagging, as deﬁned above.

Given a tag set, i.e., the set of possible labels, POS

tagging is straightforward in many cases. Consider,

as an example, the simple sentence she slowly opened

the heavy door. In this sentence, she is a personal

pronoun (denoted PP in the Penn Treebank tag set),

slowly is an adverb (RB) opened is a verb in past tense

(VBD), the is a determiner (DT), heavy is an adjective

(JJ), and door is a noun in singular form (NN), result-

ing in the following sequence (she, PP), (slowly, RB),

(opened, VBD), (the, DT), (heavy, JJ), (door, NN).

However, in many other cases, the process is less

straightforward. First of all, there may be errors in

the data sets used for training POS taggers. As men-

tioned above, it is typically stated that the accuracy

of ground truth tags is around 97% but, as already

noted by (Manning, 2011), this number may be an

overestimate. To some degree, the ground truth POS

tags could perhaps be improved via a time-consuming

process of manual correction, though. Manning pro-

ceeds to break down POS tagging errors (for the Stan-

ford tagger) into several categories, for example lexi-

con gap where a word appears in the training set, but

never with the tag relevant in the test sentences, difﬁ-

cult linguistics where assigning a correct tag depends

on long-range context unavailable to a tagger based

on local features, and underspeciﬁed, where the tag is

simply unclear or ambiguous.

There are indeed cases of POS ambiguity. Con-

sider, for instance, the sentence she was surprised

when she came home. In this case, the word surprised

can be taken as a verb in passive form, describing an

action or event (as in: As she entered her house there

was something that surprised her), but it could also

be an adjectival form, describing a state (as in: some-

thing surprised her earlier in the day, and she then

remained in a surprised state when coming home). In

such cases, one simply cannot assign a unique, cor-

rect tag without additional context.

Compound nouns are another example: While in

some languages, e.g., Swedish, Finnish, German, and

so on, compound nouns are normally closed, in En-

glish many compound nouns are open, such as full

moon, ﬁre truck, waiting room, high school, hot dog,

free trade and so on. In the last example, hot dog is

normally a compound noun, describing a type of food

rather than the state of a dog. Yet, considered sepa-

rately, hot is an adjective here so should it, in POS

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

tagging, be marked as such (as indeed the Stanford

tagger does) or should it be marked as a noun, noting

that it is a part of a compound? One can make a sim-

ilar observation for the case of free trade, where the

Stanford tagger marks free as an adjective.

In the cases described above, it is not surprising

that a POS tagger may assign a tag that differs from

the (ambiguous) ground truth. However, as will be il-

lustrated below, there are also many cases which are

quite straightforward where even very good taggers

make rather elementary and surprising errors, indicat-

ing that POS tagging is still far from solved.

Moreover, it possible that the performance of a

given POS tagger will degrade with time, as a result of

changes in language. Such temporal drift effects have

been found in the related case of named-entity recog-

nition (NER) (Liu and Ritter, 2022). With the rapid

development of new technology, vocabulary changes

in several different ways: Completely new words are

introduced (for example the verb to google), whereas

other words change their meaning, for example the

verb tweet that, today, more commonly means posting

something on (the now renamed) Twitter, rather than

a bird making a sound. Another problem is that new

POS tags may come to be used for a given word. For

example, especially in words related to new technol-

ogy, it is not uncommon to use as a verb a word that

was previously almost exclusively used as a noun, for

example the word text. This poses a problem espe-

cially for POS taggers whose training is data-driven:

For those taggers it is sometimes difﬁcult to tag cor-

rectly words that are used in a manner not seen in

the training set. Next, we will introduce our data set,

which highlights some of the problems listed above.

3 A NEW TEST SET FOR POS

TAGGING

The purpose of the new POS test set

is that it should

act as challenging test for existing POS taggers. The

data set was generated as follows: First, an ofﬂine

copy of Wiktionary was parsed to extract sample sen-

tences for the many words contained in the dictionary.

Now, Wiktionary is structured such that, for any given

word, the different usages (as a verb, noun, adjective,

and so on) are organized into separate sections, and

are typically associated with one or several examples,

making it possible to automatically extract a large

number of sentences, for which one word has a spec-

iﬁed POS tag. The structure of the (HTML) pages

is not entirely consistent over the entire Wiktionary,

Available at https://doi.org/10.5281/zenodo.10299108

Table 1: Some basic statistics for the new POS data set.

Here, n denotes the number of instances for the tag in ques-

tion. Note that NUM, DET, PART, and X do not appear

among the tagged words (one per sentence) in our data set.

In total there were 2,227 sentences and, therefore, 2,227

POS tags.

POS tag n Fraction

VERB 977 0.4387

NOUN 578 0.2595

ADJ 524 0.2353

ADV 105 0.0471

ADP 16 0.0072

CONJ 16 0.0072

PRON 11 0.0049

somewhat complicating the automated extraction of

such sentences. Nevertheless, our extraction process,

which will not be described in detail here, resulted in

roughly 67,000 sentences.

Using only the Wiktionary data, there is no sim-

ple way to assign ﬁne-grained POS tags, e.g., distin-

guishing between different verb forms, since the ex-

amples are all grouped in a broader category (e.g.,

verb, in that case). Thus, for our data set, we have

chosen to use the universal POS tags (Petrov et al.,

2012), namely NOUN (common noun), VERB, ADJ

(adjective), ADV (adverb), ADP (adposition, mean-

ing preposition or postposition), CONJ (conjunction),

and PRON (pronoun). There are additional universal

POS tags, namely NUM (number), PART (particle),

X (unknown), DET (determiner), but they did not fea-

ture in our data, and neither did PUNCT (punctua-

tion).

This set was then manually curated, removing any

sentences involving a POS tag that was either deemed

ambiguous (e.g., cases involving passive verbs that

could equally well be taken as adjectival forms) or

simply incorrectly assigned in Wiktionary. From the

remaining sentences, a drastic reduction was carried

out: Sentences were kept only if at least one of the

four benchmark taggers (listed in Section 4.1 below)

assigned an incorrect POS tag, resulting in a set with

2,227 sentences: Even though the four benchmark

POS taggers were correct in most cases (roughly in

accordance with their general performance estimates;

see Table 3), here we are speciﬁcally interested in

cases where those taggers fail. Some basic statistics

for our data set are shown in Table 1, whereas some

examples of sentences are shown in Table 2.

As noted above, for every sentence in our data set,

only a single word is POS tagged. Clearly, as the

benchmark taggers make use of surrounding words

when assigning POS tags, our data set cannot be used

for training a POS tagger, but it can be used for test-

ing an existing tagger. The procedure is as follows:

A Challenging Data Set for Evaluating Part-of-Speech Taggers

Table 2: Some examples of sentences from our data set. The POS-tagged word is shown underlined, in bold. The ground truth

tag is given in the second column, and the tags assigned by the four benchmark taggers are given in the remaining columns.

Note that many sentences in the data set are considerably longer; short sentences were chosen here simply to ﬁt in the table.

The DNN-based tagger (see Section 4.2) correctly tagged the ﬁrst three sentences, but failed for the two last.

Sentence Ground truth Brill Hunpos Stanford Perceptron

How does this bear on the question ? VERB NOUN NOUN VERB NOUN

Seaweed clung to the anchor . VERB VERB VERB NOUN NOUN

That doctor is nothing but a lousy quack ! NOUN NOUN PART X NOUN

When you quiet , we can start talking . VERB ADJ ADJ VERB VERB

He bought a used car . ADJ VERB ADJ VERB ADJ

For every sentence, the tagger is applied to the entire

sentence, assigning tags to each word. Next, the as-

signed tag for the single POS-tagged word (for a given

sentence) is mapped to the universal tag set, and can

then be compared to the ground truth tag, from our

data set. Thus, even though the set is rather small, it

is large enough to provide interesting insight into the

reasons for failures of various taggers.

4 POS TAGGERS

In Section 4.1 we list and describe four commonly

used POS taggers that, as mentioned above, were used

when deﬁning our POS data set: For every sentence

in our data set, at least one of the four taggers failed

to assign a correct POS tag to the word under consid-

eration.

Using the procedure just described, one obtains a

data set for which the absolute performance (of the

four benchmark taggers) does not convey much infor-

mation, since the sentences were indeed chosen delib-

erately to make those taggers fail. Thus, in addition,

in Section 4.2 we consider also a recent, state-of-the-

art tagger (Akbik et al., 2018) based on Bi-LSTMs.

4.1 Benchmark taggers

The benchmark taggers that we have used for deﬁn-

ing our data set are the Brill, Hunpos, Stanford, and

Perceptron taggers. We chose these four high-quality,

frequently used POS taggers for comparison because

they represent a variety of different approaches to

POS tagging. In all cases except Brill, a standard pre-

trained version is available and free to use. We also

included the Brill tagger despite the absence of a pre-

trained version, since it is an inﬂuential and much-

used POS tagger, and because it is rule-based, while

the others are primarily statistical. In what follows we

provide a brief description of each of these taggers.

4.1.1 Brill Tagger

The Brill tagger (Brill, 1992), is one of the pioneer-

ing POS taggers in the ﬁeld of computational linguis-

tics. It is based on transformation-based learning, a

rule-based approach that iteratively reﬁnes tag assign-

ments by applying a set of transformation rules to the

initial tag sequence. These rules are learned from an-

notated training data and aim to correct tagging er-

rors progressively. The Brill tagger achieved state-

of-the-art accuracy during its time and served as a

foundation for subsequent tagger development. Since

there is no standard pre-trained version of the tagger,

we trained the model similarly to the original version

over a random sample (50%) of the Brown corpus.

In particular, we used a unigram tagger with a simple

regular-expression-based backoff as initial tagger and

the original 24 templates deﬁned in (Brill, 1992).

4.1.2 Hunpos Tagger

The Hunpos tagger is a statistical POS tagging tool

that employs hidden Markov models (HMMs) to pre-

dict the most likely tag sequence for a given input

sentence (Hal

acsy et al., 2007). It is an open source

reimplementation of the Trigrams’n’Tags (TnT) tag-

ger (Brants, 2000). The tagger leverages both word-

level and contextual information, taking into account

the probabilities of transitions between POS tags and

the emission probabilities of words given their tags.

We considered the current pre-trained version (v.1.0-

wsj)

, which has been trained on the Wall Street

Journal (WSJ) section of the Penn Treebank dataset,

with the usual division into training, validation, and

test sets (Collins, 2002).

4.1.3 Stanford Tagger

The Stanford POS tagger, developed by the Stan-

ford NLP Group, is a widely used POS tagging tool

that combines both rule-based and probabilistic ap-

proaches (Toutanova et al., 2003). It employs a maxi-

Available at https://code.google.com/archive/p/

hunpos/downloads (v.1.0-en wsj).

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

mum entropy model to assign tags to words based on

features such as word identity, word shape, and sur-

rounding words. We used the current pre-trained ver-

sion (v4.2.0)

. Like the Hunpos tagger, the Stanford

tagger was trained on the Penn Treebank dataset.

4.1.4 Perceptron Tagger

The Greedy Averaged Perceptron tagger written

by Honnibal is currently the standard tagger in

the Python NLTK (Natural Language Toolkit) li-

brary

(Bird et al., 2009) and, as such, is frequently

used as a default option in POS tagging. The per-

ceptron algorithm is a linear classiﬁcation algorithm

that iteratively updates weights to train a discrimina-

tive model for assigning tags to words. We considered

the current pre-trained version (v.3.8.1) present in the

library, which, like the previous two taggers, has been

trained on the Penn Treebank corpus.

4.2 DNN-based Tagger

Given that our data set was deﬁned by deliberately

choosing sentences where at least one of the four

benchmark taggers failed to tag the selected word cor-

rectly, the performance obtained for those four tag-

gers will, of course, be artiﬁcially reduced. Thus, in

order to make a true out-of-sample test of POS tag-

ger performance over our set, we also considered a

tagger based on deep neural networks (DNNs), more

precisely the Bi-LSTM tagger presented by (Akbik

et al., 2018), which has a reported accuracy of 0.978

over the Penn Treebank test set, using the data divi-

sion described in (Collins, 2002). There are many

other DNN-based taggers, see, for example, (Ya-

sunaga et al., 2017; Ling et al., 2015), but their re-

ported performance is not much different from that

reported in (Akbik et al., 2018). Thus, for our pur-

poses, it is sufﬁcient to consider only this tagger.

5 RESULTS

In this section we present the POS tagging perfor-

mance of the four benchmark taggers and the DNN-

based tagger, over different sets, namely a group of

randomly selected subsets of Brown and Penn Tree-

bank, as well as our set.

Available at https://nlp.stanford.edu/software/tagger.

shtml, english-bidirectional-distsim model (v.4.2.0).

See https://www.nltk.org (version 3.8.1).

Table 3: The performance (measured in terms of accuracy)

of the four benchmark taggers over subsets of two standard

POS data sets, mapped to the universal tag set. The ranges

shown are the 95% conﬁdence intervals, assuming a Gaus-

sian distribution.

Tagger Data Set Accuracy

Brill Brown 0.966± 0.001

Brill Penn Treebank 0.903 ± 0.001

Hunpos Brown 0.919± 0.001

Hunpos Penn Treebank 0.987 ± 0.000

Stanford Brown 0.940 ± 0.002

Stanford Penn Treebank 0.967 ± 0.000

Perceptron Brown 0.917± 0.001

Perceptron Penn Treebank 0.974 ± 0.000

Table 4: The performance of the four benchmark taggers

over the new POS data set, measured in terms of accuracy.

Tagger Accuracy

Brill 0.473

Hunpos 0.482

Stanford 0.673

Perceptron 0.561

5.1 Benchmark Tagger Performance

The benchmark taggers were typically trained on one

of the two standard POS tagging data sets mentioned

in Section 1, using holdout validation with a train-

ing set, a validation (development) set, and a ﬁnal test

set. The reported results generally refer to the perfor-

mance of each tagger over the test set. Since the tag-

gers were trained on different data sets (with varying

training-validation-test splits) and using different tag

sets, it is difﬁcult to make an exact, direct compari-

son. Thus, here, we chose to generate ﬁve subsets,

each with 2,000 randomly selected sentences, from

either the Brown data set or the Penn Treebank data

set, thus giving a total of 10 subsets. We then com-

puted the performance of our four benchmark taggers

over those subsets, using their original tag sets during

tagging, and then mapping the results to the universal

POS tags (Petrov et al., 2012). The results are shown

in Table 3. As can be seen, the performance is gen-

erally quite good, with accuracies in the range 0.90 -

0.99, and in all cases with very small variation over

the subsets.

Looking at our POS test set, where the accuracy

is measured based on the ability of the taggers to tag

the selected word (see also Table 2), the situation is

quite different, as can be seen in Table 4. Now, it

is not surprising that the performance is much lower

than that shown in Table 3, since our data set was

deﬁned by deliberately selecting sentences where at

least one of the benchmark POS taggers fails. Thus,

while the absolute performance values in Table 4 are

A Challenging Data Set for Evaluating Part-of-Speech Taggers

Table 5: The performance of the DNN-based tagger over

the same subsets (ﬁrst and second row) used in Table 3, and

over our POS data set (last row). In all cases, the results

refer to the universal tag set.

Data set Accuracy

Brown 0.949 ± 0.001

Treebank 0.972 ± 0.000

Our data set 0.868

not so relevant, we note the rather large relative per-

formance differences of these four taggers, which can

be compared with their about-equal performance in

Table 3. Perhaps more importantly, and as analyzed

in Section 6 below, they fail in many cases where the

POS tag assignment is, in fact, quite straightforward.

5.2 DNN-based Tagger Performance

Turning now to the DNN-based tagger (Akbik et al.,

2018), Table 5 shows the results obtained over the

same 10 subsets used for the four benchmark taggers

(Table 3), using the Penn tag set and mapping the ﬁnal

results to the universal POS tags (Petrov et al., 2012).

As can be seen in the table, the performance over the

10 subsets fell roughly in the range 0.95 − 0.97. Thus,

in most cases (though not all), the DNN-based tagger

does a bit better than the four benchmark taggers (see

Table 3). By contrast, for our data set, the accuracy

was 0.868, indicating that our data set is indeed chal-

lenging for the tagger.

6 DISCUSSION

The main aim of this paper has been to generate a

novel challenging data set for POS tagging. Over

such a data set, one would expect to ﬁnd worse perfor-

mance (lower accuracy) of even a high-quality DNN-

based tagger, such as the one introduced by (Akbik

et al., 2018). This is indeed what we ﬁnd: Over our

set, this tagger achieves an accuracy (using the univer-

sal POS tag set) of 0.868, quite a bit below its accu-

racy of roughly 0.95 − 0.97 over the older, commonly

used data sets, such as Brown and Penn Treebank.

One can also measure the performance of the very

simplest POS tagger, namely the unigram tagger that,

for every word, simply assigns the most frequent POS

tag for the word in question. Using the vocabulary

from the (entire) Brown data set to deﬁne the unigram

tagger, it achieves an accuracy of only around 0.31

over our set, a number that can be compared with ac-

curacies of 0.96 and 0.85 obtained when applying the

same tagger over the Brown set and (a subset of) the

Penn Treebank set, respectively. This dramatic drop

in performance is partly due to the fact that the vo-

cabulary for the unigram tagger was sourced from the

Brown set, a set developed more than 50 years ago

that contains no instances of many of the words in

our data set. However, even if all such words are dis-

regarded in the performance computation, the accu-

racy of the unigram tagger is only around 0.40 over

our set, again illustrating that our data set is indeed

challenging.

Besides these overall measures, it is interesting

to make a more detailed analysis regarding failures

of the various taggers over our data set, a topic that

we will consider next. In that analysis, in addition

to the DNN-based tagger, it is interesting to consider

also the four benchmark taggers (especially Stanford)

whose reported performance over their test sets is not

far behind that of the DNN-based tagger (see Sec-

tion 1), even though their performance over our set

is of course lower, bearing in mind how our set was

generated.

6.1 Analysis of Tagger Failure

The ﬁrst thing to note is that, somewhat surprisingly,

the taggers occasionally fail even in cases where the

tag assignment is rather straightforward, as evidenced

by the entries in Table 2. This applies also to the DNN

tagger: While it correctly tagged the ﬁrst three exam-

ples in that table, it failed on the last two.

Proceeding with a more detailed analysis, we note

that some errors can occur due to similar morpho-

logical features shared between tags. For example,

misclassiﬁcations between nouns and verbs happen in

words ending in the sufﬁx -s (faces, casts, bears) and

-ing (shipping, googling, ironing). Similarly, the -y

sufﬁx is shared between nouns and adjectives (acces-

sory, cheesy, tidy), and between adjectives and verbs

we see the sufﬁx -ing (annoying, darling) and -ed

(cursed, dated, requested). The number of misclas-

siﬁed words with a given sufﬁx is rather balanced be-

tween these tag pairs, with the exception of verbs and

nouns, where the -s sufﬁxed words are more often

misclassiﬁed as nouns rather than vice versa.

Besides inspecting the spelling of words, taggers

typically use surrounding words and their predicted

tags (when available) to resolve ambiguities. The

spelling of surrounding context words can also be

similar between two classes, making it difﬁcult for the

taggers to choose the correct tag based on this infor-

mation. For verbs and nouns, the preceding word to

appears to be problematic. Here, the four benchmark

taggers often fail to make the distinction between us-

ing to as a preposition combined with a noun, e.g., I

go to church every day, and the to-inﬁnitive, e.g., It

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

is time to board the aircraft. In these examples, it is

clear that church is a noun and board is a verb, yet the

taggers struggle to distinguish between the two tags

in these types of contexts. The DNN-based tagger

performs better in these cases, but is not completely

error-free either. For example, it tags the verb bus in

the sentence He was hired to bus tables . . . as a noun.

Analyzing tags of surrounding words, we see that

for adjectives mistagged as nouns the target word is

often preceded by a determiner and followed by a

noun. This sequence of tags (DET NOUN NOUN)

is of course valid and it appears in the training sets,

for example, for compound nouns (a campaign co-

ordinator), two-part named entities (the Ivory Coast)

and possessives (her mother’s assistant). However, in

the following tagged examples, taken from our data

set, it is obvious that the mistagged word (whose tag

is shown in red) does not belong to any of those cat-

egories. For example, in the sentence (That, DET),

(was, VERB), (a, DET), (classy, NOUN), (response,

NOUN), the word classy is clearly an adjective, even

though some of the four benchmark taggers incor-

rectly tag it as a noun. Another example is the sen-

tence (He, PRON), (walked, VERB), (down, ADP),

(the, DET), (lit, NOUN), (corridor, NOUN), where

lit is incorrectly tagged as a noun by the DNN tag-

ger. It is not surprising that simple taggers that do not

consider labels of next tokens (unidirectional taggers)

would fail in such cases, but with the taggers evalu-

ated here, it should not be a problem. We also see

cases where some of the taggers do tag examples of

this kind correctly, indicating that there are surprising

inconsistencies. For example, if we replace the word

lit in the previous sentence with dark, the DNN tagger

classiﬁes the word correctly as an adjective.

Incorrect labels might also be assigned when other

tokens in the sequence have been misclassiﬁed ﬁrst.

In the following sentence, the Perceptron tags the

word neat as a noun, and upon manual inspection, it

seems to be because it follows the word whisky, which

has been resolved as an adjective: (I, PRON), (like,

VERB), (my, PRON), (whisky, ADJ), (neat, NOUN).

It is clear that whisky is a noun in this context. We

also note that the inspected token neat is correctly

tagged by the DNN tagger, even though the previ-

ous token whisky is not: (I, PRON), (like, VERB),

(my, PRON), (whisky, ADJ), (neat, ADJ). This type

of sequence may be challenging due to the fact that,

in English, adjectives are typically placed before the

noun (attributive adjectives), but in this sentence the

adjective neat appears after the noun (predicative and

postpositive adjectives).

Many of the words that the taggers have failed to

tag correctly include modern vocabulary related to,

for example, computers, programming, ﬁnance and

so on, indicating that there are some temporal drift ef-

fects causing misclassiﬁcations as well. The taggers

are sometimes not able to generalize so as to handle

either newly derived words (e.g., hyperlink, botting)

or existing words adapted to a new meaning (e.g., the

verb text as in please text me when you are done.).

Regarding smaller word classes included in our

data set, only one pronoun is tagged correctly (by

Hunpos and the DNN, in that case). None of the con-

junctions are correctly tagged by any tagger. A pos-

sible explanation is that the data sets used to train the

taggers do not include any examples where the words

are used in this manner.

Whether the evaluated taggers learn general gram-

matical rules is questionable. We notice that around

a third of the sentences where a verb is misclassi-

ﬁed as a noun are tagged such that they contain no

verb at all. A grammatically correct sentence typ-

ically has at least one verb, and the absence of it

should be a clear indication that there is a mistake in

the tagged output. For example, in the sentence (The,

DET), (glue, NOUN), (sets, NOUN), (in, ADP), (ﬁve,

NUM), (minutes, NOUN), it is clear that sets is a verb,

even though it can also be used as a noun in other con-

texts. The taggers do not consider the grammatical

correctness of the tagged sentences as a whole.

The DNN tagger predicts more nouns and verbs

correctly than the other taggers, as well as adjectives

when compared to Perceptron, Hunpos, and Brill.

However, comparing its results to the Stanford tagger,

the performance of the DNN tagger is not much better

for the remaining classes (ADP, ADV, CONJ, PRON).

In fact, the Stanford tagger correctly predicts more ad-

verbs than the DNN tagger does. It appears that the

DNN tagger often fails in similar circumstances as the

other four taggers, as we have illustrated in the exam-

ples in this section.

6.2 Ongoing Work

In current work, we are considering a corrective ap-

proach for improving tagger performance, where sen-

tences are ﬁrst tagged by one of the taggers consid-

ered in this paper. In that process, certain words

and patterns (described below) are identiﬁed for fur-

ther analysis, which is carried out by applying a set

of rules. For example, short, common words such

as away, as, because, and so on, for which the tag

assignment (typically ADJ, ADP, ADV, or CONJ)

is often difﬁcult and sometimes cannot be inferred

from surrounding words, would be ﬂagged and then

checked against the rules. If any rule ﬁts, the tag

prescribed by the rule would be assigned. If not, the

A Challenging Data Set for Evaluating Part-of-Speech Taggers

tag assigned by the original tagger would be kept. In

some cases, the rules could involve more generic pat-

terns and associated tag assignments. Consider, for

example, the phrase . . . a lousy quack ! (see Table 2).

In general, the word quack is either a noun or a verb,

and here evidently a noun. This case would be han-

dled correctly by a rule such as DET ADJ {NOUN

or VERB} ⇒ DET ADJ NOUN, where, in this case,

the rule should only be applied at the end of a sen-

tence. The curly brackets indicate that, from dictio-

nary information (rather than just training data), the

only possible tags for the last word are as listed. Simi-

larly, this rule would also handle phrases such as . . . a

bitter harvest (if it ends a sentence), where harvest

could be a verb or a noun, but in this case is a noun.

7 CONCLUSION

In this paper, we have introduced a novel, challenging

test set for POS tagging, with a single tagged word per

sentence. In the development of the data set, we delib-

erately chose cases where at least one of four standard

benchmark POS taggers fails to assign correct POS

tags. We then applied a state-of-the-art DNN-based

POS tagger to our data set, for a true out-of-sample

test, and found a considerable drop in accuracy, from

around 0.95 − 0.97 over standard POS data sets (in

line with reported values in the literature) to around

0.87 over our set, thus illustrating that POS tagging

still presents signiﬁcant challenges. Importantly, in

our new data set, we explicitly removed ambiguous

cases, so that linguistic ambiguity cannot be applied

as an explanation for tagger failure. Indeed, as our

analysis shows, we ﬁnd many cases where the POS

tagging is quite straightforward, but where both the

four benchmark taggers, and the DNN-based tagger

(albeit to a lesser degree), nevertheless fail to assign a

correct tag.

REFERENCES

Akbik, A., Blythe, D., and Vollgraf, R. (2018). Contextual

string embeddings for sequence labeling. In Proceed-

ings of the 27th international conference on computa-

tional linguistics, pages 1638–1649.

Bird, S., Klein, E., and Loper, E. (2009). Natural language

processing with Python: analyzing text with the natu-

ral language toolkit. O’Reilly Media, Inc.

Brants, T. (2000). Tnt: A statistical part-of-speech tag-

ger. In Proceedings of the Sixth Conference on Ap-

plied Natural Language Processing, ANLC ’00, page

224–231, USA. Association for Computational Lin-

guistics.

Brill, E. (1992). A simple rule-based part of speech tag-

ger. In Proc. of the Third Conference on Applied Nat-

ural Language Processing, ANLC ’92, page 152–155,

USA. Association for Computational Linguistics.

Chiche, A. and Yitagesu, B. (2022). Part of speech tagging:

a systematic review of deep learning and machine

learning approaches. Journal of Big Data, 9(1):1–25.

Collins, M. (2002). Discriminative training methods for

hidden markov models: Theory and experiments with

perceptron algorithms. In Proceedings of the 2002

conference on empirical methods in natural language

processing (EMNLP 2002), pages 1–8.

Francis, W. N. and Kucera, H. (1979). Brown corpus

manual. Technical report, Department of Linguistics,

Brown University, Providence, Rhode Island, US.

Hal

acsy, P., Kornai, A., and Oravecz, C. (2007). Poster pa-

per: HunPos – an open source trigram tagger. In Proc.

of the 45th Annual Meeting of the Association for

Computational Linguistics Companion Volume Pro-

ceedings of the Demo and Poster Sessions, pages 209–

212, Prague, Czech Republic. Association for Compu-

tational Linguistics.

Ling, W., Lu

ıs, T., Marujo, L., Astudillo, R. F., Amir, S.,

Dyer, C., Black, A. W., and Trancoso, I. (2015). Find-

ing function in form: Compositional character mod-

els for open vocabulary word representation. arXiv

preprint arXiv:1508.02096.

Liu, S. and Ritter, A. (2022). Do CoNLL-2003 Named En-

tity Taggers Still Work Well in 2023? arXiv preprint

arXiv:2212.09747.

Manning, C. D. (2011). Part-of-speech tagging from 97%

to 100%: Is it time for some linguistics? In Gelbukh,

A. F., editor, Computational Linguistics and Intelli-

gent Text Processing, pages 171–189, Berlin, Heidel-

berg. Springer Berlin Heidelberg.

Marcus, M. P., Santorini, B., Marcinkiewicz, M. A., and

Taylor, A. (1999). Treebank-3. Linguistic Data Con-

sortium, Philadelphia.

Petrov, S., Das, D., and McDonald, R. (2012). A universal

part-of-speech tagset. In Proceedings of the Eighth In-

ternational Conference on Language Resources and

Evaluation (LREC’12), pages 2089–2096, Istanbul,

Turkey. European Language Resources Association

(ELRA).

Ramshaw, L. A. and Marcus, M. P. (1999). Text chunk-

ing using transformation-based learning. In Natural

language processing using very large corpora, pages

157–176. Springer.

Toutanova, K., Klein, D., Manning, C. D., and Singer, Y.

(2003). Feature-rich part-of-speech tagging with a

cyclic dependency network. In Proceedings of the

2003 Human Language Technology Conference of the

North American Chapter of the Association for Com-

putational Linguistics, pages 252–259.

Wu, Z., Deshmukh, A. A., Wu, Y., Lin, J., and Mou, L.

(2023). Unsupervised Chunking with Hierarchical

RNN. arXiv preprint arXiv:2309.04919.

Yasunaga, M., Kasai, J., and Radev, D. (2017). Robust mul-

tilingual part-of-speech tagging via adversarial train-

ing. arXiv preprint arXiv:1711.04903.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence