Identifying Boundaries and Semantic Labels of Economic

Entities Using Stacking and Re-sampling

Katia Lida Kermanidis

Artificial Intelligence Group, Department of Electrical and Computer Engineering

University of Patras, Rio 26500, Greece

Abstract. Semantic entities of the economic domain are detected and labeled in

free Modern Greek text using Instance-based learning in two phases (stacking)

to force the classifier to learn from its mistakes, and random undersampling of

the majority class to improve classification accuracy of the instances of the mi-

nority classes. By not making use of any external sources (gazetteers etc), and

limited linguistic information for pre-processing, a mean f-score value of 73.3%

for the minority classes is achieved.

1 Introduction

The tagging of semantic entities in written text is an important subtask for informa-

tion retrieval and data mining and refers to the task of identifying the entities and

assigning them to the appropriate semantic category.

One major subclass of semantic entities are named entities (such as names of per-

sons, organizations, locations etc.). Automatic named entity recognition (NER) has

been attracting the interest of numerous researchers during the last years. Hendrickx

and Van den Bosch in [3] employ manually tagged and chunked English and German

datasets, and use memory-based learning to learn new named entities that belong to

four categories. They perform iterative deepening to optimize their algorithmic pa-

rameter and feature selection, and extend the learning strategy by adding seed list

(gazetteer) information, by performing stacking and by making use of unannotated

data. They report an average f-score on all four categories of 78.20% on the English

test set. Another approach that makes use of external gazetteers is described in [1],

where a Hidden Markov Model and Semi-Markov Model is applied to the CoNNL

2003 dataset. The authors report a mean f-score of 90%. Multiple stacking is also

employed in [10] on Spanish and Dutch data and the authors report 71.49% and

60.93% mean f-score respectively. The work in [9] focuses on the Natural History

domain. They employ a Dutch zoological database to learn three different named-

entity classes, and use the contents of specific fields of the database to bootstrap the

named entity tagger. In order to learn new entities they, too, train a memory-based

learner. Their reported average f-measure reaches 68.65% for all three entity classes.

Other approaches ([7], [11]) utilize combinations of classifiers in order to tag new

named entities by ensemble learning.

Lida Kermanidis K. (2007).

Identifying Boundaries and Semantic Labels of Economic Entities Using Stacking and Re-sampling.

In Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, pages 149-158

DOI: 10.5220/0002414101490158

 SciTePress

This paper describes the automatic recognition of semantic entities related to the

economic domain in Modern Greek free text. Unlike in previous approaches to NER,

the semantic entities in the present work are not limited to named entities only, such

as names of organizations, persons and locations. First, they also cover names of

stocks and bonds, as well as names of newspapers (due to the newswire genre of the

corpus). Furthermore, there are other semantic types that are important for economic

information retrieval, like quantitative units (e.g. denoting stock and fund quantities,

monetary amounts, stock values), percentages etc. Temporal words and expressions

are also identified due to their importance for data mining tasks.

This information appears in free text in either one-word, or multi-word expres-

sions. The present work views semantic entity recognition as a two-task experiment:

The first task is to detect the boundaries (the beginning and the end) of these expres-

sions. The second is to assign a semantic label to each of them.

The corpus used in the experiments is automatically tagged with part-of-speech

(pos), basic token type information (whether a token is a number, a symbol, an abbre-

viation, an acronym etc.) and elementary morphological information (case, number

and gender). This information is represented by a set of features (described in detail

in section 3.2) that form instance vectors. Context information is also taken into ac-

count for recognizing new entities, as the tokens surrounding the candidate entity

often determine the classification outcome.

Supervised learning techniques have been employed to learn the boundaries and

the labels of the entities. Learning is performed in two stages: The learner is first

trained on the training data and used to classify new, unseen instances. In the second

stage, the classification predictions of the first stage are added to the instance vector

as extra features to force the classifier to learn from its mistakes.

Another aspect of the present work that differentiates it from previous approaches

is the attempt to deal with the class imbalance problem. As every sentence token is

considered a candidate semantic entity, the class of negative instances (instances that

do not represent an entity) is highly over-represented in the dataset compared to the

positive classes (instances that do represent an entity). This imbalance has serious

consequences on classification accuracy of the instances of the minority classes. Ran-

dom under-sampling of the majority class instances is applied to balance the dataset

and improve classification performance.

2 Modern Greek

The Modern Greek language has certain properties that are significant for the present

task. First, it is highly inflectional. The case (nominative, accusative, genitive) of

nouns, adjectives or articles affects semantic labeling. For example, the genitive case

may denote possession, quantity, quality, origin, division, etc., as is shown in the

following examples:

Η τιμή ανήλθε στο ποσό των 12.33 €.

The[NOM] price[NOM] reached the[ACC] value[ACC] the[GEN] 12.33 €.

The price reached the value of 12.33 €.

150

Η Tράπεζα της Ελλάδος

The[NOM] Bank[NOM] the[GEN] Greece[GEN]

The Bank of Greece

As can be induced from these examples, another important property is the agree-

ment of morphological features (case, person, gender and number values) between

consecutive words. The borders of the agreement define the borders of basic nominal

chunks.

Context information is often decisive when trying to detect a semantic entity. In

the following example, the verb ανέρχομαι (to reach), is a strong indicator that the

entity next to it is an amount/value, because this verb is typically used in Modern

Greek to express ‘reaching a value’

Οι μετοχές ανήλθαν στις 500.

The stocks reached the 500.

The number of stocks reached 500.

3 Data

The experiments described in this paper were run on free Modern Greek text of eco-

nomic domain. This section describes the corpus, as well as the extracted feature set.

3.1 Delos

The DELOS Corpus ([4]) is a collection of economic domain texts of approximately

five million words and of varying genre. It has been automatically annotated from the

ground up. Lemmatization and morphological tagging on DELOS was performed by

the analyzer described in [8]. Regarding the morphological information that is crucial

for the present task, tagging includes assigning part-of-speech (pos) categories to

words, assigning case, number and gender tags, detecting acronyms, abbreviations,

numbers and symbols. Accuracy in part-of-speech and case tagging reaches 98% and

94% respectively.

DELOS is a collection of newspaper and journal articles. More specifically, the

collection consists of texts taken from the financial newspaper EXPRESS, reports

from the Foundation for Economic and Industrial Research, research papers from the

Athens University of Economics and several reports from the Bank of Greece. The

documents are of varying genre like press reportage, news, articles, interviews and

scientific studies and cover all the basic areas of the economic domain, i.e. microeco-

nomics, macroeconomics, international economics, finance, business administration,

economic history, economic law, public economics etc. Therefore, it presents a rich-

ness in vocabulary, in linguistic structure, in the use of idiomatic expressions and

colloquialisms, which is not encountered in the highly domain- and language-

restricted texts used normally for named entity recognition (e.g. medical records,

technical articles, tourist site descriptions).

151

The following table presents some statistical data regarding the composition of the

corpus. Residuals include transliterated words (foreign words written in the Greek

alphabet) and interjections.

Table 1. Statistical data on Delos.

Noun 36,6%

Verb 30,9%

Prepositional 27%

Phrases

Adverbial 5,5%

Words 84,1%

Punctuation marks 8,9%

Abbreviations/Acronyms 3,3%

Numbers 2,9%

Word tokens

Other Symbols 0,8%

In Greek alphabet 98,2%

Words

In Latin alphabet 1,8%

Nouns 29,9%

Verbs 10,2%

Adjectives 10,6%

Pronouns 3%

Articles 15,2%

Adverbs 5,8%

Numerals 1,5%

Conjunctions 6%

Particles 1,6%

Prepositions 9,2%

Words in Greek

Residuals 7%

3.2 Feature Set

Each token in the corpus constitutes a candidate semantic entity. Each candidate en-

tity is represented by a feature-value vector. The features forming the vector are:

1. The token lemma. In the case where automatic lemmatization was not able to

produce the token lemma, the token itself is the value of this feature.

2. The pos category of the token. The values of this feature appear in table 2.

3. The morphological tag of the token. The morphological tag is a string of 3 char-

acters encoding the case, number, and gender of the token, if it is nominal (noun,

adjective or article).

4. The case tag of the token. The case tag is one of three characters denoting the

token case.

5. Capitalization. A Boolean feature encoding whether the first letter of the token

is capitalized or not.

152

Table 2. Values of the pos feature.

Tag Description

N Noun

V Verb

A Adjective

P Pronoun

T Article

N Numeral

C Conjunction

R Adverb

S Preposition

F Punctuation mark

U Particle

Xa Acronym

Xb Abbreviation

For each candidate entity, context information was included in the feature-value

vector, by taking into account the two tokens preceding and the two tokens following

it. Each of these tokens was represented in the vector by the five features described

above. As a result, a total of 25 (5x5) features are used to form the instance vectors.

The class label assigns a semantic tag to each candidate token. These tags represent

the entity boundaries (whether the candidate token is the start, the end or inside an

entity) as well as the semantic identity of the token. A total of 40,000 tokens were

manually tagged with their class value. Table 3 shows the various values of the class

feature, as well as their frequency among the total number of tokens.

Unlike most previous approaches that focus on labeling three or four semantic

categories of named entities, the present work deals with a total of 30 class values

plus the non-entity (NULL) value, as can be seen in the previous table.

Another important piece of information provided disclosed by the previous table is

the imbalance between the populations of the positive instances (entities) in the data-

set, that form only 15% of the total number of instances, and the negative instances

(non-entities). This imbalance leads to serious classification problems when trying to

classify instances that belong to one of the minority classes ([5]). By randomly re-

moving negative examples, so that their number reaches that of the positive examples

([6]), the imbalance is attacked and the results prove that classification accuracy of

the positive instances improves considerably.

4 Experimental Setup and Results

Instance-based learning (IB1) was the algorithm selected to classify the candidate

semantic entities. Ib1 was chosen because, due to storing all examples in memory, it

is able to deal competently with exceptions and low-frequency events, which are

important in language learning tasks ([2]), and are ignored by other learning algo-

rithms.

153

Several experiments were conducted for determining the optimal context window

size of the candidate entities. Sizes (-2, +2) - two tokens preceding and two following

the candidate entity - and (-1, +1) - one token preceding and one following the candi-

date entity - were experimented with, and comparative performance results were

obtained. When decreasing the size from (-2, +2) to (-1, +1), the number of features

forming the instance vectors drops from 25 to 15. The results are shown in table 4.

Another set of experiments focused on comparing classification in one stage and in

two stages. In the first stage, the Instance-based learner predicts the class labels of the

test instances. The results, as noted previously, are presented in table 3. In the second

stage, the predictions of the first phase are added to the set of features that are de-

scribed in section 3.2. The total number of features in the second stage, when experi-

menting with the (-2, +2) context window, is 30. The results of learning in two stages

with window size (-1, +1) are shown in the first column of table 5.

Table 3. Values of the class label.

Tag Description Percentage

AE Start of company/organization/bank name 1.4%

ME Middle of company/organization/bank name 0.74%

TE End of company/organization/bank name 1.4%

E Company/organization/bank 1-word name 1.1%

AP Start of monetary amount/price/value 0.88%

MP Middle of monetary amount/price/value 0.63%

TP End of monetary amount/price/value 0.88%

AAM Start of number of stocks/bonds 0.3%

MAM Middle of number of stocks/bonds 0.42%

TAM End of number of stocks/bonds 0.3%

AT Start of percentage value 0.73%

MT Middle of percentage value 0.08%

TT End of percentage value 0.73%

AX Start of temporal expression 1%

MX Middle of temporal expression 0.75%

TX End of temporal expression 1%

X 1-word temporal expression 0.55%

AO Start of stock/bond name 0.16%

MO Middle of stock/bond name 0.17%

TO End of stock/bond name 0.16%

ON 1-word stock/bond name 0.05%

AL Start of location name 0.21%

ML Middle of location name 0.48%

TL End of location name 0.21%

L 1-word location name 0.33%

F 1-word newspaper/journal name 0.14%

AN Start of person name 0.18%

MN Middle of person name 0.02%

TN End of person name 0.18%

N 1-word person name 0.06%

Comparative experiments were also performed with and without the removal of

negative examples, in order to prove the increase in performance after applying ran-

154

dom undersampling to the data. With random undersampling, random instances of the

majority class are removed from the dataset in order for their number to reach that of

the positive classes. The classification results, after applying the undersampling pro-

cedure, are presented in the second column of table 5.

Testing of the algorithm was performed using 10-fold cross validation.

Table 4. Comparative results for different context window sizes.

Class F-score (-1,+1) F-score (-2,+2)

NULL 0.969 0.96

AE 0.728 0.683

ME 0.557 0.64

TE 0.768 0.74

AP 0.851 0.767

MP 0.865 0.852

TP 0.84 0.774

E 0.667 0.621

AAM 0.754 0.675

MAM 0.769 0.708

TAM 0.611 0.643

AO 0.353 0.465

MO 0.194 0.293

TO 0.143 0.35

AT 0.911 0.802

MT 0.588 0.857

TT 0.939 0.818

AX 0.585 0.558

TX 0.588 0.492

AL 0.421 0.449

ML 0.059 0.17

TL 0.278 0.293

X 0.452 0.457

F 0.889 0.947

AN 0.286 0.364

TN 0.378 0.632

MX 0.524 0.561

MN 0 0

ON 0 0

N 0.667 0.571

L 0.519 0.506

155

Table 5. Column A: Results with two-phase learning for context window size (-1,+1). Column

B: Results with two-phase learning for context window size (-1,+1) after applying random

undersampling.

Class F-score A F-score B

NULL 0.981 0.939

AAM 0.895 0.895

AE 0.882 0.899

AL 0.651 0.571

AN 0.65 0.756

AO 0.81 0.85

AP 0.96 0.96

AT 0.985 0.98

AX 0.755 0.806

E 0.721 0.803

F 0.944 1

L 0.55 0.565

MAM 0.944 0.911

ME 0.831 0.808

ML 0.562 0.632

MN 0 0

MO 0.55 0.5

MP 0.957 0.963

MT 0.952 0.952

MX 0.802 0.8

N 0.533 0.571

ON 0 0

TAM 0.865 0.838

TE 0.871 0.903

TL 0.524 0.465

TN 0.65 0.579

TO 0.629 0.611

TP 0.932 0.932

TT 0.954 0.96

TX 0.736 0.774

X 0.567 0.694

5 Discussion

The context window size plays an important role in classification performance. Cer-

tain types of entities require a larger window for their accurate detection, while larger

context is misleading for other types. To the former category belong the more

‘straightforward’ types, that are either normally introduced by one characteristic word

156

or acronym/abbreviation, like person names and company names, or that end in one

specific symbol or acronym/abbreviation, like monetary amounts and percentages.

As can be seen in table 4, classification for certain types reaches a poor score.

Looking more closely at table 3, this can be attributed without a doubt to the sparse-

ness that characterizes these types (multi-word person names, multi-word stock/bond

names, multi-word locations). An interesting exception to this rule is newspa-

per/journal names, that reach very high scores, despite their low frequency, because

they are normally introduced by specific words like ‘εφημερίδα’ (newspaper) or

‘περιοδικό’ (journal).

Table 4 also shows the high f-score achieved for the negative (NULL) class com-

pared to that of the positive classes, due to its high over-representation in the dataset.

The first column of table 5 shows the positive effects of stacking on the task at

hand. The f-score increases up to more than 50% after applying two-phase learning.

This improvement is due to two reasons: First, the sequential nature of the class label

tags (start, middle, end). The class of one entity depends largely on the class of the

preceding and the following entities. Second, the inclusion of the predicted class of

the candidate entity (from the previous learning stage) in the feature vector of the

second stage forces the classifier to focus on the mistakes it made, and try to correct

them. Difficult cases like multi-word locations and multi-word names are now dealt

with satisfactorily.

Random undersampling also proved highly beneficial for the majority of the entity

categories. It forces the learner to pay more attention to the minority classes. The

random nature of the undersampling process is the reason that the results for certain

entity types were not improved, as certain useful negative examples may have been

removed.

One-word stock/bond names (ON) occur extremely seldom in the corpus. Person

names consisting of more than two words (MN), are even more rare. The learner has

not been able to detect these classes due to the sparseness.

Given, however, the nature and complexity of the corpus, the low level of preproc-

essing (compared to previous approaches that use phrase-chunked input), and the

large number of class labels, the results of table 5 are very impressive when compared

to the ones reported in the literature (section 1).

6 Conclusion

This paper has presented a set of methodologies that were applied to a Modern Greek

economic corpus in order to help detect and label semantic entities in the economic

domain that are important for information retrieval, data mining, question-answering

systems. Unlike previous approaches to named-entity recognition, the present work

aims at identifying a wider range of entities (apart from names of persons, organiza-

tions and locations), that are linked to the economic domain, like names of stocks, of

newspapers, of banks, quantities, percentages, etc. Stacking was performed to help

the instance-based classifier to focus on the tricky cases and learn from previous

mistakes, leading thereby to a significant increase in accuracy. Another novel feature

of the present work is the way it deals with the imbalance in the class distribution in

157

the dataset. Further performance improvement was achieved after balancing the class

distribution using undersampling of the majority class instances. The above tech-

niques deal very well with the large number of class labels, with the low level of pre-

processing, as well as the complicated nature of the corpus.

Acknowledgements

We thank the European Social Fund (ESF), Operational Program for Educational and

Vocational Training II (EPEAEK II), and particularly the Program PYTHAGORAS

II, for funding the above work.

References

1. Ciaramita, M., Altun, Y.: Named Entity Recognition in Novel Domains with External

Lexical Knowledge. In Workshop on Advances in Structured Learning for Text and Speech

Processing (NIPS) (2005)

2. Daelemans, W., van den Bosch, A., Zavrel, J.: Forgetting Exceptions is Harmful in Lan-

guage Learning. Machine Learning, Vol. 34, (1999) 11-41

3. Hendrickx, I., van den Bosch, A.: Memory-based One-step Named-entity Recognition:

Effects of Seed List Features, Classifier Stacking and Unannotated Data. Proceedings of

the 7

Conference on Computational Natural Language Learning (CoNNL), Edmonton,

Canada (2003)

4. Kermanidis, K., Fakotakis, N., Kokkinakis, G.: DELOS: An Automatically Tagged Eco-

nomic Corpus for Modern Greek. In Proceedings of the 3rd International Conference on

Language Resources and Evaluation (LREC), Las Palmas de Gran Canaria (2002) 93-100

5. Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets. Proceedings of

the International Conference on Machine Learning (ICML) (1997) 179- 186.

6. Laurikkala, J.: Improving Identification of Difficult Small Classes by Balancing Class

Distribution. Proceedings of the 8th Conference on Artificial Intelligence in Medicine in

Europe. Cascais, Portugal (2001) 63-66

7. Radu, F., Ittycheriah A., Jing H., Zhang T.: Named Entity Recognition through Classifier

Combination. Proceedings of the 7

Conference on Computational Natural Language

Learning (CoNNL), Edmonton, Canada (2003) 168-171

8. Sgarbas, K., Fakotakis, N., Kokkinakis, G.: A Straightforward Approach to Morphological

Analysis and Synthesis, In Proceedings of the Workshop on Computational Lexicography

and Multimedia Dictionaries (COMLEX), Kato Achaia, Greece (2000) 31−34

9. Sporleder, C., van Erp, M., Porcelijn, T., van den Bosch, A., Arntzen, P.: Identifying

Named Entities in Text Databases from the Natural History Domain. In Proceedings of the

International Conference on Language Resources and Evaluation (2006)

10. Tsukamoto, K., Mitsuishi, Y., Sassano, M.: Learning with Multiple Stacking for Named

Entity Recognition. In Proceedings of the 6

Conference on Natural Language Learning,

Taipei, Taiwan (2002) 1-4

11. Wu. C., Jan, S., Tsai, T., Hsu, W.: On Using Ensemble Methods for Chinese Named Entity

Recognition. Proceedings of the 5

SIGHAN Workshop on Chinese Language Processing,

Sydney, Australia (2006) 142-145

158