COMPARISON OF NEURAL NETWORKS USED FOR PROCESSING

AND CATEGORIZATION OF CZECH WRITTEN DOCUMENTS

Pavel Mautner and Roman Mouˇcek

Department of Computer Science and Engineering, University of West Bohemia in Pilsen

Univerzitn´ı 8, 30614 Pilsen, Czech Republic

Keywords:

Document categorization, WEBSOM, ART-2.

Abstract:

The Kohonen Self-organizing Feature Map (SOM) has been developed for the clustering of input vectors and

for projection of continuous high-dimensional signal to discrete low-dimensional space. The application area,

where the map can be also used, is the processing of collections of text documents. The basic principles of the

WEBSOM method, a transformation of text information into a real components feature vector and results of

documents classiﬁcation are described in the article. The Carpenter-Grossberg ART-2 neural network, usually

used for adaptive vector clustering, was also tested as a document categorization tool. The results achieved by

using this network are also presented here.

1 INTRODUCTION

Today a huge collection of documents is accessible

in electronic libraries or on the Internet. Finding rele-

vant information in this collection of documents is of-

ten difﬁcult and time consuming task. Efﬁcient search

tools such as search engines have quickly emerged to

aid in this endeavor.

To make searching faster, the categorization

of documents according their content has become

a widely used method. Based on the keywords in-

cluded in the query it is possible to estimate the query

class (or domain) and then to make the search space

narrower. It reduces either searching time or the

length of the list of references.

In the past, many document categorization meth-

ods have been developed (Manning et al., 2007). One

of the interesting approaches to a document represen-

tation and consequential classiﬁcation was presented

by Finish scientists in (Kaski et al., 1998). Their ap-

proach is called WEBSOM and it is based on Ko-

honen self-organizing feature map (Kohonen, 2001).

The method was developed for automatic processing

and categorization of English (or Finish) written inter-

net documents and consecutive information retrieval

in these documents.

This paper deals with the application of the WEB-

SOM method for Czech written document categoriza-

tion and its modiﬁcation, in which an ART-2 neural

network is used as a document categorizer. The pa-

per is organized as follows. Section 2 provides basic

information about architecture and features of neural

networks used for document processing and catego-

rization, Section 3 describes principles of document

representation by a feature vector, word category con-

struction and documents categorization. The results

of experiments and possible future extension of this

work are summarized in Section 4.

2 SYSTEM ARCHITECTURE

2.1 Basic WEBSOM Architecture

The WEBSOM method is based on a two layer neural

network architecture (see Figure 1). The ﬁrst layer of

the WEBSOM, the Word Category Map (WCM), pro-

cesses an input feature vector. The second layer, the

Document Map (DM), categorizes input documents

according to information from WCM output. Both

layers of the WEBSOM are based on Kohonen self-

organizing feature map (SOM).

The SOM is an artiﬁcial neural network developed

by Theuvo Kohonen. It has been described in several

research papers and books (Kohonen, 2001), (Fiesler

and Beale, 1997), (Fausett, 1994). Its purpose is to

map a continuous high-dimensional space into a dis-

crete space of lower dimension (usually one or two di-

mensional space). The map contains one layer of neu-

510

Mautner P. and Mou

cek R..

COMPARISON OF NEURAL NETWORKS USED FOR PROCESSING AND CATEGORIZATION OF CZECH WRITTEN DOCUMENTS.

DOI: 10.5220/0003116205100513

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 510-513

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: Basic architecture of the WEBSOM.

rons, ordered to two-dimensional grid, and two layers

of connections. In the ﬁrst layer of connections, each

neuron is fully connected (through weights) to all fea-

ture vector components. Computations are feedfor-

ward in the ﬁrst layer of connections: the network

computes a distance between the feature vector F

and

each of the neuron weight vectors w

by the following

formula:

(t) =

N−1

∑

i=0

(t)−w

(t))

, j = 1, 2, . . . , M, (1)

where t is the time point, in which the output is ob-

served, F

(t) are components of feature vector and

i, j

(t) are components of neuron weight vector, N is

the number of feature vector components, and M is

the number of neurons (and also WCM units).

The second layer of connectionsacts as a recurrent

excitatory/inhibitory network, whose aim is to realize

winer-takes-all strategy, i.e. only the neuron with the

highest activation level d

(t) is selected and signed as

the best matching unit (BMU).

The document categorization by the WEBSOM

method proceeds by the following manner. At ﬁrst,

an input document is parsed and each word is pre-

processed and translated into a feature vector (see the

following section). The feature vector is clustered by

the WCM and a BMU value of the feature vector is

saved into the WCM output vector. After processing

all the words from the input document, the WCM out-

put vector is presented to the input of the document

map (DM). The document map processes the WCM

output vector and activates one of the output units

(BMU of the document map) which corresponds to

the category of the input document. It can be shown

(Kaski et al., 1998), that similar documents activate

similar DM output units.

2.2 Document Categorization using

ART Neural Network

In subsection 2.1 the document categorization system

based on Kohonen map was described. In that system

the document map creates clusters of similar docu-

ments, which have to be calibrated after the training

process. Within the calibration process, the output

units of the document map are labeled according to

the input documents categories, for which they have

become the BMUs. The labeling process can be com-

plicated because there are not clear borders between

document clusters.

To simplify this problem another neural network,

with simple outputs, which correspond to the doc-

ument categories accurately, was used. Since the

document separation based on topic similarity is of-

ten required, the ART (Adaptive Resonance Theory)

network was selected as a good candidate for docu-

ment categorization. The ART network developed by

Carpenter and Grossberg (Carpenter and Grossberg,

1988) is also based on clustering, but its output pro-

vides direct information about output class (document

category). There are several ARTs (ART-1, ART-2,

ARTMAP) differing by their architecture and input

feature vector type. For our work, the ART-2 net-

work, processing real-valued feature vector was used.

For detailed description of ART network see (Fausett,

1994) or (Carpenter and Grossberg, 1988).

The modiﬁed architecture of a document catego-

rization system using ART-2 network is illustrated in

Figure 2.

Figure 2: Modiﬁed architecture of system using ART-2 net-

work for document categorization.

COMPARISON OF NEURAL NETWORKS USED FOR PROCESSING AND CATEGORIZATION OF CZECH

WRITTEN DOCUMENTS

511

3 DOCUMENT

REPRESENTATION

In section 2 the system architecture for document cat-

egorization was presented. With respect to the fact

that input layer of the document processing system

uses the self-organizing map, which processes a real-

valued input vector, it is essential to transform an in-

put text to its numerical representation.

In (Kaski et al., 1998) the representation of docu-

ments by the averaged context vectors was presented.

The averaged context vectors are generated upon the

context of the words in the document collection by the

following process:

1. Each word s

in the vocabulary, which was cre-

ated for given document corpus, is evaluated by

a unique random real vector w

of dimension n.

2. The input document corpus is searched, and all

occurrences of word s

are found.

3. The context of the word s

is found, i.e. m words,

which precede/follow the word s

are taken from

each document containing this word, and the vec-

tors pw

(average of all vectors of m-tuple of w

preceding the word s

) and nw

(average of all vec-

tors of m-tuple of w

following the word s

) are

evaluated.

4. The average context vector cw

of the word s

created from values pw

, w

, nw

by the following

way:





εw





, (2)

where ε is a weight of vector representing the

word s

It is evident that the words occurring in the similar

context have a similar context vector and they belong

to the same category. Based on this assumption, it is

possible to train the word category map.

4 RESULTS AND FUTURE WORK

All neural network based systems for document cate-

gorization described in this paper were implemented

in Java. They can be downloaded and used for non-

commercial purpose.

The systems were tested on corpus of 6000 doc-

uments containing Czech Press Agency news. The

whole corpus has included approximately 146 000

words, stop and insigniﬁcant words were removed

Figure 3: Trained Word Category Map.

from the corpus. The documents were categorized

by hand into four categories, then the results were

compared with the results of automatic categoriza-

tion. Distribution of documents into categories was

the following:

document category % of all documents

sport 44

policy 51

foreign actuality 3

society 2

With regard to the low numbers of documents rep-

resenting some categories (e.g. there were approx-

imately 80 documents about society and 200 docu-

ments dealing with foreign actuality in the corpus),

a set of 160 documents (40 documents from each cat-

egory) was selected for training the word category

map and neural-based categorizers. A vocabulary of

words generated from the training set of documents

was created and all words with frequency of occur-

rence smaller than predeﬁned threshold were removed

from the vocabulary. Then the vocabulary was used

for training the WCM.

The size of the WCM (the ﬁrst layer of the classi-

ﬁcation system) was chosen in order to place approxi-

mately 25 words into each category (i.e. the map con-

tains approximately 40 neurons for 1000 words). The

word category map was trained by numeric vectors

representing the words in the dictionary.

The result of the training of the WCM and an ex-

ample of word categories are illustrated in Figure 3.

It is apparent that some output units respond to the

words only from a speciﬁc syntactic or semantic cate-

gory (nouns, ﬁrst name and surname etc.), while other

units respond to the words from various syntactic or

semantic categories.

The Document Map consists of nine neurons ar-

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

512

Table 1: Results of document categorization using Docu-

ment Map.

DM Number of documents (in %) for category:

unit

number Sport Policy Foreign Actuality Society

1 2.5 16.2 8.5 0

2 45 9.4 16.5 20

3 20 2.4 0 0

4 22.5 25.5 25 0

5 0 0 0 0

6 10 25.5 41.5 60

7 0 0 0 0

8 0 0 0 0

9 0 21 8.5 20

ranged to 3x3 grid. The map receives and processes

the vectors from the WCM outputconvolvedby Gaus-

sian mask and produces the output which corresponds

to the category of the input document. After the train-

ing, the DM output units were labeled manually.

The association of documents from particular cat-

egories to the clusters, which are represented by the

DM map output units are presented in Table 1. It is

evident that the unit 2 is mostly activated for the sport

category, units 4 and 6 are activated especially for cat-

egory policy, etc.

The ART-2 network was developed to give a com-

parable output with the SOM based categorizer. The

ART-2 categorizer has nine output units (i.e. the net-

work can create at most nine clusters). The set of

documents used for training of the SOM based cat-

egorizer was also used here. The number of actu-

ally created clusters was strongly dependent on the

parameter ρ (vigilance threshold). In our case param-

eter ρ = 0.98 was used because most documents were

submitted to only one cluster if ρ had a smaller value.

The results of categorization using ART-2 categorizer

are presented in Table 2. The meaning of values in

the table is similar as for the SOM based categorizer.

Documents with sport, policy and foreign actuality

topics are well separated (see the values for units 7,

5 and 1 respectively), documents dealing with soci-

ety news were mostly submitted to the same cluster

as documents about policy (output unit 5).

The comparisonof SOM and ART-2 based catego-

rizers is quite difﬁcult and it is still investigated. Since

the changes in the SOM network parameters affect the

resulting clusters less than it is in the case of ART-2

network, the results seem to be more natural. The

advantage of SOM categorizer is a low number of pa-

rameters. The ART-2 is very sensitive to parameters

setting. There are seven parameters of the network

(including ρ mentioned above), which have to be set

up before training the network. If the parameters are

chosen properly, the network can give better catego-

Table 2: Results of document categorization using ART-2

categorizer.

ART-2 Number of documents (in %) for category:

output unit

number Sport Policy Foreign Actuality Society

1 8.4 11.4 53.7 17.7

2 0.4 2.3 1.4 2.4

3 0.1 0 0 0

4 14.7 5.6 0.5 8.9

5 5.8 58.3 10.4 44.4

6 0.2 0.1 0.5 0

7 56.3 14.7 16.3 13.3

8 5.7 2.9 4.1 4.4

9 8.4 4.7 13.1 8.9

rization results then SOM categorizer.

In our future work we plan to focus on the follow-

ing tasks, which could improve the results of docu-

ment categorization:

• introduction of another feature set for word de-

scription,

• application of other supervise-trained neural net-

works (e.g. multilayer perceptron, LVQ, etc.) as

a second layer

• usage of more sophisticated approaches for com-

parison of categorization results

ACKNOWLEDGEMENTS

This work was supported by grant no. 2C06009 Cot-

Sewing.

REFERENCES

Carpenter, G. A. and Grossberg, S. (1988). The art of

adaptive pattern recognition by a self-organizing neu-

ral network. Computer, 21(3):77–88.

Fausett, L. V. (1994). Fundamentals of Neural Networks.

Prentice Hall, Englewood Cliffs, NJ.

Fiesler, E. and Beale, R., editors (1997). Handbook of Neu-

ral Computation. Oxford University Press.

Kaski, S., Honkela, T., Lagus, K., and Kohonen, T. (1998).

Websom-self-oganizing maps of document collec-

tions. Neurocomputer, pages 101–117.

Kohonen, T. (2001). Self-Organizing Map. Springer-Verlag,

Berlin Heidelberg.

Manning, C. D., Raghavan, P., and Sch¨utze, H. (2007). An

Introduction to Information Retrieval - Preliminary

Draft. Cambridge University Press.

COMPARISON OF NEURAL NETWORKS USED FOR PROCESSING AND CATEGORIZATION OF CZECH

WRITTEN DOCUMENTS

513