ARABIC TEXT CATEGORIZATION SYSTEM

Using Ant Colony Optimization-based Feature Selection

Abdelwadood Moh’d A. Mesleh and Ghassan Kanaan

Faculty of Information Systems & Technology, Arab Academy for Banking and Financial Sciences, Amman, Jordan

Keywords: Arabic Text Classification, Feature Selection, Ant Colony Optimization, Arabic Language, SVMs.

Abstract: Feature subset selection (FSS) is an important step for effective text classification (TC) systems. This paper

describes a novel FSS method based on Ant Colony Optimization (ACO) and Chi-square statistic. The

proposed method adapted Chi-square statistic as heuristic information and the effectiveness of Support

Vector Machines (SVMs) text classifier as a guidance to better selecting features for selective categories.

Compared to six classical FSS methods, our proposed ACO-based FSS algorithm achieved better TC

effectiveness. Evaluation used an in-house Arabic TC corpus. The experimental results are presented in

term of macro-averaging F

1

measure.

1 INTRODUCTION

It is known that the volume of Arabic information

available on the Internet is increasing. This growth

motivates researchers to better classifying Arabic

articles. TC (Manning & Schütze, 1999) is the task

to classify texts to one of a pre-specified set of

categories based on their contents.

Arabic TC process compromises three main

components (Mesleh, 2007): data pre-processing,

text classifier construction, and document

categorization. Data pre-processing makes the text

documents compact and applicable to train the text

classifier. Text classifier construction implements

the function of learning from a training dataset.

After evaluating the effectiveness of the text

classifier, the TC system can implement the function

of Arabic document classification. When given an

enough number of labeled examples (training

dataset), we can build a TC model to predict the

category of new documents. Those examples include

a huge number of features, and some of the features

do not reveal significant document-category

characteristics. This is why FSS techniques are

essential to decrease the size of training dataset, to

speed up training process and to improve the text

classifier’s effectiveness.

In this paper, much attention is paid to pre-

processing and in particular to the FSS process.

The rest of this paper is organized as follows. In

section 2, an overview of FSS methods is presented;

section 3 describes our proposed Ant Colony

Optimization-based FSS method (ACO-based FSS).

Experimental results and conclusions are discussed

in sections 4 and 5 respectively.

2 FSS OVERVIEW

FSS is a process that chooses a subset of features

from an original feature set according to some

criterion, the FSS basic steps are (

Liu & Yu, 2005):

Feature Generation. In this step, a number of

candidate subsets of features are generated by some

search process.

Feature Evaluation. In this step, the candidate

feature subsets are evaluated to measure their

goodness. Evaluation is divided into filter and

wrapper methods. In filter methods, features are

selected by a filtering process that is based on scores

which were assigned by a specific weighting

method. On the other hand, in wrapper methods,

feature selection is based on the accuracy of some

given classifier.

Stopping Criteria. In this step, the FSS process

stops if a predefined criterion is met.

In TC tasks, many FSS approaches (Yang &

Pedersen, 1997; Forman, 2003) are often used such

as Document Frequency thresholding (DF), Chi-

square statistic (CHI), Term Strength (TS),

384

Moh’d A. Mesleh A. and Kanaan G. (2008).

ARABIC TEXT CATEGORIZATION SYSTEM - Using Ant Colony Optimization-based Feature Selection.

In Proceedings of the Third International Conference on Software and Data Technologies - PL/DPS/KE, pages 384-387

DOI: 10.5220/0001892803840387

Copyright

c

SciTePress

Information Gain (IG), Mutual Information (MI),

Odds Ratio (OR), NGL coefficient and GSS score.

The valuable FSS studies (Yang & Pedersen,

1997; Forman, 2003) investigated FSS methods for

English TC tasks. However, Syiam, Fayed and

Habib (Syiam, Fayed & Habib, 2006) evaluated the

effectiveness of many FSS methods (Chi-square,

DF, IG, OR, GSS score, and NGL coefficient) for

Arabic TC tasks with Rocchio and kNN classifiers.

They concluded that a hybrid approach of DF and IG

is a preferable FSS method for Arabic TC task.

In a recent FSS study, Mesleh (Mesleh, 2007)

has conducted an empirical comparison of these FSS

methods evaluated on an Arabic dataset with SVMs

classifier. Mesleh concluded that Chi-square works

best with SVMs classifier for Arabic TC tasks.

Theoretically FSS has been shown to be an NP-

hard problem (Blum, & Rivest, 1992), as a result,

automatic feature space construction and FSS from a

large set has become an active research area. On the

other hand, optimization algorithms (such as genetic

algorithm (Goldberg, 1989)) have become

applicable to FSS processes.

In this work, ACO algorithm is proposed to

enhance (optimize) the Chi-square based FSS

process; the following may justify the selection of

ACO algorithm for Arabic FSS in TC task:

Comparing with other evolutionary-based

algorithms, ACO algorithm (Elbeltagi,

Hegazy & Grierson, 2005) performs better in

term of processing time. Where processing

time is very important when dealing with the

huge number of features in TC Arabic dataset.

Compared to English, Arabic language

(Yahya, 1989) is more sparsed, which means

that English words are repeated more often

than Arabic words for the same text length.

Sparseness yields less weight for Arabic terms

(features) than English features. The

difference of weight among Arabic word

features is less and this makes it more difficult

to differentiate between different Arabic

words (this may negatively affect Arabic text

classifier’s effectiveness).

ACO algorithm, which imitates foraging behavior of

real life ants (Dorigo, Maniezzo & Colorni, 1996)

was first proposed to solve traveling salesman

problem. However, it has been recently proposed to

solve many other problems such as FSS.

3 PROPOSED ACO-BASED FSS

ACO algorithm was used (Al-Ani, 2005) in the FSS

processes for speech segment and texture

classification problems. Similarly, ACO algorithm

was used (Jensen, & Shen, 2003) in an entropy-

based modification of the original rough set-based

approach for FSS problems. ACO algorithm was

used (Schreyer, & Raidl, 2002) to label point

features, a pre-processing step to reduce the search

space. And a hybrid method (Sivagaminathan, &

Ramakrishnan, 2007) of ACO and Neural Networks

was used to select features.

The main difference between these FSS approaches

is in the calculation of the used heuristic values.

Heuristic values help the algorithm reach an optimal

solution. Accordingly, we have tailored ACO

algorithm to fit the FSS process for Arabic TC tasks.

This new proposed FSS method adapted Chi-square

statistic as heuristic information and the

effectiveness of SVMs text classifier as a guidance

to better selecting features for selective text

categories in our Arabic TC system.

The main steps of our proposed ACO-Based FSS

method are as follows:

Initialization Step. Initially, Ant colony algorithm

parameters are initialized:

Define the amount of pheromone change for

each feature

0

i

τΔ=. Where i is a feature

index,

[0, ]iN∈ , and

N

is the total number

of features in the feature space.

Define pheromone level associated with each

feature (

1

i

τ =

).

Stopping criterion: define the maximum

number of iterations (NIs = 30).

Define the desired macro-averaging

1

F

measure (BF

1

= 88.11).

Define the number of Solutions (number of

ants) (NAs = 30).

Define the number of features in each candidate

subset of features (NFs).

Define the Number of Top Best Solutions

(TBS=10).

Local Selection Criterion: for all the features in

the original feature set, Chi-square statistic

scores are pre-calculated.

Step 2 – Generation Ants for Initial Iteration.

FOR each ant (solution) ( :1:

i

ant i NAs= ),

randomly select NFs features.

Step 3 – Evaluation Solutions. FOR each solution

(

:1:

i

ant i NAs= ), run the classifier (SVMs text

ARABIC TEXT CATEGORIZATION SYSTEM - Using Ant Colony Optimization-based Feature Selection

385

classifier) to evaluate the goodness of solution

i ,

Evaluation is based on macro-averaging

1

F

measure

Step 4 – Stopping Criterion. IF a predefined

stopping criterion is met THEN stop the Ant Colony

Optimization-Based FSS process.

ELSE:

(1) Pheromone Update. Update the pheromone

levels associated with features in the TBS solutions,

pheromone update is defined by:

.

otherwise

ii ii j

i

ii

wfEBSρτ τ τ

τ

ρτ τ

⎧

++ ∈

⎪

⎪

⎪

=

⎨

⎪

+

⎪

⎪

⎩

++

+

Where:

i

τ+ is defined by:

11

1:

11

1: 1:

max ( )

max ( max ( ) )

0therwise

gj

gTBS

ij

gh

i

h TBS g TBS

FF

f

S

FF

τ

=

==

⎧

−

⎪

⎪

⎪

∈

⎪

⎪

−

=

⎨

⎪

⎪

⎪

⎪

⎪

⎩

+

ρ is a coefficient such that (1 )ρ− represents

the evaporation of pheromone level. Elitist

Best Solution (

j

EBS ) is any solution

j

S

among the TBS solutions that outperformed

BF

1

. w

is the performance effectiveness of

solution

j

S . And

i

f

is a feature indexed by i .

(2) Probabilistic Feature Selection. Select new

features for the NAs ants for the next iteration.

Selection is defined by the following Chi-square

based Feature Selection Probability (CHIFSP):

[].[ ]

[].[ ]

0otherwise

j

j

j

S

ii

ij

S

S

gg

i

gisallowed

CHI

f

S

CHI

CHIFSP

αβ

αβ

τ

τ

⎧

⎪

⎪

∉

⎪

⎪

⎪

⎪

=

⎨

⎪

⎪

⎪

⎪

⎪

⎪

⎩

∑

Where:

Sj

i

CHI is the local importance of feature

i

f

given the solution

j

S .

α and β are used to control the effects of Chi-

square statistic and the pheromone level.

Go to evaluation Step 3.

4 EXPERIMENTAL RESULTS

In this work, we have used an in-house collected

corpus (see Mesleh, 2007). It consists of 1445

documents of different lengths belonging to nine

categories. We followed (Mesleh, 2007) in

processing the Arabic dataset: Each article in the

Arabic dataset is processed to remove digits and

punctuation marks. Normalize some Arabic letters

such as “ء” (hamza) in all its forms to “ا” (alef). All

the non Arabic texts were filtered. Arabic function

words (such as “ﺮﺧﺁ”, “اﺪﺑأ”, “ﺪﺣأ” etc.) were

removed. Arabic documents were represented by

vector space model. Lastly, all terms with term

frequency less than some threshold were filtered

(threshold is set to three for positive features and set

to six for negative features in training documents).

To implement SVMs text classifier (Mesleh,

2007), we used an SVMs package, TinySVM

(downloaded from http://chasen.org/~taku/), the

soft-margin parameter

C

is set to 1.0. And in order

to fairly compare our ACO-Based FSS with other

FSS methods, six FSS methods (IG, CHI, NGL,

GSS, OR and MI) were implemented. For the ACO-

Based FSS method,

α and β are set to 1.

TC effectiveness (Baeza-Yates and Ribeiro-

Neto, 1999) is measured in terms of Precision,

Recall and

1

F Measure. Denote the precision, recall

and

1

F measures for a class

i

C by

i

P ,

i

R and

i

F ,

respectively. We have:

i

P= ,

i

ii

TP

TP FP+

i

R= ,

i

ii

TP

TP FN+

ii

i

ii

2PR

F=

R+P

Where TP

i

, FP

i

, FN

i

, and TN

i

are defined in Table 1.

Table 1: The Contingency Table for Category

i

c .

Category

i

c

Expert Judgment

YES NO

Classifier

Judgment

YES

TP

i

FP

i

NO

FN

i

TN

i

To evaluate the average performance over many

categories, the macro-averaging

1

F (

M

1

F

) is used

and defined as follows:

1

M

ii i i

11 1 1

F=2[ R P]/ [ R P]

CC C C

ii i i

N

== = =

+

∑∑ ∑ ∑

To evaluate the effectiveness of our proposed

ACO-based FSS method, we conducted three groups

of TC experiments. For each group and for each text

category, we have randomly specified one third of

the articles and used them for testing while the

remaining articles used for training the Arabic

classifier. And for each FSS method (ACO-based

FSS, Chi-square, GSS, NGL, IG, OR, and MI), we

have conducted three experiments to select 180, 160,

and 140 features respectively. Then we conducted an

additional experiment without any FSS method (the

result of this experiment is referred to as original

classifier). In this work, ONLY one category’s

ICSOFT 2008 - International Conference on Software and Data Technologies

386

features are selected by ACO-based FSS method, i.e.

SVMs

M

1

F results were achieved by only optimizing

one text category (the smallest category). We noted

that optimizing any category will enhance the

classifier’s effectiveness.

Figure 1: SVMs

M

1

F values for SVMs with the seven FSS

methods at different subset of features.

Figure 1 shows

M

1

F results for SVMs text

classifier with the seven FSS methods at different

sizes of feature subsets. It is obvious that our ACO-

based FSS method outperformed the original

classifier (where all the 78699 features are used for

training the SVMs text classifier) and outperformed

the other six FSS methods. Best Chi-square

M

1

F

result was 88.11, and after optimizing the feature

selection of the smallest category,

M

1

F result became

88.743.

5 CONCLUSIONS

Our proposed ACO-based FSS method adapted Chi-

square statistic as heuristic information and the

effectiveness of SVMs as a guidance to better

selecting features in Arabic TC tasks. In this work,

the proposed FSS method was selectively applied to

a single text category (Computer category is the

smallest category). Compared to six classical FSS

methods, it achieved better TC effectiveness results.

Optimizing features for all categories, tuning the

ACO-based FSS parameters and studying their

effects, and comparing our proposed method with

other ACO algorithm flavors are left as future work.

REFERENCES

Manning, C., Schütze, H., 1999. Foundations of Statistical

Natural Language Processing. MIT Press.

Liu, H., Yu, L., 2005. Toward integrating feature selection

algorithms for classification and clustering. IEEE

Transaction on Knowledge and Data Engineering, vol.

17, no. 4, 491-502.

Yang, Y., Pedersen, J., 1997. A Comparative Study on

Feature Selection in Text Categorization. In J. D. H.

Fisher, editor, The 14th

International Conference on

Machine Learning (ICML'97), Morgan Kaufmann,

412-420.

Forman, G., 2003. An Extensive Empirical Study of

Feature Selection Metrics for Text Classification,

Journal of Machine Learning Research, vol. 3, 1289-

1305.

Syiam, M., Fayed, Z., Habib, M., 2006. An Intelligent

System for Arabic Text Categorization. International

Journal of Intellegent Computing & Information

Ssciences, vol.6, no.1, 1-19.

Mesleh, A., 2007. Support Vector Machines based Arabic

Language Text Classification System: Feature

Selection Comparative Study, to appear in the

proceedings of the International Joint Conferences on

Computer, Information, and Systems Sciences, and

Engineering (CIS2E 07), December 3-12, Springer-

Verlag.

Blum, A., & Rivest, R.., 1992. Training a 3-Node Neural

Network is NP-Complete. Neural Networks, vol. 5,

no. 1, 117-127.

Goldberg, D., 1989. Genetic Algorithms in search,

optimization, and machine learning, Addison-Wesley.

Dorigo, M., Maniezzo, V., Colorni A., 1996. The ant

system: optimization by a colony of cooperating

agents. IEEE Transactions on Systems, Man, and

Cybernetics-Part B, vol. 26, no. 1, 29--41.

Elbeltagi, E., Hegazy, T., Grierson, D., 2005. Comparison

among five evolutionary-based optimization

algorithms, Advanced Engineering Informatics, vol.

19, no. 1, 43-53.

Yahya, A., 1989. On the complexity of the initial stages of

Arabic text processing, First Great Lakes Computer

Science Conference; Kalamazoo, Michigan, USA.

Al-Ani, A., 2005. Feature Subset Selection Using Ant

Colony Optimization, International Journal of

Computational Intelligence. vol. 2, no. 1, 53-58.

Jensen, R., Shen, Q., 2003. Finding rough set reducts with

ant colony optimization. In Proceedings of the 2003

UK workshop on computational intelligence, 15-22.

Schreyer, M., Raidl, G., 2002. Letting ants labeling point

features. In Proceedings of the 2002 IEEE congress on

evolutionary computation at the IEEE world congress

on computational intelligence, 1564-1569.

Sivagaminathan, R.K., Ramakrishnan, S., 2007. A hybrid

approach for feature subset selection using neural

networks and ant colony optimization. Expert Systems

with Applications, vol. 33, 49-60.

Baeza-Yates, R., Rieiro-Neto, B., (1999). Modern

Information Retrieval. Addison-Wesley & ACM

Press.

ARABIC TEXT CATEGORIZATION SYSTEM - Using Ant Colony Optimization-based Feature Selection

387