A New Tool for Textual Aggregation In Information Retrieval

Mustapha Bouakkaz

, Sabine Loudcher

and Youcef Ouinten

LIM Laboratory, University of Laghouat, Laghouat, Algeria

ERIC Laboratory, University of Lyon2, Lyon, France

Keywords:

Aggregation, OLAP , Textual data, Algorithm.

Abstract:

We present in this paper a system for textual aggregation from scientiﬁc documents in the online analytical

processing (OLAP) context. The system extracts keywords automatically from a set of documents according

to the lists compiled in the Microsoft Academia Search web site. It gives the user the possibility to choose

their methods of aggregation among the implemented ones. That is TOP-Keywords, TOPIC, TUBE, TAG,

BienCube and GOTA. The performance of the chosen methods, in terms of recall, precision, F-measure and

runtime, is investigated with two real corpora ITINNOVATION and OHSUMED with 600 and 13,000 scientiﬁc

articles respectively, other corpora can be integrated to the system by users.

1 INTRODUCTION

The huge increasing amount of complex data such as

text available in different web sites, e-mails, local net-

works in business company, electronic news and else-

where is overwhelming. This uncontrolled increase

of information in the different ﬁelds, makes difﬁcult

to exploit the useful ones from the rest of data. This

situation starts switching the information from useful

to troublesome. The capability of OLAP tools avail-

able especially the text OLAP is not growing in the

same way and the same speed the amount of textual

documents is increasing. This problem is dramati-

cally exacerbated by the big quantity of textual docu-

ments indexed by Search engines every moment. This

makes the task of text OLAP and knowledge extrac-

tion from textual documents very limited and reduces

the competitive advantage we can gain. Recently, a

large number of systems have been developed over

the years to solve this kind of problems and perform

tasks in Information Retrieval; many of these systems

perform speciﬁc tasks such as word counter and text

summarization, however they are not in the level to

satisfy the growing need of users to extract the use-

ful information from documents using Text OLAP ap-

proaches.

In this paper we describe a software platform for

keywords extraction and aggregation in an OLAP

context. The platform implements a new way for ex-

tracting keywords from a corpus of document based

on the Microsoft academia research web site and six

algorithms for keyword aggregation which process

a corpus of textual data to discover aggregated key-

words.

The rest of the paper is organized as follows: Sec-

tion 2 introduces related works in keywords extrac-

tion and aggregation in OLAP context. Section 3 de-

scribes the main components of the software proto-

type along with their functionalities. Whereas section

4 is devoted to numerical experiments. Finally, Sec-

tion 5 presents conclusions and discusses further de-

velopments.

2 EXISTING APPROACHES AND

TOOLS

Many approaches are proposed for keyword extrac-

tion but only a few for aggregation keywords. On the

other hand, the majority of the existing work is based

on information retrieval, and only some of them are

in the OLAP context, where textual documents are

stored in a data warehouse. In this section we make

an inventory of the existing approaches in OLAP con-

text, which describes a corpus of documents through

the most representative aggregated keywords. There

is a classical classiﬁcation that includes the super-

vised and unsupervised approaches for keywords ex-

traction, meanwhile in our case we introduce a new

classiﬁcation for textual extraction and aggregation

approaches proposed in the OLAP context. We clas-

sify the previous works found in the literature into

four categories. The ﬁrst one uses statistical meth-

232

Bouakkaz, M., Loudcher, S. and Ouinten, Y.

A New Tool for Textual Aggregation In Information Retrieval.

In Proceedings of the 18th International Conference on Enterprise Information Systems (ICEIS 2016) - Volume 1, pages 232-237

ISBN: 978-989-758-187-8

ods; the second one is based on linguistic knowledge;

the third one is based on graphs; while the last uses

external knowledge.

The approaches based on statistical methods use

the occurrence frequencies of terms and the correla-

tion between terms to extract the keywords. Hady

et al. (Hady et al., 2007) proposed an approach

called TUBE (Text-cUBE). They adopted a relational

database to textual data based on the cube design,

each cell contains keywords, and they attached to

each keyword an interestingness value. Zhang et

al. (Zhang et al., 2009) proposed an approach called

Topic Cube. The main idea of a Topic Cube is to

use the hierarchical topic tree as the hierarchy for

the text dimension. This structure allows users to

drill-down and roll-up along this tree. users discover

also the content of the text documents in order to

view the different granularities and levels of topics

in the cube. The ﬁrst level in the tree contains the

detail of topics, the second level contains more gen-

eral types and the last level contains the aggregation

of all topics. A textual measure is needed to aggre-

gate the textual data. The authors proposed two types

of textual measures, word distribution and topic cov-

erage. The topic coverage computes the probability

that a document contains the topic. These measures

allow user to know which topic is dominant in the set

of documents by aggregating the coverage over the

corpus. Ravat et al. (Ravat et al., 2008) proposed

an aggregation function called TOP-Keywords to ag-

gregate keywords extracted from documents. They

used the t f .id f measure, then they selected the ﬁrst

k most frequent terms. Bringay et al. in (Bringay

et al., 2011) proposed an aggregation function, based

on a new adaptive measure of t f.id f . It takes into

account the hierarchies associated to the dimensions.

Wartena et al. (Wartena and Brussee, 2008) proposed

another method we called TOPIC in which they used

the k-bisecting clustering algorithm and based on the

Jensen-Shannon divergence for the probability dis-

tributions as described in (Archetti and Campanelli,

2006). Their method starts with the selection of two

elements for the two ﬁrst clusters. are assigned to the

cluster of the closest of the two selected elements.

Once all the terms are assigned, the process will be

repeated for each cluster with a diameter larger than a

speciﬁed threshold value. Bouakkz et al. (Bouakkaz

et al., 2015) proposed a textual aggregation based

on keywords. When a user wants to obtain a more

aggregate view of data, he does a roll-up operation

which needs an adapted aggregation function. their

approach entitled GOTA is composed of three main

parts, including: (1) extraction of keywords with their

frequencies; (2) construction of the distance matrix

between words using the Google similarity distance;

(3) applying the k-means algorithm to distribute key-

words according to their distances, and ﬁnally (4) se-

lection the k aggregated keywords.

The approaches based on linguistic knowledge

consider a corpus as a set of the vocabulary men-

tioned in the documents; but the results in this case

are sometimes ambiguous. However, to overcome this

obstacle, techniques based on lexical knowledge and

syntactic knowledge previews have been introduced.

In (Poudat et al., 2006; Kohomban and Lee, 2007)

the authors described a classiﬁcation of textual doc-

uments based on scientiﬁc lexical variables of dis-

course. Among these lexical variables, they chose

nouns because they are more likely to emphasize the

scientiﬁc concepts, rather than adverbs, verbs or ad-

jectives.

The approaches based on the use of external

knowledge select certain keywords that represent a

domain. These approaches often use models of

knowledge such as ontology. Ravat et al. proposed an

other aggregation function that takes as input a set of

keywords extracted from documents of a corpus and

that outputs another set of aggregated keywords (Ra-

vat et al., 2007). They assumed that both the ontol-

ogy and the corpus of documents belong to the same

domain. Oukid et al. proposed an aggregation opera-

tor Orank (OLAP rank) that aggregated a set of docu-

ments by ranking them in a descending order using a

vector space representation (Oukid et al., 2013).

The approaches based on graphs use keywords to

construct a keyword graph. The nodes represent the

keywords obtained after pre-processing, candidate se-

lection and edge representation. After the graph rep-

resentation step, different types of keyword ranking

approaches have been applied. The ﬁrst approach pro-

posed in (Mihalcea and Tarau, 2004) is called Tex-

tRank, where graph nodes are the keywords and edges

represent the co-occurrence relations between key-

words. The idea is that, if a keyword gets link to a

large number of other keywords, this keyword will be

considered as important. Bouakkaz et al. (Bouakkaz

et al., 2014) propose a new method which performs

aggregation of keywords of documents based on the

graph theory. This function produces the main ag-

gregated keywords out of a set of terms representing

a corpus. Their aggregation approach is called TAG

(Textual Aggregation by Graph). It aims at extracting

from a set of terms a set of the most representative

keywords for the corpus of textual document using a

graph. The function takes as input the set of all ex-

tracted terms from a corpus, and outputs an ordered

set, containing the aggregated keywords. The process

of aggregation goes through three steps: (1) Extrac-

A New Tool for Textual Aggregation In Information Retrieval

233

tion of keywords with their frequencies, (2) Construc-

tion of the afﬁnity matrix and the afﬁnity graph, and

(3) Cycle construction and aggregated keywords se-

lection.

The software system developed in this domain

consists of two main components; Text Pre-processor

and Topics Extractor. Text pre-processor, offers learn-

ing and inference functionalities. The learning func-

tionality pre-processes a document collection by ex-

ploiting a stop words list and a general purpose to ob-

tain the word-document matrix according to the bag-

of-words model. The user can choose the number of

words to be used for document indexing. The infer-

ence functionality processes a document to obtain one

of the following bag-of-words representations; binary,

term frequencies and the inverse term document fre-

quency. Topic extractor implements a customized ver-

sion of the Latent Dirichlet Allocation (LDA) model

(Blei and Andrew, 2003). The solution of the LDA

learning is obtained by using the Expected Maximiza-

tion and the Gibbs Sampling algorithms which have

been implemented in the C++ programming language

on a single processor machine. Each topic is sum-

marized through the estimate of its prior probability,

a sorted list of its most frequent words together with

the estimate of their conditional probabilities. Seman-

tria

is a text analytical tool that offers an API that

performs sentiment analysis and analytic text. Users

can be integrated in the service to quickly yield ac-

tionable data from their unstructured text data, from

review sites, blogs, or other sources. Additionally,

users can download trial version and use Semantria

for Excel, which installs directly into Ofﬁce Excel to

set up an environment for analyses.

3 THE SOFTWARE SYSTEM

DESCRIPTION

In order to create a suitable environment for the online

analysis of textual data, we intend to propose a new

software which performs aggregation of keywords.

The system described in this paper consists of three

main components; namely Text Pre-processor, Key-

words Extractor and Keywords Aggregator. These

components have been integrated into a software sys-

tem developed with Java programming language.

3.1 Text Pre-processor

This software component implements functionalities

devoted to document pre-processing and document

https://semantria.com/

corpus representation. It offers words counter, and

represents the documents of the corpus as a list of

words with their frequencies (Figure1). Furthermore,

binary and term frequency representations are al-

lowed. The system takes the pdf, Microsoft Word and

txt formats as valid inputs as shown in ﬁgure 1.

3.2 Keywords Extractor

This component is for keywords extraction. The key-

word extraction function is based on the Microsoft

Academic Search web site (MAS). MAS is a service

provided by Microsoft to the public and it is free of

charge. MAS classiﬁes scientiﬁc articles into ﬁfteen

categories according to their ﬁelds. In each category

it extracts the scientiﬁc keywords from articles and re-

orders them according to their frequencies. Our key-

words extractor component uses this list of keywords

and takes form each ﬁeld the 2000 most frequent key-

words, which are saved in separate text ﬁles. After

that, Keywords Extractor process starts to compare

MAS keywords with whole words extracted by the

Text Pre-processor component. When a MAS key-

word exists in the list, the extractor component saves

it in a text ﬁle with its frequency and the name of the

document in which it occurs.

Once our process is ﬁnished, we will get the right

useful keywords validated by MAS. The output of this

component is a two fold Matrix of document and key-

words (MDKW). which is used by the third compo-

nent to aggregate keywords.

3.3 Keywords Aggregation

The keywords aggregation component uses a set

of textual aggregation algorithms TOP-Keywords,

TOPIC, TUBE, TAG, BienCube and GOTA to aggre-

gate keywords obtained in the previous step. it also

produces the recall, precision, F-measure and the run

time for each algorithm.

3.4 Graphical User Interface

The graphical user interface (GUI) is a necessary el-

ement in our system (OLAP-TAS) we take into con-

sideration the ergonomical aspect to add an interac-

tivity between the user and the machine when using

our platform. The aim of the graphical user interface

is to give the user a simple access to OLAP-TAS al-

gorithms by a number of windows that help him to

navigate in the system and test the different imple-

mented algorithms without any need of previous Java

programming experience or knowledge.

It is also helpful to assist students and researchers

ICEIS 2016 - 18th International Conference on Enterprise Information Systems

234

Figure 1: System architecture.

to do their scientiﬁc works and research experiments

in a visual platform. It is obvious that the use of an

interactive tool facilitates understanding and makes

learning more beneﬁcial task for many learners.

The GUI consists of two components: the ﬁrst

one is devoted to the preprocessing and keywords

extraction and the second one is for Keywords ag-

gregation. The Text Pre-processor and Keywords

extraction components allow the user to create the

Documents x keywords matrix based on Microsoft

Academic Search web site (MAS) as shown in Fig-

ure 2. This interface gives users different possibil-

ities to choose and conﬁgure the different parame-

ters such as T hreshold level and select the type of

corpus (computer science, medicine, chemistry or all

ﬁeld of study). For the second interface which is de-

voted for Keywords Aggregation, it allows the user 1-

to run, tests and compare the results obtained by the

different implemented algorithms. 2- to visualize the

aggregated keywords obtained by the different key-

words aggregation approaches. 3- to compute differ-

ent statistics for different approaches such as recall,

precision, F-measure and run time, and save the dif-

ferent obtained results in various format .xls, .txt or

.doc . 4- to change the corpus and run the Text Pre-

Figure 2: The Text Pre-processor and Keywords Extraction

component interface.

Figure 3: Keywords Aggregation component interface.

processor besides Keywords extraction components

to load an other Documents x keywords Matrix, as

shown in Figure 3.

4 RESULTS AND USAGES

4.1 Test and Results

In this subsection, we present an example to show

how OLAP-TAS has been used. We compiled two

real corpora, the ﬁrst is from the IIT conference

(conference and workshop papers) from the years

2008 to 2012. It consists of 600 papers ranging from

7 to 8 pages in IEEE format, including tables and ﬁg-

ures. The keywords are extracted from the full words

according to the Microsoft Academia Search

key-

words. The second corpus is used by many authors

to test their works such as (Sebastiani, 2002) (Mos-

chitti, 2003) (Moschitti and Basili, 2004), this cor-

pus is called Ohsumed collection

, it includes medi-

cal abstracts from the MeSH (Medical Subject Head-

ings)

, it contains 20,000 documents. In our case we

selected 13,000 medical abstracts to test the perfor-

mance of the implemented algorithm in our OLAP-

TAS. For the evaluation task, many types of measures

http://www.it-innovations.ae

academic.research.microsoft.com/

ftp://medir.ohsu.edu/pub/ohsumed

http://www.ncbi.nlm.nih.gov/mesh/

A New Tool for Textual Aggregation In Information Retrieval

235

have been proposed to evaluate keywords aggregation

approaches, the majority of them insist on three mea-

sures, which are known as recall, precision, and F-

measure. these measures are deﬁned as fallows: The

recall is the ratio of the number of documents to the

total number of retrieved documents.

Recall =

| {RelevantDoc} ∩ {RetrievedDoc} |

| {RelevantDoc} |

(1)

The precision is the ratio of the number of rele-

vant documents to the total number of retrieved docu-

ments.

Precision =

| {RelevantDoc} ∩ {RetrievedDoc} |

| {RetrievedDoc} |

(2)

The F-measure or balanced F-score, which com-

bines precision and recall, is the harmonic mean of

precision and recall.

To show the kind of results and statistics obtained

by OLAP-TAS after the execution, we take the ﬁrst

corpus as an example to illustrate the different graphs

obtained for different algorithms in Figures 4, 5, 6 and

Figure 4: Comparaison of the Recall.

Figure 5: Comparaison of the Precision.

4.2 Uses of OLAP-TAS

In this section we will illustrate the use of the devel-

oped tool in both education and research.

Figure 6: Comparaison of the F-measur.

Figure 7: Comparaison of the Runtime.

Education: OLAP-TAS is a visual tool that instruc-

tors can use to help their students understand the ba-

sic concepts and the algorithms they face during their

study. For example, it can be used to teach the stu-

dents how the k-bisecting clustering algorithm based

on the Jensen-Shannon divergence for the probabil-

ity distribution works (Wartena and Brussee, 2008).

As well as the T F ∗ IDF and their variation in Top-

keyword (Ravat et al., 2008) and Biencube (Bringay

et al., 2011). It can also help students to understand

how to use graphs for textual by the selection of cycles

in TAG (Bouakkaz et al., 2014) and the use of Google

similarity distance (Cilibrasi and Vitanyi, 2007). In

addition it shows the students how the recall, preci-

sion and F-measure change their values according to

number of aggregated keywords k introduced by the

user. Instructors may ask their students to do experi-

ments with a real corpus using OLAP-TAS, write ap-

plications that use the Java classes, extend an existing

approaches, or contribute in implementing a new al-

gorithm to integrate in OLAP-TAS.

Research: OLAP-TAS contains implementations for

several algorithms and approaches that solve common

problems, such as textual aggregation in an OLAP

context. It also comes with two corpora and annotated

datasets. The implementation of other algorithms as

well as other corpora, can be integrated into the plat-

ICEIS 2016 - 18th International Conference on Enterprise Information Systems

236

form. This makes it a good resource for researchers to

build systems and conduct experiments. OLAP-TAS

was successfully used in several research projects as

shown in (Bouakkaz et al., 2014).

5 CONCLUSIONS

In this paper a system for textual aggregation in text

OLAP (OLAP-TAS) has been described. The soft-

ware assists the user to discover the main aggregated

keywords that best represent in a document collection.

It is important to note that each approach is coded in a

separate Java class to allow users to extend it or export

it to another system. The use of OLAP-TAS reduces

the amount of repeated code; it simpliﬁes common

tasks, and provides a graphical interface for textual

aggregation approaches without requiring the knowl-

edge in Java programming language.

REFERENCES

Archetti, F. and Campanelli, P. (2006). A hierarchical

document clustering environment based on the in-

duced bisecting k-means. International Conference

on Database and Expert Systems Applications, pages

257–269.

Blei, D. and Andrew, Y. (2003). Latent dirichlet allocation.

Journal of Machine Learning Research, 42:993–1022.

Bouakkaz, M., Loudcher, S., and Ouinten, Y. (2014). Au-

tomatic textual aggregation approach of scientiﬁc ar-

ticles in olap context. 10th International Conference

on Innovations in Information Technology.

Bouakkaz, M., Loudcher, S., and Ouiten, Y. (2015). Gota:

Using the google similarity distance for olap textual

aggregation. 17th International Conference on Enter-

prise Information Systems (ICEIS).

Bringay, S., Laurent, A., and Poncelet, P. (2011). Towards

an on-line analysis of tweets processing. Database

and Expert Systems Applications, pages 154–161.

Cilibrasi, R. and Vitanyi, P. (2007). The google similarity

distance. IEEE Transactions on Knowledge and Data

Engineering, pages 370–383.

Hady, W., Ecpeng, L., and HweeHua, P. (2007). Tube (text-

cube) for discovering documentary evidence of asso-

ciations among entities. Symposium on Applied Com-

puting, pages 824–828.

Kohomban, U. and Lee, W. S. (2007). Optimizing classiﬁer

performance in word sense disambiguation by redeﬁn-

ing sense classes. International Joint Conference on

Artiﬁcial Intelligence, pages 1635–1640.

Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing order

into texts. Empirical Methods in Natural Language

Processing, pages 26–31.

Moschitti, A. (2003). Natural language processing and text

categorization: a study on the reciprocal beneﬁcial in-

teractions. PhD thesis, University of Rome Tor Ver-

gata, Rome, Italy, pages 34–47.

Moschitti, A. and Basili, R. (2004). Complex linguistic fea-

tures for text classiﬁcation: a comprehensive study.

The 26th European Conference on Information Re-

trieval Research, pages 34–47.

Oukid, L., Asfari, O., and Bentayeb, F. (2013). Cxt-cube:

Contextual text cube model and aggregation operator

for text olap. International Workshop On Data Ware-

housing and OLAP, pages 56–61.

Poudat, C., Cleuziou, G., and Clavier, V. (2006). Cleuziou

g., and clavier v., categorisation de textes en domaines

et genres. complementarite des indexations lexicale et

morpho syntaxique. Lexique et morphosyntaxe en RI,

9:61–76.

Ravat, F., Teste, O., and Tournier, R. (2007). Olap aggrega-

tion function for textual data warehouse. In Interna-

tional Conference on Enterprise Information Systems,

pages 151–156.

Ravat, F., Teste, O., and Tournier, R. (2008). Top keyword

extraction method for olap document. In Interna-

tional Conference on Data Warehousing and Knowl-

edge Discovery, pages 257–269.

Sebastiani, F. (2002). Machine learning in automated text

categorization. ACM Computing Surveys, pages 34–

47.

Wartena, C. and Brussee, R. (2008). Topic detection by

clustering keywords. International Conference on

Database and Expert Systems Applications, pages 54–

58.

Zhang, D., Zhai, C., and Han, J. (2009). Topic cube:

Topic modeling for olap on multidimensional text

databases. International Conference on Data Mining,

pages 1124–1135.

A New Tool for Textual Aggregation In Information Retrieval

237