Learning Query Expansion from Association Rules Between Terms

Ahem Bouziri

, Chiraz Latiri

, Eric Gaussier

and Yassin Belhareth

ENSI - Manouba University, LIPAH-FST, Manouba, Tunisia

ISAMM - Manouba University, LIPAH-FST, Tunis, Tunisia

Universit´e Joseph Fourrier-Laboratoire d’Informatique de Grenoble, Grenoble, France

ISAMM - Manouba University, Manouba, Tunisia

Keywords:

Query Expansion, Association Rules, Classiﬁcation.

Abstract:

Query expansion technique offers an interesting solution for obtaining a complete answer to a user query while

preserving the quality of retained documents. This mainly relies on an accurate choice of the added terms to

an initial query. In this paper, we attempt to use data mining methods to extract dependencies between terms,

namely a generic basis of association rules between terms. Face to the huge number of derived association

rules and in order to select the optimal combination of query terms from the generic basis, we propose to model

the problem as a classiﬁcation problem and solve it using a supervised learning algorithm. For this purpose,

we ﬁrst generate a training set using a genetic algorithm based approach that explores the association rules

space in order to ﬁnd an optimal set of expansion terms, improving the MAP of the search results, we then

build a model able to predict which association rules are to be used when expanding a query. The experiments

were performed on SDA 95 collection, a data collection for information retrieval. The main observation is that

the hybridization of textmining techniques and query expansion in an intelligent way allows us to incorporate

the good features of all of them. As this is a preliminary attempt in this direction, there is a large scope for

enhancing the proposed method.

1 INTRODUCTION

Query expansion technique aims to reducing the usual

query/document mismatch by expanding the query

using terms that are related to the original query

terms, but have not been explicitly mentioned by the

user. The goal of this technique is not only to improve

the recall by retrieving relevant documentsthat cannot

be retrieved by the user query, but also to improve the

precision of the retrieved documents by putting the

most relevant ones at the top list of the retrieveddocu-

ments. We clame that a synergy between classical IR

techniques and some advanced text mining methods,

especially association rules between terms (Agrawal

and Skirant, 1994) is particularly appropriate.

However, applying association rules in the context

of IR is far from being a trivial task, mostly because

of the huge number of potentially interesting rules

that can be drawn from a document collection. We

mainly concentrate in this work on reducing the num-

ber of selected rules for the expansion process while

retaining the most interesting ones.

In this paper, we propose to model query ex-

pansion based on association rules as a classiﬁcation

problem and solve it using a supervised learning al-

gorithm. Given a query and the set of its association

rules, the classiﬁer must be able to decide for each

association rule whether it is appropriate to use it for

expansion. Our automatic query expansion process is

based on the new generic basis of association rules.

The main thrust in the proposal is that the introduced

basis gathers a minimal set of rules allowing an ef-

fective selection of rules to be used in the expansion

process.

2 RELATED WORK

Different methods dedicated to query expansion have

been proposed in the literature such as those based on

user relevance feedback (Ruthven and Lalmas, 2003),

pseudo relevance feedback (Buckley et al., 1994; Mi-

tra et al., 1998), and terms co-occurrences (Lin et al.,

2008; Rungsawang et al., 1999). A recent survey of

automatic query expansion approaches is proposed in

(Carpineto and Romano, 2012).

Bouziri, A., Latiri, C., Gaussier, E. and Belhareth, Y..

Learning Query Expansion from Association Rules Between Terms.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 525-530

ISBN: 978-989-758-158-8

525

In a user relevance feedback context, related terms

come from user identiﬁed relevant documents or

queries. In (Fonseca et al., 2005), Fonseca et al. seg-

mented query sessions in search engine query logs

into subsessions and then used association rules to ex-

tract related queries from those subsessions. They cal-

culated the relatedness between all queries using the

association rule mining model and then built a query

relation graph. The query relation graph was used for

identifying terms related to a given user input query.

The apporoach presented in (Boughanem and Tamine,

2000) combines relevance feedback and genetic algo-

rithms query to optimise query reformulation.

On the other hand, the pseudo relevance feedback

expanded terms which come from the top k retrieved

documents assumed to be relevant without any inter-

vention from the user. The authors in (Buckley et al.,

1994; Mitra et al., 1998) proposed approaches for ex-

panding search engine queries. The related terms are

extracted from the top documents that are returned in

response to the original query using statistical heuris-

tics, and the query is expanded using these extracted

terms. The results of this approach on large col-

lections are sometimes even negative since the as-

sumed relevant documents retrieved by an informa-

tion retrieval system are unfortunately not all rele-

vant (Buckley et al., 1994). Another limitation of this

technique is that it is using a local query expansion

technique based on a set of documents retrieved for

the query. As a consequence, they are more focused

on the given query than global analysis. Indeed, in

(Xu and Croft, 1996), the authors showed that using

global analysis techniques produces results that are

both more effective and more predictable than sim-

ple local feedback. In (Chifu and Mothe, 2014), the

pseudo relevance feedback is used in a selective query

expansion approach. To overcome some of the limits

of PRF, the authors propose to use it only for difﬁcult

queries. An evolutionary approach for improving efﬁ-

ciency of pseudo-relevance feedback-based query ex-

pansion is proposed in (Pragati Bhatnagar, 2015). In

this method, the candidate terms for query expansion

are selected from an initially retrieved list of docu-

ments, ranked on the basis of co-occurrence measure

of the terms with the query terms.

Association rules techniques extract relationships

based on term co-occurrences where the window size

used is a document. The authors of (Tangpong and

Rungsawang, 2000) performed a small improvement

when using the APRIORI algorithm (Agrawal and Ski-

rant, 1994) with a high conﬁdence threshold (more

than 50%) that generated a small amount association

rules. Using a lower conﬁdence threshold (10%),

authors performed better results (Rungsawang et al.,

1999). The same approach is proposed by (Haddad

et al., 2000) performing improvement when using the

APRIORI algorithm to extract association rules. The

best improvements were performed with low conﬁ-

dence values. The main limitation of this approach

consists in the huge number of generated association

rules while a large part of them are redundant in the

sense that several rules convey the same information.

The removal of redundancy within mined rules is then

a key step for improving the quality of the expan-

sion as performed in the approach we propose in this

work. A more adapted mining algorithm to text that

avoids redundancy is proposed by (Latiri et al., 2012).

A generic basis M GB of non redundant association

rules between terms is ﬁrst derived from the tested

document collection. This compact basis is then used

to blindly expand the user query considering all terms

that appear in the conclusions of the irredundant asso-

ciation rules whose premise is contained by the orig-

inal query. Experimental evaluation of this approach

shows an improvement of the mean precision for the

tested document collections. In the present work, we

reﬁne this approach

3 MINING ASSOCIATION RULES

BETWEEN TERMS

In this work, we shall use in text mining ﬁeld, the

theoretical framework of Formal Concept Analysis

(FCA) presented in (Ganter and Wille, 1999). First,

we formalize an extraction context made up of docu-

ments and index terms, called textual context.

Deﬁnition 1. A textual context is a triplet M :=

(C , T , I ) where:

• C := {d

, d

, . . . , d

} is a ﬁnite set of n documents

of a collection.

• T := {t

, ·· · , t

} is a ﬁnite set of m distinct

terms in the collection. The set T then gathers

without duplication the terms of the different doc-

uments which constitute the collection.

• I ⊆ C × T is a binary (incidence) relation. Each

couple (d, t) ∈ I indicates that the document d ∈

C has the term t ∈ T .

A termset is a set of terms. The support of a

termset is deﬁned as follows.

Deﬁnition 2. Let T ⊆ T . The support of T in M

is equal to the number of documents in C containing

all the term of T. The support is formally deﬁned as

follows :

Supp(T) = |{d|d ∈ C ∧ ∀ t ∈ T : (d, t) ∈ I }| (1)

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

526

Supp(T) is called the absolute support of T in M.

The relative support (aka frequency) of T ∈ M is

equal to

Supp(T)

|C|

A termset is said frequent (aka large or cover-

ing) if its terms co-occur in the collection a number

of times greater than or equal to a user-deﬁned sup-

port threshold, denoted minsupp. Otherwise, it is said

unfrequent (aka rare).

The derivation of association rules between terms

is achieved starting from the set of frequent termsets

extracted from a context M. Many representations

of frequent termsets were proposed in the litera-

ture where terms are characterized by the frequency

of their co-occurrence. The ones based on closed

termsets (Pasquier et al., 2005) and minimal gener-

ators (Bastide et al., 2000) are at the core of the

deﬁnitions of almost all generic bases of the litera-

ture. They result from the mathematical background

of FCA (Ganter and Wille, 1999), described in the

next subsection.

Deﬁnition 3. An association rule R is an implication

of the form R: T

⇒ T

, where T

and T

are subsets

of T , and T

∩ T

0. The termsets T

and T

are,

respectively, called the premise and the conclusion of

R. The rule R is said to be based on the termset T

equal to T

∪ T

. The support of a rule R: T

⇒ T

then deﬁned as:

Supp(R) = Supp(T), (2)

while its conﬁdence is computed as:

Conf(R) =

Supp(T)

Supp(T

)

. (3)

An association R is said to be valid if its conﬁ-

dence value, i.e., Conf(R), is greater than or equal to

a user-deﬁned threshold denoted minconf. This con-

ﬁdence threshold is used to exclude non valid rules.

Also, the given support threshold minsupp is used to

remove rules based on termsets T that do not occur

often enough, i.e., rules having Supp(T) < minsupp.

4 LEARNING QUERY

EXPANSION FROM

ASSOCIATION RULES

4.1 Model and Problem Statement

In our approach, the query expansion problem is de-

ﬁned as follows : given an original query OQ := {t

, ... , t

}, and the set of its related association rules

; select the association rules whose conclusion

terms are the more adapted to expand OQ and return

documents that meet the user need.

We model this problem as a supervised classiﬁca-

tion problem in which we attempt to predict whether

to use or not a given association rule to expand the

query at hand. Each observation x lie in an input

space X ⊆ R

and is associated with a class y ∈ Y ,

where |Y | = 2. In the context of query expansion us-

ing association rules, x

∈ X denotes the vector rep-

resentation of a pair query/association rule and its

class y

∈ Y represents the class associated with x

is belongs to positive class (ie y

= 1) if in the pair

query/association rule represented by x

, the associa-

tion rule is used to expand the query, it belongs to the

negative class (y

= 0) otherwise.

4.2 Input Space Representation

Observations are vectors in the input space X ⊆

(d = 14). Values on these vectors are computed

based on statistical distribution measures of terms in

the query’s text and in association rule conclusion.

These features are of three categories and are ex-

plained in the following paragraphs.

4.2.1 Document Frequency based Features

The document frequency (DF) is a statistical predic-

tor that measures whether a term is rare or common

in the corpus. Its value for a query represents the av-

erage of the DF for all query terms . The DF(Q) of a

query Q is computed as in 4.

DF(Q) =

|Q|

∑

i=1

|{d

∈ d

|D|

(4)

We compute also the inverted document frequency

(iDF) for a query as follows :

iDF(Q) =

|Q|

∑

i=1

log

|D|

|{d

∈ d

(5)

Both DF and iDF are also computed for an associa-

tion rule, they represent the average of the DF or iDF

for all the terms in the conclusion of the association

rule.

4.2.2 Term Frequency based Features

We include features dealing with term frequency in

the documents of the collection. The term frequency

(TF(t, d)) is simply the number of occurrences of

term t in document d. For each term of the query

or of the association rule conclusion we use an aver-

age of its frequencies in all documents and we note it

Learning Query Expansion from Association Rules Between Terms

527

ATF(t

) for term t

. For a query, the term frequency

is considered to be the average of the ATF of al the

terms in the query and is calculated as in 6.

TF(Q) =

|Q|

∑

i=1

ATF(t

) (6)

ATF(t

) =

|D|

∑

j=1

TF(t, d

)

When dealing with an average, it is important to

know how the values are distributed around it. We in-

troduce a variance measure to evaluate the variation

of term frequencies. For a query, we compute the av-

erage of theses variations for all query terms as men-

tionned in equation 7

V(Q) =

|Q|

∑

i=1

|D| − 1

|D|

∑

j=1

(TF(t

, d

) − ATF(t

, d

))

(7)

4.2.3 Features Based on Association Rule

Properties

In addition to the features based on term’s distribution

measures, we use 6 features to characterise an associ-

ation rule :

• number of terms in the association rule premise,

• number of terms in the association rule conclu-

sion,

• association rule conﬁdence given in (3),

• association rule support as given in(2),

• a ratio of the number of terms in the association

rule premise by the number of terms in the query.

This ratio measures how much the rule is match-

ing the query.

• the proportion of terms in the association rule con-

clusion that are present in the other association

rules. This proportion measures the importance

of the conclusion terms.

4.3 Generating Training Instances

We represent training instances as couples (query / as-

sociation rule). For each query, we generate as many

instances as association rules with at least one term

of the premise is in the query. An instance belongs

to positive class if the corresponding association rule

is used to ﬁnd the optimal expanded query, it belongs

to negative class otherwise. Query expansion is op-

timised using an exploring process based on genetic

algorithms. The original query and the set of associa-

tions rules candidate to expansion are the input to the

optimising process. The process returns an optimal

expanded query and generates training instances cor-

responding to this optimal expansion. The principles

of this exploring process are described in the follow-

ing paragraphs.

4.3.1 Chromosome Representation

A chromosome is a vector of terms representing a

candidate expanded query formed by the initial query

terms and the conclusion terms of a candidate asso-

ciation rule. An association rule is considered to be

candidate if its premise is composed of a sub set of

initial query terms. The length of a chromosome is

determined by the number of initial query added to

the total number of terms candidate to expansion.

Given an original query OQ = {t

,... ,t

}, the set

of terms candidate to expansion obtained from asso-

ciation rules of the M GB basis , denoted TC, is :

(Latiri et al., 2012) :

∀R : T

⇒ T

, a non-redondant rule ∈ M GB; (8)

if T

⊆ OQ then TC = TC ∪ T

Equation (8) means if the premise of association

rule R is contained inOQ, conclusion terms of R are

candidate to expansion.

We adopt a binary coding of chromosomes which

indicates whether a given term appear or not in the

expanded query.

4.3.2 Initial Population

To build initial population individuals we ﬁrst ﬁlter

the generic basis of association rules M GB keeping

only rules which premise terms are terms of the initial

query. The size of initial population is equal to the

number of these rules.

4.3.3 Fitness Function

Query expansion aims at improving search results.

The ﬁtness of an individual measures the ability of the

candidate expanded query to return documents meet-

ing the user’s need. We use the mean average preci-

sion (MAP) returned by the SRI Terrier to measure

the ﬁtness of a chromosome.

4.3.4 Genetic Operators

Selection. This operator allows individuals in the

current generation to survive, to reproduce or die. In

general, the probability of survival of an individual

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

528

depends on its ﬁtness. In this implementation , we

opt for an elite selection by selecting the best n indi-

viduals needed to build new generation.

Crossover. The crossover operator allows recom-

bination of the information in the genetic heritage

thereby favouring exploration of the search space .

The crossing permits in our case to produce new ex-

panded query. We opt for a crossover operator which

produces one child from two parents. The individual

child inherits the terms of expanding both parents.

Replacement. At each iteration, new individuals

replace their parents.

5 EXPERIMENTAL EVALUATION

5.1 Evaluation Framework

Experiments were conducted under TERRIER using

the collection of French texts of CLEF 2003 corpus.

The collection French SDA 95, noted in the remainder

of this paper as SDA-95, includes 42615 documents

and 60 queries each being provided with a set of rele-

vant documents. We conﬁgure Terrier so as search in-

cludes terms in both description and narration ﬁelds.

5.2 Evaluation Steps

The evaluation of the automatic query expansion pro-

cess proposed in this paper proceeds in eight steps.

1. Evaluate search with original queries to determine

a baseline for comparaison.

2. Evaluate search with pseudo relevance feed back

(PRF) expanded query.

3. Generate association rules minimal basis using

Charm tool

4. Build trainning set : this task is automatically per-

formed.

5. Construct learning models using decision tree

(DT) under Weka.

6. Apply classiﬁer to test set

7. Expand queries in the test set according to classi-

ﬁcation results obtained in step 3.

8. Carry out search using the weighting schema

OKAPI BM25 (Jones et al., 2000).

The relevance of the returned documents is esti-

mated according to the following measures:

• The MAP (Mean Average Precision) which de-

ﬁnes the overall performance of the search en-

gine.

• Precisions at P@5, P@10, P@15, and P@30

returned relevant documents.

• Precisions at 11 recall points (P@11).

5.3 Preliminary Results and Discussion

Table 1: Experimental Results for SDA-95 Collection.

MAP P@5 P@10 P@20

AG 0,422 0,469 0,360 0,243

0,345 0,376 0,282 0,203

PRF 0,356 0,385 0,300 0,217

Baseline

0,339 0,369 0,283 0,205

∆ AG 24,48% 27% 27% 19%

∆ DT

1,77% 1% 0% 0%

∆ PRF 5,01% 4% 6% 6%

Table 1 synthesises results obtained for the

SDA95 collection . Values in AG line are obtained

with expanded queries optimised through GA based

exploring algorithm. These results can be consid-

ered as upperbounds for expanded queries using as-

sociation rules. Values on the DT line are drawn

from experiements on the same queries expanded us-

ing association rules that are predicted through classi-

ﬁer Weka J48 implementing decision tree. PRF line

reports results obtained for expanded queries using

pseudo relevance feed back method implemented in

Terrier with default settings (T=10, D=3). Improve-

ments relative to baseline are reported in the second

part of Table 1. These results show that expanding

queries by learning process is doing better than base-

line however it doesn’t reach PRF performance.

0.2

0.4

0.6

0.8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision

Recall

SDA 95 under Okapi BM25

Baseline−Okapi BM25

Expansion−Dec−Tree (Okapi BM25)

Expansion−PRF (Okapi BM25)

Expansion−AG−AR (Okapi BM25)

Figure 1: Recall/Precision curves of the expansion ap-

proaches for SDA95 collection.

Figure 1 shows evolution of the precision (P@11)

for the different query expansion strategies.

Learning Query Expansion from Association Rules Between Terms

529

6 CONCLUSION

We have presented in this article, a query expansion

approach which is based on learning how to expand

queries using association rules between terms. The

problem of expansion is modeled as a supervised clas-

siﬁcation problem. An exploratory process based on

genetic algorithms is used to both explore association

rules space in search of the best terms for expansion

and to generate training instances that are used later to

build a classiﬁer. The resolution of the learning prob-

lem is by decision tree. Experiments conducted on

the French text collection SDA-95 show that learn-

ing how to expand queries using association rules is

a promising approach. These preliminary results are

encouraging, we plan to extend our representation of

the input space by including new features in order to

improve the learning process.

REFERENCES

Agrawal, R. and Skirant, R. (1994). Fast algorithms for

mining association rules. In Proceedings of the 20

International Conference on Very Large Databases,

VLDB 1994, pages 478–499, Santiago, Chile.

Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., and

Lakhal, L. (2000). Mining minimal non-redundant as-

sociation rules using frequent closed itemsets. In Pro-

ceedings of the 1

International Conference on Com-

putational Logic, volume 1861 of LNAI, pages 972–

986, London, UK. Springer-Verlag.

Boughanem, M. and Tamine, L. (2000). Query optimization

using an improved genetic algorithm. In Proceedings

of the 2000 ACM CIKM International Conference on

Information and Knowledge Management, McLean,

VA, USA, November 6-11, 2000, pages 368–373.

Buckley, C., Salton, G., Allan, J., and Singhal, A. (1994).

Automatic Query Expansion Using SMART: TREC-

3. In Proceedings of the 3

Text REtrieval Confer-

ence.

Carpineto, C. and Romano, G. (2012). A survey of auto-

matic query expansion in information retrieval. ACM

Computing Surveys (CSUR), 44(1):1.

Chifu, A. and Mothe, J. (2014). Expansion s´elective de

requˆetes par apprentissage. In Moens, M., Viard-

Gaudin, C., Zargayouna, H., and Terrades, O. R.,

editors, CORIA 2014 - Conf´erence en Recherche

d’Infomations et Applications- 11th French Informa-

tion Retrieval Conference. CIFED 2014 Colloque In-

ternational Francophone sur l’Ecrit et le Document,

Nancy, France, March 19-23, 2014., pages 257–272.

ARIA-GRCE.

Fonseca, B. M., Golgher, P. B., Pˆossas, B., Ribeiro-Neto,

B. A., and Ziviani, N. (2005). Concept-based inter-

active query expansion. In Proceedings of the 14

International Conference on Information and Knowl-

edge Management, CIKM 2005, pages 696–703, Bre-

men, Germany. ACM Press.

Ganter, B. and Wille, R. (1999). Formal Concept Analysis.

Springer-Verlag.

Haddad, H., Chevallet, J. P., and Bruandet, M. F. (2000).

Relations between terms discovered by association

rules. In Proceedings of the Workshop on Machine

Learning and Textual Information Access in conjunc-

tion with the 4

European Conference on Principles

and Practices of Knowledge Discovery in Databases,

PKDD 2000, Lyon, France.

Jones, K. S., Walker, S., and Robertson, S. E. (2000). A

probabilistic model of information retrieval: develop-

ment and comparative experiments. Information Pro-

cessing and Management, 36(6):779–840.

Latiri, C., Haddad, H., and Hamrouni, T. (2012). Towards

an effective automatic query expansion process using

an association rule mining approach. Journal of Intel-

ligent Information Systems, 39(1):209–247.

Lin, H. C., Wang, L. H., and Chen, S. M. (2008). Query

expansion for document retrieval by mining additional

query terms. Information and Management Sciences,

19(1):17–30.

Mitra, M., Singhal, A., and Buckley, C. (1998). Improving

automatic query expansion. In Proceedings of the 21

Annual International ACM SIGIR Conference on Re-

search and Development in Information Retrieval, SI-

GIR’98, pages 206–214, Melbourne, Australia. ACM

Press.

Pasquier, N., Bastide, Y., Taouil, R., Stumme, G., and

Lakhal, L. (2005). Generating a condensed represen-

tation for association rules. Journal of Intelligent In-

formation Systems, 24(1):25–60.

Pragati Bhatnagar, N. P. (2015). Genetic algorithm-based

query expansion for improved information retrieval.

In Intelligent Computing, Communication and De-

vices, Advances in Intelligent Systems and Computing,

volume 308, pages 47–55.

Rungsawang, A., Tangpong, A., Laohawee, P., and Kham-

pachua, T. (1999). Novel query expansion technique

using apriori algorithm. In Proceedings of the 8

Text REtrieval Conference, TREC 8, pages 453–456,

Gaithersburg, Maryland.

Ruthven, I. and Lalmas, M. (2003). A survey on the use

of relevance feedback for information access systems.

Knowledge Engineering Review, 18(2):95–145.

Tangpong, A. and Rungsawang, A. (2000). Applying as-

sociation rules discovery in query expansion process.

In Proceedings of the 4

World Multi-Conference on

Systemics, Cybernetics and Informatics, SCI 2000,

Orlando, Florida, USA.

Xu, J. and Croft, W. B. (1996). Query expansion using local

and global document analysis. In Proceedings of the

Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval,

SIGIR 1996, pages 4–11, Zurich, Switzerland. ACM

Press.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

530