Learning Query Expansion from Association Rules Between Terms
Ahem Bouziri
1
, Chiraz Latiri
2
, Eric Gaussier
3
and Yassin Belhareth
4
1
ENSI - Manouba University, LIPAH-FST, Manouba, Tunisia
2
ISAMM - Manouba University, LIPAH-FST, Tunis, Tunisia
3
Universit´e Joseph Fourrier-Laboratoire d’Informatique de Grenoble, Grenoble, France
4
ISAMM - Manouba University, Manouba, Tunisia
Keywords:
Query Expansion, Association Rules, Classification.
Abstract:
Query expansion technique offers an interesting solution for obtaining a complete answer to a user query while
preserving the quality of retained documents. This mainly relies on an accurate choice of the added terms to
an initial query. In this paper, we attempt to use data mining methods to extract dependencies between terms,
namely a generic basis of association rules between terms. Face to the huge number of derived association
rules and in order to select the optimal combination of query terms from the generic basis, we propose to model
the problem as a classification problem and solve it using a supervised learning algorithm. For this purpose,
we first generate a training set using a genetic algorithm based approach that explores the association rules
space in order to find an optimal set of expansion terms, improving the MAP of the search results, we then
build a model able to predict which association rules are to be used when expanding a query. The experiments
were performed on SDA 95 collection, a data collection for information retrieval. The main observation is that
the hybridization of textmining techniques and query expansion in an intelligent way allows us to incorporate
the good features of all of them. As this is a preliminary attempt in this direction, there is a large scope for
enhancing the proposed method.
1 INTRODUCTION
Query expansion technique aims to reducing the usual
query/document mismatch by expanding the query
using terms that are related to the original query
terms, but have not been explicitly mentioned by the
user. The goal of this technique is not only to improve
the recall by retrieving relevant documentsthat cannot
be retrieved by the user query, but also to improve the
precision of the retrieved documents by putting the
most relevant ones at the top list of the retrieveddocu-
ments. We clame that a synergy between classical IR
techniques and some advanced text mining methods,
especially association rules between terms (Agrawal
and Skirant, 1994) is particularly appropriate.
However, applying association rules in the context
of IR is far from being a trivial task, mostly because
of the huge number of potentially interesting rules
that can be drawn from a document collection. We
mainly concentrate in this work on reducing the num-
ber of selected rules for the expansion process while
retaining the most interesting ones.
In this paper, we propose to model query ex-
pansion based on association rules as a classification
problem and solve it using a supervised learning al-
gorithm. Given a query and the set of its association
rules, the classifier must be able to decide for each
association rule whether it is appropriate to use it for
expansion. Our automatic query expansion process is
based on the new generic basis of association rules.
The main thrust in the proposal is that the introduced
basis gathers a minimal set of rules allowing an ef-
fective selection of rules to be used in the expansion
process.
2 RELATED WORK
Different methods dedicated to query expansion have
been proposed in the literature such as those based on
user relevance feedback (Ruthven and Lalmas, 2003),
pseudo relevance feedback (Buckley et al., 1994; Mi-
tra et al., 1998), and terms co-occurrences (Lin et al.,
2008; Rungsawang et al., 1999). A recent survey of
automatic query expansion approaches is proposed in
(Carpineto and Romano, 2012).
Bouziri, A., Latiri, C., Gaussier, E. and Belhareth, Y..
Learning Query Expansion from Association Rules Between Terms.
In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 525-530
ISBN: 978-989-758-158-8
Copyright
c
2015 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
525
In a user relevance feedback context, related terms
come from user identified relevant documents or
queries. In (Fonseca et al., 2005), Fonseca et al. seg-
mented query sessions in search engine query logs
into subsessions and then used association rules to ex-
tract related queries from those subsessions. They cal-
culated the relatedness between all queries using the
association rule mining model and then built a query
relation graph. The query relation graph was used for
identifying terms related to a given user input query.
The apporoach presented in (Boughanem and Tamine,
2000) combines relevance feedback and genetic algo-
rithms query to optimise query reformulation.
On the other hand, the pseudo relevance feedback
expanded terms which come from the top k retrieved
documents assumed to be relevant without any inter-
vention from the user. The authors in (Buckley et al.,
1994; Mitra et al., 1998) proposed approaches for ex-
panding search engine queries. The related terms are
extracted from the top documents that are returned in
response to the original query using statistical heuris-
tics, and the query is expanded using these extracted
terms. The results of this approach on large col-
lections are sometimes even negative since the as-
sumed relevant documents retrieved by an informa-
tion retrieval system are unfortunately not all rele-
vant (Buckley et al., 1994). Another limitation of this
technique is that it is using a local query expansion
technique based on a set of documents retrieved for
the query. As a consequence, they are more focused
on the given query than global analysis. Indeed, in
(Xu and Croft, 1996), the authors showed that using
global analysis techniques produces results that are
both more effective and more predictable than sim-
ple local feedback. In (Chifu and Mothe, 2014), the
pseudo relevance feedback is used in a selective query
expansion approach. To overcome some of the limits
of PRF, the authors propose to use it only for difficult
queries. An evolutionary approach for improving effi-
ciency of pseudo-relevance feedback-based query ex-
pansion is proposed in (Pragati Bhatnagar, 2015). In
this method, the candidate terms for query expansion
are selected from an initially retrieved list of docu-
ments, ranked on the basis of co-occurrence measure
of the terms with the query terms.
Association rules techniques extract relationships
based on term co-occurrences where the window size
used is a document. The authors of (Tangpong and
Rungsawang, 2000) performed a small improvement
when using the APRIORI algorithm (Agrawal and Ski-
rant, 1994) with a high confidence threshold (more
than 50%) that generated a small amount association
rules. Using a lower confidence threshold (10%),
authors performed better results (Rungsawang et al.,
1999). The same approach is proposed by (Haddad
et al., 2000) performing improvement when using the
APRIORI algorithm to extract association rules. The
best improvements were performed with low confi-
dence values. The main limitation of this approach
consists in the huge number of generated association
rules while a large part of them are redundant in the
sense that several rules convey the same information.
The removal of redundancy within mined rules is then
a key step for improving the quality of the expan-
sion as performed in the approach we propose in this
work. A more adapted mining algorithm to text that
avoids redundancy is proposed by (Latiri et al., 2012).
A generic basis M GB of non redundant association
rules between terms is first derived from the tested
document collection. This compact basis is then used
to blindly expand the user query considering all terms
that appear in the conclusions of the irredundant asso-
ciation rules whose premise is contained by the orig-
inal query. Experimental evaluation of this approach
shows an improvement of the mean precision for the
tested document collections. In the present work, we
refine this approach
3 MINING ASSOCIATION RULES
BETWEEN TERMS
In this work, we shall use in text mining field, the
theoretical framework of Formal Concept Analysis
(FCA) presented in (Ganter and Wille, 1999). First,
we formalize an extraction context made up of docu-
ments and index terms, called textual context.
Definition 1. A textual context is a triplet M :=
(C , T , I ) where:
C := {d
1
, d
2
, . . . , d
n
} is a finite set of n documents
of a collection.
T := {t
1
,t
2
, ·· · , t
m
} is a finite set of m distinct
terms in the collection. The set T then gathers
without duplication the terms of the different doc-
uments which constitute the collection.
I C × T is a binary (incidence) relation. Each
couple (d, t) I indicates that the document d
C has the term t T .
A termset is a set of terms. The support of a
termset is defined as follows.
Definition 2. Let T T . The support of T in M
is equal to the number of documents in C containing
all the term of T. The support is formally defined as
follows :
Supp(T) = |{d|d C t T : (d, t) I }| (1)
KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval
526
Supp(T) is called the absolute support of T in M.
The relative support (aka frequency) of T M is
equal to
Supp(T)
|C|
.
A termset is said frequent (aka large or cover-
ing) if its terms co-occur in the collection a number
of times greater than or equal to a user-defined sup-
port threshold, denoted minsupp. Otherwise, it is said
unfrequent (aka rare).
The derivation of association rules between terms
is achieved starting from the set of frequent termsets
extracted from a context M. Many representations
of frequent termsets were proposed in the litera-
ture where terms are characterized by the frequency
of their co-occurrence. The ones based on closed
termsets (Pasquier et al., 2005) and minimal gener-
ators (Bastide et al., 2000) are at the core of the
definitions of almost all generic bases of the litera-
ture. They result from the mathematical background
of FCA (Ganter and Wille, 1999), described in the
next subsection.
Definition 3. An association rule R is an implication
of the form R: T
1
T
2
, where T
1
and T
2
are subsets
of T , and T
1
T
2
=
/
0. The termsets T
1
and T
2
are,
respectively, called the premise and the conclusion of
R. The rule R is said to be based on the termset T
equal to T
1
T
2
. The support of a rule R: T
1
T
2
is
then defined as:
Supp(R) = Supp(T), (2)
while its confidence is computed as:
Conf(R) =
Supp(T)
Supp(T
1
)
. (3)
An association R is said to be valid if its confi-
dence value, i.e., Conf(R), is greater than or equal to
a user-defined threshold denoted minconf. This con-
fidence threshold is used to exclude non valid rules.
Also, the given support threshold minsupp is used to
remove rules based on termsets T that do not occur
often enough, i.e., rules having Supp(T) < minsupp.
4 LEARNING QUERY
EXPANSION FROM
ASSOCIATION RULES
4.1 Model and Problem Statement
In our approach, the query expansion problem is de-
fined as follows : given an original query OQ := {t
1
,
t
2
, ... , t
n
}, and the set of its related association rules
AR
Q
; select the association rules whose conclusion
terms are the more adapted to expand OQ and return
documents that meet the user need.
We model this problem as a supervised classifica-
tion problem in which we attempt to predict whether
to use or not a given association rule to expand the
query at hand. Each observation x lie in an input
space X R
d
and is associated with a class y Y ,
where |Y | = 2. In the context of query expansion us-
ing association rules, x
i
X denotes the vector rep-
resentation of a pair query/association rule and its
class y
i
Y represents the class associated with x
i
.
y
i
is belongs to positive class (ie y
i
= 1) if in the pair
query/association rule represented by x
i
, the associa-
tion rule is used to expand the query, it belongs to the
negative class (y
i
= 0) otherwise.
4.2 Input Space Representation
Observations are vectors in the input space X
R
d
(d = 14). Values on these vectors are computed
based on statistical distribution measures of terms in
the querys text and in association rule conclusion.
These features are of three categories and are ex-
plained in the following paragraphs.
4.2.1 Document Frequency based Features
The document frequency (DF) is a statistical predic-
tor that measures whether a term is rare or common
in the corpus. Its value for a query represents the av-
erage of the DF for all query terms . The DF(Q) of a
query Q is computed as in 4.
DF(Q) =
1
|Q|
|Q|
i=1
|{d
j
,t
i
d
j
}|
|D|
(4)
We compute also the inverted document frequency
(iDF) for a query as follows :
iDF(Q) =
1
|Q|
|Q|
i=1
log
|D|
|{d
j
,t
i
d
j
}|
(5)
Both DF and iDF are also computed for an associa-
tion rule, they represent the average of the DF or iDF
for all the terms in the conclusion of the association
rule.
4.2.2 Term Frequency based Features
We include features dealing with term frequency in
the documents of the collection. The term frequency
(TF(t, d)) is simply the number of occurrences of
term t in document d. For each term of the query
or of the association rule conclusion we use an aver-
age of its frequencies in all documents and we note it
Learning Query Expansion from Association Rules Between Terms
527
ATF(t
i
) for term t
i
. For a query, the term frequency
is considered to be the average of the ATF of al the
terms in the query and is calculated as in 6.
TF(Q) =
1
|Q|
|Q|
i=1
ATF(t
i
) (6)
ATF(t
i
) =
1
|D|
|D|
j=1
TF(t, d
j
)
When dealing with an average, it is important to
know how the values are distributed around it. We in-
troduce a variance measure to evaluate the variation
of term frequencies. For a query, we compute the av-
erage of theses variations for all query terms as men-
tionned in equation 7
V(Q) =
1
|Q|
|Q|
i=1
1
|D| 1
|D|
j=1
(TF(t
i
, d
j
) ATF(t
i
, d
j
))
2
(7)
4.2.3 Features Based on Association Rule
Properties
In addition to the features based on terms distribution
measures, we use 6 features to characterise an associ-
ation rule :
number of terms in the association rule premise,
number of terms in the association rule conclu-
sion,
association rule confidence given in (3),
association rule support as given in(2),
a ratio of the number of terms in the association
rule premise by the number of terms in the query.
This ratio measures how much the rule is match-
ing the query.
the proportion of terms in the association rule con-
clusion that are present in the other association
rules. This proportion measures the importance
of the conclusion terms.
4.3 Generating Training Instances
We represent training instances as couples (query / as-
sociation rule). For each query, we generate as many
instances as association rules with at least one term
of the premise is in the query. An instance belongs
to positive class if the corresponding association rule
is used to find the optimal expanded query, it belongs
to negative class otherwise. Query expansion is op-
timised using an exploring process based on genetic
algorithms. The original query and the set of associa-
tions rules candidate to expansion are the input to the
optimising process. The process returns an optimal
expanded query and generates training instances cor-
responding to this optimal expansion. The principles
of this exploring process are described in the follow-
ing paragraphs.
4.3.1 Chromosome Representation
A chromosome is a vector of terms representing a
candidate expanded query formed by the initial query
terms and the conclusion terms of a candidate asso-
ciation rule. An association rule is considered to be
candidate if its premise is composed of a sub set of
initial query terms. The length of a chromosome is
determined by the number of initial query added to
the total number of terms candidate to expansion.
Given an original query OQ = {t
1
,... ,t
n
}, the set
of terms candidate to expansion obtained from asso-
ciation rules of the M GB basis , denoted TC, is :
(Latiri et al., 2012) :
R : T
1
T
2
, a non-redondant rule M GB; (8)
if T
1
OQ then TC = TC T
2
.
Equation (8) means if the premise of association
rule R is contained inOQ, conclusion terms of R are
candidate to expansion.
We adopt a binary coding of chromosomes which
indicates whether a given term appear or not in the
expanded query.
4.3.2 Initial Population
To build initial population individuals we first filter
the generic basis of association rules M GB keeping
only rules which premise terms are terms of the initial
query. The size of initial population is equal to the
number of these rules.
4.3.3 Fitness Function
Query expansion aims at improving search results.
The fitness of an individual measures the ability of the
candidate expanded query to return documents meet-
ing the user’s need. We use the mean average preci-
sion (MAP) returned by the SRI Terrier to measure
the fitness of a chromosome.
4.3.4 Genetic Operators
Selection. This operator allows individuals in the
current generation to survive, to reproduce or die. In
general, the probability of survival of an individual
KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval
528
depends on its fitness. In this implementation , we
opt for an elite selection by selecting the best n indi-
viduals needed to build new generation.
Crossover. The crossover operator allows recom-
bination of the information in the genetic heritage
thereby favouring exploration of the search space .
The crossing permits in our case to produce new ex-
panded query. We opt for a crossover operator which
produces one child from two parents. The individual
child inherits the terms of expanding both parents.
Replacement. At each iteration, new individuals
replace their parents.
5 EXPERIMENTAL EVALUATION
5.1 Evaluation Framework
Experiments were conducted under TERRIER using
the collection of French texts of CLEF 2003 corpus.
The collection French SDA 95, noted in the remainder
of this paper as SDA-95, includes 42615 documents
and 60 queries each being provided with a set of rele-
vant documents. We configure Terrier so as search in-
cludes terms in both description and narration fields.
5.2 Evaluation Steps
The evaluation of the automatic query expansion pro-
cess proposed in this paper proceeds in eight steps.
1. Evaluate search with original queries to determine
a baseline for comparaison.
2. Evaluate search with pseudo relevance feed back
(PRF) expanded query.
3. Generate association rules minimal basis using
Charm tool
4. Build trainning set : this task is automatically per-
formed.
5. Construct learning models using decision tree
(DT) under Weka.
6. Apply classifier to test set
7. Expand queries in the test set according to classi-
fication results obtained in step 3.
8. Carry out search using the weighting schema
OKAPI BM25 (Jones et al., 2000).
The relevance of the returned documents is esti-
mated according to the following measures:
The MAP (Mean Average Precision) which de-
fines the overall performance of the search en-
gine.
Precisions at P@5, P@10, P@15, and P@30
returned relevant documents.
Precisions at 11 recall points (P@11).
5.3 Preliminary Results and Discussion
Table 1: Experimental Results for SDA-95 Collection.
MAP P@5 P@10 P@20
AG 0,422 0,469 0,360 0,243
DT
0,345 0,376 0,282 0,203
PRF 0,356 0,385 0,300 0,217
Baseline
0,339 0,369 0,283 0,205
AG 24,48% 27% 27% 19%
DT
1,77% 1% 0% 0%
PRF 5,01% 4% 6% 6%
Table 1 synthesises results obtained for the
SDA95 collection . Values in AG line are obtained
with expanded queries optimised through GA based
exploring algorithm. These results can be consid-
ered as upperbounds for expanded queries using as-
sociation rules. Values on the DT line are drawn
from experiements on the same queries expanded us-
ing association rules that are predicted through classi-
fier Weka J48 implementing decision tree. PRF line
reports results obtained for expanded queries using
pseudo relevance feed back method implemented in
Terrier with default settings (T=10, D=3). Improve-
ments relative to baseline are reported in the second
part of Table 1. These results show that expanding
queries by learning process is doing better than base-
line however it doesn’t reach PRF performance.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
Recall
SDA 95 under Okapi BM25
Baseline−Okapi BM25
Expansion−Dec−Tree (Okapi BM25)
Expansion−PRF (Okapi BM25)
Expansion−AG−AR (Okapi BM25)
Figure 1: Recall/Precision curves of the expansion ap-
proaches for SDA95 collection.
Figure 1 shows evolution of the precision (P@11)
for the different query expansion strategies.
Learning Query Expansion from Association Rules Between Terms
529
6 CONCLUSION
We have presented in this article, a query expansion
approach which is based on learning how to expand
queries using association rules between terms. The
problem of expansion is modeled as a supervised clas-
sification problem. An exploratory process based on
genetic algorithms is used to both explore association
rules space in search of the best terms for expansion
and to generate training instances that are used later to
build a classifier. The resolution of the learning prob-
lem is by decision tree. Experiments conducted on
the French text collection SDA-95 show that learn-
ing how to expand queries using association rules is
a promising approach. These preliminary results are
encouraging, we plan to extend our representation of
the input space by including new features in order to
improve the learning process.
REFERENCES
Agrawal, R. and Skirant, R. (1994). Fast algorithms for
mining association rules. In Proceedings of the 20
th
International Conference on Very Large Databases,
VLDB 1994, pages 478–499, Santiago, Chile.
Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., and
Lakhal, L. (2000). Mining minimal non-redundant as-
sociation rules using frequent closed itemsets. In Pro-
ceedings of the 1
st
International Conference on Com-
putational Logic, volume 1861 of LNAI, pages 972–
986, London, UK. Springer-Verlag.
Boughanem, M. and Tamine, L. (2000). Query optimization
using an improved genetic algorithm. In Proceedings
of the 2000 ACM CIKM International Conference on
Information and Knowledge Management, McLean,
VA, USA, November 6-11, 2000, pages 368–373.
Buckley, C., Salton, G., Allan, J., and Singhal, A. (1994).
Automatic Query Expansion Using SMART: TREC-
3. In Proceedings of the 3
rd
Text REtrieval Confer-
ence.
Carpineto, C. and Romano, G. (2012). A survey of auto-
matic query expansion in information retrieval. ACM
Computing Surveys (CSUR), 44(1):1.
Chifu, A. and Mothe, J. (2014). Expansion s´elective de
requˆetes par apprentissage. In Moens, M., Viard-
Gaudin, C., Zargayouna, H., and Terrades, O. R.,
editors, CORIA 2014 - Conf´erence en Recherche
d’Infomations et Applications- 11th French Informa-
tion Retrieval Conference. CIFED 2014 Colloque In-
ternational Francophone sur l’Ecrit et le Document,
Nancy, France, March 19-23, 2014., pages 257–272.
ARIA-GRCE.
Fonseca, B. M., Golgher, P. B., Pˆossas, B., Ribeiro-Neto,
B. A., and Ziviani, N. (2005). Concept-based inter-
active query expansion. In Proceedings of the 14
th
International Conference on Information and Knowl-
edge Management, CIKM 2005, pages 696–703, Bre-
men, Germany. ACM Press.
Ganter, B. and Wille, R. (1999). Formal Concept Analysis.
Springer-Verlag.
Haddad, H., Chevallet, J. P., and Bruandet, M. F. (2000).
Relations between terms discovered by association
rules. In Proceedings of the Workshop on Machine
Learning and Textual Information Access in conjunc-
tion with the 4
th
European Conference on Principles
and Practices of Knowledge Discovery in Databases,
PKDD 2000, Lyon, France.
Jones, K. S., Walker, S., and Robertson, S. E. (2000). A
probabilistic model of information retrieval: develop-
ment and comparative experiments. Information Pro-
cessing and Management, 36(6):779–840.
Latiri, C., Haddad, H., and Hamrouni, T. (2012). Towards
an effective automatic query expansion process using
an association rule mining approach. Journal of Intel-
ligent Information Systems, 39(1):209–247.
Lin, H. C., Wang, L. H., and Chen, S. M. (2008). Query
expansion for document retrieval by mining additional
query terms. Information and Management Sciences,
19(1):17–30.
Mitra, M., Singhal, A., and Buckley, C. (1998). Improving
automatic query expansion. In Proceedings of the 21
th
Annual International ACM SIGIR Conference on Re-
search and Development in Information Retrieval, SI-
GIR’98, pages 206–214, Melbourne, Australia. ACM
Press.
Pasquier, N., Bastide, Y., Taouil, R., Stumme, G., and
Lakhal, L. (2005). Generating a condensed represen-
tation for association rules. Journal of Intelligent In-
formation Systems, 24(1):25–60.
Pragati Bhatnagar, N. P. (2015). Genetic algorithm-based
query expansion for improved information retrieval.
In Intelligent Computing, Communication and De-
vices, Advances in Intelligent Systems and Computing,
volume 308, pages 47–55.
Rungsawang, A., Tangpong, A., Laohawee, P., and Kham-
pachua, T. (1999). Novel query expansion technique
using apriori algorithm. In Proceedings of the 8
th
Text REtrieval Conference, TREC 8, pages 453–456,
Gaithersburg, Maryland.
Ruthven, I. and Lalmas, M. (2003). A survey on the use
of relevance feedback for information access systems.
Knowledge Engineering Review, 18(2):95–145.
Tangpong, A. and Rungsawang, A. (2000). Applying as-
sociation rules discovery in query expansion process.
In Proceedings of the 4
th
World Multi-Conference on
Systemics, Cybernetics and Informatics, SCI 2000,
Orlando, Florida, USA.
Xu, J. and Croft, W. B. (1996). Query expansion using local
and global document analysis. In Proceedings of the
19
th
Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval,
SIGIR 1996, pages 4–11, Zurich, Switzerland. ACM
Press.
KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval
530