
 
the concept of “luxury bedding” depends on the 
brands and designs available on the market that are 
considered as luxury and their attributes. Bridging 
the semantic gap therefore is in essence the problem 
of inferring the meaning of search phrases in all its 
nuances. 
Our Approach: In this paper we present an 
algorithm that (i) structures item information and (ii) 
uses a frequent itemset mining algorithm to learn the 
“target phrase” definitions. 
2 RELATED WORKS  
In (Aholen, 1998), generalized episodes and episode 
rules are used for Descriptive Phrase Extraction. 
Episode rules are the modification of association 
rules and episode is the modification of frequent set.  
An episode is a collection of feature vectors with a 
partial order; authors claimed that their approach is 
useful in phrase mining in Finnish, a language that 
has the relaxed order of words in a sentence. In our 
previous work (Nguyen, 2003), we present a co-
occurrence clustering algorithm that identifies 
phrases that frequently co-occurs with the target 
phrase from the meta-tags of Web documents. 
However, in this paper we address a different 
problem; we attempt to mine the phrase definitions 
in terms of extracted item information, thus, the 
mined definitions can be utilized to connect “search 
phrases” to real items in all their nuances. 
     The  frequent  itemset  mining  problem  is  to 
discover a set of items shared among a large number 
of records in the database. There are two main 
search strategies to find the frequent items set. 
Apriori (Agrawal, 1994) and several other Apriori 
like algorithms adopt Breadth-First-Search model, 
while Eclat (Zaki, 2000) and FPGrowth (Han, 2000) 
are well known algorithms that employ Depth-First 
manner to search all frequent itemsets of a database. 
Our algorithm also searches for frequent itemsets in 
a Depth-First manner. But, unlike the lattice 
structure used in Eclat or the conditional frequent 
pattern tree used in FPGrowth, we propose the so 
called 2-frequent itemset graph and utilize heuristic 
syntheses to prune the search space in order to 
improve the performance. We plan to further 
optimize our algorithm and conduct detailed 
comparisons to the above algorithms. 
     The relevance feedback (Salton, 1990) method 
can also be used to refine the original keyword 
phrase by using the document vectors (Baeza-Yates, 
1999) of the extracted relevant items as additional 
information. In Section 6, we present experimental 
results and show that the rules that our system 
learns, by utilizing the extracted relevant item 
information, are easier to validate and perform better 
than retrieval with the relevance feedback method. 
3 SYSTEM DESCRIPTION 
I. Item Name Structuring:  This component takes a 
product catalogue and extracts structured 
information for mining the phrase based and 
parametric definitions. Details are discussed in 
Section 4. 
II. Mining Search Phrase Definitions: In this 
phase, we divide the phrase definition mining 
problems into two sub problems (i) mining the 
parametric definitions from extracted attribute value 
pairs of items, and (ii) mining phrase based 
definitions from the long item descriptions. Details 
are discussed in Section 5. 
4 DATA LABELING 
This section presents the techniques for an e-
commerce domain, for the sake of providing 
examples. Our techniques can be customized for 
different domains.  The major tasks in this phase are 
structuring  and  labeling of extracted data. The 
readers are also referred to (Davulcu, 2003) for more 
information in details. 
4.1 Labeling and Structuring 
Extracted Data 
This section describes a technique to partition the 
short product item names into their various 
attributes. We achieve this by grouping and aligning 
the tokens in the item names such that the instances 
of the same attribute from multiple products fall 
under the same category indicating that they are of 
similar types.  
The motivation behind doing the partition is to 
organize data. By discovering attributes in product 
data and arranging the values in a table, one can 
build a search engine which can enable quicker and 
precise product searches in an efficient way.  
4.2 The Algorithm 
Before proceeding to the algorithm, it helps to 
identify item names as a sequence of tokens obtained 
when white-space is used as a delimiter. Since the 
sequences of tokens obtained from item names are 
BOOSTING ITEM FINDABILITY: BRIDGING THE SEMANTIC GAP BETWEEN SEARCH PHRASES AND ITEM
INFORMATION
49