
 
 
patents with the IPC code G10L 17 have something 
to do with “speech recognition”, the description of 
the IPC code. If a query patent belongs to the cluster 
of patents sharing the same IPC code, it is highly 
probable to find a conflicting patent in the cluster. 
The IPC codes make cluster-based retrieval since 
they can be the basis for semantic clustering (Kang 
et al., 2007). 
However, IPC-based clusters may present a 
problem when searches are performed within each of 
them. Since the documents in a cluster are similar to 
each other, they share many terms, making it 
difficult to discriminate among each other. Since the 
goal of invalidity search is to pinpoint the patent 
documents claiming the same technology, retrieving 
many grossly similar documents with ordinary index 
terms would not be very helpful, especially when the 
size of a cluster is large. Since identifying 
discriminating features would be difficult but critical 
for patent invalidity search, we need semantically 
annotated terms that would help making a fine 
distinction between the patents claiming the same 
technology or method from those that are grossly 
similar to each other based on all the index terms. 
The main thrust of this paper, therefore, is to link 
problem/solution-based semantic annotations, 
clustering, and patent retrieval. We describe a patent 
retrieval model based on semantic clusters. The 
system proposed in this paper consists of two parts: 
semantic annotation for the PROBLEM and 
SOLUTION categories and cluster-based retrieval 
based on extracted semantic key phrases. For the 
retrieval part, we attempt to distinguish patent 
documents in a cluster for the same PROBLEM or 
SOLUTION from those in other clusters, assuming 
that documents belonging to the same semantic 
cluster are more likely to be similar and hence 
conflicting among each other. 
The rest of this paper is organized as follows. In 
Section 2, we present the related work in patent 
retrieval and cluster-based retrieval. In Section 3, we 
describe the semantic clustering method based on 
the problem and solution annotations and a semantic 
patent retrieval model. We illustrate and interpret the 
experimental results in Section 4 and finally present 
our conclusion in Section 5. 
2 RELATED WORK 
A cluster-based model for Information Retrieval (IR) 
takes advantages of document clusters by assuming 
that relevant documents would be grouped within 
the same cluster. In general, documents are 
automatically grouped by their topical relatedness 
and relevant clusters are chosen with respect to a 
given query (Croft, 1980; Voorhees, 1985), so that 
the query terms in the relevant cluster are heavily 
weighted in the retrieval model. In order to verify 
the superiority of cluster-based retrieval model, Liu 
and Croft (2004) compared with the cluster-less 
model in a large test collection, using the language 
modeling approach. 
Prior to the series of workshops related to patent 
retrieval, Larkey (1999) utilized IPC codes to divide 
an entire corpus of patents into sub-corpora. The 
patents in each sub-corpus compose a large virtual 
document, and a query patent was mapped to each 
virtual document to select n-best sub-collections. In 
this approach, the search techniques in distributed IR 
(Callan et al., 1995) were applied in order to reduce 
long search time in several sub-collections. The 
work is considered an important attempt to use a 
unique aspect of patent documents.  
Chen et al. (2003a) proposed a patent document 
retrieval system concerning semantic and syntactic 
properties. They utilized Latent Semantic Index to 
recognize synonymous expressions. The system first 
finds the patent documents whose vectors lie in the 
neighbourhood of the query vector. It then uses the 
template matching algorithm developed by Chen & 
Tokuda (2003b) to calculate the similarity of the 
document and the query. Takaki (2004) proposed an 
associative document retrieval method. They 
extracted sub-topics from each query and weighted 
them by a term frequency-based entropy model. 
They applied this method in patent invalid search by 
using a query patent claim.  
Many previous studies were presented in the 
series of the NTCIR workshops (Kando, 2004, 2005, 
2007).  Among the work related to this paper is the 
one by Konishi et al. (2004) that used an IPC code 
as a category for each patent and combined TF/ICF 
(term frequency and inverse category frequency) 
with a general TF/IDF scoring formula. Fujii (2007) 
integrated content and citation information to 
identify an authoritative page by citation information 
(i.e., a patent is cited by a large number of other 
patents – foundation patent) like the PageRank 
method, which was combined with the Okapi BM25 
model. His system performed the best among all the 
participants in the task of patent retrieval in NTCIR-
6 (Fujii et al., 2007b).  
The work by Kang et al. (2007) seems to be most 
relevant to our research. They proposed a cluster-
based retrieval model utilizing IPC classes. Since the 
same IPC class would be assigned to somewhat 
relevant patents, this approach is quite effective to 
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
212