
 
which applies Dice’s coefficient to Rand’s method 
in order to overcome some defects of Rand’s method.   
For the three types of web pages (Border, 2002), 
we decided to choose queries from informational 
classes which are estimated as the most proper ones 
for blog searches (Gliad and Maarten, 2006). We 
chose one query each from movie, music, and book 
categories. First, ‘X-Japan,’ the name of one of the 
most famous rock bands in Japan, is the query from 
music. Second, we chose ‘Cha T.H,’ who is one of 
the most famous actors in Korea, as the query word 
from movies. Lastly, ‘Ekuni Gaori,’ who is one of 
the most famous Japanese novelists, is the query 
from books. 
We tested 50 sets of blog data from the Naver 
search engine, without performing any editing. 
Although our prototype system could test its 
algorithm with all blog pages on the web, we 
decided to test just 50 pages this time. We will test 
more pages for more accuracy later. Before our 
prototype system perofrmed its task, we made an 
ideal clustered set by hand. After the prototype 
system created its result set, we compared this with 
the ideal set by CSIM. 
Table 1: The evaluation result. 
Query word  X-Japan  Cha T.H.  Ekuni Gaori 
CSLM value  0.857  0.711  0.805 
Since the CSIM value is quite close to 1, we can 
conclude that the prototype system is successful in 
clustering blog information. Although the test was 
performed on only 50 sets of blog data, it certainly 
clustered data which should be clustered, so we 
think that the larger the example set is, the more 
exact the results will be. We expect that this system 
will be able to offer useful and special information 
to users and companies that want to know the 
public’s response to their products or image. 
5 CONCLUSIONS 
In this paper, we discuss a blog search algorithm that 
considers the characteristics of blog content based 
on the assumption that the resultant blog 
classification can provide more valuable information 
to users. We also made a simple prototype to 
evaluate our algorithm. In order to test this system, 
we tried to find features of a blog and the problems 
of general search engines, and then find a solution 
which could solve those problems to an extent. We 
decided to use the concept of K-means as the 
classification method. We developed our own 
algorithm to adjust K-means to blog information. As 
shown in section 4, our algorithm and system 
provides certain benefits to users with clustered 
groups. It may not satisfy all the users, but it can 
give additional useful data to users and suggest a 
new approach to the blog search engine field. 
For future research, there is something else to 
consider. There were three important issues in 
making an algorithm with K-means, as you can see 
in section 2.2, and we do not think that our solution 
suggested in this paper is the only possible one. So 
we will try to find the best solution which can 
extract a better weight from the blog and choose a 
better K and critical point. In addition, we will study 
more classification methods which can be matched 
more closely with blog searches. Finally, nowadays 
a variety of search algorithms and methods used in 
search engines exist. Since our final goal is to 
present the best blog algorithm, we will study other 
search mechanisms, including classification.   
ACKNOWLEDGEMENTS 
This research was financially supported by the 
Ministry of Knowledge Economy(MKE) and Korea 
Industrial Technology (KOTEF) through the Human 
Resource Training Project for Strategic Technology. 
REFERENCES 
Aixin, S., Maggy, S., Ying, L. 2007. Blog Classification 
Using Tags: An Empirical Study. In ICADL 2007. 
Bloglines: http://www.bloglines.com/. 
Blogpulse: http://www.blogpulse.com/. 
BLOGRANGER: http://ranger.labs.goo.ne.jp/. 
BlogWatcher: http://blogwatcher.pi.titech.ac.jp/. 
Broder A. 2002. A Taxonomy of Web Search. In SIGIR 
Forum. 
Chung, Y.M., Lee, J.Y. 2001. A corpus-based approach to 
comparative evaluation of statistical term association 
measures. In J. of the American Society for 
Information Science and Technology. 
Fujiki, T., Nanno, T., Suzuki, Y., Okumura, M. 2004. 
Identification of Bursts in a Document Stream. In First 
International Workshop on Knowledge Discovery 
2004. 
Fujimura, K.,Toda, H., Inoue, T., Hiroshima, N., Kataoka, 
R., Sugizaki M. 2006. BLOGRANGER – A multi-
faceted Blog Search Engine. In WWW 2006.   
Gilad, M., Maarten, R. 2006. A Study of Blog Search. In 
ECIR 2006. LNCS 3936. 
Google, http://www.google.com/. 
Kumar, R., Novak, J., Raghavan, P., Tomkins, A. 2003. 
On the bursty evolution of blogspace. In WWW’03: 
ICEIS 2009 - International Conference on Enterprise Information Systems
66