
 
hazardous because it requires no guidance to mark 
these cases as failures. In the latter cases, however, 
the extracted data must be rejected manually. 
Using automatic extraction with existing domain 
knowledge, 85% of the extracted product attributes 
were correct and 10% bogus data. On average, 23 of 
27 available product attributes were correctly 
extracted and one false positive was mined. 
Overall, the information extraction component 
showed feasible results. Assuming that the 
algorithms are included in an information platform 
used by consumers, it is expected that users provide 
extraction hints to the system in a wiki-like form. 
After some running time and the intensive collection 
of domain knowledge, the extraction success should 
even increase, thus only making the employment of 
information extraction by crawling inevitable in very 
few cases. 
6 CONCLUSIONS 
In this paper we presented algorithms for locating 
and extracting product information from websites 
while only being supplied with a product name and 
its producer’s name. While the retrieval algorithm 
was developed from scratch, the extraction 
algorithm extends previous works presented in 
Section 2 especially leveraging the special 
characteristics of product detail pages. The 
evaluation showed the feasibility of the approaches. 
Both the retrieval and extraction component 
generated better results when being supplied with 
domain knowledge used for bootstrapping. Thus, 
future research will focus on improving the system’s 
learning component to automatically create 
extensive domain knowledge at runtime.  
Currently, additional algorithms are being 
developed for mapping the extracted specification 
keys to a central terminology and converting the 
corresponding values to standard formats. Thus, 
product comparisons would be enabled at runtime. 
Evaluations will examine the success of these 
algorithms. Another direction of future research 
includes the automatic extension of the used product 
specification terminology being represented by an 
ontology. Thus, the mapping algorithm’s evaluation 
results would be improved significantly. 
The consolidated integration of this paper’s 
algorithms as well as described future extensions in 
a federated consumer product information system 
would enable users to create an all-embracing view 
on products of interest and compare those products 
effectively while only requiring a fraction of today’s 
effort for gathering product information from the 
information provider. In the same manner it may be 
integrated in enterprise product information systems 
as well as online shopping systems easing and 
accelerating the process of implementing product 
specifications. 
REFERENCES 
Arasu, A. and Garcia-Molina, H. (2003). Extracting 
Structured Data from Web Pages. In SIGMOD 
International Conference on Management of Data. 
San Diego, CA, USA 10-12 June 2003. ACM Press: 
New York. 
Banko, M., Cafarella, M. J. Soderland, S., Broadhead, M. 
and Etzioni, O. (2007). Open Information Extraction 
from the Web. In IJCAI 20
th
 International Joint 
Conference on Artificial Intelligence. Hyderabad, 
India 9-12 January 2007. Morgan Kaufmann 
Publishers Inc.: San Francisco. 
Califf, M. E. and Mooney, R. J. (1997). Relational 
Learning of Pattern-Match Rules for Information 
Extraction. In ACL SIGNLL Meeting of the ACL 
Special Interest Group in Natural Language Learning. 
Madrid, Spain July 1997. T. M. Ellison: Madrid. 
Chang, C.-H. and Lui, S.-C. (2001). IEPAD: Information 
Extraction based on Pattern Discovery. In IW3C2 10
th
 
International Conference on the World Wide Web. 
Hong Kong, China 1-5 May 2001. ACM Press: New 
York. 
Crescenzi, V., Mecca, G. and Merialdo, P. (2001). 
Roadrunner: Towards Automatic Data Extraction from 
Large Web Sites. In VLDB Endowment 27
th
 
International Conference on Very Large Data Bases. 
Rome, Italy 11-14 September 2001. Morgan 
Kaufmann Publishers Inc.: San Francisco. 
Freitag, D. (1998). Information Extraction from HTML: 
Application of a General Machine Learning Approach. 
In AAAI 15
th
 National Conference on Artificial 
Intelligence. Madison, WI, USA 26-30 July 1998. 
AAAI Press: Menlo Park. 
Hsu, C.-N. and Dung, M.-T. (1998). Generating Finite-
State Transducers for Semi-Structured Data Extraction 
from the Web. Journal of Information Systems, 23(8), 
pp.521-538. 
Kushmerick, N., Weld, D. S. and Doorenbos, R. (1997). 
Wrapper Induction for Information Extraction. In 
IJCAI 15
th
 International Joint Conference on Artificial 
Intelligence. Nagoya, Japan 23-29 August 1997. 
Morgan Kaufmann Publishers Inc.: San Francisco. 
Laender, A. H. F., Ribeiro-Neto, B. and da Silva, A. S. 
(2002). DEByE - Data Extraction by Example. Data 
and Knowledge Engineering, 40(2), pp.121–154. 
Liu, B. (2007). Web Data Mining: Exploring Hyperlinks, 
Contents, and Usage Data. Springer: Heidelberg. 
 
LOCATING AND EXTRACTING PRODUCT SPECIFICATIONS FROM PRODUCER WEBSITES
21