A Lightweight Approach for Extracting Product Records from the Web

Andrea Horch, Holger Kett, Anette Weisbecker

2015

Abstract

Gathering product records from the Web is very important to both shoppers and on-line retailers for the purpose of comparing products and prices. For consumers, the reason for doing this is to find the best price for a product, whereas on-line retailers want to compare their offers with those of their competitors in order to remain competitive. Due to the huge number and vast array of product offers in the Web an automated approach for collecting product data is needed. In this paper we propose a lightweight approach to automatically identify and extract product records from arbitrary e-shop websites. For this purpose we have adopted and extended the existing technique called Tag Path Clustering for clustering similar HTML tag paths and developed a novel filtering mechanism especially for extracting product records from websites.

References

  1. Anderson, N. and Hong, J. (2013). Visually extracting data records from the deep web. In Proceedings of the 22nd International World Wide Web Conference (WWW 2013), WWW 7813, pages 1233-1238, New York, NY, USA. ACM.
  2. Grigalis, T. (2013). Towards web-scale structured web data extraction. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 7813, pages 753-758, New York, NY, USA. ACM.
  3. Grigalis, T. and Cenys, A. (2014). Unsupervised structured data extraction from template-generated web pages. pages 169-192.
  4. Liu, B. (2006). Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
  5. Liu, B., Grossman, R., and Zhai, Y. (2003). Mining data records in web pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 7803, pages 601-606, New York, NY, USA. ACM.
  6. McGovern, C. and Levesanos, A. (2014). Optimizing pricing and promotions in a digital world: From productled to customer-centric strategies. Online.
  7. Nagelvoort, B. et al. (2014). European b2c e-commerce report 2014. Website. http://www.adigital.org/sites/default/files/studies/ european-b2c-ecommerce-report-2014.pdf.
  8. PostNord (2014). E-commerce in europe 2014. Website. http://www.postnord.com/globalassets/global/english/ document/publications/2014/e-commerce-in-europe2014.pdf.
  9. Real, R. and Vargas, J. M. (1996). The Probabilistic Basis of Jaccard's Index of Similarity. Systematic Biology, 45(3):380-385.
  10. Rijsbergen, C. J. V. (1979). Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edition.
  11. Simon, H. and Fassnacht, M. (2008). Preismanagement: Strategie - Analyse - Entscheidung - Umsetzung. Gabler Verlag, Wiesbaden.
  12. Walther, M., Hähne, L., Schuster, D., and Schill, A. (2010). Locating and extracting product specifications from producer websites. In ICEIS 2010 - Proceedings of the 12th International Conference on Enterprise Information Systems, Volume 4, SAIC, Funchal, Madeira, Portugal, June 8 - 12, 2010, pages 13-22.
  13. Zhao, H., Meng, W., Wu, Z., Raghavan, V., and Yu, C. (2005). Fully automatic wrapper generation for search engines. In Proceedings of the 14th International Conference on World Wide Web, WWW 7805, pages 66-75, New York, NY, USA. ACM.
Download


Paper Citation


in Harvard Style

Horch A., Kett H. and Weisbecker A. (2015). A Lightweight Approach for Extracting Product Records from the Web . In Proceedings of the 11th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-758-106-9, pages 420-430. DOI: 10.5220/0005441404200430


in Bibtex Style

@conference{webist15,
author={Andrea Horch and Holger Kett and Anette Weisbecker},
title={A Lightweight Approach for Extracting Product Records from the Web},
booktitle={Proceedings of the 11th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2015},
pages={420-430},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005441404200430},
isbn={978-989-758-106-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 11th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - A Lightweight Approach for Extracting Product Records from the Web
SN - 978-989-758-106-9
AU - Horch A.
AU - Kett H.
AU - Weisbecker A.
PY - 2015
SP - 420
EP - 430
DO - 10.5220/0005441404200430