Content-based Title Extraction from Web Page

Najlah Gali, Pasi Fränti

Abstract

Web pages are usually designed in a presentation oriented fashion, having therefore a large amount of non-informative data such as navigation banners, advertisement and functional text. For a particular user, only informative data such as title, main content, and representative images are considered useful. Existing methods for title extraction rely on the structural and visual features of the web page. In this paper, we propose a simpler, but more effective method by analysing the content of the title and meta tags in respect to the main body of the page. We segment the title and meta tags using a set of predefined delimiters and score the segments using three criteria: placement in tag, popularity within all header tags in the page, and the position in the link of the web page. The method is fully automated, template independent, and not limited to any certain type of web pages. Experimental results show that the method significantly improves the accuracy (average similarity to the ground truth title) from 62 % to 84 %.

References

  1. Breiman, L. (2001). Random forests. Machine learning, 45(1), pp.5-32.
  2. Brew, C. and McKelvie, D. (1996). Word-pair extraction for lexicography. In Proceeding of the second International Conference on New Methods in Language Processing, pp. 45-55.
  3. Cai, D., Yu, S., Wen, J. R., & Ma, W. Y. (2003). Vips: a vision-based page segmentation algorithm (p. 28). Microsoft technical report, MSR-TR-2003-79. p. 28.
  4. Changuel, S., Labroche, N., & Bouchon-Meunier, B. (2009). A general learning method for automatic title extraction from html pages. In Machine Learning and Data Mining in Pattern Recognition. pp. 704-718. Springer Berlin Heidelberg.
  5. Fränti, P., Chen, J., Tabarcea, A. (2011) Four Aspects of Relevance in Sharing Location-based Media: Content, Time, Location and Network. In WebIST, pp 413-417.
  6. Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., & Li, H. (2005). Title extraction from bodies of HTML documents and its application to web page retrieval. In Proceedings of the 28th annual international ACM. SIGIR conference on Research and development in information retrieval. pp. 250-257. ACM.
  7. Fan, J., Luo, P., & Joshi, P. (2011). Identification of web article pages using HTML and visual features. In IS&T/SPIE Electronic Imaging International Society for Optics and Photonics. pp. 78790K-78790K.
  8. Jeong, O. R., Oh, J., Kim, D. J., Lyu, H., & Kim, W. (2014). Determining the titles of Web pages using anchor text and link analysis. Expert Systems with Applications, 41(9). pp 4322-4329.
  9. Kan, M. Y., & Thi, H. O. N. 2005. Fast webpage classification using URL features. In Proceedings of the 14th ACM international conference on Information and knowledge management. pp. 325-326. ACM.
  10. Manning, C. D., & Raghavan, P. H. Sch utze. (2009). An introduction to information retrieval.
  11. Mohammadzadeh, H., Gottron, T., Schweiggert, F., & Heyer, G. (2012). Finder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity. In Proceedings of the twelfth international workshop on Web information and data management .pp. 65-72. ACM.
  12. Wang, C., Wang, J., Chen, C., Lin, L., Guan, Z., Zhu, J. & Bu, J. (2009). Learning to extract web news title in template independent way. In Rough Sets and Knowledge Technology. pp. 192-199. Springer Berlin Heidelberg.
  13. Wang, J., Li, G., & Feng, J. (2014). Extending string similarity join to tolerant fuzzy token matching. ACM Transactions on Database Systems (TODS), 39(1), 7.
  14. Win, C. S., & Thwin, M. M. S. (2014). Web Page Segmentation and Informative Content Extraction for Effective Information Retrieval. IJCCER, 2(2), pp 35- 45.
  15. Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin C. & Li, H. (2007). Web page title extraction and its application. Information processing & management, 43(5). Pp 1332-1347.
Download


Paper Citation


in Harvard Style

Gali N. and Fränti P. (2016). Content-based Title Extraction from Web Page . In Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-758-186-1, pages 204-210. DOI: 10.5220/0005794102040210


in Bibtex Style

@conference{webist16,
author={Najlah Gali and Pasi Fränti},
title={Content-based Title Extraction from Web Page},
booktitle={Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2016},
pages={204-210},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005794102040210},
isbn={978-989-758-186-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - Content-based Title Extraction from Web Page
SN - 978-989-758-186-1
AU - Gali N.
AU - Fränti P.
PY - 2016
SP - 204
EP - 210
DO - 10.5220/0005794102040210