EFFICIENT RSS FEED GENERATION FROM HTML PAGES

Jun Wang, Kanji Uchino

Abstract

Although RSS demonstrates a promising solution to track and personalize the flow of new Web information, many of the current Web sites are not yet enabled with RSS feeds. The availability of convenient approaches to “RSSify” existing suitable Web contents has become a stringent necessity. This paper presents EHTML2RSS, an efficient system that translates semi-structured HTML pages to structured RSS feeds, which proposes different approaches based on various features of HTML pages. For the information items with release time, the system provides an automatic approach based on time pattern discovery. Another automatic approach based on repeated tag pattern discovery is applied to convert the regular pages without the time pattern. A semi-automatic approach based on labelling is available to process the irregular pages or specific sections in Web pages according to the user’s requirements. Experimental results show that our system is efficient and effective in facilitating the RSS feed generation.

References

  1. Berners-Lee, T., 2001. The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American, May 2001 Issue.
  2. Chang, C., 2001. IEPAD: Information Extraction based on Pattern Discovery, In the 10th International Conference on World Wide Web, Hong Kong.
  3. Chen, Y., 2003. Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices, In the 12th International Conference on World Wide Web, Budapest, Hungary.
  4. Gupta, S., 2003. DOM-based Content Extraction of HTML Documents, In the 12th International Conference on World Wide Web, Budapest, Hungary.
  5. Gusfield, D., 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press; 1st edition
  6. Hammer, J., 1997. Extracting Semistructured Information from the Web, In Workshop on the Management of Semistructured Data, 1997
  7. Hammersley, B., 2003. Content Syndication with RSS, Oreilly & Associate, Inc. 1st edition.
  8. Huck, G., 1998. Jedi: Extracting and Synthesizing Information from the Web, In CoopIS1998, 3rd International Conference of Cooperative Information Systems. New York.
  9. Miller, R., 2004. Can RSS Relieve Information Overload? EContent Magazine, March 2004 Issue.
  10. Mukherjee, S., 2003. Automatic Discovery of Semantic Structures in HTML Documents, In 7th International Conference on Document Analysis and Recognition, Edinburgh
  11. Nottingham, M., XPath2rss, http://www.mnot.net/
  12. Sahuguet, A., 1999. Web Ecology: Recycling HTML pages as XML documents using W4F, In WebDB 99.
  13. Ukkonen, E., 1995. On-line construction of suffix trees. Algorithmica, 14(3):249-260, Sept. 1995.
  14. Wang, J., 2003. On Intra-page and Inter-page Semantic Analysis of Web Pages, In the 17th Pacific Asia Conference on Language, Information and Computation, Singapore
Download


Paper Citation


in Harvard Style

Wang J. and Uchino K. (2005). EFFICIENT RSS FEED GENERATION FROM HTML PAGES . In Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 972-8865-20-1, pages 311-318. DOI: 10.5220/0001230103110318


in Bibtex Style

@conference{webist05,
author={Jun Wang and Kanji Uchino},
title={EFFICIENT RSS FEED GENERATION FROM HTML PAGES},
booktitle={Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2005},
pages={311-318},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001230103110318},
isbn={972-8865-20-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - EFFICIENT RSS FEED GENERATION FROM HTML PAGES
SN - 972-8865-20-1
AU - Wang J.
AU - Uchino K.
PY - 2005
SP - 311
EP - 318
DO - 10.5220/0001230103110318