loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Tarek Amr Abdallah and Beatriz de la Iglesia

Affiliation: University of East Anglia, United Kingdom

Keyword(s): Language Models, Information Retrieval, Web Classification, Web Mining, Machine Learning.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Clustering and Classification Methods ; Computational Intelligence ; Evolutionary Computing ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Machine Learning ; Methodologies and Technologies ; Mining Text and Semi-Structured Data ; Operational Research ; Optimization ; Soft Computing ; Symbolic Systems ; Web Mining

Abstract: This paper is concerned with the classification of web pages using their Uniform Resource Locators (URLs) only. There is a number of contexts these days in which it is important to have an efficient and reliable classification of a web-page from the URL, without the need to visit the page itself. For example, emails or messages sent in social media may contain URLs and require automatic classification. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Much of the current research on URL-based classification has achieved reasonable accuracy, but the current methods do not scale very well with large datasets. In this paper, we propose a new solution based on the use of an n-gram language model. Our solution shows good classification performance and is scalable to larger datasets. It also allows us to tackle the problem of classifying new URLs with unseen sub-sequences.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.206.83.160

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Amr Abdallah, T. and de la Iglesia, B. (2014). URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2014) - KDIR; ISBN 978-989-758-048-2; ISSN 2184-3228, SciTePress, pages 14-21. DOI: 10.5220/0005030500140021

@conference{kdir14,
author={Tarek {Amr Abdallah}. and Beatriz {de la Iglesia}.},
title={URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2014) - KDIR},
year={2014},
pages={14-21},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005030500140021},
isbn={978-989-758-048-2},
issn={2184-3228},
}

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2014) - KDIR
TI - URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models
SN - 978-989-758-048-2
IS - 2184-3228
AU - Amr Abdallah, T.
AU - de la Iglesia, B.
PY - 2014
SP - 14
EP - 21
DO - 10.5220/0005030500140021
PB - SciTePress