loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Maxence Delong 1 ; Baptiste David 1 and Eric Filiol 2 ; 3

Affiliations: 1 Laboratoire de Virologie et de Cryptologie Opérationnelles, ESIEA, Laval, France ; 2 Department of Computing, ENSIBS, Vannes, France ; 3 High School of Economics, Moscow, Federation of Russia

Keyword(s): Crawler Trap, Information Distance, Bots, TOR Network.

Abstract: In the domain of web security, websites want to prevent themselves from data gathering performed by automatic programs called bots. In that way, crawler traps are an efficient brake against this kind of programs. By creating similar pages or random content dynamically, crawler traps give fake information to the bot and resulting by wasting time and resources. Nowadays, there is no available bots able to detect the presence of a crawler trap. Our aim was to find a generic solution to escape any type of crawler trap. Since the random generation is potentially endless, the only way to perform crawler trap detection is on the fly. Using machine learning, it is possible to compute the comparison between datasets of webpages extracted from regular websites from those generated by crawler traps. Since machine learning requires to use distances, we designed our system using information theory. We used wild used distances compared to a new one designed to take into account heterogeneous data. Indeed, two pages does not have necessary the same words and it is operationally impossible to know all possible words by advance. To solve our problematic, our new distance compares two webpages and the results showed that our distance is more accurate than other tested distances. By extension, we can say that our distance has a much larger potential range than just crawler traps detection. This opens many new possibilities in the scope of data classification and data mining. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.219.213.196

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Delong, M.; David, B. and Filiol, E. (2020). Detection of Crawler Traps: Formalization and Implementation Defeating Protection on Internet and on the TOR Network. In Proceedings of the 6th International Conference on Information Systems Security and Privacy - ForSE; ISBN 978-989-758-399-5; ISSN 2184-4356, SciTePress, pages 775-783. DOI: 10.5220/0009367207750783

@conference{forse20,
author={Maxence Delong. and Baptiste David. and Eric Filiol.},
title={Detection of Crawler Traps: Formalization and Implementation Defeating Protection on Internet and on the TOR Network},
booktitle={Proceedings of the 6th International Conference on Information Systems Security and Privacy - ForSE},
year={2020},
pages={775-783},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0009367207750783},
isbn={978-989-758-399-5},
issn={2184-4356},
}

TY - CONF

JO - Proceedings of the 6th International Conference on Information Systems Security and Privacy - ForSE
TI - Detection of Crawler Traps: Formalization and Implementation Defeating Protection on Internet and on the TOR Network
SN - 978-989-758-399-5
IS - 2184-4356
AU - Delong, M.
AU - David, B.
AU - Filiol, E.
PY - 2020
SP - 775
EP - 783
DO - 10.5220/0009367207750783
PB - SciTePress