Efficient Hashing of Multiple Spaced Seeds with Application

Eleonora Mian, Enrico Petrucci, Cinzia Pizzi, Matteo Comin

2023

Abstract

Alignment-Free analysis of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Hashing k-mers is a common function across many alignment-free applications and it is widely used for indexing, querying and rapid similarity search. Recently, spaced seeds, a special type of pattern that accounts for errors or mutations, are routinely used instead of k-mers. Spaced seeds allow to improve the sensitivity, with respect to k-mers, in many applications, however the hashing of spaced seeds increases substantially the computational time. Moreover, if multiple spaced seeds are used the accuracy can further increases at the cost of running time. In this paper we address the problem of efficient multiple spaced seed hashing. The proposed algorithms exploit the similarity of adjacent spaced seed hash values in an input sequence in order to efficiently compute the next hashes. We report the results on several tests which show that our methods significantly outperform the previously proposed algorithms, with a speedup that can reach 20x. We also apply these efficient spaced seeds hashing algorithms to an application in the field of metagenomic, the classification of reads performed by Clark-S (Ounit and Lonardi, 2016), and we shown that a significant speedup can be obtained, thus resolving the slowdown introduced by the use of multiple spaced seeds. Code available at: https://github.com/CominLab/MISSH.

Download


Paper Citation


in Harvard Style

Mian E., Petrucci E., Pizzi C. and Comin M. (2023). Efficient Hashing of Multiple Spaced Seeds with Application. In Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2023) - Volume 3: BIOINFORMATICS; ISBN 978-989-758-631-6, SciTePress, pages 155-162. DOI: 10.5220/0011632900003414


in Bibtex Style

@conference{bioinformatics23,
author={Eleonora Mian and Enrico Petrucci and Cinzia Pizzi and Matteo Comin},
title={Efficient Hashing of Multiple Spaced Seeds with Application},
booktitle={Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2023) - Volume 3: BIOINFORMATICS},
year={2023},
pages={155-162},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011632900003414},
isbn={978-989-758-631-6},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2023) - Volume 3: BIOINFORMATICS
TI - Efficient Hashing of Multiple Spaced Seeds with Application
SN - 978-989-758-631-6
AU - Mian E.
AU - Petrucci E.
AU - Pizzi C.
AU - Comin M.
PY - 2023
SP - 155
EP - 162
DO - 10.5220/0011632900003414
PB - SciTePress