SMT: A High-Performance Approach for Counting Kmers

Jader M. C. Garbelini

1 a

, Danilo Sipoli Sanches

2 b

, Andr

e Yoshiaki Kashiwabara

2 c

and Aurora T. R. Pozo

1 d

Federal University of Paran

a, Curitiba, Brazil

Federal University of Technology, Corn

elio Proc

opio, Brazil

Keywords:

Kmers, Motifs, Sequence Analysis, Optimization.

Abstract:

Motivation: Finding conserved motifs in DNA sequences is a key problem in bioinformatics. The growing

availability of large-scale genomic data poses signiﬁcant challenges for computational biology, particularly

in terms of efﬁciency in analysis, kmer identiﬁcation, and noise presence. The detection of conserved motifs

and patterns in DNA sequences is determinant for understanding gene functions and regulations. Therefore,

it is essential to develop a novel approaches and methods that can handle these large volumes of information

and provide accurate and fast results. Results: We present SMT, an innovative tool designed to efﬁciently

store and count kmers, optimizing memory usage and computation time. The application of SMT has also

proven effective in discovering motifs in CHIP-SEQ data, allowing the identiﬁcation of conserved regions in

sequences. Furthermore, SMT allows exact searches in constant time proportional to the size of k and retrieves

the most abundant kmers through a frequency table. This approach facilitates large-scale data analysis and

provides important insights into the conserved properties of biological sequences. The application of SMT in

motif discovery demonstrates its potential to drive research in bioinformatics and genomics. Availability and

implementation: Supplementary data and results are available to provide additional information and support

the conclusions. SMT and source code can be found at the following address: https://github.com/jadermcg/smt.

1 BACKGROUND

In recent years, there has been an exponential increase

in the amount of available genomic data, thanks to

advances in DNA and RNA sequencing technologies.

The detection of motifs and conserved patterns in se-

quences is important for understanding gene func-

tions and regulations, as well as for identifying func-

tional and structural elements of the genome (Good-

win et al., 2016). To date, various data structures and

algorithms have been proposed to deal with the in-

creasing demand for efﬁcient large-scale kmer anal-

ysis. However, many of these solutions are not

sufﬁciently fast or require signiﬁcant computational

resources, which limits their applicability to ever-

growing genomic datasets (Deorowicz et al., 2019)

(Marchet et al., 2019).

In this context, we introduce the Sparse Motif

Tree (SMT), an innovative tool speciﬁcally designed

https://orcid.org/0000-0003-2289-882X

https://orcid.org/0000-0002-8972-5221

https://orcid.org/0000-0003-3280-2035

https://orcid.org/0000-0001-5808-3919

to store and count kmers efﬁciently. SMT optimizes

memory usage and computation time, allowing for

the rapid and accurate analysis of large volumes of

genomic data. SMT also allows histograms in hash

table format to be extracted from your database ef-

ﬁciently. These features enable researchers to iden-

tify recurring patterns and conserved regions within

sequences, as well as analyze variations within DNA

and RNA sequences.

Therefore, SMT is particularly useful in the prob-

lem of motif discovery, a central challenge in bioin-

formatics and genomics (Bailey et al., 2015). De-

tecting conserved sequences and recurring patterns in

DNA and RNA sequences is a key task for identifying

functional elements and understanding gene regula-

tion in different organisms. The efﬁciency and versa-

tility of the SMT allows researchers to quickly analyze

large genomic datasets and accurately identify biolog-

ically relevant regions with precision and reliability.

The results obtained suggest that SMT and the

developed algorithms have great potential to count

kmers efﬁciently and facilitate the analysis of CHIP-

SEQ and others types of massive data. This work con-

Garbelini, J., Sanches, D., Kashiwabara, A. and Pozo, A.

SMT: A High-Performance Approach for Counting Kmers.

DOI: 10.5220/0012546500003657

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2024) - Volume 1, pages 545-552

ISBN: 978-989-758-688-0; ISSN: 2184-4305

545

tributes to the development of new tools and methods

to address the challenge of ﬁnding conserved regions

and opens new perspectives for the analysis of molec-

ular biology data. In addition to its application in mo-

tif discovery, SMT can also be used in several other

contexts within bioinformatics and genomics, such as

comparative genome analysis, identiﬁcation of func-

tional elements in sequences, study of genetic varia-

tions and metagenomics (Wood et al., 2019), (Chais-

son et al., 2019). SMT’s versatility and efﬁciency

make it a valuable tool for researchers looking for

scalable, high-performance solutions to handle large-

scale genomic data analysis.

In this paper, we described the architecture and

implementation of SMT. In addition, we discussed the

potential of SMT to boost research in bioinformatics

and genomics, helping researchers to explore and un-

derstand the complexity and diversity of genomes of

different organisms. In the supplementary materials,

we provide the pseudocode of the algorithms, some

complexity analysis and also demonstrate a practical

application of SMT in discovering biological motifs.

The remainder of this paper is organized as follows:

In Section 2, we delineate the workings and software

implementation of SMT, along with the description of

its key satellite algorithms. In Section 3, we present

the outcomes of comparing SMT with other state-of-

the-art algorithms. Finally, in Section 5, we draw ﬁnal

considerations regarding the paper.

2 IMPLEMENTATION

SMT was developed with the primary objective of

minimizing RAM memory consumption, thereby en-

abling the process execution on computers with mod-

est speciﬁcations. The optimization of memory usage

is achieved through efﬁcient compression strategies

and a high-performance sparse data structure, which

together provide a robust solution without requiring

high-capacity hardware. This approach democratizes

access to genomic data analysis, allowing researchers

and professionals with limited hardware resources to

perform complex analyses without the need for sig-

niﬁcant investments in IT infrastructure.

Our method employs the divide and conquer strat-

egy to efﬁciently handle large genomic datasets. This

approach breaks down the original data into smaller

subsets, allowing the parallel and independent analy-

sis of each segment. By dividing the initial problem

into more manageable parts, the divide and conquer

strategy facilitates the identiﬁcation and counting of

kmers in a more effective and scalable manner.

SMT is represented through a two-dimensional

data structure, implemented from text fragments of

the same width. It was lightly inspired by the the-

ory of ROOM squares(Archbold and Johnson, 1958),

in which each element M

i, j

is either empty or holds

a set value between 1 and ν. In this manner, each

row of M represents a node 1, 2, 3, . . . , ν of the tree,

columns 1 to 4 represent the nucleotides A, C, G, T,

and the last column stores the count of the fragments.

SMT support different types of efﬁcient queries, such

as KSEARCH, HMAP and, KDIVE. The KSEARCH was

designed to perform exact searches in the SMT, re-

turning the existence of a speciﬁc kmer as well as its

respective count. This algorithm is optimized to cap-

italize on the hierarchical tree structure, allowing an

efﬁcient and precise search, with the execution time

complexity being linear in relation to the size of the

k-mer (O(k)).

HMAP was devised to swiftly extract a map con-

taining all the kmers and their respective frequencies

from the SMT. The efﬁciency and speed of HMAP are

attributed to its direct execution on the SMT tree struc-

ture, which facilitates organized access and recovery

of kmers and their counts. This approach leverages

the hierarchical and organized nature of the SMT, al-

lowing for efﬁcient traversal of the tree to compile

the kmer map. Furthermore, HMAP employs libraries

to parallelize the reads and insertions into the shared

hash structure, contributing to faster execution. The

synergy between the tree structure of the SMT and the

efﬁcient execution of HMAP results in a powerful tool

that provides quick and precise extraction of the kmer

map, thus facilitating subsequent genomic analyses.

SMT is especially notable when compared to other

structures, such as JELLYFISH (Marc¸ais and Kings-

ford, 2011) and KANALYZE (Audano and Vannberg,

2014). It is important to highlight that although the

ﬁrst version of Jellyﬁsh was released in 2011, its de-

velopment is constant and the last update of this algo-

rithm corresponding to version 2.3.1 was carried out

in December 2023. Among the algorithms operating

on the SMT, KDIVE stands out primarily from the per-

spective of motif analysis. This algorithm was de-

signed with the purpose of performing agile searches

for text fragments in the SMT, even when these con-

tain up to d mutations. In other words, KDIVE was

developed to return a positive search result even if the

search string exhibits up to d mismatches. An impor-

tant aspect to highlight is that the algorithm is predi-

cated on the assumption that the probability of a mis-

match is uniform for any point of the sequence.

BIOINFORMATICS 2024 - 15th International Conference on Bioinformatics Models, Methods and Algorithms

546

3 RESULTS AND DISCUSSION

In this section, we will present the results obtained

from the evaluation and analysis of SMT, in compar-

ison with the JELLYFISH and KANALIZE algorithms.

These comparisons were important to validate the ef-

ﬁcacy and efﬁciency of SMT in the context of genomic

data processing and analysis. The comparative anal-

ysis allows corroborating the superior performance

of SMT in various scenarios and also highlighting its

distinctive features that contribute to its high perfor-

mance and ﬂexibility in handling kmers. Throughout

this section, we will discuss the methodological as-

pects of the tests carried out, as well as the metrics

employed for the evaluation of the data structures and

algorithms under study. The insights derived from

this analysis provide a deep understanding of the po-

tential of SMT as a robust and effective data structure

for analyzing large genomic data sets.

To evaluate the algorithms, all .bed ﬁles from the

JASPAR 2022 (Castro-Mondragon et al., 2022) repos-

itory were selected, relevant to CHIP-SEQ data with

more than 10.000 sequences, totaling 131 distinct data

sets. The algorithms were executed with values of k

ranging between 5 and 30. The monitoring of exe-

cution times and RAM consumption was carried out

through the command /usr/bin/time -v, available

in practically all LINUX/UNIX operating systems. All

tests on the Linux Ubuntu 22.04.3 LTS operating sys-

tem, with equipment equipped with AMD EPYC 7B12

processors, 8 GB of RAM, and an internal clock of ap-

proximately 2250 GHz. For reproducibility purposes,

the following command line instructions were used:

1. SMT: smt -i <fasta file> -k <size of

kmer> -s 500

2. JELLYFISH: jellyfish count -m <size of

kmer> -s 500M

3. KANALYZE: kanalyze count -k <size of

kmer> -f fasta <fasta file>

Figure 1 displays the average performance of the

algorithms considering all datasets and all values of k.

In the upper left part of this ﬁgure, illustrating RAM

consumption, it is observed that the JELLYFISH algo-

rithm requires considerably more memory compared

to the other two. Speciﬁcally, it consumes about 1000

Mb, while KANALYZE consumes approximately 250

Mb and SMT uses about 50 Mb. In the upper right

graph, depicting time (in seconds), KANALYZE shows

the longest execution time, reaching about 8 seconds.

JELLYFISH, in turn, takes approximately 2 seconds,

while SMT shows the shortest time, close to 0.5 sec-

onds.

Continuing with Figure 1, observing the bottom-

left graph, we note that JELLYFISH has a median close

to 1400 Mb with outliers exceeding 3000 Mb. KAN-

ALYZE has its median around 300 Mb, with peaks

reaching almost 2000 Mb. Conversely, SMT main-

tains lower consumption, with a median near 150 Mb

and outliers close to 3000 Mb. In the bottom-right

graph, JELLYFISH shows low dispersion with its me-

dian surpassing 4 seconds. KANALYZE exhibits high

dispersion, with a median close to 7 seconds and out-

liers going beyond 55 seconds. SMT remains with the

best performance, with a median close to 0.5 seconds

and outliers nearing 18 seconds.

The analysis of the interquartile range (IIQ) for the

RAM consumption of the three algorithms reveals im-

portant insights about their dispersion. The JELLY-

FISH algorithm has the highest IIQ, with 2284 Mb,

indicating a signiﬁcant variation in its RAM consump-

tion in the central half of the data. In contrast, KAN-

ALYZE and SMT have much lower IIQs, with 133 Mb

and 115 Mb, respectively. This suggests that while

JELLYFISH has a signiﬁcant dispersion of RAM con-

sumption, KANALYZE and SMT show efﬁciency in

this aspect, at least in the central half of their distribu-

tions. These observations complement the analysis of

the execution time, where SMT stands out with a con-

siderably lower median time compared to the others.

Upon analyzing the interquartile range (IIQ) for

the execution time of the three algorithms, we can ex-

tract signiﬁcant information regarding the variability

of each method. JELLYFISH displayed an IIQ of 4.19

seconds, indicating that the central half of its execu-

tion times varies around this range. KANALYZE has

a slightly higher IIQ, with 4.73 seconds, suggesting a

slightly larger variation in its central times compared

to JELLYFISH. SMT, on the other hand, exhibited an

IIQ of only 0.5 seconds, reﬂecting good consistency

in execution time.

Figure 2 illustrates the relationship between k-mer

size and two performance metrics: RAM consump-

tion and execution time for the algorithms JELLYFISH,

KANALYZE, and SMT. Regarding RAM consumption,

we observe that JELLYFISH shows an increasing con-

sumption relative to k-mer size, with a sharp increase

after 20 mers. Surprisingly, KANALYZE exhibits a

slight decrease in RAM consumption relative to k-mer

size. SMT, in turn, maintains an almost constant and

low consumption proﬁle, irrespective of the k-mer

size.

Figure 3 displays the behavior of the algorithms

concerning RAM memory consumption, grouped by

k values. Generally, it’s observed that the SMT algo-

rithm tends to have a more compact distribution of

RAM usage, while JELLYFISH and KANALYZE exhibit

SMT: A High-Performance Approach for Counting Kmers

547

Figure 1: Comparative analysis among the algorithms SMT, JELLYFISH, and KANALYZE regarding RAM consumption (in

megabytes) and execution time (in seconds). (a) Average RAM consumption. (b) Average execution time. (c) Distribution of

RAM consumption, with median and outliers. (d) Distribution of execution time, with median and outliers. SMT proves to be

more efﬁcient in both aspects compared to the other two algorithms.

Figure 2: Comparative analysis among the algorithms SMT,

JELLYFISH, and KANALYZE regarding RAM consumption

(in megabytes) and execution time (in seconds) for various

k sizes.

more pronounced variations. In particular, JELLYFISH

showed some peaks in RAM usage, especially in con-

ﬁgurations 8, 12, 15, and 26. In many scenarios, it

tends to use more RAM than the other two algorithms.

KANALYZE appears to have a more uniform varia-

tion of RAM usage compared to JELLYFISH, although

it still shows some oscillations. In some cases (such as

in conﬁgurations 11, 12, and 15), KANALYZE’s RAM

usage surpasses that of JELLYFISH and SMT. SMT

generally has the lowest RAM usage in almost all con-

ﬁgurations. However, in some situations (like 18 and

29), its RAM usage approaches or even surpasses the

other two algorithms.

Some conﬁgurations, like 7, 13, 21, and 25, show

a convergence in RAM usage among the three algo-

rithms, while others, like 8, 12, 15, and 26, exhibit

signiﬁcant divergences between them. The SMT al-

gorithm, generally presented the lowest RAM usage

and displayed a more consistent and compact distri-

bution, while JELLYFISH and KANALYZE tend to be

more volatile.

Lastly, Figure 4 illustrates the temporal perfor-

mance of the algorithms across different k sizes, rang-

ing from 5 to 30. According to this ﬁgure, the algo-

rithms JELLYFISH and SMT display more consistent

temporal performance, with lesser variation. In most

cases, both showed execution times below 10 sec-

onds, with a slight advantage for JELLYFISH, as for

some k values, SMT exhibited times that surpassed

15 seconds. On the other hand, KANALIZE presents

higher execution times, which frequently approach or

reach the upper limit of 40 seconds. In almost all test

sets, KANALIZE has the longest execution time among

the three algorithms.

Table 1 provides a comparative analysis of aver-

age time and RAM memory consumption among three

tools: JELLYFISH, KANALYZE, and SMT, varying the

value of k from 5 to 30. It is evident that SMT per-

forms signiﬁcantly better in terms of RAM memory

consumption compared to the other algorithms, while

maintaining a much lower execution time compared

to JELLYFISH as the size of k increases. The incre-

ment in execution time and RAM memory consump-

tion with the increase in k is more gradual in SMT,

BIOINFORMATICS 2024 - 15th International Conference on Bioinformatics Models, Methods and Algorithms

548

Figure 3: Comparative analysis among the boxplot graphs of the algorithms SMT, JELLYFISH, and KANALYZE concerning

RAM consumption (in megabytes), grouped by k values.

Figure 4: Comparative analysis between the boxplot graphs of the algorithms SMT, JELLYFISH, and KANALYZE regarding

time consumption (in seconds), grouped by k values.

highlighting its efﬁciency, especially at higher values

of k.

Furthermore, Table 1 shows that, starting from

k = 13, the execution time of SMT remains consis-

tently lower than that of JELLYFISH. This is an in-

teresting point, as it demonstrates the efﬁciency of

SMT in maintaining a low execution time even with

increasing k. Additionally, KANALYZE exhibits a sig-

niﬁcantly higher execution time compared to the other

two tools across all values of k, highlighting the efﬁ-

ciency of both SMT and JELLYFISH in terms of execu-

tion time.

In summary, while JELLYFISH and SMT exhibit

greater consistency in their time consumption, KAN-

ALYZE displays higher variability. Depending on the

context and speciﬁc needs of a project, the selection

of the ideal algorithm may vary. For instance, for

projects with stringent memory constraints, SMT may

be the ideal choice due to its efﬁciency. In contexts

where variability in RAM consumption is not an issue

and performance is a priority, both SMT and JELLY-

FISH are viable options. On the other hand, despite its

higher execution times, KANALYZE has the advantage

of being a Java-based tool, making it a versatile option

for systems that do not depend on speciﬁc hardware

and software.

4 STATISTICAL ANALYSIS

In this section, we will employ a rigorous statistical

approach to discern signiﬁcant differences in perfor-

mance among the algorithms in terms of execution

time and RAM memory consumption for each value

of k. Initially, we will conduct the Friedman test to

check for the existence of differences between the al-

gorithms. If a signiﬁcant difference is identiﬁed, we

will proceed with post-hoc Nemenyi analysis, with

Bonferroni correction, to determine which algorithm

stands out as the most efﬁcient for each value of k.

This analysis will enable us to draw robust conclu-

sions regarding the relative performance of the algo-

rithms under study.

The Table 2 contains comparative results for the

three tested algorithms: KANALIZE, JELLYFISH, and

SMT. The comparison was made using the Friedman

test with Nemenyi post-hoc for the variable time,

and the results were categorized according to differ-

SMT: A High-Performance Approach for Counting Kmers

549

Table 1: Comparative analysis of average time and RAM memory consumption grouped by k.

k Jellyﬁsh (time) KAnalyze(time) SMT (time) Jellyﬁsh (Ram) KAnalyze (Ram) SMT (Ram)

5 0.01 6.06 0.19 5.12 383.39 44.51

6 0.01 6.45 0.13 5.15 377.63 50.06

7 0.01 6.87 0.18 5.13 380.32 61.39

8 0.01 7.24 0.23 5.14 378.42 77.83

9 0.01 7.46 0.28 5.18 371.73 94.32

10 0.02 7.74 0.32 6.02 371.51 106.11

11 0.08 8.20 0.37 9.53 376.00 121.10

12 0.28 8.46 0.41 23.58 370.78 129.10

13 1.09 8.68 0.45 79.76 370.00 140.45

14 4.35 8.66 0.49 304.45 367.68 153.74

15 3.27 8.76 0.54 1054.17 367.40 164.43

16 4.05 8.76 0.58 1203.98 365.46 173.34

17 4.07 9.01 0.62 1347.63 360.25 183.47

18 4.11 8.95 0.67 1473.61 359.49 194.03

19 4.12 8.89 0.70 1618.81 358.23 206.14

20 4.14 8.88 0.75 1753.24 354.65 220.10

21 4.35 8.75 0.78 1885.82 352.14 231.02

22 4.18 8.68 0.83 2025.08 348.31 236.46

23 3.49 8.72 0.87 2102.76 348.49 249.19

24 4.23 8.70 0.90 2293.25 347.14 262.14

25 4.25 8.53 0.94 2425.23 340.15 269.82

26 4.28 8.48 0.97 2558.67 339.98 275.51

27 4.29 8.44 1.01 2701.95 341.64 293.82

28 4.34 8.36 1.04 2801.81 337.39 302.62

29 4.34 8.35 1.07 2966.11 334.86 308.54

30 4.36 8.26 1.10 3105.75 335.34 310.63

ent values of the variable k.

We can observe that the values of χ

are sub-

stantially high for all values of k, oscillating mainly

around 262 and gradually lowering to 229.481.

The associated p-values are extremely low, often

on the order of 10

−50

or smaller. This indi-

cates that the differences between the compared

groups are statistically signiﬁcant. The columns for

paired comparisons (KAN VS JELLY, SMT VS JELLY,

SMT VS KAN) show that, for many values of k, the p-

values are so low that they almost reach zero, indicat-

ing that the differences between the methods are sta-

tistically signiﬁcant. However, a p-value of 0.967 for

k = 12 in the comparison between SMT and JELLY

stands out as an exception, indicating that there is no

signiﬁcant difference between these two methods in

this speciﬁc case. The RESULT column summarizes

the efﬁcacy of the SMT method in relation to the oth-

ers. For values of k from 5 to 11, the SMT underper-

forms, as indicated by the − sign in the Result col-

umn. For values of k from 13 to 30, the SMT outper-

forms the other methods, as indicated by the + sign.

The analysis suggests that the performance of the

SMT method is strongly dependent on the value of the

k variable. The algorithm appears to be less effective

for smaller values of k and more effective for higher

values. Such observation may be important for the

selection of the most suitable method depending on

the application context and the considered value of k.

Finally, Table 3 contains comparative results for

the three tested algorithms: KANALIZE, JELLYFISH,

and SMT in relation to the RAM variable. Similar to

the previous test, this comparison was conducted us-

ing the Friedman test with Nemenyi post-hoc, and the

results were categorized according to different values

of the k variable.

Analyzing the obtained values, we observe that

they are consistently high, often close to 262 and

gradually reducing to 226.733 at the end of the ta-

ble. The corresponding p-values are extremely low,

often in the order of 10

−50

or lower, indicating

that the differences between the compared groups

are statistically signiﬁcant. The p-values for the

paired comparisons (KAN VS JELLY, SMT VS JELLY,

SMT VS KAN) are, in most cases, extremely low, sug-

gesting a signiﬁcant difference between the compared

methods. It is noted that for all comparisons the p-

values are practically zero, reinforcing the idea of sig-

niﬁcant differences.

In the RESULT column, the efﬁcacy of the SMT

method is summarized in relation to other algorithms.

For the k values from 5 to 12, the SMT algorithm

shows inferior performance, as indicated by the ”-”

13 to 30, the SMT algorithm demonstrates superiority,

represented by the ”+” sign. The data suggests that

BIOINFORMATICS 2024 - 15th International Conference on Bioinformatics Models, Methods and Algorithms

550

Table 2: Results of the Friedman test with Nemenyi post-hoc for the time variable grouped by k. Caption: (+) SMT won. (-)

SMT lost. (=) there was no signiﬁcant difference between ﬁrst place and the others.

k χ

P-VALUE KAN VS JELLY SMT VS JELLY SMT VS KAN RESULT

5 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

6 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

7 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

8 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

9 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

10 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

11 259.53 4.4e-57 < 2e − 16 5.04e-14 3.59e-14 -

12 196.937 1.72e-43 < 2e − 16 0.967 < 2e − 16 =

13 254.244 6.19e-56 3.22e-14 1.08e-13 < 2e − 16 +

14 243.527 1.31e-53 2.1e-11 2.42e-14 < 2e − 16 +

15 258.061 9.18e-57 3.52e-14 5.73e-14 < 2e − 16 +

16 246.427 3.08e-54 2.5e-13 3.52e-14 < 2e − 16 +

17 248.29 1.21e-54 1.08e-13 3.72e-14 < 2e − 16 +

18 250.183 4.71e-55 5.21e-14 3.94e-14 < 2e − 16 +

19 242.794 1.9e-53 1.49e-12 3.22e-14 < 2e − 16 +

20 248.29 1.21e-54 1.08e-13 3.72e-14 < 2e − 16 +

21 239.282 1.1e-52 8.81e-12 2.94e-14 < 2e − 16 +

22 244.595 7.71e-54 6.07e-13 3.32e-14 < 2e − 16 +

23 256.046 2.51e-56 3.94e-14 5.04e-14 < 2e − 16 +

24 244.157 9.59e-54 9.5e-13 3.25e-14 < 2e − 16 +

25 241.023 4.6e-53 3.64e-12 3.04e-14 < 2e − 16 +

26 232.626 3.06e-51 2.63e-10 2.42e-14 < 2e − 16 +

27 235.893 5.98e-52 4.95e-11 2.7e-14 < 2e − 16 +

28 234.244 1.36e-51 1.15e-10 2.58e-14 < 2e − 16 +

29 235.142 8.7e-52 1.15e-10 2.58e-14 < 2e − 16 +

30 229.481 1.48e-50 1.31e-09 2.25e-14 < 2e − 16 +

Table 3: Results of the Friedman test with Nemenyi post-hoc for the variable RAM grouped by k values. Caption: (+) SMT

won. (-) SMT lost. (=) there was no signiﬁcant difference between ﬁrst place and the others.

k χ

P-VALUE KAN VS JELLY SMT VS JELLY SMT VS KAN RESULT

5 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

6 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

7 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

8 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

9 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

10 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

11 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

12 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 -

13 262 1.28e-57 < 2e − 16 3.94e-14 3.94e-14 +

14 193.115 1.16e-42 0.0495 2.89e-14 < 2e − 16 +

15 258.061 9.18e-57 5.73e-14 < 2e − 16 3.52e-14 +

16 260.015 3.45e-57 4.59e-14 < 2e − 16 3.72e-14 +

17 260.015 3.45e-57 4.59e-14 < 2e − 16 3.72e-14 +

18 262 1.28e-57 3.94e-14 < 2e − 16 3.94e-14 +

19 260.015 3.45e-57 3.72e-14 < 2e − 16 4.59e-14 +

20 260.015 3.45e-57 3.72e-14 < 2e − 16 4.59e-14 +

21 256.137 2.4e-56 3.32e-14 < 2e − 16 5.21e-14 +

22 256.137 2.4e-56 3.32e-14 < 2e − 16 5.21e-14 +

23 258.061 9.18e-57 3.52e-14 < 2e − 16 5.73e-14 +

24 250.55 3.93e-55 2.94e-14 < 2e − 16 6.07e-13 +

25 245.237 5.59e-54 2.58e-14 < 2e − 16 8.81e-12 +

26 245.237 5.59e-54 2.58e-14 < 2e − 16 8.81e-12 +

27 238.58 1.56e-52 2.14e-14 < 2e − 16 2.63e-10 +

28 229.511 1.45e-50 3.2e-14 < 2e − 16 2.75e-08 +

29 229.511 1.45e-50 3.2e-14 < 2e − 16 2.75e-08 +

30 226.733 5.83e-50 2.83e-14 < 2e − 16 1.15e-07 +

SMT: A High-Performance Approach for Counting Kmers

551

the performance of the SMT method is strongly de-

pendent on the value of the k variable, similar to the

observation made in the time analysis. The algorithm

is less effective for lower k values and more effective

for higher k values. This information may be crucial

when choosing the most appropriate method, depend-

ing on the considered value of k and the system re-

source in question.

5 CONCLUSIONS

The performance of algorithms in bioinformatics are

of utmost importance, given that datasets in compu-

tational biology are often vast and complex. In our

comparative analysis among the algorithms JELLY-

FISH, KANALYZE, and SMT, we observed signiﬁcant

differences in terms of temporal and spatial perfor-

mance. The SMT algorithm stood out for its supe-

rior efﬁciency, demonstrating consistently low execu-

tion times and RAM consumption across all test sets.

This performance suggests that for projects with strin-

gent time and space constraints, SMT may be an ideal

choice. It is noteworthy that SMT is a data structure

recognized for its efﬁciency in processing and query-

ing strings. As a result, it beneﬁts greatly from non-

uniform distributions, a feature often observed in real-

world data. The non-uniformity frequently arises due

to various biological phenomena, and SMT’s ability to

capitalize on these irregularities further underscores

its suitability and robustness for handling biological

datasets. On the ﬂip side, JELLYFISH, although ex-

hibiting elevated variability in some cases, remains

a robust option, especially in contexts where consis-

tency in time consumption is important. KANALYZE,

despite its higher execution times, offers the ﬂexibil-

ity of being a Java-based tool, ﬁtting well into systems

seeking versatility in terms of hardware and software.

It is imperative to consider the speciﬁc context and

needs of each project when selecting an algorithm.

While efﬁciency is an important metric, adaptability

and versatility are also fundamental, especially in a

rapidly evolving ﬁeld like bioinformatics.

ACKNOWLEDGEMENTS

The authors would like to thank Coordenac¸

ao de

Aperfeic¸oamento de Pessoal de N

ıvel Superior -

Brasil (CAPES) - Finance Code 001 - for the ﬁnan-

cial support given to this research.

REFERENCES

Archbold, J. and Johnson, N. (1958). A construction

for room’s squares and an application in experimen-

tal design. The Annals of Mathematical Statistics,

29(1):219–225.

Audano, P. and Vannberg, F. (2014). Kanalyze: a fast

versatile pipelined k-mer toolkit. Bioinformatics,

30(14):2070–2072.

Bailey, T. L., Johnson, J., Grant, C. E., and Noble, W. S.

(2015). The meme suite. Nucleic acids research,

43(W1):W39–W49.

Castro-Mondragon, J. A., Riudavets-Puig, R., Raulusevi-

ciute, I., Berhanu Lemma, R., Turchi, L., Blanc-

Mathieu, R., Lucas, J., Boddie, P., Khan, A., Manos-

alva P

erez, N., et al. (2022). Jaspar 2022: the

9th release of the open-access database of transcrip-

tion factor binding proﬁles. Nucleic acids research,

50(D1):D165–D173.

Chaisson, M. J., Sanders, A. D., Zhao, X., Malhotra, A.,

Porubsky, D., Rausch, T., Gardner, E. J., Rodriguez,

O. L., Guo, L., Collins, R. L., et al. (2019). Multi-

platform discovery of haplotype-resolved structural

variation in human genomes. Nature communications,

10(1):1784.

Deorowicz, S., Gudy

s, A., Długosz, M., Kokot, M., and

Danek, A. (2019). Kmer-db: instant evolutionary dis-

tance estimation. Bioinformatics, 35(1):133–136.

Goodwin, S., McPherson, J. D., and McCombie, W. R.

(2016). Coming of age: ten years of next-generation

sequencing technologies. Nature Reviews Genetics,

17(6):333–351.

Marc¸ais, G. and Kingsford, C. (2011). A fast, lock-free

approach for efﬁcient parallel counting of occurrences

of k-mers. Bioinformatics, 27(6):764–770.

Marchet, C., Lecompte, L., Silva, C. D., Cruaud, C., Aury,

J.-M., Nicolas, J., and Peterlongo, P. (2019). De novo

clustering of long reads by gene from transcriptomics

data. Nucleic Acids Research, 47(1):e2–e2.

Wood, D. E., Lu, J., and Langmead, B. (2019). Improved

metagenomic analysis with kraken 2. Genome biol-

ogy, 20:1–13.

BIOINFORMATICS 2024 - 15th International Conference on Bioinformatics Models, Methods and Algorithms

552