Fast Alignment-free Comparison for Regulatory Sequences

using Multiple Resolution Entropic Proﬁles

Matteo Comin and Morris Antonello

Department of Information Engineering, University of Padova, Padova, Italy

Keywords:

Alignment-free, Sequence Comparison, Entropic Proﬁles.

Abstract:

Enhancers are stretches of DNA (100-1000 bp) that play a major role in development gene expression, evolu-

tion and disease. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead

they collaborate by forming clusters of cis-regulatory modules (CRMs). Even if the binding of transcription

factors is sequence-speciﬁc, the identiﬁcation of functionally similar enhancers is very difﬁcult and it cannot

be carried out with traditional alignment-based techniques. In this paper we study the use of alignment-free

measures for the classiﬁcation of CRMs. However alignment-free measures are generally tied to a ﬁxed reso-

lution k. Here we propose an alignment-free statistic that is based on multiple resolution patterns derived from

Entropic Proﬁles. Entropic Proﬁle is a function of the genomic location that captures the importance of that

region with respect to the whole genome. We evaluate several alignment-free statistics on simulated data and

real mouse ChIP-seq sequences. The new statistic is highly successful in discriminating functionally related

enhancers and, in almost all experiments, it outperforms ﬁxed-resolution methods.

1 INTRODUCTION

Many articles (Shlyueva et al., 2014) discuss re-

cent views on enhancers or cis-regulatory modules

(CRMs), and their coordinated action in regulatory

networks. Enhancers are stretches of DNA (100-1000

bp) that play a major role in development gene ex-

pression, evolution and disease. Indeed, they can up-

regulate, i.e. enhance, the transcription process. As a

result, during animal development, a single cell gives

rise to a multitude of different cell types and organs,

that acquire different morphologies and functions by

expressing different sets of genes.

It is worthwhile summing up their main features.

First, they contain short (6-15 bp) DNA motifs that

act as binding sites for transcription factors (TFBSs)

and often allow different nucleotides at some of the

binding positions, in other words there may be word

mismatches. Second, they act seemingly indepen-

dently of the distance and orientation to their target

genes as a consequence of looping. It follows that the

strand to which a CRM under study belongs is un-

known so both cases need to be considered. Third,

they maintain their functions independently of the se-

quence context, they are modular and contribute ad-

ditively and partly redundantly to the overall expres-

sion pattern of their target genes. Finally, enhancers

with similar transcription factors binding sites con-

tent have a high probability of bearing the same func-

tion. Thus, it is evident that predictions and classi-

ﬁcations of enhancers can be addressed by similar-

ity searches. However the presence of multiple bind-

ing sites can make the localization of each enhancer

very difﬁcult. For these reasons biologists need ﬁrst

to screen ChIP-seq datasets to select cell-speciﬁc reg-

ulatory sequences, which are based on common con-

tents.

In this context the idea to describe a sequence by

its word content ﬁts very well the model of CRMs,

where we assume that a similar function is driven by

the presence of different binding site contents (Comin

and Verzotto, 2010; Comin and Verzotto, 2014). The

comparison of sequences without an alignment, and

thus based on word distributions, is usually referred

as alignment-free. The use of alignment-free methods

for comparing sequences has been proved useful for

a variety of different tasks (Foret et al., 2009; Comin

et al., 2014; Comin and Verzotto, 2011; Comin and

Schimd, 2014). See Vinga and Almeida for a com-

prehensive review (Vinga and Almeida, 2003). How-

ever the major drawback of alignment-free measures

is that they are all tied on the choice of the resolution

k, which crucially inﬂuences performances but can-

not be known in advance. In this paper we extend the

171

Comin M. and Antonelli M..

Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Proﬁles.

DOI: 10.5220/0005251001710177

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2015), pages 171-177

ISBN: 978-989-758-070-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

idea of alignment-free measures accounting for mul-

tiple resolutions. In particular we will show that En-

tropic Proﬁles (Vinga and Almeida, 2007; Fernandes

et al., 2009) pave the way to more robust but still efﬁ-

cient alignment-free methods.

1.1 Previous Work on Alignment-free

Measures

The common way to identify homologous sequences

is sequence alignment, for which many algorithms

have been proposed in literature (Smith and Water-

man, 1981) (Altschul et al., 1990). Nevertheless

they are unsuitable for predicting and classifying en-

hancers through the matching of transcription factor

binding sites for many reasons (Vinga and Almeida,

2003) (Song et al., 2014): 1) item enhancer location

and orientation do not matter so no reliable align-

ment can be obtained; 2) they are time-consuming

and inadequate for comparing sequences in realisti-

cally large datasets, e.g. large ChIP-seq datasets; and

3) enhancers do not work alone and their coordinated

action can not be fully explored with a single align-

ment.

On the contrary, alignment-free approaches pro-

vide viable alternatives (Vinga and Almeida, 2003)

(Song et al., 2014). With the aim of effectively sum-

ming up sequence content they are usually based on

k-mer counts. Consider two genome sequences A and

B and let A

and B

be the frequencies of word w, of

length k, in A and B.

Historically, D

(Blaisdell, 1986), see Formula 1,

is one of the ﬁrst proposed similarities and is deﬁned

as the inner product of the k-mer frequency vectors.

Despite its simplicity and distance properties, D

can

be dominated by the noise caused by the randomness

of the background and has low statistical power to de-

tect potential relationship. As a result, more powerful

variants, D

and D

∗

(Reinert et al., 2009), see Formu-

las 2 and 3, have been developed by standardizing the

k-mer counts with their expectations and standard de-

viations. Let

= A

− (n− k+ 1) ∗ p

, where p

the probability of w under the null model.

∑

(1)

∑

w∈Σ

(2)

∗

∑

w∈Σ

(n− k + 1)p

. (3)

These statistics have been used as a raw measure

of similarity in a number of different studies (G¨oke

et al., 2012; Kantorovitz et al., 2007), however a more

formal computation of p-values is desirable (Foret

et al., 2009). One of the best similarity measure is N

(G¨oke et al., 2012). N

aims at overcoming the limita-

tion of exact word counts by taking into account word

neighbourhood counts. N

is deﬁned similarly to D

∗

except that every word w is replaced with a set n(w) of

words somehow linked to w, e.g. reverse complement

and mismatches.

The major drawback of alignment-free measures

is that they are all tied on the choice of the resolution

k, which crucially inﬂuences performances but can-

not be known in advance. In this paper we extend

these alignment-free measures accounting for multi-

ple resolutions. In particular we will show that en-

tropic proﬁles pave the way to more robust but still

efﬁcient alignment-free methods.

1.2 Entropic Proﬁles

The concept of Entropic Proﬁler (EP) was introduced

to analyze DNA sequences (Vinga and Almeida,

2007). The Entropic Proﬁler is a function of the ge-

nomic location that captures the importance of that

region with respect to the whole genome. This score

is based on the Shannon entropies of the words dis-

tribution. The formal deﬁnition of entropic proﬁles

(Vinga and Almeida, 2007) (Fernandes et al., 2009)

comes from the use of the CGR representation to es-

timate the sequence Renyi entropy on the basis of the

Parzen window density estimation method. The EP

is deﬁned for every location i of the entire sequence S

as:

L,ϕ

) =

∑

k=1

· c([i− k + 1, i])

∑

k=0

(4)

where l is the length of the entire sequence, L the res-

olution, i.e. the k-mer length, ϕ is a smoothing param-

eter, and c([i− k+ 1, i]) is the number of occurrences

of (x

i−k+1

. . . x

), i.e. the sufﬁx of length k that ends

at position i. EP values are standardized with their

arithmetic mean m

L,ϕ

and standard deviation s

L,ϕ

) =

L,ϕ

) − m

L,ϕ

, where (5)

L,ϕ

∑

i=1

L,ϕ

) (6)

L,ϕ

l − 1

∑

i=1



L,ϕ

) − m

L,ϕ



(7)

Entropic Proﬁlers proved to be useful for the dis-

covery of patterns in genome (Fernandes et al., 2009)

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

172

and they can be computed efﬁciently in linear time

and space (Comin and Antonello, 2013; Comin and

Antonello, 2014). By deﬁnition Entropic Proﬁles

are based on multiple resolution k-mers counts, thus

they are not tied to a ﬁxed resolution k, as almost all

alignment-free measures. Our intent is to extend this

function for developing new alignment-free measures

for the prediction and classiﬁcation of enhancers.

2 METHOD: ENTROPIC

PROFILES AS AN

ALIGNMENT-FREE MEASURE

In order to establish a suitable alignment-free mea-

sure, ﬁrst we need to study the statistical properties of

Entropic Proﬁles. We can simplify the original For-

mula 4 and consider the main term, that we call sim-

ple entropy SE

of a word w = (w

, ..., w

) of length

L :

∑

k=1

w,k

∑

k=1

(8)

where c

w,k

is the number of occurrences of the k-

mer sufﬁx s

w,k

and the weights a

have been gener-

alized. Without loss of generality the entire sequence

S = (X

, X

, ..., X

) can be modeled by a station-

ary Markov chain (S. Robin, 2005) and the probabil-

ity of a word can be denoted by µ(w). The expected

entropy E[SE

] can be derived as:

E[SE

] = E



∑

k=1

w,k

∑

k=1



∑

k=1

E [c

w,k

]

∑

k=1

where

E[c

w,k

] = (l − k+ 1)µ(s

w,k

)

The variance Var[SE

] is important to take into

account the dependence between entropies of over-

lapping words:

Var[SE

] = Var



∑

k=1

w,k

∑

k=1



∑

′

∑

′′

′

′′

Cov



w,k

′

, c

w,k

′′



(

∑

k=1

)

where the derivation of the covariance of the counts is

non-trivial. There are two cases which need to be ex-

plored. If k

′

= k

′′

≡ k there is only one sufﬁx of ﬁxed

length, and Cov



w,k

′

, c

w,k

′′



= Var[c

w,k

]. Otherwise,

if s

w,k

′

6= s

w,k

′′

, one word is the sufﬁx of the other. For

space limitation here we will consider only the ﬁrst

case by extending (S. Robin, 2005), but the exact for-

mula for the second case will be provided in the full

version of this paper. In order to derive Var[c

w,k

] we

need to consider three terms which respectively take

into account: 1) self-overlap of the word with itself;

2) partial self-overlap, the sufﬁx of the word with its

preﬁx or vice-versa; 3) disjoint occurrences. For-

mally:

Var[c

w,k

] = (l − k+ 1)µ(w)(1− µ(w))+

2µ(w)

k−1

∑

d=1

(l − k− d + 1)∗

∗

k−d

(w)

∏

j=k−d+1

π(w[ j − 1], w[ j]) − µ(w)

+2µ

(w)

l−2k+1

∑

t=1

(l − 2k − t + 2)



(w[k], w[1])

µ(w[1])

− 1



where ε

(w) is the asymmetric overlap indicator

(w) =

(

1 if w[k-u+1...k] = w[1...u]

0 otherwise

and t = d − k+ 1 and π

(w[k], w[1]) is the probability

that the last letter of w is separated from an occurrence

of w[1] by t − 1 letters.

2.1 New Alignment-free Measures

Derived from Entropic Proﬁles

Entropies and counts are very much alike, this sug-

gests that the adaptation of the state-of-the-art mea-

sures can be done by replacing the vector of k-mer

counts with the vector of entropies. Consider two

genome sequences A and B and let A

and B

be the

entropies of word w in A and B. We can redeﬁne clas-

sical alignment-free measures as:

∑

(9)

∑

− E[A

])(B

− E[B

])

Var[A

]

Var[B

]

(10)

While the implementation of D

is straightforward,

instead is based on the statistical properties of en-

tropies. The theory developed in the previous section

is preliminary to the implementation of EP

. Note

that, similarly to N

, the background model is es-

timated separately for every sequence, this can cut

down computational costs. Moreover Entropic Pro-

ﬁles, expectations and variances can be computed in

linear time and space by adapting the implementa-

tion in (Comin and Antonello, 2014). Thus EP

can

be computed efﬁciently as many other alignment-free

measures.

FastAlignment-freeComparisonforRegulatorySequencesusingMultipleResolutionEntropicProfiles

173

3 EXPERIMENTAL RESULTS

This section deals with the testing procedures for the

study of the statistical power of the proposed multi-

resolution sequence similarity measures. The experi-

mental setup is the same of (Kantorovitz et al., 2007)

and (Liu et al., 2011). In each experiment two equal-

length sets of sequences, which are named negative

and positive set, are built. Sequences in the former

are dissimilar while those in the latter similar. The

positive predictive value (PPV) is evaluated in two

steps: 1) similarity scores are computed for each pair

of sequences in the two sets; 2) if similarity scores

are sorted in descending order, the PPV is the per-

centage of pair of sequences from the positive set in

the ﬁrst half of the chart. The best PPV is 1 and

means a perfect separation between negative and pos-

itive sets while a PPV close to 0.5 implies no statis-

tical power. Performances will depend on the choice

of the background model, the k-mer length and the

standard deviation σ of the Gaussian kernel, which is

centered about k = L, i.e. a

= e

−

(L−k)

2σ

. The choice

of the background model can be so crucial that differ-

ent measures have to be compared without changing

it. For this reason, the results are mainly presented

for the pair of similarity measures EP

and N

, both

of which compute it on the single sequences.

3.1 Implanted Motifs on Drosophila

Genome

In this simulation study, the sequences in the neg-

ative set are randomly picked from a real genome

while those in the positive set are built by implanting

some motifs in those of the negative set. Thus, as in

(Comin and Verzotto, 2014), we chose the intergenic

sequences of Drosophila genome, (downloadable

from FlyBase http://ﬂybase.org/dmel-all-intergenic-

r5.49.fasta).

Patterns can be artiﬁcially implanted via the pat-

tern transfer model (Reinert et al., 2009) or the revised

one (Comin and Verzotto, 2014) with the aim of mim-

icking the exchange of genetic material. While, under

the former model, only strings of the same length, e.g

5, are considered, under the latter, also strings of dif-

ferent length, e.g. 4, 5 and 6 are implanted.

The goal of the ﬁrst experiment is to assess the in-

ﬂuence of the background model so as to use the best

one in the next tests. It has been performed varying

many parameters such as implanted motifs, insertion

probability, entire sequence length and k-mer length.

Generally, Markov model M1 outperforms Bernoulli

model M0. This is outlined by Figure 1, which shows

Figure 1: Background model M1 outperforms M0.

performances as a function of background model and

k-mer length. In this example, only one motif of

length 6 has been implanted, the insertion probability

has been set to 0.004, the sequences length to 2000

and the standard deviation to 0.5. Before passing to

the next test, it is also worthwhile noting that EP

better than N

if the k-mer length is overestimated,

i.e. k > 6, as a consequence of the multi-resolution

property of entropic proﬁles. Of course, this effect de-

pends on the standard deviation of the Gaussian ker-

nel. Figure 2 shows the results of the study of the

inﬂuence of the standard deviation when implanting

many motifs of average length 5 on a random back-

ground, in this example the sequence length is 500

and the insertion probability 0.01: an higher standard

deviation positively impacts performances when the

k-mer length is overestimated, for high values of the

standard deviation make short motifs to have bigger

weights. To exemplify the idea, if the standard devia-

tion is 1.5, the four biggest weights are 1, 0.80, 0.41

and 0.13 and performances are inﬂuenced while if the

standard deviation is 0.1, the Gaussian bell is so thin

that EP

is equivalent to N

Figure 2: PPV as a function of standard deviation and k-mer

length.

Considering our limited knowledge of regulatory

sequences (G¨oke et al., 2012), it is interesting to eval-

uate performances when implanting similar motifs of

different length via the more realistic pattern transfer

model revised, where similar means having common

substrings, e.g. sufﬁxes and preﬁxes. To this end,

we have performed many experiments varying both k-

mer and sequence length. Figure 3 shows the results

when the sequence length is 4000, the insertion prob-

ability of 0.008 and the standard deviation is 0.6. EP

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

174

outperforms N

and both variants of D

, which do not

take into account the statistical properties of counts or

entropies. The pick is at k-mer length 5, which is the

selected value for Figure 4, which shows that these

results hold also varying the entire sequence length.

Performances do not tend to increase with the length

of the sequence even if the number of implanted mo-

tifs also increases because sequences are taken from

different parts of the genome, which might have dif-

ferent statistical properties.

Figure 3: PPV as a function of k-mer length and method.

Figure 4: PPV as a function of entire sequence length and

method.

3.2 Comparison of Mouse Regulatory

Sequences

This series of experiments involves neither artiﬁcial

enhancers nor implanted transcription factor binding

sites. The positive set is build from ChIP-seq data of

real enhancers, which have been already identiﬁed in

a genome-wide manner using the co-activator protein

p300 by (Visel et al., 2009) (Blow et al., 2010). More

precisely, it consists in sequences of length between

350 and 1000 randomly picked from tissue-speciﬁc

enhancers of mouse embryos active in one of the fol-

lowing tissues: forebrain, midbrain, limb or heart. As

a result of their limited size, Bernoulli model behaves

better than higher order Markov models, which lead

to over-ﬁtting by exaggerating minor ﬂuctuations in

the data and poor predictive performances.

In the ﬁrst experiment, the negative set

contains sequences taken at random from

the mouse genome, which is download-

able from Ensembl (http://www.ensembl.org/,

Mus

musculus.GRCm38.75.dna.toplevel.fa). The

Table 1: Average PPV if background model M0, k-mer

length 4, standard deviation 0.7.

Tissue EP

Limb 0.76 0.75

Forebrain 0.74 0.71

Midbrain 0.69 0.69

Heart 0.70 0.69

Average 0.72 0.71

number of sequences per set is 20 and the results are

averaged over 10 runs. Given that no artiﬁcial motif

is implanted, which implies that the best motif length

is unknown and function of the tissue, the chosen

standard deviation is 0.7 so short motifs have bigger

weights. The purpose is to take advantage of the

multi-resolution property. The results in Table 1 and

2 show that EP

is better than N

for different k-mer

lengths.

Table 2: Average PPV if background model M0, k-mer

length 7, standard deviation 0.7.

Tissue EP

Limb 0.72 0.68

Forebrain 0.66 0.62

Midbrain 0.67 0.64

Heart 0.67 0.62

Average 0.68 0.64

The previous test shows that tissue-speciﬁc en-

hancers have similar word content. However, the

comparison with random genomic sequences can be

biased by the technology, e.g when it more likely ex-

tracts sequences with high or similar GC-content, as

already described in (Comin and Verzotto, 2014) or

(G¨oke et al., 2012). To avoid this bias, different ChIP-

seq sequences are compared with each other. In other

words, the positive set contains the enhancers active

in one of the tissues while the negative set contains

the enhancers active in all the other. This is a much

more challenging test, that can be used by biologists

to select enhancers that drive a similar expression pat-

tern.

Table 3: Average PPV if background model M1, L = 4,

σ = 0.7.

Tissue EP

Limb 0.64 0.63

Forebrain 0.60 0.55

Midbrain 0.51 0.49

Heart 0.59 0.59

Average 0.59 0.57

The results are averaged over 10 runs, the num-

ber of sequences per set is 35 and the standard de-

viation is 0.7 as before. The results in Table 3 and

4 shows that EP

is slightly better than N

for dif-

ferent k-mer lengths. Higher performances may be

FastAlignment-freeComparisonforRegulatorySequencesusingMultipleResolutionEntropicProfiles

175

Table 4: Average PPV if background model M1, L = 7,

σ = 0.7.

Tissue EP

Limb 0.55 0.53

Forebrain 0.56 0.53

Midbrain 0.48 0.49

Heart 0.53 0.53

Average 0.53 0.52

obtained by ensuring a maximum of repetitive se-

quence for every negative sample as done in (G¨oke

et al., 2012). Although the PPV values decrease com-

pared to the previous Tables, these later experiments

conﬁrm that similar tissue-speciﬁc enhancers have a

higher sequence similarity, and thus they can be de-

tected with alignment-free methods.

4 CONCLUSIONS

In this paper we studied the use of alignment-free

measures to detect functional and/or evolutionary

similarities among regulatory sequences. We intro-

duced a multiple resolution alignment-free method

based on Entropic Proﬁles that is designed around the

use of variable-length words combined with statisti-

cal properties based on Information Theory. To eval-

uate the performance of several alignment-free meth-

ods, we devised a series of tests on both synthetic

and real data. In almost all simulations our method

outperforms all other statistics. Importantly EP

is also able to detect similarities between in vivo

identiﬁed enhancer sequences, e.g. of mouse. This

will help to better understand the sequence-dependent

code within CRMs, which is responsible for the large

diversity of cell types.

ACKNOWLEDGEMENTS

M. Comin was partially supported by the P.R.I.N.

Project 20122F87B2.

REFERENCES

Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman,

D. (1990). Basic local alignment search tool. J. Mol.

Biol., 215:403–410.

Blaisdell, B. (1986). A measure of the similarity of sets

of sequences not requiring sequence alignment. Proc.

Natl Acad. Sci., 83(5155-5159).

Blow, M. et al. (2010). Chip-seq identiﬁcation of

weakly conserved heart enhancers. Nature Genetics,

42(9):806–810.

Comin, M. and Antonello, M. (2013). Fast computation of

entropic proﬁles for the detection of conservation in

genomes. In in BIoinformatics (LNBI), L. N., editor,

Proceedings of Pattern Recognition in Bioinformatics,

volume 7986, pages 277–288.

Comin, M. and Antonello, M. (2014). Fast entropic pro-

ﬁler: An information theoretic approach for the dis-

covery of patterns in genomes. IEEE/ACM Transac-

tions on Computational Biology and Bioinformatics,

11(3):500 – 509.

Comin, M., Leoni, A., and Schimd, M. (2014). Qcluster:

Extending alignment-free measures with quality val-

ues for reads clustering. Algorithms in Bioinformatics,

Lecture Notes in Computer Science, 8701:1–13.

Comin, M. and Schimd, M. (2014). Assembly-free

genome comparison based on next-generation se-

quencing reads and variable length patterns. BMC

Bioinformatics, 15(Suppl 9):S1.

Comin, M. and Verzotto, D. (2010). Classiﬁcation of pro-

tein sequences by means of irredundant patterns. BMC

bioinformatics, 11(Suppl 1):S16.

Comin, M. and Verzotto, D. (2011). The irredundant

class method for remote homology detection of pro-

tein sequences. Journal of Computational Biology,

18(12):1819–1829.

Comin, M. and Verzotto, D. (2014). Beyond ﬁxed-

resolution alignment-free measures for mammalian

enhancers sequence comparison. IEEE/ACM Trans-

actions on Computational Biology and Bioinformat-

ics, 11(4):628–637.

Fernandes, F., Freitas, A., Almeida, J., and Vinga, S.

(2009). Entropic proﬁler - detection of conservation

in genomes using information theory. BMC research

notes, 2:72.

Foret, S., Wilson, S., and Burden, C. (2009). Characterising

the d2 statistic: word matches in biological sequences.

Stat. Appl. Genet. Mol. Biol., 8(43).

G¨oke, J., Schulz, M., Lasserre, J., and Vingron, M. (2012).

Estimation of pairwise sequence similarity of mam-

malian enhancers with word neighbourhood counts.

28(5):656–663.

Kantorovitz, M., Robinson, G., and Sinha, S. (2007). A sta-

tistical method for alignment-free comparison of reg-

ulatory sequences. 23(13):249–255.

Liu, X., Wan, L., Reinert, G., Waterman, M., Sun, F., and

Li, J. (2011). New powerful statistics for alignment-

free sequence comparison under a pattern transfer

model. 1:106–116.

Reinert, G., Chew, D., Sun, F., and Waterman, M. S.

(2009). Alignment-free sequence comparison (i):

statistics and power. Journal of Computational Biol-

ogy, 16(12):1615–1634.

S. Robin, e. a. (2005). DNA, Words and Models: Statistics

of Exceptional Words. Cambridge University Press.

Shlyueva, D., Stampfel, G., and Stark, A. (2014). Tran-

scriptional enhancers: from properties to genome-

wide predictions. Nature Reviews Genetics, 15:272

– 286.

Smith, T. and Waterman, M. (1981). Comparison of biose-

quences. Adv. Appl. Math., 2:482–489.

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

176

Song, K., Ren, J., Reinert, G., Deng, M., Waterman,

M. S., and Sun, F. (2014). New developments

of alignment-free sequence comparison: measures,

statistics and next-generation sequencing. Brief Bioin-

form, 15(3):343–353.

Vinga, S. and Almeida, J. (2003). Alignment-free sequence

comparison a review. Bioinformatics, 19(4):513–523.

Vinga, S. and Almeida, J. S. (2007). Local renyi en-

tropic proﬁles of dna sequences. BMC Bioinformatics,

8:393.

Visel, A. et al. (2009). Chip-seq accurately pre-

dicts tissue-speciﬁc activity of enhancers. Nature,

457(7231):854–858.

FastAlignment-freeComparisonforRegulatorySequencesusingMultipleResolutionEntropicProfiles

177