ON ORDER EQUIVALENCES BETWEEN DISTANCE

AND SIMILARITY MEASURES ON SEQUENCES AND TREES

Martin Emms and Hector-Hugo Franco-Penya

School of Computer Science and Statistics, Trinity College, Dublin, Ireland

Keywords:

Similarity, Distance, Tree, Sequence.

Abstract:

Both ’distance’ and ’similarity’ measures have been proposed for the comparison of sequences and for the

comparison of trees, based on scoring mappings, and the paper concerns the equivalence or otherwise of these.

These measures are usually parameterised by an atomic ’cost’ table, deﬁning label-dependent values for swaps,

deletions and insertions. We look at the question of whether orderings induced by a ’distance’ measure, with

some cost-table, can be dualized by a ’similarity’ measure, with some other cost-table, and vice-versa. Three

kinds of orderings are considered: alignment-orderings, for ﬁxed source S and target T, neighbour-orderings,

where for a ﬁxed S, varying candidate neighbours T

are ranked, and pair-orderings, where for varying S

and varying T

, the pairings hS

i are ranked. We show that (1) alignment-orderings by distance can be

dualized by similarity, and vice-versa; (2) neigbour-ordering and pair-ordering by distance can be dualized by

similarity; (3) neighbour-ordering and pair-ordering by similarity can sometimes not be dualized by distance.

A consequence of this is that there are categorisation and hierarchical clustering outcomes which can be

achieved via similarity but not via distance.

1 TREE DISTANCE AND

SIMILARITY

In many pattern-recognition scenarios the data either

takes the form of, or can be encoded as, sequences or

trees. Accordingly, there has been much work on the

deﬁnition, implementation and deployment of mea-

sures for the comparison of sequences and for the

comparison of trees.

These measures are sometimes described as ’dis-

tances’ and sometimes as ’similarities’. We are con-

cerned in what follows in ﬁrst distinguishing between

these, and then with the question whether orderings

induced by a ’distance’ measure can be dualized by

a ’similarity’ measure, and vice-versa. To some ex-

tent this can be seen as applying the same kind of

analysis to sequence and tree comparison measures

as has been applied to set and vector comparison mea-

sures (Batagelj and Bren, 1995; Omhover et al., 2005;

Lesot and Rifqi, 2010).

From statements such as the following

To compare RNA structures, we need a score

system, or alternatively a distance, which

measures the similarity (or the difference) be-

tween the structures. These two versions of

the problem score and distance are equivalent.

(Herrbach et al., 2006)

which are not uncommon in the literature (Alves

et al., 2002; Kondrak, 2003; Bose and van der Aalst,

2009), it would be easy to gain the impression that

similarity and distance (on sequences and trees) are

straightforwardly interchangeable notions. In sec-

tion 1.1 several distinct kinds of equivalence are de-

ﬁned. Sections 2, 3.1 and 3.2 then show that while

some kinds of equivalence hold, others do not.

To begin we need to clarify what we will mean

by ’distance’ and ’similarity’ on sequences and trees.

Because sequences can be encoded as vertical trees it

sufﬁces to give deﬁnitions for trees. Tai ﬁrst proposed

a tree-distance measure (Tai, 1979). Where S and T

are ordered, labelled trees, a Tai mapping α : S 7→ T is

a partial, 1-to-1 function from the nodes of S into the

nodes of T, which respects left-to-right order and an-

cestry

. For the purpose of assigning a score to such

a mapping it is convenient to identify three sets:

M the (i, j) ∈ α: the ’matches’ and ’swaps’

D the i ∈ S s.t. ∀ j ∈ T,(i, j) 6∈ α: the ’deletions’

I the j ∈ T s.t. ∀i ∈ S,(i, j) 6∈ α: the ’insertions’

Thus M just is the mapping, as a set of node pairs, and

So if (i, j) and (i

′

, j

′

) are in the mapping then (T1)

le ft(i,i

′

) iff left( j, j

′

) and (T2) anc(i,i

′

) iff anc( j, j

′

Emms M. and Franco-Penya H. (2012).

ON ORDER EQUIVALENCES BETWEEN DISTANCE AND SIMILARITY MEASURES ON SEQUENCES AND TREES.

In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 15-24

DOI: 10.5220/0003712500150024

 SciTePress

D and I just the remaining nodes of S and T which

are not ’touched’ by the mapping. Let (.)

give the

label of a node and let C

∆

be a ’cost’ table, indexed

by {λ} ∪ Σ, where Σ is the alphabet of labels, which

assigns ’costs’ to M , D and I according to

for (i, j) ∈ M cost is C

∆

, j

)

for i ∈ D cost is C

∆

,λ)

for j ∈ I cost is C

∆

(λ, j

)

Where α : S 7→ T is any mapping from S to T, deﬁne

∆(α : S 7→ T) by

Deﬁnition 1. (’Distance’ Scoring of an Alignment).

∆(α : S 7→ T) =

∑

(i, j)∈M

∆

, j

) +

∑

i∈D

∆

,λ) +

∑

j∈I

∆

(λ, j

)

From this costing of alignments, a ’distance’ score on

tree pairs is deﬁned by minimization:

Deﬁnition 2. (’Distance’ Scoring of a Tree Pair). The

Tree- or Tai-distance ∆(S,T) between two trees S and

T is the minimum value of ∆(α : S 7→ T) over possible

Tai-mappings from S to T, relative to a chosen cost

table C

∆

There is an illustration of the deﬁnitions in Figure 1

ba b

a b

With C

∆

(x,λ) =

∆

(λ,x) = 1,

∆

(x,x) = 0, C

∆

(x,y) = 1

for x 6= y, the alignment

has score ∆(α) = 3 and

this is minimal for the

given C

∆

Figure 1: An illustration of tree distance.

∆(S, T) can be computed by the algorithm of (Zhang

and Shasha, 1989). Sequences can be encoded as ver-

tical trees, and on this domain of trees the tree dis-

tance coincides with a well known comparison mea-

sure on sequences, the (alphabet-weighted) string edit

distance (Wagner and Fischer, 1974; Gusﬁeld, 1997).

We have formulated the deﬁnition

in terms of

costs applied to mappings which respect tree-ordering

properties. In contrast to this declarative perspective,

there is procedural deﬁnition via the notion of an edit-

script of atomic operations transforming S to T in a

succession of stages. For both sequences and trees

the mapping-based and script-based notions coincide

Note in this general setting even a pairing of two nodes

with identical labels can in principal make a non-zero cost

contribution.

The literature contains quite a number of inequivalent

notins, all referred to as ’tree distance’; in this article Deﬁ-

nition 2 will be understood to deﬁne the term.

(Wagner and Fischer, 1974; Tai, 1979; Kuboyama,

2007) and so we omit further details of the deﬁnition

via edit-scripts.

While the correctness of the Tai ’distance’ al-

gorithm (Zhang and Shasha, 1989) – ie. that it

truly ﬁnds the minimal value of ∆(α : S 7→ T) given

cost-table C

∆

– does not require the cost-table C

∆

to satisfy any particular properties, some settings of

∆

clearly make little sense. The combination of

deletion/insertion cost-entries which are negative –

∆

(x,λ) < 0, C

∆

(λ,y) < 0 – with swap/match cost en-

tries which are not negative givesthe counter-intuitive

effect that a supertree of S is ’closer’ – in the sense of

having a lower ∆ score – to S than S itself

. This is a

rationale for the following non-negativity assumption

∀x,y ∈ Σ(C

∆

(x,y) ≥ 0,C

∆

(x,λ) ≥ 0,C

∆

(λ,y) ≥ 0)

(1)

which is a pretty universal assumption, and from

which it follows that ∆(S,T) ≥ 0, giving a minimum

consistency with the every day notion of ’distance’. In

what follows we will conﬁne attention to ’distance’ ∆

based on a table C

∆

which satisﬁes at least (1).

When the cost-table C

∆

(x,y) is constrained more

strictly than this to satisfy all the conditions of a

distance-metric, then it is well known that ∆(S,T)

will also be a distance-metric. Whether such further

restriction is desirable is moot: in so-called stochas-

tic variants (Ristad and Yianilos, 1998; Bernard et al.,

2008; Emms, 2010), in which the entries in C

∆

are

interpreted as negated logs of probabilities, these ad-

ditional distance-metric assumptions are not fulﬁlled.

In this article we shall only assume the cost-table C

∆

satisﬁes the non-negativity requiremnt of (1).

Turning now to ’similarity’, rather than approach

the problem of comparison by minimizing accumu-

lated costs assigned to an alignment, a widely fol-

lowed alternative, especially for sequence compari-

son, has been to maximize a score assigned to an

alignment, with swaps/matches rewarded, and dele-

tions/insertions punished.

Let C

be a ’similarity’ table, again indexed by

{λ} ∪ Σ, where Σ is the alphabet of labels, and where

α : S 7→ T is any mapping from S to T, and then let

Θ(α : S 7→ T) be deﬁned by

Deﬁnition 3. (’Similarity’ Scoring of an Alignment).

Θ(α : S 7→ T) =

∑

(i, j)∈M

, j

) −

∑

i∈D

,λ) −

∑

j∈I

(λ, j

)

From this costing of alignments, a ’similarity’ score

on tree pairs is deﬁned by maximisation:

Or a subtree.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

Deﬁnition 4. (’Similarity’ Scoring of a Tree Pair).

The Tree- or Tai-similarity Θ(S,T) between two trees

S and T is the maximum value of Θ(α : S 7→ T) over

possible Tai-mappings from S to T, relative to a cho-

sen cost table C

Applied to the same example as shown in Fig-

ure 1, with C

(x,λ) = C

(λ,x) = 0, C

(x,x) = 2,

∆

(x,y) = 0 for x 6= y, the shown alignment has score

Θ(α) = 9, which is maximal for the givenC

Θ(S, T) can be computed via a simple modiﬁca-

tion of the algorithm of (Zhang and Shasha, 1989).

Again on the domain of vertical trees this coincides

with a well known approach to sequence comparison,

the (alphabet-weighted) string similarity (Smith and

Waterman, 1981; Gusﬁeld, 1997).

As with ∆, while the correctness of the algorithm

for Θ is not dependent on any assumptions about

the cost-table C

, some settings of C

make little

sense. Given the formulation in (3), which subtracts

the contribution from deletions and insertions, a set-

ting where deletion/insertion cost entries are negative

– C

(x,λ) < 0, C

(λ,x) < 0 – gives the counter-

intuitive effect that a supertree of S would be more

’similar’ – in the sense of higher Θ score – to S than

S itself. This gives a rationale for the nearly univer-

sal assumption of non-negativedeletion/insertions en-

tries in C

∀x,y ∈ Σ(C

(x,λ) ≥ 0,C

(λ,y) ≥ 0) (2)

In what follows we will conﬁne attention always

to ’similarity’ Θ based on a table C

satisfying (2)

For the C

-entries which are not deletions or inser-

tions, it is quite common in biological sequence com-

parison to have both positive and negative entries. In

contrast to the notion of a distance-metric, the notion

of a set of axioms for a similarity Θ is less well es-

tablished. (Chen et al., 2009) have recently made a

proposal concerning this (see section 5).

To reiterate, for the purposes of this discussion a

tree ’distance’ measure will imply a cost-tableC

∆

, sat-

isfying (1), used in accordance to deﬁnitions 1 and 2

to score alignments and tree pairs. A tree ’similarity’

measure measure will imply a cost-table C

, satisfy-

ing (2), used in accordance to deﬁnitions 3 and 4 to

score alignments and tree pairs. This is sufﬁcient to

distinguish the ’distance’ approach from the ’similar-

ity’ approach in an intuitive way without commiting

to any further axioms.

While Deﬁnition 3 formulates Θ with dele-

tion/insertion contributions subtracted, as is often done

(Smith and Waterman, 1981; Stojmirovic and Yu, 2009),

an alternative formulation has these treated additively

(Gusﬁeld, 1997). With the additive formulation, the

same consideration suggests making deletion/insertions

non-positive.

1.1 Order-equivalence Notions between

Tai Distance and Similarity

Given a ’distance’ ∆ scoring of alignments, it can be

set to work to induce orderings of at least three differ-

ent kinds entities

Alignment Ordering. Given ﬁxed S, and ﬁxed T,

rank the possible alignments α : S 7→ T by ∆(α :

S 7→ T)

Neighbour Ordering. Given ﬁxed S, and varying

candidate neighbours T

, rank the neighbours T

by ∆(S,T

) – typically used in k-NN classiﬁcation.

Pair Ordering. Given varying S

, and varying T

rank the pairings hS

i by ∆(S

) – typically

used in hierarchical clustering.

Similarly a ’similarity’ Θ scoring of alignments in-

duces orderings of the above kinds of entities. Com-

paring these orderings motivates the following deﬁni-

tion

Deﬁnition 5. (A-,N- and P-dual). When the align-

ment orderings induced by a choice of C

∆

(used in ac-

cordance with (1)) and by a choice C

(used in accor-

dance with (3)) are the reverse of each other, we will

say that C

is a A-dual of C

∆

. Similarly we will say

we have an N-dual when neighbour ordering is re-

versed, and a P-dual where pair-ordering is reversed.

For example, the following are A-duals in this

sense (proven in section 2):

Example 1.

∆ with







∆

(x,λ) = 1

∆

(x,x) = 0

∆

(x,y) = 1

Θ with







(x,λ) = 0

(x,x) = 2

(x,y) = 1

Example 2.

∆ with







∆

(x,λ) = 0.5

∆

(x,x) = 0

∆

(x,y) = 0.5

Θ with







(x,λ) = 0

(x,x) = 1

(x,y) = 0.5

A natural question that presents itself then is

whether for every choice ofC

∆

, there is a choice ofC

which is a A-dual, N-dual or P-dual, and vice-versa.

More precisely there are the following

Order-relating Conjectures.

A-duality



(i) ∀C

∆

∃C

∆

and C

are A-duals)

(ii) ∀C

∃C

∆

and C

are A-duals)

N-duality



(i) ∀C

∆

∃C

( C

∆

and C

are N-duals)

(ii) ∀C

∃C

∆

( C

∆

and C

are N-duals)

P-duality



(i) ∀C

∆

∃C

( C

∆

and C

are P-duals)

(ii) ∀C

∃C

∆

( C

∆

and C

are P-duals)

Arguably these notions go to the heart of the

question whether there is really anything that can

be accomplished using an alignment ’distance’ score,

ON ORDER EQUIVALENCES BETWEEN DISTANCE AND SIMILARITY MEASURES ON SEQUENCES AND

TREES

which cannot by accomplised via an alignment ’sim-

ilarity’ score, and vice-versa. For example, if it

turns out that all these order conjectures hold, then

any alignment outcome, any categorisation outcome

via k-NN and any hierarchical clustering outcome,

achieved by a particular distance can be replicated by

a similarity, and vice-versa, making the choice merely

a matter of personal taste. On the other hand, if these

duality conjectures do not hold, then there is substan-

tive difference, with the outcomes achievable by dis-

tances and similarities being distinct.

For a number of similarity and distance measures

based on sets and vectors, notions analogous to N-

dual and P-dual have been considered (Batagelj and

Bren, 1995; Omhover et al., 2005; Lesot and Rifqi,

2010), motivated similarly by the question whether

anything which can be accomplished with one or

other such measure can be replicated by another such

measure. It is for example shown there that a particu-

lar Dice measure will rank retrieval results inevitably

the same as a particular Jaccard measure. In the case

of alignment-based measures on sequences and trees,

as far as we are aware, these notions seem not have

been systematically considered and the following sec-

tions endeavour to ﬁll that gap.

2 ALIGNMENT-DUALITY

The following lemma will be useful for considering

the A-duality conjectures above:

Lemma 1. For any C

∆

, and some choice δ such that

0 ≤ δ/2 ≤ min(C

∆

(·,λ),C

∆

(λ,·)) let C

be deﬁned

according to (i) below. For anyC

, and choice δ such

that 0 ≤ δ ≥ max(C

(·,·)) letC

∆

be deﬁned according

to (ii) below.

(i)







(x,λ) = C

∆

(x,λ) − δ/2

(λ,y) = C

∆

(λ,y) − δ/2

(x,y) = δ −C

∆

(x,y)

(ii)







∆

(x,λ) = C

(x,λ) + δ/2

∆

(λ,y) = C

(λ,y) + δ/2

∆

(x,y) = δ −C

(x,y)

then in either case, for any α : S 7→ T

∆(α)+ Θ(α) = δ/2× (

∑

s∈S

(1) +

∑

t∈T

(1)) (3)

Proof of Lemma 1. If deﬁning C

from C

∆

by (i), by

the choice of δ we have the non-negativity of C

(x,λ)

and C

(λ,y). If deﬁning C

∆

from C

by (ii), by the

choice of δ, we have the non-negativity of all entries

in C

∆

Whether deﬁning C

from C

∆

by (i), or C

∆

from C

(ii), it is straightforward to show

∆(α) + Θ(α) = δ/2× (2|M | + |D| + |I |)

But then (3) follows since

2|M | + |D| + |I | =

∑

s∈S

(1) +

∑

t∈T

(1)

Theorem 2. A-duality (i) and (ii) hold

Proof of Theorem 2. A-duality (i): deﬁneC

accord-

ing to (i) in Lemma 1. Given the constant summation

property of (3), the ordering on alignments by ∆ must

be the reverse of the ordering by Θ.

A-duality (ii): similarly deﬁneC

∆

accordingto (ii)

in Lemma 1

Example 1 Revisited. The C

of Example 1 can be

seen as derived from the C

∆

with δ = 2. Table below

shows outcomes for other choices of δ

∆

(δ = 2) C

(δ = 1) C

(δ = 0)

(x,λ) 1 0 0.5 1

(x,x) 0 2 1 0

(x,y) 1 1 0 -1

As a corollary one can obtain the following con-

cerning how one similarity table can be ’shifted’ to an

equivalent one, and similarly for distance tables.

Corollary 3. (’Shifting’). for any C

, an alignment

equivalent C

can be derived by the conversion (a)

below, and for any C

∆

, an alignment equivalent C

∆

can be derived by the conversion (b)

(a)







(x,λ) = C

(x,λ) − κ/2

(λ,y) = C

(λ,y) − κ/2

(x,y) = C

(x,y) + κ

(b)







∆

(x,λ) = C

∆

(x,λ) + κ/2

∆

(λ,y) = C

∆

(λ,y) + κ/2

∆

(x,y) = C

∆

+ κ

Proof of Corollorary 3. (a) is the composition of (ii),

for some δ

, with (i), for some δ

, giving κ = δ

− δ

(b) is the composition (i), for some δ

, with (ii), for

some δ

, giving κ = δ

− δ

Example 1 Revisited Again. The three A-dualizing

similarities C

(δ = 2), C

(δ = 1) and C

(δ = 0) de-

rived from the unit-cost distance table using varying

δ in the (i) conversion of Lemma 1 can be seen as re-

lated to each other by the (a) ’shifting’ conversion of

Lemma 3, with κ = −1 each time.

The property of alignment dualizability between dis-

tance and similarity (and vice-versa) expressed above

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

in Lemma 1 and Theorem 2 was essentially ﬁrst

proven for the case of sequence comparison by (Smith

and Waterman, 1981). On the basis of this perhaps it

is tempting to consider the case closed and treat ’dis-

tance’ and ’similarity’ as interchangeable. However,

as noted in Section 1.1, there is more than one kind

of ordering that one might wish to be sure of repli-

cating in switching between distance and similarity,

with N-duality coming to the fore in the context of

k-NN classiﬁcation, and P-duality coming to the fore

in the context of hierarchical clustering. Section 3.1

considers the N-duality (i) and P-duality (i) order con-

jectures, and Section 3.2 considers the N-duality (ii)

and P-duality (ii) conjectures.

3 NEIGHBOUR AND PAIR

ORDERING

3.1 Distance to Similarity

Having seen that A-duals can always be created in

both directions, attention shifts to N-duals and P-

duals.

The case of using δ = 0 in the (i) conversion of

Lemma 1 from C

∆

to C

gives non-positive values

for all non-deletion, non-insertion entries in C

, and

is an especially trivial case of dualizing a distance set-

ting C

∆

, with the effect that Θ(S,T) = −1 × ∆(S,T).

Because of this, this distance-to-similarity conversion

not only makes A-duals, but also N-duals and P-duals.

Theorem 4. N-duality (i) and P-duality(i) hold

Proof of Theorem 4. By choosing δ = 0 in the

(i) conversion of Lemma 1 from C

∆

to C

, we

have Θ(S,T) = −1× ∆(S,T), and hence Θ(S

) ≤

Θ(S

) ⇔ ∆(S

) ≥ ∆(S

)

This distance-to-similarity by negation is well

known. On the other hand, concerning similarity-to-

distance, in the (ii) conversionof Lemma 1 fromC

∆

, you can only choose δ = 0 if all C

(x,y) ≤ 0, and

clearly there are many natural settings of C

where

that is not true.

3.2 Similarity to Distance

The remaining order-equivalence conjectures of sec-

tion 1.1 are N-duality(ii) and P-duality(ii), concern-

ing the similarity-to-distance direction. Of the re-

maining conjectures, P-duality(ii) is stronger than N-

duality(ii). We can fairly easily show P-duality(ii)

does not hold

Theorem 5. P-duality (ii) does not hold, that is, there

are C

such that there is no C

∆

such that C

and C

∆

are P-duals.

Proof of Theorem 5. It is clearly possible for C

be such that there is no maximum value for Θ(S,T).

For example for table below:

(a,a) 1

(a,λ) 1

its clear we have Θ(a,a) = 1, Θ(aa,aa) = 2 and in

general Θ(a

) = n. Let C

be any table deﬁn-

ing a similarity with no maximum. On the other

hand, for each C

∆

there will be minimum value of

∆(S, T). Suppose some C

∆

is a P-dual to C

. For

any n let [Θ]

(resp. [∆]

) be the set of pairs with

similarity (resp. distance) n. If C

∆

is a P-dual to

, there is some bijection between the set of simi-

larity classes { [Θ]

} and the set of distances classes

of {[∆]

}. Some similarity class [Θ]

of Θ must cor-

respond to the minimum distance class [∆]

. Let

[Θ]

be a higher Θ class than [Θ]

. It must corre-

spond to some ∆ class [∆]

distinct from [∆]

, and

since [∆]

is the distance-minimum, this must be a

higher distance class. Then for (S

) ∈ [∆]

, and

) ∈ [∆]

you have ∆(S

) < ∆(S

), but

also Θ(S

) < Θ(S

). So the supposed dual C

∆

does not reverse the pair-ordering of C

Of the order-relating conjectures of section 1.1 the

only remaining one is N-duality(ii) – that is the ques-

tion whether every neighbour-ordering via some C

can be replicated by a neighbour ordering via some

∆

. We can show that there are neighbour-orderings

by a Tai-similarity which cannot be dualized by any

Tai-distance whose deletion and insertion costs are

symmetric.

Theorem 6. There is C

such that there is noC

∆

with

∆

(x,λ) = C

∆

(λ,x) such that C

and C

∆

are N-duals

Proof of Theorem 6. Let S = aa, and the set of neigh-

bours be {a,aaa}.

Let C

(a,a) = x > 0, and C

(a,λ) = C

(λ,a) = y >

For (aa,aaa), the alignments with 2,1, and 0 a-

matches haves scores, 2x− y, x− 3y and −5y, respec-

tively, so the alignments maximising Θ are those with

two a-matches, and Θ(aa,aaa) = 2x− y.

For (aa,a), the alignments with 1 and 0 a-matches

have scores x − y and −3y, respectively, so the

alignments maximising Θ have one a-match, and

Θ(aa,a) = x− y.

Consider what is required for the Θ-decreasing

neigbour ordering to be: [aaa, a],

ON ORDER EQUIVALENCES BETWEEN DISTANCE AND SIMILARITY MEASURES ON SEQUENCES AND

TREES

Θ(aa,aaa) > Θ(aa,a)

⇔ 2x− y > x− y

⇔ x > 0

So there is a Θ-decreasing neighbour-ordering

[aaa, a].

Let C

∆

(a,a) = x

′

, and C

∆

(a,λ) = C

∆

(λ,a) = y

′

. Note

this assumes symmetric insertion and deletion costs.

For (aa,aaa), the alignments with 2,1, and 0 a-

matches haves scores, 2x

′

+ y

′

, x

′

+ 3y

′

and 5y

′

, re-

spectively. We distinguish two cases (i) 2y

′

< x

′

and

(ii) 2y

′

≥ x

′

For case (i), x

′

= 2y

′

+ ε, for some no-zero ε > 0,

and the 2,1,and 0 a-matches scores become 5y

′

+ 2ε,

′

+ ε and 5y

′

, respectively, so taking the minimum,

∆(aa,aaa) = 5y

′

For case (ii), y

′

= x

′

/2+κ, for some κ ≥ 0, and the

2,1,and 0 a-matches scores become 2.5x

′

+ κ, 2.5x

′

3κ and 2.5x

′

+ 5κ, respectively, and 2-match case is

amongst the minimal cases, so ∆(aaa, aa) = 2.5x

′

+κ.

For (aa,a), the alignments with 1 and 0 a-matches

haves scores, x

′

+ y

′

and 3y

′

respectively. We again

distinguish between cases (i) 2y

′

< x

′

and (ii) 2y

′

≥ x

′

For case (i), the 1 and 0 a-matches scores become

′

+ ε and 3y

′

respectively, so taking the minimum,

∆(aa,a) = 3y

′

For case (ii), the 1 and 0 a-match scores become

1.5x

′

+ κ and 1.5x

′

+ 3κ respectively, and the 1-match

case is amongst the minimal cases, so ∆(aa, a) =

1.5x

′

+ κ.

Summarising the ∆ possibilities

∆(aa,aaa) ∆(aa,a)

(i)2y

′

< x

′

(ii)2y

′

≥ x

′

2.5x

′

+ κ 1.5x

′

+ κ

So in neither case (i) nor case (ii) is it possible to

achieve a ∆-ascending neighbour ordering [aaa, a],

which was the Θ-descending neighbour ordering

which was achieved with the assumed C

Remark. If we drop the requirement that the N-

dualizing C

∆

have C

∆

(x,λ) = C

∆

(λ,x), then the ar-

gument does not go through. The Θ-descending

neighbour ordering [aaa,a] can be replicated by

a ∆-ascending neighbour ordering with C

∆

(a,λ) >

∆

(λ,a). For most applications of alignment-based

’distances’, such an asymmetric setting of deletion

and insertion costs would be considered unnatural.

4 EMPIRICAL INVESTIGATION

(Lesot and Rifqi, 2010) consider distance and sim-

ilarity measures often used in information retrieval.

These are deﬁned over ﬁnite vectors, whose features

are either binary or real-valued. They basically con-

sider the neighbour orderings produced by different

measures. Besides demonstrating absolute equiva-

lence between some measures, between other mea-

sures they empirically determine equivalence degrees,

between 0 and 1, based on the Kendall-tau statistic

for comparing orderings (Kendall, 1945). While their

work concerned comparison measures on vectors, it

is a natural to consider an analogous empirical quan-

tiﬁed comparison of distance and similarity orderings

on trees and sequences. Some preliminary ﬁndings of

such a study are given below.

The (i) conversion of Lemma 1 converts distance

settings to A-dual similarity settings and one thing to

consider is the degree to which the derived similari-

ties are also N-duals of the distance. Table 1 gives

some distance and similarity settings: the ﬁrst column

gives the unit-cost settings for ∆ and the columns to

the right give different similarity settings C

deriv-

able by the (i) conversion of Lemma 1 as δ is varied

through various values.

Table 1: Unit-cost distance setting and several A-dual simi-

larity settings.

dual C

for varying δ

∆

2 1.5 1 0.5 0.2 0.1 0

(x,λ) 1 0 0.25 0.5 0.75 0.9 0.95 1

(x,x) 0 2 1.5 1 0.5 0.2 0.1 0

(x,y) 1 1 0.5 0 -0.5 -0.8 -0.9 -1

An experiment was done to quantify how close the

similarities deﬁned by the varying C

tables come to

being N-duals for the distance. Using a set of 1334

trees

, repeatedly a tree S was chosen, and neighbour

ﬁles N

∆

(S) and N

(S) were computed, with N

∆

(S)

the ordering of the remaining trees by ascending ∆,

and N

(S) the ordering by descending Θ. N

∆

(S) and

(S) were then compared by the kendall-tau mea-

sure τ (see the Appendix for the deﬁnition). For each

δ the average of this τ comparison between the dis-

tance and similarity neighbour ﬁles is shown in Fig-

ure 2.

The bottom-left corner, for δ = 0 is the special

case of Lemma 1 which amounts to the well-known

trivial distance-to-similarity conversion, Θ(S,T) =

−1 × ∆(S,T), noted in section 3.1. In this case the

distance and similarity neighbour ﬁles are identical.

See the Appendix for further details of this data set.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

0.0 0.5 1.0 1.5 2.0

0.0 0.2 0.4 0.6

delta

tau

Figure 2: Average Kendall-tau comparison on neighbours

using distance and derived similarities. Distance setting

is ﬁrst column of Table 1. Similarity settings are further

columns of Table 1 deﬁned by varying δ.

As the graph clearly shows, as δ increases, the neigh-

bour ﬁles exhibit progressively greater difference in

ordering, until at δ = 2 the τ score is 0.73, which cor-

responds to a tendency more towards order reversal

than to replication. This experiment shows that al-

though each of these similarity settings is an A-dual

of the simple distance setting, they are not at all equiv-

alent to each other as far as neighbour ordering is con-

cerned.

The (ii) conversion of Lemma 1 converts similar-

ity settings to A-dual distance settings. Table 2 gives

a similarity setting and then several distance settings

derivable by the (ii) conversion as δ is varied through

various values

Table 2: A similarity setting and several A-dual distance

settings.

dual C

∆

for varying δ

1 1.5 2 2.5 3 3.5 4

(x,λ) 0.5 1 1.25 1.5 1.75 2 2.25 2.5

(x,x) 1 0 0.5 1 1.5 2 2.5 3

(x,y) 0 1 1.5 2 2.5 3 3.5 4

Figure 3 plots the average τ comparison between

the similarity and distance neighbour ﬁles, as δ is var-

ied to give different distances. Again this experiment

shows that although each of the distance settings is an

A-dual of the similarity setting, they are not equiva-

lent to each other as far as neighbour ordering is con-

cerned.

The nodes in these experiments have multi-part labels.

Whilst the ﬁrst experiment treated these simply as identi-

cal or not, for this second experiment, the base-line similar-

ity node label are compared via C

(x,y) = 1 − ham(x,y),

ham(x,y) is the standard hamming distance. The table thus

shows the extreme values of C

(x,y) and C

∆

(x,y).

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.30 0.32 0.34 0.36

delta

tau

Figure 3: Average Kendall-tau comparison on neighbours

using a similarity and derived distances. Similarity setting

is ﬁrst column of Table 2. Distance settings are further

columns of Table 2 deﬁned by varying δ.

Theorem 5 concerned the non-replicability by dis-

tance of pair-orderings by similarity. To illustrate

this, consider a set of strings {a

}. A ta-

ble of pair-wise similarities of these was made with

(a,a) = 1,C

(a,λ) = 1, and used to generate a

single-link clustering, shown as the the uppermost

dendrogram in Figure 4.

sim swap:1 del:1 single

dist swap:0 del:1 single

dist swap:1 del:1 single

Figure 4: Similarity and distance clusterings. The instance

labels i5. . . i1 represesent a

...a

No single-link clustering based on distance repli-

cates this similarity clustering. The middle den-

dogram in Figure 4 is the result with C

∆

(a,a) =

0,C

∆

(a,λ) = 1, with all ﬁve shown on the same level

because ∆(a

m+1

) = 1. The lowest dendogram in

Figure 4 shows a result with C

∆

(a,a) = 1,C

∆

(a,λ) =

1. The same structure was found holding C

∆

(a,a) =

1, and allowing the deletion/insertion cost to vary be-

tween 0.5 and 5.5 (which are ≥ C

∆

(a,a)) and between

ON ORDER EQUIVALENCES BETWEEN DISTANCE AND SIMILARITY MEASURES ON SEQUENCES AND

TREES

0.4 and 0.1 (which are < C

∆

(a,a))

5 DISCUSSION AND

COMPARISONS

In view of the outcomes noted in sections 2, 3.1 and

3.2 concerning the various ordering conjectures we

can say that

• Any hierarchical clustering outcome achieved via

∆ can be replicated via Θ, but not vice-versa.

• Any categorisation outcome using nearest-

neighbours achieved via ∆ can be replicated via

Θ, but not vice-versa.

and in this sense ’similarity’ and ’distance’ compar-

ison measures on sequences and trees are not inter-

changeable.

As far as we are aware this aspect of the choice

between a similarity-based versus a distance-based

comparison measure on sequences or trees has not

been noted before.

There are a number of papers concerning con-

version from a similarity-based sequence compari-

son measure to a distance-based comparison mea-

sure, and particularly one satisfying distance-metric

axioms (Spiro and Macura, 2004; Stojmirovic and Yu,

2009). An aim of these papers is to ﬁnd techniques for

accelerating so-called range similarity queries, which

are requests to ﬁnd all neighbours within a similar-

ity threshold N

≤θ

(S) = {T : Θ(S,T) ≥ θ}. To discuss

these papers it will be as well to note the distance-

metric axioms

Deﬁnition 6. (Distance Metric). A binary relation ∆

is a distance-metric if it satisﬁes

D1.∆(S,T) = ∆(T,S)

D2.∆(S,T) ≥ 0

D3.∆(S,V) ≤ ∆(S,T) + ∆(T,V)

D4.∆(S,T) = 0 iff S = T

It is a pseudo-metric if D4. is dropped. It is a

quasi-metric if D1. is droppped

For a distance-metric on sequences there is a way

to use the triangle-inequality to accelerate solution of

a distance range query, N

≤δ

(S) = {T : ∆(S, T) ≤ δ}.

Suppose S is a query, and T

is a training-set point

known to be far from S, and that another training-set

point T

is knownto be close to T

. Intuitively S is also

going to be far from T

. More speciﬁcally, if ∆ is a

distance-metric, an instance of the triangle-inequality

will be:

∆(S,T

) ≤ ∆(T

) + ∆(T

,S) (4)

via which ∆(T

,S) is bounded below by ∆(S, T

) −

∆(T

). So if T

has already been excluded from a

distance neigbhourhood, T

can be also immediately

excluded if ∆(S,T

) − ∆(T

) exceeds the thresh-

old.

Most biologicalsequence comparison is done with

similarity not distance and the concern of (Spiro and

Macura, 2004) is to ﬁnd a corresponding means of ac-

celerating similarity range queries. In terms of the no-

tations used here, they essentially propose the follow-

ing conversion from similarity to distance cost-table

∀x,y ∈ Σ (C

∆

(x,y) = C

(x,x) +C

(y,y) − 2C

(x,y))

∀x ∈ Σ (C

∆

(x,λ) = C

(x,λ))

∀y ∈ Σ (C

∆

(λ,x) = C

(λ,x))

and they prove that, under some conditions imposed

on C

, the corresponding ∆ will satisfy all the condi-

tions of a distance-metric, in particular satisfying the

triangle-inequality. and that the relation between Θ

and ∆ is then

∆(X,Y) = Θ(X,X) + Θ(Y,Y) − 2Θ(X,Y) (5)

Substitution of (5) into the triangle-inequality and

some re-arrangement gives that Θ(T

,S) is bounded

above by Θ(S,T

) + Θ(T

) − Θ(T

), giving a

means for rapid exclusion of T

from a similarity

neigbhourhood.

Beside the fact that equation (5) relating Θ and ∆

holds only under particular assumptions concerning

, more importantly the obtained relationship in (5)

is not sought in the context of deriving a P-dual or

N-dual distance ∆ from a given similarity Θ, and in

fact (5) does not do this. Thus while Spiro et al do

provide a conversion from a similarity to a distance, it

addresses concerns somewhat orthogonal to those of

this paper.

(Stojmirovic and Yu, 2009) is a paper with similar

concerns to (Spiro and Macura, 2004). In terms of

the notations used here, they propose the following

conversion from similarity to distance cost-table:

∀x,y ∈ Σ (C

∆

(x,y) = C

(x,x) −C

(x,y))

∀x ∈ Σ (C

∆

(x,λ) = C

(x,x) +C

(x,λ))

∀y ∈ Σ (C

∆

(λ,x) = C

(λ,x))

and prove, under some assumptions concerning C

that the then derived ’distance’ is a quasi-metric and

that the relationship between ∆ and Θ is then:

∆(S, T) = Θ(S, S) − Θ(S,T) (6)

Though not a distance-metric – it is asymmetric

– it does satisfy the triangle-inequality ∆(X, Z) ≤

∆(X,Y) + ∆(Y,Z), and substituting (6) into the

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

triangle-inquality and re-arranging again gives an up-

per bound which might be used to accelerate a simi-

larity range query: Θ(S,T

) ≤ Θ(S,T

) + Θ(T

) −

Θ(T

Though again this similarity to distance conver-

sion is not sought in the context of ﬁnding P- or N-

duals, Stojmirov et al’s equation in (6) does make the

derived distance an N-dual of the similarity. This is

not, however, inconsistent with the example in sec-

tion 3.2 of a similarity with no N-dualizing distance.

Stojmirov et al’s conversion generates asymmetric in-

sertion and deletion entries in the distance cost-table

∆

, whereas the proof in section 3.2 concerned the

impossibily of a N-dualizing distance with symmetric

insertion and deletion entries.

Our ﬁndings on the various order-relating conjec-

tures concern notions with speciﬁc, though widely

used, deﬁnitions (Defs.1, 2, 3 and 4). There are other

closely related notions, and the corresponding ques-

tions concerning these have not been addressed. One

variant is stochastic: in a stochastic similarity, proba-

bilities are assigned to aspects of a mapping and mul-

tiplied. We conjecture that these will be A-, N- and

P-dualisable to distance. This is because, under a log-

arithmic mapping, it seems such stochastic variants

can be exactly simulated by a similarity as we have

deﬁned it. In the resulting table, all C

(x,y) ≤ 0,

allowing the (ii) conversion of Lemma 1 to deﬁne a

∆

choosing δ = 0. There are also normalised vari-

ants, which we have not considered. Throwing the net

very much more widely, (Chen et al., 2009) study re-

lationships between distance and similarity measures,

in a very general setting, not restricted to measures

based on sequence or tree alignment. Parallel to the

well-known axioms of a distance-measure, they pro-

pose a set of similarity axioms, and they deﬁne con-

versions from similarity to distance and in the other

direction, showing that the derived score satisﬁes the

relevant axioms if the score that is input to the conver-

sion does. Their work, however, does not address the

question whether the conversions give N- or P-duals,

that is whether they preserve relevant orderings.

Concerning directions for further work, the em-

pirical investigation in section 4 was quite prelimi-

nary. For the Kendall-tau comparison of distance and

similarity neighbourhoods, we looked at just one par-

ticular baseline distance and one particular baseline

similarity, and compared only to A-duals as given by

Lemma 1, so clearly there are other possibilities one

could consider here. One is Spiro and Macura’s re-

lation in (5). The Appendix notes some further A-

dualizing conversions, from distannce to similarity

and from similarity to distance, which might be con-

sidered. It is also the case that we applied the Kendall-

tau comparison to full rankings, and it would be of in-

terest to look also at top-k ranking, as has been done

for vector- and set-based measures (Lesot and Rifqi,

2010).

ACKNOWLEDGEMENTS

This research is supported by the Science Foundation

Ireland (Grant 07/CE/I1142) as part of the Centre for

Next Generation Localisation (www.cngl.ie) at Trin-

ity College Dublin.

REFERENCES

Alves, C. E. R., C´aceres, E. N., and Dehne, F. (2002). Paral-

lel dynamic programming for solving the string edit-

ing problem on a cgm/bsp. In Proceedings of the four-

teenth annual ACM symposium on Parallel algorithms

and architectures, SPAA ’02, pages 275–281. ACM.

Batagelj, V. and Bren, M. (1995). Comparing resemblance

measures. Journal of Classiﬁcation, 12(1):73–90.

Bernard, M., Boyer, L., Habrard, A., and Sebban, M.

(2008). Learning probabilistic models of tree edit dis-

tance. Pattern Recogn., 41(8):2611–2629.

Bose, R. P. J. C. and van der Aalst, W. M. P. (2009). Con-

text aware trace clustering: Towards improving pro-

cess mining results. In SAIM International Confer-

ence on Data Mining, SDM, pages 401–412.

Chen, S., Ma, B., and Zhang, K. (2009). On the similarity

metric and the distance metric. Theoretical Computer

Science, 410(24-25):2365 – 2376.

Emms, M. (2010). Trainable tree distance and an applica-

tion to question categorisation. In KONVENS 2010.

Emms, M. and Franco-Penya, H. (2011). Data-

set used in Kendall-Tau experiments

www.scss.tcd.ie/Martin.Emms/SimVsDistData.

Gusﬁeld, D. (1997). Algorithms on strings, trees, and se-

quences. Cambridge Univ. Press.

Haji, J., Ciaramita, M., Johansson, R., Kawahara, D., Mey-

ers, A., Nivre, J., Surdeanu, M., Xue, N., and Zhang,

Y. (2009). The conll-2009 shared task: Syntactic and

semantic dependencies in multiple languages. In Pro-

ceedings of the 13th Conference on Computational

Natural Language Learning (CoNLL-2009).

Herrbach, C., Denise, A., Dulucq, S., and Touzet, H. (2006).

Alignment of rna secondary structures using a full set

of operations. Technical Report 145, LRI.

Kendall, M. G. (1945). The treatment of ties in ranking

problems. Biometrika, 33(3):239–251.

Kondrak, G. (2003). Phonetic alignment and similarity.

Computers and the Humanities, 37.

Kuboyama, T. (2007). Matching and Learning in Trees.

PhD thesis, Graduate School of Engineering, Univer-

sity of Tokyo.

Lesot, M.-J. and Rifqi, M. (2010). Order-based equiva-

lence degrees for similarity and distance measures.

In Proceedings of the Computational intelligence

ON ORDER EQUIVALENCES BETWEEN DISTANCE AND SIMILARITY MEASURES ON SEQUENCES AND

TREES

for knowledge-based systems design, and 13th inter-

national conference on Information processing and

management of uncertainty, IPMU’10, pages 19–28,

Berlin, Heidelberg. Springer-Verlag.

Omhover, J.-F., Rifqi, M., and Detyniecki, M. (2005).

Ranking invariance based on similarity measures in

document retrieval. In Adaptive Multimedia Retrieval,

pages 55–64.

Ristad, E. S. and Yianilos, P. N. (1998). Learning string edit

distance. IEEE Transactions on Pattern Recognition

and Machine Intelligence, 20(5):522–532.

Smith, T. F. and Waterman, M. S. (1981). Comparison

of biosequences. Advances in Applied Mathematics,

2(4):482 – 489.

Spiro, P. A. and Macura, N. (2004). A local alignment

metric for accelerating biosequence database search.

Journal of Computational Biology, 11(1):61–82.

Stojmirovic, A. and Yu, Y.-K. (2009). Geometric aspects of

biological sequence comparison. Journal of Compu-

tational Biology, 16:579–610.

Tai, K.-C. (1979). The tree-to-tree correction problem.

Journal of the ACM (JACM), 26(3):433.

Wagner, R. A. and Fischer, M. J. (1974). The string-to-

string correction problem. Journal of the Association

for Computing Machinery, 21(1):168–173.

Zhang, K. and Shasha, D. (1989). Simple fast algorithms for

the editing distance between trees and related prob-

lems. SIAM Journal of Computing, 18:1245–1262.

APPENDIX

Proof of Alignment Sum Property from Lemma 1.

In the proof of Lemma 1 it was claimed with C

∆

and

related according to the (i) or (ii) conversions that

for any alignment α, ∆(α) + Θ(α) = δ/2 × (2|M | +

|D| + |I |). This is proven as follows.

If deﬁning C

from C

∆

by (i), for Θ(α) we have:

∑

(i, j)∈M

[δ−C

∆

(i, j)] −

∑

i∈D

∆

(i,λ) − δ/2]

−

∑

j∈I

∆

(λ, j) − δ/2)

= δ(|M | +

|D|

|I |

)

−

∑

(i, j)∈M

∆

(i, j)] −

∑

i∈D

∆

(i,λ)] −

∑

j∈I

∆

(λ, j)]

(2|M | + |D| + |I |) − ∆(α)

If deﬁning C

∆

from C

by (ii), for ∆(α) we have

∑

(i, j)∈M

[δ−C

(i, j)] +

∑

i∈D

(i,λ) + δ/2]

∑

j∈I

∆

(λ, j) + δ/2)

= δ(|M | +

|D|

|I |

)

−

∑

(i, j)∈M

(i, j)] +

∑

i∈D

(i,λ)] +

∑

j∈I

∆

(λ, j)]

(2|M | + |D| + |I |) − Θ(α)

Hence in either case the claim holds.

Deﬁnition of Kendall-Tau (with Ties). Let N

and

be two assignments of ranks to the same set of

objects, U (with the possibility of ties). Where P is

the set of all two-element sets of distinct objects from

U, deﬁne a penalty function p on any {T

} ∈ P ,

such that (i) p({T

}) = 1 if the order in N

is the

reverse of the order in N

, (ii) p({T

}) = 0.5 if

there is a tie in N

but not in N

or vice-versa and

(iii) p({ T

}) = 0 otherwise. The Kendall-Tau dis-

tance (with ties) between N

and N

, τ(N

), is

∑

}∈P

[p({T

})] ×

m×(m−1)

Details of the Data Set for Kendall-Tau Experi-

ments. Section 4 reports experiments quantifying the

difference between neighbour ﬁles computed by dis-

tance and similarity, when the two are related by the

conversion in Lemma 1. The experiments used a set

of 1334 trees, taking each tree in turn and ranking all

the remaining trees. The trees represent syntax struc-

tures and originate in a data-set which was used in a

shared-task on identifying inter-node semantic depen-

dencies (Haji et al., 2009). See (Emms and Franco-

Penya, 2011) for download information concerning

this data.

Further A-dualizing Conversions. Concerning A-

duals, there are besides the conversions given in

Lemma 1, others which also generate A-duals.

Lemma 7. For any C

∆

, for any k, let C

be deﬁned

according to (iii) below.

(iii)











(x,λ) = kC

∆

(x,λ)

(λ,y) = kC

∆

(λ,y)

(x,y) = (1− k)(C

∆

(x,λ) +C

∆

(λ,y)) −C

∆

(x,y)

Then for any α : S 7→ T

∆(α) + Θ(α) = (1− k) × (

∑

s∈S

∆

(s,λ)) +

∑

t∈T

∆

(λ,t)))

Lemma 8. For any C

, for any k, let C

∆

be deﬁned

according to (iv) below.

(iv)











∆

(x,λ) = C

(x,λ) + kC

(x,x)

∆

(λ,y) = C

(λ,y) + kC

(y,y)

∆

(x,y) = k(C

(x,x) +C

(y,y)) −C

(x,y)

Then for any α : S 7→ T,

∆(α) + Θ(α) = k× (

∑

s∈S

(s,s)) +

∑

t∈T

(t,t)))

The proofs of these follow a similar pattern to that of

Lemma 1 and are omitted. In a similar fashion both

these conversions will give A-duals.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods