ST

ATISTICAL SIGNIFICANCE IN OMIC DATA ANALYSES

Alternative/Complementary Method for Efﬁcient Automatic Identiﬁcation of

Statistically Signiﬁcant Tests in High Throughput Biological Studies

Christine Nardini, Luca Benini

DEIS, University of Bologna, Viale Risorgimento 2, Bologna, Italy

Michael D. Kuo

UCSD Medical Center HillCrest, 200 West Arbor Drive, San diego, CA, USA

Keywords:

Statistical testing, statistical signiﬁcance, multiple hypothesis testing, false discovery rate, statistical resam-

pling methods, statistical meta-analysis, omic data.

Abstract:

The post-Genomic Era is characterized by the proliferation of high-throughput platforms that allow the par-

allel study of a complete body of molecules in one single run of experiments (omic approach). Analysis and

integration of omic data represent one of the most challenging frontiers for all the disciplines related to Sys-

tems Biology. From the computational perspective this requires, among others, the massive use of automated

approaches in several steps of the complex analysis pipeline, often consisting of cascades of statistical tests.

In this frame, the identiﬁcation of statistical signiﬁcance has been one of the early challenges in the handling

of omic data and remains a critical step due to the multiple hypotheses testing issue, given the large number

of hypotheses examined at one time. Two main approaches are currently used: p-values based on random

permutation approaches and the False Discovery Rate. Both give meaningful and important results, however

they suffer respectively from being computationally heavy -due to the large number of data that has to be

generated-, or extremely ﬂexible with respect to the deﬁnition of the signiﬁcance threshold, leading to difﬁ-

culties in standardization. We present here a complementary/alternative approach to these current ones and

discuss performances and limitations.

1 INTRODUCTION

In recent times high-throughput devices for genome-

wide analyses have greatly increased in size, scope

and type. In the post-Genomic Era, several solutions

have been devised to extend the successful approach

adopted for gene expression analyses with microar-

ray technology to other bodies of data such as pro-

teomes, DNA copy number, single nucleotide poly-

morphisms, promoter sites and many more (Nardini

et al., 2006). These data supports, and notably their

integration, represent the future of molecular biology;

for this reason the elucidation and deﬁnition of tools

and methods suited to handle the data produced by

these high-throughput devices is of great importance.

Early methods for such analyses were mainly

dealing with gene expression data, their goal being

to extract items that appear to have coherent trends

among themselves (in this context commonly called

unsupervised methods) or with respect to external fea-

tures, such as clinical markers (supervised methods).

Both types of approaches have been used for example

for the classiﬁcation of subtypes of poorly understood

diseases with unpredictable outcomes (Ramaswamy

et al., 2003; Lapointe et al., 2004). Currently, other

approaches, that take advantage of larger and diverse

sources of information are being devised to address

questions of varying complexity in different areas of

research rooted in molecular biology. These methods

cover a broad variety of applications, from the study

of complex hereditary diseases (Rossi et al., 2006)

to the identiﬁcation of radiological traits’ surrogate

markers (the molecular origin of a clinical trait) for

enabling non-invasive personalized medicine (Segal

et al., 2007). Overall, besides the variety and com-

plexity of the analyses and methods adopted, some in-

variants can be identiﬁed. The most common atomic

step is the identiﬁcation on the large scale of similari-

ties or associations among molecular behaviors. Such

association measures consist for example of scores

that evaluate similarities across several samples of

genes’ expression proﬁles, or genetic coherence in

genes copy number or deletion, and more. Coher-

ence among expression proﬁles and other association

56

Nardini C., Benini L. and D. Kuo M. (2008).

STATISTICAL SIGNIFICANCE IN OMIC DATA ANALYSES - Alternative/Complementary Method for Efﬁcient Automatic Identiﬁcation of Statistically

Signiﬁcant Tests in High Throughput Biological Studies.

In Proceedings of the First International Conference on Bio-inspired Systems and Signal Processing, pages 56-63

DOI: 10.5220/0001059900560063

Copyright

c

SciTePress

measures can be assessed by means of statistical tech-

niques, namely, by computing a measure of trend sim-

ilarity (test score, θ) and evaluating the likelihood of

this measure to occur by chance (α-level or p-value).

The test score is then assumed to be either a measure

of actual similarity or only a random effect, based on

the value of the associated p-value. The p-value rep-

resents the probability of being wrong when assuming

that the score represents an actual similarity. This er-

ror (type I error) can happen for non-extreme values

of the test θ that are difﬁcult to classify as good or

bad and results in erroneously refuting the null hy-

pothesis (H

0

: θ = 0) which assumes that there is no

relationship, when actual facts show that the items are

tightly related. The scientiﬁc community typically

assumes to be meaningful (i.e. statistically signiﬁ-

cant) test scores that are coupled to p-values lower

or equal to one of the following nominal p-values:

0.05,0.01,0.001. These values represent the proba-

bility of committing typeI errors. Given these deﬁni-

tions, the highly dimensional nature of genome-wide

data has posed problems and challenges to conven-

tional biostatistical approaches. Indeed, when per-

forming in parallel such a large number of tests, typeI

errors inherently rise in number, since over a large

number of items, the possibility of faults increases.

For this reason, p-values need to be readjusted in a

more conservative way, accounting for the so called

multiple hypothesis testing issue. The most classical

technique to account for this problem is the Bonfer-

roni correction (R.R.Sokal and F.J.Rohlf, 2003) that

simply multiplies the actual p-value of every single

test by the total number of tests observed. However,

this approach is not considered viable in omic studies,

as in fact it often leads to the rejection of too many

tests, since none of the corrected p-value are smaller

than any of the nominal p-values. An alternative and

less conservative approach to this problem is the gen-

eration of a random distribution, based on random re-

sampling or on the generation of scores obtained from

the randomization of the data. Such approaches allow

to build a distribution that represents the population’s

behavior, and can thus be used to test the hypothesis

of interest. When operating with omic data, another

statistic, the False Discovery Rate (FDR) has been in-

troduced (Benjamini and Hochberg, 1995; Storey and

Tibshirani, 2003; Tusher et al., 2001). Like the p-

value, the FDR measures the false positives, however

while the p-value controls the number of false posi-

tive over the number of truly null tests, the FDR con-

trols the number of false positive over the fraction of

signiﬁcant tests. The utility of this statistic is unde-

niable, however, its interpretation is far less standard-

ized than the better known

p

-value, and thus, very of-

ten, the value of acceptance of a test based on FDR is

much more ﬂexible and dependent on the investigator

experience. Globally, these characteristics make the

results assessed by FDR highly dependent on the re-

jection level the investigator chooses. This makes it

difﬁcult to automate with high parallelism the iden-

tiﬁcation of statistically signiﬁcant hypotheses. This

problem can becomes relevant due to the increasingly

common necessity to merge different sources of in-

formation to assess the validity of a given biologi-

cal hypothesis. Examples of such circumstances arise

whenever, for example, the analysis aims at reﬁning,

by means of cascades of statistical tests, a set of genes

candidate to explain a biological assumption. The

hypothesis in fact is reﬁned collecting information

across various databases or other forms of a priori

knowledge, that progressively ﬁlter out the spurious

data -only as an example see various tools presented

in (Tifﬁn et al., 2006; Rossi et al., 2006). To be ef-

ﬁcient, the analysis requires the result of each ﬁlter-

ing step to be automatically sent to the following one.

Thus the possibility to assess signiﬁcance by mean of

universally accepted values of signiﬁcance becomes

relevant. This latter observation was one of the stim-

uli motivating the search for an alternative/integrative

approach to the multiple hypotheses problem encoun-

tered when dealing with genomic datasets. We also

wanted this method to be reasonably efﬁcient to be

computed. We thus approached the problem based

on techniques that allow the intrinsic correction of

p-values in case of multiple tests (meta analyses ap-

proaches) used for the combination of various statis-

tical tests. Among them, we turned our attention to

the category of the omnibus tests (L.B.Hedges and

I.Olkin, 1985). These approaches are non-parametric,

meaning that they do not depend on the distribution

of the underlying data, as long as the test statistic is

continuous. In fact, p-values derived from such tests

have a uniform distribution under the null hypothe-

sis, regardless of the test statistic or the distribution

they have been derived from. However, omnibus tests

suffer from a strong limitation: they can be used to as-

sess whether there is a superior outcome in any of the

studies performed. This means that the combined sig-

niﬁcance is not a measure of the average signiﬁcance

of the studies performed. An omnibus test therefore

cannot be used as is, to assess the global statistical

validity of the number of tests considered simultane-

ously. Thus, we manipulated this approach to make it

applicable to the deﬁnition of a signiﬁcance threshold.

The main advantage of our solution is twofold. On

one side the p-values can be computed in very reason-

able times and can thus help managing the computa-

tional issues related to permutations techniques; on

STATISTICAL SIGNIFICANCE IN OMIC DATA ANALYSES - Alternative/Complementary Method for Efficient

Automatic Identification of Statistically Significant Tests in High Throughput Biological Studies

57

the other side they represent p-values for which nom-

inal threshold of signiﬁcance (e.g. 0.05, 0.01, 0.001)

can be applied, and can overcome the threshold selec-

tion issue faced when using FDR approaches. Addi-

tionally, this method appears to perform slightly bet-

ter than other methods in avoiding the selection of

false positives. However, this is coupled to a partially

diminished ability in identifying correctly true posi-

tives in complex patterns of association. These con-

sideration support the ﬁndings of several authors that

strongly suggest to validate the results obtained from

omic studies through the use of different techniques

and threshold of signiﬁcance, given the highly noisy

nature of the data (Pan et al., 2005).

2 RELATED WORK

Two main methodologies are currently being used to

approach the multiple hypothesis testing issue. The

ﬁrst is based on the principles that deﬁne the resam-

pling statistical approaches (R.R.Sokal and F.J.Rohlf,

2003). In particular we adopted the permutation

method that requires the construction of a null dis-

tribution to which to compare the actual data. This

distribution must be built from the generation of a

large number of random data. When the distribu-

tion is built using the randomized data generated by

all the tests, the corresponding p-value is corrected

for these same multiple hypotheses. This represents

a structurally simple, robust, but computationally in-

tensive approach, given the large numbers involved

in the analysis of omic data. The computational ef-

ﬁciency issue can become extremely relevant, since

most of the interpreted languages commonly used for

their large libraries of bioinformatics related functions

(notably R and the Bioconductor Project (Gentleman

et al., 2005), and Matlab), cannot reasonably han-

dle such approaches. Even with the recent improve-

ments for (implicit) parallelization of the computa-

tion, time lags for the evaluation of the results re-

main large. Moreover, for large datasets, compiled

languages such as C also require intensive and long

lasting computational efforts, unless speciﬁc archi-

tectures are adopted to enhance efﬁciency. The sec-

ond approach consists of novel methods purposely

introduced to handle omic data that deﬁnes the con-

cept of False Discovery Rate. This statistic comes

in a number of ﬂavors, and relies on complex sta-

tistical assumption. A full description is beyond the

scope of this paper, here we brieﬂy describe three of

the most used approaches: (i) the pioneering work

of Benjamini (Benjamini and Hochberg, 1995); (ii)

the deﬁnition of the q-value (Storey and Tibshirani,

2003); (iii) the FDR adopted in the tool Signiﬁcance

Analysis of Microarray -SAM, (Tusher et al., 2001)- a

widespread software used for the analysis of microar-

ray data.

Benjamini FDR: This approach controls the FDR

by modifying the p-values obtained on a single test,

rescaling it in the following way: FDR

BEN

=

K p

i

i

∑

K

i=1

i

−1

,

where p

i

represents the i-th of the K single p-values.

q-value: The q-value is the minimum false discov-

ery rate. This measure can be approximated by the

ratio of the number of false positives over the num-

ber of signiﬁcant tests, the implementation of the q-

value provides several options to evaluate this esti-

mate and to compare it to the corresponding p-values.

q ≈ min(#false positives/#signiﬁcant tests).

SAM FDR: SAM is a tool that allows the ex-

traction of signiﬁcant genes that help differentiate 2

or more sample classes by means of various scores

suited to answer different questions (i.e. depending

on the number of sample classes observed and on

the meaning of the scores deﬁning the classes, such

as survival times, experimental points in time course

experiments etc.). Statistical validation of the score

value produced by SAM is performed by the genera-

tion of a distribution of random score values. These

scores are evaluated by means of random permuta-

tions of the class labels. These new values, along with

the ones from the original classiﬁcation are used to

evaluate the FDR as the average of falsely signiﬁcant

items: FDR

SAM

=

#signi f. permuted scores

#permutations

#signi f . actual scores

i.e. the number

of items with permuted test scores called signiﬁcant

divided by the number of permutations over the num-

ber of items called signiﬁcant in actual data.

The q-value approach is one of the most

widespread, both because of its quality and because of

the various and user-friendly implementations the au-

thors have made available. For this reason we choose

this method for comparison to ours. In general, FDR

scores represent an extremely valuable information

while dealing with omic data, however, the main is-

sue to the fully automated use of these techniques lies

in the ﬂexible acceptance of the threshold values for

signiﬁcance. In other words the investigator can set

his threshold for the acceptance of the False Discov-

ery Rate, but no universally accepted thresholds have

been recognized. This issue has been pointed out for

example in (Cheng et al., 2004). In this work the au-

thors designed three other statistical scores to help in

the choice of the threshold for signiﬁcance. Among

these scores, two are designed to assess general sig-

niﬁcance threshold criteria for large-scale multiple

tests and one is based on existing biological knowl-

edge. Our method does not represent a novel way to

BIOSIGNALS 2008 - International Conference on Bio-inspired Systems and Signal Processing

58

evaluate FDR, but it deﬁnes a p-value, for this reason

universally accepted thresholds for signiﬁcance can

be adopted.

More recently and independently from our ap-

proach (Yang and Yang, 2006) have designed a

method based on omnibus tests to improve the identi-

ﬁcation of the FDR. Again, one of our goals is to pro-

vide an efﬁcient way to evaluate a p-value that takes

into account the multiple hypotheses tested, in order

to be able to adopt the thresholds of signiﬁcance ac-

cepted by the scientiﬁc community (0.05,0.01,0.001),

easier to automate in long pipelines of tests. In this

paper we show that the p-value obtained with manip-

ulation of the inverse χ

2

method (one of the omnibus

tests) can also be used directly as a measure of sig-

niﬁcance for the identiﬁcation of statistically signiﬁ-

cantly tests.

3 METHOD

We chose as the base for our approach the inverse

χ

2

method (L.B.Hedges and I.Olkin, 1985), an om-

nibus statistical test used to ascertain if at least one

among several tests is signiﬁcant, by evaluation of

the following statistics: S(k) = −2

∑

k

i=1

ln(p

i

) and

s(k) = χ

2

(S,2k) where k = 1...K are the tests per-

formed and p

i

the p-value of the i-th test. S has a

χ

2

(s,2k)

distribution, where s is the p-value of the χ

2

distribution with 2k degrees of freedom, and repre-

sents the signiﬁcance of the combined tests, meaning

that it can assess if any of the tests can be consid-

ered signiﬁcant, accounting for the total number of K

tests performed. Thus, in the following, s will indi-

cate the p-value we can use for assessing the statis-

tical signiﬁcance of the tests taking into account the

multiple hypothesis issue, while p will indicate the

signiﬁcance of the single test. The score θ is the value

resulting from the statistical test. Making use of the

χ

2

inverse method means testing the null hypothesis

H

0

: H

0,1

= ... = H

0,K

= 0. Values of s > 0.05 indi-

cate that H

0

cannot be rejected and thus that it holds

for all the subhypotheses H

0,i

= 0, i ∈ [1, K]. Con-

versely, more than one combination of rejection and

non rejection of single hypotheses H

0,i

is possible to

justify the rejection of the global null hypothesis H

0

.

For example all but one of the subhypotheses could

be null, or only one could be null etc. Evaluating s

on all the tests performed would be of no interest in

terms of deﬁning a global threshold for signiﬁcance.

In fact, while a non signiﬁcant value of s would indi-

cate that none of the items has a score value that al-

lows the rejection of the null hypothesis, a low value

of s (< 0.05) would only mean that at least one item’s

score is relevant to the rejection of the null hypothesis,

with no indication on which one(s) are the relevant

items. To overcome this limitation we ranked the tests

scores θ in ascending order (assuming that signiﬁcant

values of the test are represented by high values of

the score), and ordered the p-values consistently. We

then evaluated s for sets of p-values of increasing size,

starting from a set made of only the p-value corre-

sponding to the worse test score, then adding at each

iteration of this algorithm another p-value coupled to

the immediately higher or equal (better) score (θ), and

closing the last iteration with all the p-values. By in-

duction (Equation 1) we can show that whenever the

value of s drops below any of the standard values of

signiﬁcance (0.05,0.01,0.001) the score correspond-

ing to the last p-value added is the threshold for sig-

niﬁcance, since it represents the speciﬁc test that ac-

counts for the impossibility to reject the global null

hypothesis H

0

. By construction, at each iteration, the

p-value added is always smaller, and correspondingly,

due to the logarithm properties, S shows a fast growth

(S(k) = −2

∑

k

i=1

ln(p

i

)). At the same time the param-

eter of the χ

2

function k, grows linearly (2 · k). Be-

cause of the shape of the χ

2

function and because of

the logarithm properties, if there are enough small p-

values, S becomes quickly and abruptly very large,

and moves to behaviors typical of the ones on the right

hand side of Figure 1(c), χ

2

k

(S) →

k→inf,S→inf

0. This

gives s its typical shape (shown in Figure 1(b)), with a

very abrupt drop from values very close to 1 to values

very close to 0.

For i = 1 s(i) > 0.05 ⇒

H

0

not re j. H

0,1

not re j.

Let i = n s(i) > 0.05 ⇒

H

0

not re j. H

0,i

not re j., ∀i ∈ [1,n]

T hen i = n + 1 s(i) > 0.05 ⇒

H

0

not re j. H

0,i

not re j., ∀i ∈ [1,n + 1]

s(i) ≤ 0.05 ⇒

H

0

re j. H

0,i

not re j., ∀i ∈ [1,n],

H

0,n+1

re j.

(1)

Figure 1 shows an example of the trends of the

variables involved in the evaluation of global signiﬁ-

cance: the statistics S and s that deﬁne the global sig-

niﬁcance, the test score θ and the corresponding sin-

gle p-value that are the basic units of the analysis. The

statistic S represents the argument of the χ

2

function

and is associated to a given degree of freedom (k). For

any given degree of freedom it is possible to identify

the minimum value (here called S

idα

) for which the

inverse χ

2

function returns the suited probability α.

Since S

id

is the minimum value, the p-value that rep-

resents the threshold for signiﬁcance is associated to

k

signα

and can be conveniently visualized as the point

STATISTICAL SIGNIFICANCE IN OMIC DATA ANALYSES - Alternative/Complementary Method for Efficient

Automatic Identification of Statistically Significant Tests in High Throughput Biological Studies

59

0 200 400 600

0

200

400

600

800

1000

1200

Iterations

S (capital)

(b)

S

S

id

0.05

0 500 1000 1500

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Iterations

s(script)

(a)

s

θ

p

0 200 400

0

0.2

0.4

0.6

0.8

1

1.2

1.4

S (capital)

s (script)

(c)

k =1

k =2

k =3

k =4

k =5

k =6

k =7

k =8

k =9

k =10

k

Figure 1: Graphical representation of the different scores involved in the analysis. Figure(a) deals with the statistic S and S

id

.

Figure(b) plots the corrected p-value s, the absolute value of the correlation score θ and the single p-value p. Figure(c) shows

the χ

2

probability density function..

in which k

signα

= k|S

idα

(k) = S(k). Equivalently for s

the threshold for signiﬁcance at a given nominal level

α can be deﬁned as k

sign

= min

k∈[1,K]

|s(k) ≤ α. In

our experiments θ is the Spearman correlation score

(R.R.Sokal and F.J.Rohlf, 2003). Before processing

the test values we separated positive from negative

scores, and then performed the previously described

operations on the absolute values. This sign segre-

gation of the data has a two-fold objective. On one

side this fulﬁlls the requirement for the applicabil-

ity of the test since one tailed p-values are required.

On the other side it satisﬁes the biological necessity

to discern between signiﬁcantly over and under ex-

pressed genes, based on positive and negative values

of the test scores. As far as the permutation approach

is involved we generated 1000 random permutations

of each trait values as it was done in other applica-

tions with this same goal (Liang et al., 2005). We

then re-evaluate the θ scores for all 1000 randomized

instance of each trait, these constitute the null distri-

bution. For the FDR approach, we used the q-value

R package with default settings. For the identiﬁcation

of signiﬁcant items, we adopted as threshold the same

values we used for the p-value. The method was im-

plemented in Matlab, scripts for the method are avail-

able upon request.

3.1 Data

To test our method, we simulated the typical set up of

a common genomic experiment. Namely, we gener-

ated a random expression matrix 1000x100 (i.e. 1000

genes and 100 samples) and we deﬁned 5 external

traits for which we search the surrogate markers. In

other words, these external traits mimic any clini-

cal trait or molecular marker. The goal of the ex-

periment is to identify the genes associated to the

external traits, to deﬁne the traits’ surrogate mark-

ers. This approach is then used to investigate the

molecular etiology of commonly used clinical mark-

ers. Several examples of such approaches can be

found in literature, only as a sample see (Lapointe

et al., 2004; Liang et al., 2005). At ﬁrst, we tested

the method’s ability to recognize surrogate markers

of variable size. The surrogate markers were obtained

either by simple copy of expression proﬁles (in vary-

ing number of copies, namely 0,1,5), or by sum of

varying numbers of proﬁles (namely 5, 30). The ﬁrst

group of external traits (#1,#2,#3) provides both the

negative control (0 copies, obtained by elimination of

a randomly chosen expression proﬁle, and exported

as external trait) and helps measuring the compara-

tive ability of the 3 different approaches (FDR, per-

mutations and our method) in extracting small clus-

ter of correlated proﬁles (1, 5 copies). The second set

of traits (#4,#5) tests the approach with more chal-

lenging data (sums of 5,30 copies). To each ex-

pression value we added varying levels of gaussian

noise (0%,50%,100%) proportional to the expression

value, to better mimic real data (Bansal et al., 2007).

To avoid speciﬁc case results, we replicated our ap-

proach 3 times per each noise level and averaged the

results of speciﬁcity, sensitivity, positive and negative

predictive value. We observed the approach for the

3 levels of signiﬁcance 0.05,0.01, 0.001. Finally, we

tested our method to assess its reliability with variable

numbers of genes.

3.2 Multiclass Statistical Scores

To compare our results we evaluated the speciﬁcity,

sensitivity, negative and positive predictive value of

the 3 methods: permutations, FDR and ours. These

statistics are used in combination to quantify different

aspects of the accuracy of a binary test, evaluating dif-

ferent proportions of correctly and incorrectly classi-

ﬁed items, when compared to a known classiﬁcation,

considered the gold standard. In this context the test is

BIOSIGNALS 2008 - International Conference on Bio-inspired Systems and Signal Processing

60

the ensemble of all the operations performed to clas-

sify each items; positive and negatives label the items

according to the two classes c = N, P = 0,1 they be-

long to; true (T) and false (F) represent the ability of

the test to classify coherently or not a given item in

the test classiﬁcation with respect to the gold stan-

dard classiﬁcation. Thus, for example, in classical

deﬁnitions TN (true negative) labels items belonging

to class 0 (N) correctly classiﬁed by the test, and FP

(false positive) labels items incorrectly classiﬁed as 1

(P) by the test. Given these deﬁnitions, positive and

negative predictive value (PPV, NPV), sensitivity (Se)

and speciﬁcity (Sp) are usually formalized with the

relationships in the ﬁrst part of Equations 2.

Table 1: Classical deﬁnition and generalization to 3 classes

for true, false, negatives, positives.

(a) Classical Deﬁnition

Gold Standard

T F

Test

P TP FP → P

t

N FN TN → N

t

↓ ↓

P

gs

N

gs

(b) 3-Classes Deﬁnition

Gold Standard

2 1 0

Test

2 T

2

x

12

x

13

→ C

2,t

1 x

21

T

1

x

23

→ C

1,t

0 x

31

x

32

T

0

→ C

0,t

↓ ↓ ↓

C

2,gs

C

1,gs

C

0,gs

When the test classiﬁes n > 2 categories, these

deﬁnitions become more complex to apply. However,

it still remains important to be able to characterize the

performances of the test in terms of its ability to dis-

tinguish between items that belong and do not belong

to any category (in our case between genes that con-

stitute and do not constitute any molecular surrogate).

To reach this goal and preserve the meaning of the 4

scores (PPV, NPV, Se, Sp) some caution must be used.

In fact the meaning of positive and negative is not rel-

evant anymore, since there are now positives. Then,

while the deﬁnition of true remains straightforward,

as it indicates coherence between the classiﬁcation of

the test and the gold standard, the deﬁnition of false

can be cumbersome, since there are n−1 ways to mis-

classify an item. Additionally, the possibly intuitive

deﬁnition of false positives (or negatives as items that

are non-zero in the test (or in the gold standard) clas-

siﬁcation leads to ambiguity, since items happen to

be contemporary false positives and false negatives.

To avoid confusion and ambiguities the actual values

of all false can be identiﬁed by rewriting the prob-

lem in terms of a system of equation based on the

relationships indicated in Table 1. Here P

t

,N

t

repre-

sent the total number of positive and negative items

that can be found in the test (t) categorization, and

P

gs

,Ngs in the gold standard (gs) classiﬁcation. The

deﬁnitions can be generalized to n > 2 classes chang-

ing the term negative and positive with the indices of

the corresponding classes c = 0,1,...,n, and having

C

c

that designs the total number of positives for each

given class. The system of equations obtained from

the relationships in the rows and columns of Table 1

contains 2 · n equations (i.e. T P + FP = P

t

) and 2 · n

unknown (x

i j

), thus it is completely speciﬁed. It is

worth noticing, that with these general deﬁnitions, in

case of 2-classes test, Se and Sp appear to be dual

scores. Thus, when generalizing to n-classes it is pos-

sible to deﬁne the predictive ability of the test for each

given class c ∈ 0,1,..,n as PV

c

= T

c

/C

t

and the Sensi-

tivity/Speciﬁcity (now called Sep) for the same class

c as Sep

c

= T

c

/C

gs

. To clarify the situation it is ex-

tremely useful to rewrite the deﬁnitions as they are

written on the left hand side of Equation 2, namely:

PPV = T P/T P + FP) = T P/P

t

PPN = TN/(T N + FN) = TN/N

t

Se = T P/(T P + FN) = T P/P

gs

Sp = T N/(T N +FP) = T N/N

gs

(2)

For n classes this gives:

PPV =

∑

c

T

c

/

∑

c

C

c,t

,c = 1, ..,n

PPN = T

0

/N

t

= T

0

/C

0,t

Se =

∑

c

T

c

/

∑

c

C

c,gs

,c = 1, ..,n

Sp = T

0

/N

gs

= T

0

/C

0,gs

(3)

4 RESULTS AND DISCUSSION

All the results obtained with our method were ob-

tained in much more efﬁcient times compared to the

permutation method, since the computational com-

plexity of our algorithm is O(g·t) while the bootstrap-

ping one is O(g ·t · p), with g indicating the number of

genes, t the number of external traits, and p the num-

ber of permutations. The comparison with FDR in

these terms is not relevant, since this method is com-

putationally efﬁcient. We performed 3 main experi-

ments: the ﬁrst for comparison among the 3 methods

across all the types of traits (global comparison, Ta-

ble 2); then more speciﬁcally, trait by trait (Table 3);

ﬁnally we explored the stability of the method across

varying numbers of tests performed.

As far as the ﬁrst comparison is involved, all meth-

ods performed with varying good degrees of speci-

ﬁcity (Sp > 0.95), but none had satisfactory sensitiv-

ity (Se < 0.5 to Se << 0.5) except the permutation

method for only the threshold 0.05, Se

perm,α=0.05

=

.67. In particular, our method has intermediate sen-

sitivity (better than FDR) and speciﬁcity (better than

STATISTICAL SIGNIFICANCE IN OMIC DATA ANALYSES - Alternative/Complementary Method for Efficient

Automatic Identification of Statistically Significant Tests in High Throughput Biological Studies

61

Table 2: Statistics of the performances of the 3 methods compared: our method, permuted p-values and FDR. The comparison

is done on expression matrices 1000x100 and 5 traits as they are described in Section 3.1. Results are averaged over 3

instances of the random data generated with the same speciﬁcs. Standard deviations of these averages are below 10

−2

. The

ﬁrst column indicates the noise level (n), the second the threshold of signiﬁcance chosen (α) and then all the scores for the 3

methods. Because of space constraints only values for noise 0.5 are shown.

Our Method Permutations FDR - q-value

n α Se Sp Se Sp Se Sp

.05 .1905 .9998 .6746 .9512 .1667 .9948

0.5 .01 .1667 .9999 .4603 .9898 .1667 .9948

.001 .1667 1.000 .3175 .9981 .1667 .9948

Table 3: Class by class comparison of the algorithms performances. Our method performs better in terms of avoiding false

positive ans worse with false negatives. Data are shown as averages across the random replicates and across the 3 different

levels of signiﬁcance, for 3 different levels of noise (n). Figures in italic were inferred from NANs.

PV (classes) Sep (classes)

n Method 0 1 2 3 4 5 0 1 2 3 4 5

Ours .9998 1.000 1.000 1.000 .3111 .0556 .9936 1.000 1.000 1.000 0.000 0.000

0 Perm. .9797 1.000 1.000 1.000 .9556 .2852 .9956 .2510 0.000 .3846 .4325 .3494

Ours .9999 1.000 1.000 1.000 .0444 .0037 .9931 1.000 1.000 1.000 0.000 0.000

0.5 Perm. .9797 1.000 1.000 1.000 .9556 .2852 .9956 .2510 0.000 .3846 .4325 .3494

Ours .9999 .3333 1.000 .7333 .0000 .0037 .9925 0.000 1.000 .9506 0.000 0.000

1 Perm. .9797 1.000 1.000 1.000 .9556 .2852 .9956 .2510 0.000 .3846 .4325 .3494

permutations). Since the FDR method at the chosen

thresholds for signiﬁcance appears to behave in ex-

treme ways, i.e. with better speciﬁcity and worse sen-

sitivity with respect to both methods, we focused our

attention to a more reﬁned comparison between the

bootstrapping method and ours, and did not pursue

the goal, out of our scope here, to evaluate results with

other thresholds for signiﬁcance.

Namely, we performed the second experiment, on a

trait by trait basis, with two goals: to investigate the

reasons of the improved performances of our method

in terms of speciﬁcity; to assess the reasons for the

poor global performances in terms of sensitivity. For

this we evaluated PV and Sep for each one of the 6

classes (c = 0,1,..,n). In general our method seems

to have more problems with false negatives, while the

bootstrapping method collects a much larger number

of false positives (Table 3). These characteristics de-

pend on the intrinsic properties of s as they have been

described in Section 3. The abrupt drop in value of

s is responsible for an almost binary behavior of this

score. This leaves very little gray areas for spurious

classiﬁcation, thus ambiguous θ values are quickly

coupled to high s values and discarded from the sig-

niﬁcant tests set. Overall, trait #5 deﬁnes a too com-

plex pattern (sum of 30 proﬁles), and none of the

method can treat it correctly, conversely, trait #4 (sum

of 5 proﬁles) can be superiorly handled by the permu-

tation method and trait #1, #2 and #3 (1,0,5 corre-

lated proﬁles) are better recognized with our method.

It is difﬁcult to speculate on whether surrogate mark-

ers of type #3 are more or less common than the ones

0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Fraction

True Positive Fraction

Figure 2: ROC curve for PV, AUC ≈ 0.6.

of type #4 in actual biology, we can state however that

our method is able to identify the surrogate markers

of trait #3 with proﬁles that have as little correlation

as 0.33 (100% noise addedd). To summarize these

results we evaluated ROC curves to assess if any of

the methods was strikingly outperforming the other

(ROC curves in this case are not used to evaluate the

relationship between sensitivity and speciﬁcity, but to

compare two populations of data, that happen to be

PV and Sep scores). We compared: (i) PV and Sep

for each method, (ii) Sep only, (iii) PV only. Namely,

sensitivity and speciﬁcity combined, as well as sensi-

tivity alone lead to AUC ≈ 0.5, while the speciﬁcity

test leads to AUC ≈ 0.6, slightly better, but not statis-

tically signiﬁcant (Figure 2, AUC = 0.5 indicates tests

with comparable performances).

Finally, we tested our method for the same hy-

potheses for varying numbers of genes, from 100 to

2000 (steps of 100 genes). Across 20 samples we ob-

tained median values that reproduce the ﬁndings of

BIOSIGNALS 2008 - International Conference on Bio-inspired Systems and Signal Processing

62

the two previous experiments (global and trait by trait

performances) with very small variances across the 20

samples (≈ 10

−2

for sensitivity and ≈ 10

−3

for speci-

ﬁcity). Thus, the method appears to be stable with

respect to the number of items tested.

5 CONCLUSIONS

We presented a method for the identiﬁcation of p-

values in omic studies. This approach is based on a

meta-analysis and has two main advantages. On one

side it is computationally efﬁcient, and can thus be

used in interpreted languages such as R and Matlab

that offer rich libraries of functions for omic analyses.

On the other side it is based on the identiﬁcation of

a p-value rather than FDR, and can thus take advan-

tage of nominal threshold for signiﬁcance, allowing

for an easier automation of ﬁltering steps in analyses

based on statistical tests. Conversely to the permuta-

tion technique, that remains a computationally inten-

sive but very robust reference method, our approach,

globally, appears to be more speciﬁc but less sensi-

tive. This improved speciﬁcity can be extremely ad-

vantageous in the practice of Systems Biology, since

novel compact functional subunits can emerge or re-

main uncovered and require longer and costly exper-

imental investigations to be extracted, depending on

the noise they appear to be identiﬁed with. Applica-

tion to real data needs to be provided and this repre-

sents our current research activity. For these reasons

we believe the deﬁnition of alternative and comple-

mentary method is appropriate.

ACKNOWLEDGEMENTS

The authors would like to thank Diego di Bernardo

and Mukesh Bansal for constructive discussion.

REFERENCES

Bansal, M., Belcastro, V., Ambesi-Impiombato, A., and

di Bernardo, D. (2007). How to infer gene networks

from expression proﬁles. Mol Syst Biol, 3.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the

false discovery rate: a practical and powerful ap-

proach to multiple testing. J.R. Stat. Soc. B, 57:289–

300.

Cheng, C., Pounds, S., Boyett, J., Pei, D., Kuo, M., and

Roussel, M. F. (2004). Statistical signiﬁcance thresh-

old criteria for analysis of microarray gene expression

data. Stat Appl Genet Mol Biol, 3:Article36.

Gentleman, R., Carey, V., Huber, W., Irizarry, R., and Du-

doit, S. (2005). Bioinformatics and Computational Bi-

ology Solutions Using R and Bioconductor. Springer.

Lapointe, J., Li, C., Higgins, J. P., van de Rijn, M., Bair,

E., Montgomery, K., Ferrari, M., Egevad, L., Rayford,

W., Bergerheim, U., Ekman, P., DeMarzo, A. M., Tib-

shirani, R., Botstein, D., Brown, P. O., Brooks, J. D.,

and Pollack, J. R. (2004). Gene expression proﬁling

identiﬁes clinically relevant subtypes of prostate can-

cer. Proc. Natl. Acad. Sci., 101(3):811–816.

L.B.Hedges and I.Olkin ((1985)). Statistical Methods in

Meta-Analysis. Academic Press, New York.

Liang, Y., Diehn, M., Watson, N., Bollen, A. W., Aldape,

K. D., Nicholas, M. K., Lamborn, K. R., Berger,

M. S., Botstein, D., Brown, P. O., and Israel, M. A.

(2005). Gene expression proﬁling reveals molecularly

and clinically distinct subtypes of glioblastoma multi-

forme. Proc. Natl. Acad. Sci., 102(16):5814–5819.

Nardini, C., Benini, L., and Micheli, G. D. (2006). Circuits

and systems for high-throughput biology. Circuits and

Systems Magazine, IEEE, 6(3):10–20.

Pan, K.-H., Lih, C.-J., and Cohen, S. N. (2005). Effects

of threshold choice on biological conclusions reached

during analysis of gene expression by DNA microar-

rays. Proc. Natl. Acad. Sci., 102(25):8961–8965.

Ramaswamy, S., Ross, K. N., Lander, E. S., and Golub,

T. R. (2003). A molecular signature of metastasis in

primary solid tumors. Nat. Genet., 33(1):49–54.

Rossi, S., Masotti, D., Nardini, C., Bonora, E., Romeo, G.,

Macii, E., Benini, L., and Volinia, S. (2006). TOM:

a web-based integrated approach for efﬁcient identiﬁ-

cation of candidate disease genes. Nucleic Acids Res.,

34(doi:10.1093/nar/gkl340):W285–W292.

R.R.Sokal and F.J.Rohlf (2003). Biometry. Freeman, New

York.

Segal, E., Sirlin, C. B., Ooi, C., Adler, A. S., Gollub,

J., Chen, X., Chan, B. K., Matcuk, G. R., Barry,

C. T., Chang, H. Y., and Kuo, M. D. (2007). De-

coding global gene expression programs in liver can-

cer by noninvasive imaging. Nature Biotechnology,

25(6):675–680.

Storey, J. D. and Tibshirani, R. (2003). Statistical signiﬁ-

cance for genomewide studies. PNAS, 10(16):9440–

9445.

Tifﬁn, N., Adie, E., Turner, F., Brunner, H., van Driel nd

M. Oti, M. A., Lopez-Bigas, N., Ouzunis, C., Perez-

Iratxeta, C., Andrade-Navarro, M. A., Adeyemo,

A., Patti, M. E., Semple, C. A. M., and Hide,

W. (2006). Computational disease gene identiﬁca-

tion: a concert of methods prioritizes type 2 dia-

betes and obesity candidate genes. Nucleic Acids Res.,

34(doi:10.1093/nar/gkl381).

Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Sig-

niﬁcance analysis of microarrays applied to the ion-

izing radiation response. Proc. Natl. Acad. Sci.,

98(9):5116–5121.

Yang, J. J. and Yang, M. C. (2006). An improved procedure

for gene selection from microarray experiments using

false discovery rate criterion. BMC Bioinformatics,

7:15.

STATISTICAL SIGNIFICANCE IN OMIC DATA ANALYSES - Alternative/Complementary Method for Efficient

Automatic Identification of Statistically Significant Tests in High Throughput Biological Studies

63