Complementary Domain Prioritization: A Method to Improve

Biologically Relevant Detection in Multi-Omic Data Sets

Benjamin A. Neely

and Paul E. Anderson

Marine Biochemical Sciences, National Institute of Standards and Technology, Charleston, SC 29412, U.S.A.

Department of Computer Science, College of Charleston, Charleston, SC 29424, U.S.A.

Keywords:

Genomics, Trancriptomics, Proteomics, Proteogenomics, Proteotranscriptomics.

Abstract:

As the speed and quality of different analytical platforms increase, it is more common to collect data across

multiple biological domains in parallel (i.e., genomics, transcriptomics, proteomics, and metabolomics). There

is a growing interest in algorithms and tools that leverage heterogeneous data streams in a meaningful way.

Since these domains are typically non-linearly related, we evaluated whether results from one domain could

be used to prioritize another domain to increase the power of detection, maintain type 1 error, and highlight

biologically relevant changes in the secondary domain. To perform this feature prioritization, we developed

a methodology called Complementary Domain Prioritization that utilizes the underpinning biology to relate

complementary domains. Herein, we evaluate how proteomic data can guide transcriptomic differential ex-

pression analysis by analyzing two published colorectal cancer proteotranscriptomic data sets. The proposed

strategy improved detection of cancer-related genes compared to standard permutation invariant ﬁltering ap-

proaches and did not increase type I error. Moreover, this approach detected differentially expressed genes that

would not have been detected using ﬁltering alone while also highlighted pathways that might have otherwise

been overlooked. These results demonstrate how this strategy can effectively prioritize transcriptomic data

and drive new hypotheses, though subsequent validation studies are still required.

1 INTRODUCTION

Individually, the ﬁelds of genomics, transcriptomics,

and proteomics continue to receive signiﬁcant re-

search attention as their utility as novel discovery

platforms increases (Larance and Lamond, 2015;

de Klerk and a.C. t Hoen, 2015); however, there is

growing interest in algorithms and tools that leverage

these heterogeneous data streams (Boja et al., 2014).

This is for two main reasons: (i) it is becoming rea-

sonable both from an experimental and cost perspec-

tive to run two or more analytical methods simultane-

ously (e.g., proteomics and transcriptomics), and (ii)

it is believed that integrating data sources will give

rise to a deeper understanding of the system being

interrogated. Thus using multidomain data in a non-

trivial manner supports improved systems biology ap-

proaches to high-throughout biological analysis.

The most straightforward approach to harmoniz-

ing heterogeneous data streams is to combine p-

values (Alves and Yu, 2014) or simply to identify

statistical agreement following parallel analyses. For

example, Zhang et al. utilized proteomic analysis to

complement a seminal genomic/transcriptomic anal-

ysis of colorectal cancer (Cancer and Atlas, 2012).

By analyzing the same samples, the analyses could

be directly compared to identify shared changes at

the gene, transcript and protein level such that hy-

potheses from the genomic study were conﬁrmed at

the protein level. In other words, identiﬁed differen-

tially abundant proteins increased the conﬁdence of

differentially expressed genes. In addition to differ-

ential analyses of multiple domains, topological net-

work approaches and gene set enrichment analysis

can be used to predict activated/inhibited transcrip-

tion factors. Agreement between this type of anal-

ysis of transcriptomic and proteomic data has been

used in studies of renal cell carcinoma and psoriasis to

identify disease-relevant transcription targets(Neely

et al., 2016; Piruzian et al., 2010). These standard

approaches make comparisons in parallel, as opposed

to directly incorporating the data sets into a uniﬁed

computational analysis. It has yet to be demonstrated

how to capitalize fully on the relationship between ge-

nomic/transcriptomic and proteomic data sets.

Another approach to analyzing mutlidomain data

Neely B. and Anderson P.

Complementary Domain Prioritization: A Method to Improve Biologically Relevant Detection in Multi-Omic Data Sets.

DOI: 10.5220/0006151500680080

In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2017), pages 68-80

ISBN: 978-989-758-214-1

sets is to combine information from features in mul-

tiple biological domains into a singular analytical

space. Numerous tools that combine different data

modalities are being developed (see (Kumar et al.,

2016; Haider and Pal, 2013; Kuo et al., 2013) for ex-

tensive reviews). For example, tools such as 3Omics

offer a web-based one-click tool to combine data

from different domains by constructing correlation

networks and co-expression proﬁles to highlight tran-

script/protein/metabolites with strong agreement be-

tween the domains (Kuo et al., 2013). There is also

an effort to develop techniques that integrate net-

works between domains, such as metabolomic and

proteomic data (Pirhaji et al., 2016), to identify re-

lationships that are driving biological changes. It is

evident that merging data from different domains into

a shared analytical approach is a powerful approach.

A third approach is to use a uniﬁed computational

method that relies on data from complementary do-

mains to prioritize traditional methods of differential

expression to highlight biologically relevant changes.

Prioritization can be used to reduce the search space

of an analytical domain by using data from a sec-

ondary domain. Such prioritization methods are be-

coming more important as the number of variables

(e.g., transcripts, proteins, metabolites) that are stud-

ied using next-generation technology may range from

a few hundreds to tens of thousands. Transcrip-

tomics can include gene arrays which measure tens

of thousands of probes or RNA-seq analysis which

can sequence and measure tens of thousands of tran-

scripts, while proteomics (speciﬁcally mass spec-

trometry based shotgun proteomics) can provide a

proteome sampling of around ten thousand proteins

(Geiger et al., 2012). These high-dimensional search

spaces, along with relatively small sample sizes, re-

duce the power to detect potentially biologically rel-

evant changes while controlling the FDR. One of the

most common forms of analysis in high-dimensional

data such as these, is variable-by-variable statistical

testing. This is used to test the null hypothesis that be-

havior for a given variable does not change between

conditions. In the case of gene microarrays, this can

be accomplished on a gene-by-gene level using a t-

test, and extended to more complex experimental de-

signs using ANOVA. For next-generation sequenc-

ing projects where read counts are available, p-values

may be computed using read count statistics (Robin-

son and Smyth, 2007). For high-dimensional data this

results in a large number of hypotheses that need to

be evaluated, and thus, many genes may be detected

as signiﬁcant even though the null hypothesis is true

(i.e., there is really no change). This is known as type

I error and is controlled with various false positive

measures, including family-wise error rate (FWER)

or false discovery rate (FDR), that control for type I

error or the extent to which false positives occur but

also reduces the power to detect true positives. A re-

view of these methods can be found in (Dudoit et al.,

2003).

One general methodology to reduce the transcrip-

tomic search space is gene prioritization, which uti-

lizes a priori phenotype relationships in published

databases (reviewed extensively here (B

ornigen et al.,

2012)). Since this method requires manually cu-

rated and compiled information to prioritize exper-

imental data, it can overestimate relationships once

the data being queried becomes part of the a priori

relationship, and it is limited to the search space of

the database. Another method is to apply a ﬁltering

scheme prior to statistical testing. A comparison of

these methods can be found in Bourgon et al. In these

ﬁltering schemes, a set of variables (e.g., genes) are

identiﬁed that generate uninformative signal. Then

the formal statistical test is applied to the remaining

features. The goal of this approach is to select a ﬁl-

ter that makes multiple test correction less severe and

thus enhance true positive detection.

In order to overcome a priori biases, several au-

thors have recommended nonspeciﬁc or unsupervised

ﬁlters that do not make use of sample class labels, and

thus, have little inﬂuence on the formal statistical test-

ing (Talloen et al., 2007). One criticism of supervised

ﬁltering techniques is that they are in fact a statisti-

cal test and may result in optimistic adjusted p-values

and a true false positive rate that is larger than re-

ported. The goal therefore of any ﬁltering technique is

to enrich true differential expression/abundance, if not

then these techniques have the potential to generate

overly optimistic conclusions. Common permutation-

invariant ﬁlters include variance or abundance thresh-

olds since it is assumed that this removes genes that

are unlikely to be detected as signiﬁcantly different

and emphasize biologically relevant signal. While

these ﬁlters are a common ﬁrst step used by tools

commonly used to analyze transcriptomic data (e.g.,

limma, DESeq2, EdgeR, etc.), they do not incorpo-

rate data from multiple domains.

We propose a general method for feature prioriti-

zation and ﬁltering that combines data from different

biological domains and is built upon the biological re-

lationship between independent data domains: Com-

plementary Domain Prioritization. Unlike methods

that attempt to merge domains into a singular analyt-

ical space, this method prioritizes features in the tar-

get domain by integrating empirical data from com-

plementary domains through evidence-based relation-

ships in curated databases. Herein, we demonstrate

Complementary Domain Prioritization: A Method to Improve Biologically Relevant Detection in Multi-Omic Data Sets

how this general methodology can use proteomic data

to guide transcriptomic differential expression analy-

sis. The method utilizes heterogeneous data streams

by leveraging available pathway data in Kyoto Ency-

clopedia of Genes and Genomes [KEGG; (Kanehisa

et al., 2014)], WikiPathways (Kelder et al., 2009),

and MSigDB:Transcription Factor Target databases

(Matys et al., 2006) as a bridge between proteomic

and transcriptomic data to increase the power to de-

tect biologically relevant differential expression pat-

terns in transcriptomic data. We apply this prioriti-

zation strategy to two independent studies with pub-

licly available data. Our results indicate that this ap-

proach to prioritization followed by independent ﬁl-

tering provides an experimentally informed prioriti-

zation strategy leading to an increase in the power to

detect meaningful changes while still controlling type

I error.

2 METHODS

Complementary Domain Prioritization (CDP) is

a two-stage prioritization and ﬁltering approach

that combines data from two biological domains

with the goal of increasing the power of discovery

while still controlling type I error (Figure 1). In

the ﬁrst stage, enrichment analysis is performed

on one domain (in this example, proteomic data)

using pathway or gene set databases. Gene lists

from pathways or gene sets are then extracted and

provide the input to the second stage, which applies

these gene lists as prioritization criteria to the second

domain (in this case, transcriptomic data). This is

followed by a permutation-invariant gene list ﬁltering

approach. This general methodology, which can be

extended to other domains, is demonstrated herein

by using proteomic data to prioritize transcriptomic

data. An R package has been developed during

the course of this study and is available for general

use by the research community. This repository

contains all of the required scripts and documen-

tation for their use https://github.com/Anderson-

Lab/ComplementaryDomainPrioritization. This

package provides a programmatic method to obtain

the gene lists derived from the protein enrichment

data.

2.1 Gene List Generation

The ﬁrst stage of CDP is gene list generation, which

in this example is driven by proteomic data anal-

ysis. Quantitative proteomic data was used along

with class labels to detect differentially abundant

Protein

List

Gene

Lists

Prioritized

Gene List

Final

Gene List

Gene List Generation

Prioritization and Filtering

Prioritization

Transcriptomic

Data

Filtering

Figure 1: Complementary Domain Prioritization

ﬂowchart. This ﬁgure shows the overall workﬂow of prior-

itization and ﬁltering.

proteins at a given threshold. This protein list

was uploaded to WebGestalt (WEB-based GEne SeT

AnaLysis Toolkit), a web-based tool that can per-

form pathway and gene set enrichment against dif-

ferent databases for eight different species (Wang

et al., 2013), though Homo sapiens is the focus of

this example. For each enrichment test the refer-

ence set was speciﬁed as protein-coding genes (e.g.,

hsapiens entrezgene protein-coding) when using pro-

teomic data. The statistical parameters used for the

hypergeometric statistical method were BH p-value

adjustment, requiring a signiﬁcance level < 0.05 and

minimum of two proteins per pathway/gene set. For

clariﬁcation, transcription factor (TF) target analysis

enrichment is gene set enrichment analysis against the

transcription factor targets in the C3:motif gene set

of MSigDB (Matys et al., 2006). Following KEGG,

WikiPathways (WikiP) or TF enrichment analysis, the

resulting pathways and gene sets were downloaded

from WebGestalt as .tsv ﬁles.

BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms

2.2 Gene Prioritization and Filtering

The second stage of CDP ﬁrst uses the enriched path-

ways or gene sets to prioritize the genes in the second

domain belonging to each pathway or gene set of in-

terest. Pathway information for KEGG was extracted

using the KEGGREST R package. Pathway informa-

tion for WikiPathways was retrieved using the ofﬁcial

web service provided by WikiPathways. Gene set in-

formation from the transcription factor database (via

MSigDB) was downloaded and queried locally. The

prioritized gene list G is deﬁned as:

P = {P

}, (1)

genes(P

) = {g

}, (2)

G =

[

∈P

genes(P

), (3)

where P is the set of all pathways or gene sets identi-

ﬁed in stage 1, genes(P

) is the list of genes in path-

way P

, and G is the resulting prioritized gene list.

CDP prioritizes genes in the transcriptome in-

volved in pathways or gene sets showing enrichment

at the protein level and removing genes not present

in these pathways or gene sets. Next, permutation-

invariant ﬁltering is applied to further enhance the

power of detection by applying variance or mean

abundance ﬁltering. Variance ﬁltering is deﬁned as

ranking the genes according to variance across sam-

ples (ignoring class labels). This has been shown to

be similar to fold-change ﬁltering, but in an unsuper-

vised manner (see Bourgon et al. for a detailed dis-

cussion). Mean abundance ﬁltering ranks the genes

by the mean abundance of each gene. In the method-

ology described herein, these permutation-invariant

ﬁltering methods are applied after prioritization as

follows:

Var(G

) =

n − 1

∑

j=1

i j

−

)

(4)

Mean(G

) =

∑

j=1

i j

(5)

R = order(G) (6)

F =

[

i=1...θ

(7)

(8)

where E

i j

is the expression estimate of gene i and

sample j, R is an ordered list of genes that have been

sorted by their variance (Var(G

)) or mean abundance

(Mean(G

)), θ is the desired number of genes, and F

is the ﬁnal set of genes which have been prioritized

and ﬁltered.

2.3 Experimental Data Sets

Two data sets were selected for evaluation, referred to

herein as TCGA CRCa and Marra. The ﬁrst data set,

TCGA CRCa, is comprised of transcriptomic and pro-

teomic analyses of 87 tumor samples from individuals

with colorectal carcinoma, for which transcript levels

were measured by RNA sequencing yielding FPKM

measurements (Cancer and Atlas, 2012), and protein

levels were measured by label-free shotgun proteomic

analysis yielding spectral counts that were quantiled

and then log-transformed (Zhang et al., 2014). Tran-

scriptomic data for the TCGA CRCa data set were

retrieved from supplemental tables of the published

paper (Cancer and Atlas, 2012). Using the key pro-

vided by both the authors and the key in the follow-

up proteomics paper (Zhang et al., 2014), we deter-

mined the 87 samples which overlapped between the

two studies. The transcriptomic data supplied in sup-

plemental were utilized directly (as opposed to re-

processing raw data), which was given as FPKM val-

ues for 20,531 genes. These gene symbols were ﬁrst

conﬁrmed as being current accepted HGNC symbols,

followed by converting them to Entrez gene IDs. Of

the original 20,531 genes, 18,995 had Entrez gene ID

annotation and were used for analysis. A side note

is that we understand that more robust statistics can

be performed using direct count data with tools such

as EdgeR or baySeq, but for our purposes comparing

FPKM values with a t-test was acceptable.

Proteomic data for the TCGA CRCa data set were

retrieved from supplemental tables of the published

paper (Zhang et al., 2014). The supplemental data

was used directly and consisted of already processed

data that was spectral count data for 7,211 proteins,

which had been quantiled then log transformed. Of

these 7,211 proteins (with gene symbol identiﬁers),

7,147 had Entrez gene ID annotation and were used

for analysis. To generate a p-value for these proteins

to be used in CDP, a moderated t-test was applied

to the data using the limma package (Smyth, 2004)

followed by a Benjamini-Hochberg procedure to cor-

rect for multiple hypothesis testing. For this analy-

sis, the samples are dichotomized based on whether

lymphatic invasion was present in an attempt to clas-

sify tumor aggressiveness similar to the original anal-

ysis (Cancer and Atlas, 2012). The original paper uti-

lized a model consisting of tumour stage, lymph node

status, distant metastasis and vascular invasion at the

time of surgery. Of the 87 samples analyzed, eight did

not have a value speciﬁed, 44 had lymphatic invasion

and 35 did not.

The second data set, Marra, is comprised of tran-

scriptomic and proteomic analyses of paired adenoma

Complementary Domain Prioritization: A Method to Improve Biologically Relevant Detection in Multi-Omic Data Sets

and normal mucosa samples from individuals with

pre-cancerous colorectal tumors or lesions. Transcript

levels of 35 adenomas with matched normal mu-

cosa are measured using Affymetrix GeneChip Hu-

man Exon ST Arrays (Cattaneo et al., 2011). Tran-

scriptomic data for the Marra data set (Cattaneo et al.,

2011) were retrieved from the NCBI Gene Expres-

sion Omnibus database identiﬁer GSE21962, which

contained raw ﬁles from Affymetrix GeneChip Hu-

man Exon ST Arrays (HuEx-1 0-st-v2). Array data

that was speciﬁc to pre-cancerous tumors and lesions

was utilized, without duplicate samples resulting in

35 adenomas with matched normal mucosa. Files

were processed using the oligo package (Carvalho

and Irizarry, 2010) and pd.huex.1.0.st.v2 library. Ro-

bust multichip average (RMA) normalization was ap-

plied at the transcript level (i.e., core gene ST probe-

set) generating expression values on a log-scale. This

resulted in data for 20,011 probes, which contained

16,132 genes with Entrez gene ID annotation and

were used for analysis.

Proteomic analysis of 30 similar but not iden-

tically matched samples was performed later using

isobaric labeling (iTRAQ) to quantify protein abun-

dance (Uzozie et al., 2014). Even though the Marra

data set is comprised of data for similar but not iden-

tical samples, this is still relevant since often re-

searchers use publicly available transcriptomic data

to augment original analysis, such as Shimwell et

al. (Shimwell et al., 2013). Proteomic data for the

Marra data set (Uzozie et al., 2014) were retrieved

from the published paper. This supplemental data was

quantiﬁed at the peptide level using 8-plex iTRAQ

across 10 experiments, with iTRAQ labels 113 and

114 being a pooled reference sample and the remain-

ing six labels being three adenoma/normal mucosa

pairs from the same patient per experiment (i.e., 30

pairs total). These data were processed using Infer-

noRDN(Polpitiya et al., 2008) similar to the steps de-

scribed by the authors(Uzozie et al., 2014). Brieﬂy,

for each experimental data set only unique peptides

were utilized and label intensity measurements were

ﬁrst log2 transformed followed by mean centering

(central tendency adjustment). These mean centered

log transformed peptide level data were rolled up to

the protein level using reference peptide based scal-

ing (default parameters except one-hit wonders were

allowed). This resulted in protein level quantiﬁcation

that was mean normalized log2 transformed by ex-

periment. The distribution of 113 and 114 labels was

found to be nearly identical between the 10 experi-

ments, and so inter-experimental normalization was

not required. Therefore we utilized the normalized

log-intensity values for each adenoma/normal mucosa

pair to generate a log-fold change value. Next we

found proteins that were measured in all experiments,

and used these 820 proteins for analysis. Of these

820 proteins with UniProtKB Identiﬁers, 768 had En-

trez gene IDs, which were used to perform a moder-

ated t-test with the limma package (Smyth, 2004) fol-

lowed by a Benjamini-Hochberg procedure to correct

for multiple hypothesis testing.

Since we use pathway and gene set databases to

create prioritization ﬁlters, it is important to report rel-

evant percent overlap related to the data sets and these

databases. The TCGA CRCa data set is comprised of

7,147 unique proteins and 18,787 unique transcripts

with Entrez gene IDs. The KEGG database contains

3,185 (45%) and 6,307 (34%) of these proteins and

transcripts, respectively. The WikiP database con-

tains 3,719 (52%) and 7,215 (38%) of these proteins

and transcripts, respectively. The TF database con-

tains 5,208 (73%) and 12,035 (64%) of these proteins

and transcripts, respectively. The Marra data set is

comprised of 768 unique proteins and 15,807 unique

transcripts with Entrez gene IDs. The KEGG database

contains 474 (62%) and 6,073 (38%) of these proteins

and transcripts, respectively. The WikiP database

contains 533 (69%) and 6,841 (43%) of these proteins

and transcripts, respectively. The TF database con-

tains 628 (82%) and 11,556 (73%) of these proteins

and transcripts, respectively. The relative overlap is

higher for protein than transcript data sets against the

pathway databases (KEGG and WikiP) since path-

ways are comprised of proteins. Also for reference,

there are 27,228 genes with Entrez gene IDs in the

Homo sapiens genome, and 21,061 of these are pro-

tein coding (Ensembl 81, H. sapiens GRCh38.p3).

3 RESULTS

3.1 Prioritization and Novel Discovery

The goal of Complementary Domain Prioritization

(CDP) is to guide differential analysis in one do-

main using differential analysis of a parallel domain.

The proposed method was evaluated using two pub-

lished proteotranscriptomic data sets, TCGA CRCa

and Marra, with diverse gene expression and pro-

tein abundance patterns. The TCGA CRCa data set

presents an example of a difﬁcult differential expres-

sion discovery task when studying differences in tu-

mor aggressiveness using lymphatic invasion as a

phenotype. The original TCGA analysis identiﬁed

40 genes related to tumor aggressiveness, which was

determined by more factors than just lymphatic inva-

sion (Cancer and Atlas, 2012). In the current anal-

BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms

ysis, using lymphatic invasion as a phenotype there

were minor changes with respect to differential ex-

pression at the transcript level with 49 differentially

expressed genes (two-sided equal t-test, Benjamini-

Hochberg corrected p< 0.2; BH < 0.2) and 26 differ-

entially abundant proteins (BH < 0.2). Of these 49

differentially expressed genes, two were identiﬁed in

the original analysis. In contrast, using CDP detected

on average 10 of the 40 genes identiﬁed in the orig-

inal TCGA study by ﬁltering alone or prioritization

then ﬁltering (data not shown). This provided a mod-

est improvement in analysis to detect genes related to

tumor aggressiveness.

In order to demonstrate whether CDP was priori-

tizing otherwise disregarded genes that are interesting

candidates correlated to cancer phenotypes, we report

the number of rejected null hypotheses (e.g., number

of candidate discoveries) as a function of the FDR for

only cancer-related genes. To accomplish this, a list

of 1663 cancer-related genes, retrieved via the Hu-

man Protein Atlas (Uhl

en et al., 2015), were tracked

and evaluated for each data set (Figure 2 and Supple-

mentary Figure S1 online). In the TCGA CRCa data

set prioritization followed by ﬁltering consistently de-

tected approximately two to three-fold more cancer-

related transcripts with BH < 0.3 than using ﬁltering

only, and higher θ resulted in better detection (Fig-

ure 2). Also, cancer-related transcripts were detected

at BH < 0.1 only when using WikiPathways (WikiP)

and transcription factor (TF) enrichment based prior-

itization. The effect of CDP on the Marra data set,

which had a strong differential expression proﬁle with

40% of genes differentially expressed without CDP,

was less obvious (Figure 3). For the Marra data set,

ﬁlter-only performed as well as CDP in some cases

with CDP outperforming ﬁlter-only when used with

WikiP (Figure 3b and Figure 3e).

In addition to detecting previously reported genes

of interest, successful prioritization should also result

in detecting genes that would otherwise not be de-

tected. The Marra data set has 6,300 differentially

expressed genes (BH < 0.5) and 437 differentially

abundant proteins (BH < 0.5) and provides an in-

teresting example of the effect of CDP on prioritiz-

ing gene detection. Using CDP drastically decreased

the number of gene candidates by prioritizing 6 to

43% of the original 15,807 genes prior to permuta-

tion invariant ﬁltering. Additional ﬁltering targeted

3 to 22% of the original genes (488, 813 and 3,427

using WikiP, KEGG, and TF, respectively). Prioriti-

zation followed by variance ﬁltering detected 6 to 164

genes that would not have been detected with variance

ﬁltering alone, while prioritization followed by mean

abundance ﬁltering detected less than 10 genes that

would not have been detected with mean abundance

ﬁltering alone. Although these changes in detection

may seem minor, they lead to large differences at the

pathway level, which is investigated later in this pa-

per.

3.2 Controlling False Positive Rate

Using ﬁlters to improve detection power can lead to a

loss of false positive control depending on the choice

of ﬁlter. Permutation-invariant ﬁlters, such as vari-

ance and abundance, have been shown to be appro-

priate ﬁlters (Bourgon et al., 2010). An example

of a permutation-variant ﬁlter is a fold-change ﬁl-

ter, where the fold-change between two classes is de-

pendent on the labels of the samples, and thus, the

ordering of the samples is important in the calcula-

tion. In contrast, a variance ﬁlter, abundance ﬁlter,

or complementary domain based prioritization is in-

dependent on the ordering of the samples. To evalu-

ate false positive control, the conditional and uncon-

ditional marginal distributions of test statistics after

using permutation-invariant ﬁltering were compared.

It has been shown that the conditional marginal dis-

tributions of test statistics after using permutation-

invariant ﬁltering are the same as the unconditional

distributions before ﬁltering, where the conditional

distributions are the same as the distributions after ap-

plying the ﬁlter (Bourgon et al., 2010). This behav-

ior is a necessary criteria for multiple testing adjust-

ments that attempt to control experiment-wide type

I error rate. We show the distribution of t-statistics

for the conditioned and unconditioned TCGA CRCa

data set in order to empirically demonstrate the effect

of these ﬁlters (Figure 4). The ﬁndings on the Marra

data set were also consistent with these results (data

not shown). We deliberately eschewed fold-change

from other comparisons since this requires incorpo-

rating class labels, and thus, will affect type I error.

A variance based ﬁlter is similar in practice to a fold-

change ﬁlter, despite being independent of class lev-

els. This is discussed in detail by Bourgon et al., but

brieﬂy, for small sample sizes, the bound is essentially

a constant multiple of the cut-off on the variance.

3.3 Effect on Pathway Analysis

To more clearly deﬁne the similarities in the priori-

tization strategies, we compared the overlap of pri-

oritized and differentially expressed genes using dif-

ferent databases for prioritization. Since there are

only 26 differentially abundant proteins in the TCGA

CRCa data set (at BH < 0.2), we did not pursue us-

ing this data set as a proof of principle. Differentially

Complementary Domain Prioritization: A Method to Improve Biologically Relevant Detection in Multi-Omic Data Sets

0.0 0.1 0.2 0.3

Adjusted p−value cutoff

Rejected null hypotheses

100

500

1000

2000

2500

Filter Only

Permuted

(a) KEGG Variance Filter

0.0 0.1 0.2 0.3

Adjusted p−value cutoff

Rejected null hypotheses

(b) WikiP Variance Filter

0.0 0.1 0.2 0.3

Adjusted p−value cutoff

Rejected null hypotheses

0.0 0.1 0.2 0.3

Adjusted p−value cutoff

Rejected null hypotheses

(d) KEGG Mean Abundance Filter

0.0 0.1 0.2 0.3

Adjusted p−value cutoff

Rejected null hypotheses

(e) WikiP Mean Abundance Filter

0.0 0.1 0.2 0.3

Adjusted p−value cutoff

Rejected null hypotheses

(f) TF Mean Abundance Filter

Fig. 2: Power analysis and prioritization comparison on the TCGA CRCa data set. Figures a - f show the number of rejected null hypotheses

as a function of the p-value cutoff using three different ﬁlters (KEGG, WikiP and TF) and two different permutation invariant prioritization

strategies (variance and mean abundance). All methods were evaluated using thresholds θ=100, 500, 1000, 2000, and 2500. For presentation

clarity, only a single theta is shown for methods without prioritization (overall mean and overall variance). For these methods, the threshold

resulting in the maximum number of signiﬁcant genes for a FDR of 0.2 was chosen.

Figure 2: Power analysis and prioritization comparison on the TCGA CRCa data set. Figures a - f show the number of

rejected null hypotheses as a function of the BH adjusted p-value cutoff using three different prioritization strategies (KEGG,

WikiP and TF) and two different permutation invariant ﬁltering strategies (variance and mean abundance). All methods were

evaluated using thresholds θ=100, 500, 1000, 2000, and 2500, where θ is the desired number of genes. For presentation

clarity, only a single θ is shown for methods using only a permutation invariant strategy (ﬁlter only, no prioritization), which

was the threshold resulting in the maximum number of signiﬁcant genes for a FDR of 0.2, and is labeled as Filter Only.

Figure 3: Power analysis and prioritization comparison on the Marra data set. Figures a - f show the number of rejected

null hypotheses as a function of the BH adjusted p-value cutoff using three different prioritization strategies (KEGG, WikiP

and TF) and two different permutation invariant ﬁltering strategies (variance and mean abundance). All methods were eval-

uated using thresholds θ=100, 500, 1000, 2000, and 2500, where θ is the desired number of genes. For presentation clarity,

only a single θ is shown for methods using only a permutation invariant strategy (ﬁlter only, no prioritization), which was the

threshold resulting in the maximum number of signiﬁcant genes for a FDR of 0.2, and is labeled as Filter Only.

BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms

0.00

0.05

0.10

0.15

0.20

−20 0 20

t−statistic

Frequency

All

Fold-change

KEGG

ean

Variance

WikiP

Figure 4: Distribution of t-statistic values before and af-

ter conditioning. The proposed prioritization and ﬁltering

approaches are permutation invariant, in contrast to a fold-

change ﬁlter that is not permutation invariant.

abundant proteins in the Marra data set (437 proteins

at BH < 0.05) were used to perform pathway en-

richment or gene set analysis against KEGG, WikiP

or TF databases (Figure 5, and Figure 6). Regard-

less of the permutation-invariant ﬁlter used following

prioritization, the trends were the same between the

databases: TF based CDP resulted in the highest num-

ber of genes, of which 80% were unique to TF based

prioritization, while KEGG and WikiP prioritized far

fewer genes. Regardless of the prioritization approach

used, approximately half of the resulting genes were

detected as differentially expressed, which was the

same as in the untreated data set. Lastly, the KEGG

and WikiP approaches were more similar than TF,

likely since these are pathway databases as opposed

to gene set databases.

WikiP

(n = 488)

KEGG

(n = 813)

(n = 3427)

108

114

269

297

133

2864

WikiP

(n = 260)

KEGG

(n = 420)

(n = 1641)

125

171

1326

Figure 5: Comparison of the gene overlap following pri-

oritization treatments after variance ﬁltering. Different

prioritization approaches were used (KEGG, WikiP or TF)

followed by an independent ﬁlter (50% variance) for the

Marra data set. (A) Overlap of prioritized genes follow-

ing prioritization with different databases. (B) Overlap of

differentially expressed genes (BH adjusted p < 0.05) fol-

lowing prioritization with different databases.

The candidate discoveries identiﬁed in a differen-

tial expression studies are often the starting point for

pathway and gene set enrichment analyses; therefore,

WikiP

(n = 488)

KEGG

(n = 813)

(n = 3427)

220

332

168

145

2782

WikiP

(n = 274)

KEGG

(n = 482)

(n = 1859)

134

196

100

1489

A B

Figure 6: Comparison of the gene overlap following

prioritization treatments after mean abundance ﬁlter-

ing. Different prioritization approaches were used (KEGG,

WikiP or TF) followed by an independent ﬁlter (50%

mean abundance) for the Marra data set. (A) Overlap

of prioritized genes following prioritization with different

databases. (B) Overlap of differentially expressed genes

(BH adjusted p < 0.05) following prioritization with dif-

ferent databases.

it is critical to explore how the proposed prioritiza-

tion strategy affects downstream analysis. For this

we compared how ﬁltering alone and prioritization

followed by ﬁltering affected results when analyzing

the Marra data set (Table 1) and report the identiﬁed

pathways. We used the differentially expressed genes

(BH < 0.05) from these prioritized and/or ﬁltered data

to perform pathway enrichment analysis against the

KEGG database. The unﬁltered data resulted in 206

pathways, variance ﬁltering alone identiﬁed 189 path-

ways and prioritizing with KEGG, WikiP and TF fol-

lowed by variance ﬁltering identiﬁed 151, 87 and 154

pathways, respectively. Since often the top enriched

pathways are used for followup experimental studies,

the top 10 enriched pathways were compared (Table

1). Each approach yielded slightly different top 10

pathways and different rankings of speciﬁc pathways,

with only ’Metabolic pathways’ being shared across

approaches. Interestingly, the ’Complement and co-

agulation cascades’ pathway was ranked highest after

protein prioritization (using WikiP) with 26 genes be-

ing differentially expressed, while it was fourth high-

est using KEGG (albeit with 29 genes being differen-

tially expressed). For comparison, this pathway was

ranked 17th using a variance ﬁlter alone (with 28 dif-

ferentially expressed genes). Also, the ’Protein pro-

cessing in ER’ pathway was unique to KEGG based

protein prioritization, with 33 genes in this pathway

being differentially expressed. When using a variance

ﬁlter alone, this pathway was the 99th highest ranked

pathway with 30 differentially expressed genes, while

in the unﬁltered treatment this pathway was ranked

20th with 72 differentially expressed genes (data not

shown). For comparison, ’Protein processing in the

ER’ was ranked 15th with 13 differentially abundant

proteins following pathway enrichment analysis us-

ing the proteomic data. Overall, Using CDP on the

Marra data set changed the ranking of identiﬁed path-

Complementary Domain Prioritization: A Method to Improve Biologically Relevant Detection in Multi-Omic Data Sets

ways, possibly indicating pathways that are more bi-

ologically relevant than those identiﬁed using just the

transcriptomic results.

4 DISCUSSION

As the speed, quality and complexity of analysis

used to quantify the complex biochemical interactions

within biological systems continues to improve, stud-

ies are often confronted with the issue of over-ﬁtting

high-dimensional data with a relatively small sam-

ple set. These data sets can also span multiple bio-

logical domains (genes, transcripts, proteins, metabo-

lites), presenting the opportunity to utilize the data

across domains in a meaningful way. It was the aim

of this study to develop and evaluate a method that

leveraged proteomic data to prioritize transcriptomic

data while avoiding increased type I error. Similar

to methods that use permutation-invariant ﬁlters such

as variance or mean abundance to improve power

(Bourgon et al., 2010), our approach improved the

power to detect cancer-related genes while controlling

type I error in two different experimental data sets.

These results demonstrate that prioritization of tran-

scriptomic data using proteomic data provides similar

power improvements as other permutation-invariant

ﬁlters, while utilizing the underpinning biological hi-

erarchy to create an empirical prioritization ﬁlter.

One issue when applying ﬁlters to data is that us-

ing class level differences to ﬁlter or prioritize data

can inﬂate the number of signiﬁcant variables. This

problem is mitigated during Complementary Domain

Prioritization because the domains are independent

and class level differences across different biologi-

cal domains are largely non-linearly related. Even

though the majority of transcript and protein levels

are positively correlated, the average correlation from

a subset of published data is 0.26 (Gygi et al., 1999;

Foss et al., 2007; Zhang et al., 2014). It is possi-

ble that genes detected as differentially expressed do

not manifest changes at the protein level due to post-

transcriptional and post-translational regulation. This

has been observed when proteomic analysis has been

used in tandem with transcriptomic studies (Zhang

et al., 2014; Gygi et al., 1999; Foss et al., 2007),

where changes observed at the transcript level were

not propagated to the protein level. Alternatively, sig-

niﬁcant protein changes have been observed in the ab-

sence of transcript changes (Zhang et al., 2014; Gygi

et al., 1999), further emphasizing the dynamic nature

of post-transcriptional regulation, which can attenu-

ate protein abundance, as well as the independence

of the different domains. By utilizing protein levels,

which are arguably more informative of phenotypic

changes while also being non-linearly related to tran-

script changes, to prioritize pathways and gene sets in

the transcript domain, Complementary Domain Prior-

itization does not artiﬁcially inﬂate detection of sig-

niﬁcant changes across domains.

During Complementary Domain Prioritization a

ﬁlter is created using enrichment analysis against

pathway and gene set databases, thereby incorpo-

rating secondary experimental data with evidence-

based relationships. In other words, CDP is able

to leverage evidence-based relationships (databases)

with secondary domain data (proteomics) to priori-

tize signal at the transcriptomic level. Since Com-

plementary Domain Prioritization relies heavily on

databases and secondary domain data, both of these

offer room for improvement. Databases are inherently

ambiguous association lists that aren’t exhaustive or

completely accurate. This means that the choice of

database can affect downstream results since they rely

on different assumptions and curation. We show that

different prioritization databases affect detection of

differentially expressed genes, with Complementary

Domain Prioritization targeting 6 to 43% of the genes

detected by ﬁltering alone. Also, prioritization us-

ing pathway based databases (KEGG and WikiPath-

ways) generated results that were more similar than

prioritization based on a gene set database (MSigDB).

Moreover, if a database omits a gene in a given path-

way or gene set, then this gene will not be present

following the prioritization procedure. In both data

sets, the pathway based databases include approxi-

mately 40% of transcripts in the transcriptomic data,

whereas the gene set database includes approximately

75% of both proteomic or transcriptomic data. Also,

though the WikiPathways database consistently in-

cluded more proteins and transcripts from the data

sets than KEGG, fewer WikiPathways pathways were

identiﬁed by pathway enrichment analysis. This may

explain why although both KEGG and WikiPathways

prioritization strategies identiﬁed the ’Complement

and Coagulation Cascades’ pathway in the Marra data

set, though the number of differential expressed genes

was different between the two prioritization strate-

gies. We have shown performance using KEGG,

WikiPathways, and MSigDB (i.e., TF) databases, but

in the future this could be expanded to include other

databases (e.g., Reactome). As more heterogeneous

domain data sets become available, isolating database

speciﬁc effects from data set effects will help estab-

lish best practices for Complementary Domain Prior-

itization. In general, with continued development of

more accurate databases, the prioritization quality of

this approach will improve.

BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms

Table 1: Comparison of top 10 enriched pathways. Results for the Marra data set are shown using four different strategies:

permutation invariant ﬁlter only (50% variance) and prioritization (KEGG, WikiP or TF) followed by ﬁlter (50% variance).

After ranking enriched pathways using the BH adjusted p-value, the top 10 are shown in order. Unique pathways are in bold.

Filter Only KEGG + Filter WikiP + Filter TF + Filter

Metabolic pathways Metabolic pathways

Complement and

coagulation cascades

Metabolic pathways

Cell cycle Purine metabolism Metabolic pathways Cell cycle

Cell adhesion

molecules (CAMs)

RNA transport Ribosome Pathways in cancer

Pathways in cancer

Complement and

coagulation cascades

Parkinson’s disease Purine metabolism

Cytokine-cytokine

receptor interaction

Protein processing in

endoplasmic reticulum

Oxidative

phosphorylation

p53 signaling pathway

Purine metabolism

Pyrimidine metabolism

Alzheimer’s disease DNA replication

Chemokine signaling

pathway

Huntington’s disease Huntington’s disease MAPK signaling pathway

p53 signaling pathway Alzheimer’s disease

Starch and sucrose

metabolism

RNA transport

Rheumatoid arthritis Ribosome

Staphylococcus aureus

infection

Cell adhesion

molecules (CAMs)

Regulation of actin

cytoskeleton

Glycolysis /

Gluconeogenesis

Systemic lupus

erythematosus

Prostate cancer

In addition to reliance on databases, Complemen-

tary Domain Prioritization relies on the data qual-

ity of the secondary domain. Current shotgun pro-

teomics data sets are prone to false negatives by the

very nature of mass spectrometric analysis. As newer

techniques and technologies are developed and per-

fected, such as data independent analysis, false neg-

ative rates should decrease though this still poses a

problem when integrating across domains. For ex-

ample, the type of sample preparation and separa-

tion may not perform well at isolating and identify-

ing membrane-bound proteins, which could handicap

the Complementary Domain Prioritization procedure

away from signaling cascades. Proteomic analysis

also varies greatly in coverage (or sampling depth)

of the proteome based on separation techniques and

speed/resolution of the mass spectrometer. Our re-

sults demonstrate that a set of 800 proteins (Marra

data set) or 7,000 proteins (TCGA CRCa data set)

can successfully be used to prioritize companion tran-

scriptomic data. It is likely that there are more false

negatives in the smaller protein data set and using this

set is likely prioritizing pathways with higher abun-

dance proteins. If the goal is to discover a biomarker

with clinical utility, then this limitation is actually

an advantage at prioritizing transcript changes that

are related to measurable protein changes. Regard-

less, as both proteomic analysis and database quality

improves, Complementary Domain Prioritization will

also improve.

It was not the goal of this study to compare re-

sults to the original transcriptomic analysis. Regard-

less, using Complementary Domain Prioritization fol-

lowed by unsupervised ﬁltering performed better at

detecting known cancer-related genes in both data sets

relative to ﬁltering alone or using no ﬁlter at all. In the

ﬁrst data set, the TCGA authors investigated gene ex-

pression related to tumor aggressiveness, though tu-

mor aggressiveness was not evaluated in the follow-

up proteomic analysis by Zhang et al. Using lym-

phatic invasion as a proxy of tumor aggressiveness,

our method identiﬁed more of the 40 genes related

to tumor aggressiveness than ﬁltering alone (six ver-

sus four). If the pathways and gene sets prioritized

by the protein data did not include these 40 genes,

then they were omitted from further analysis. On av-

erage only 10 of these genes were selected follow-

ing ﬁltering alone or prioritization followed by ﬁlter-

ing. This ﬁnding demonstrates how Complementary

Domain Prioritization can be used to improve previ-

ous results based on a single domain. In the second

data set evaluated, Cattaneo et al. evaluated polyploid

pre-cancerous colorectal lesions whereas the follow-

up proteomics analysis by Uzozie et al., and therefore

the prioritization analysis described herein, focused

on normal versus pre-cancerous lesions. The pub-

lished proteomic analysis identiﬁed alterations in sor-

bitol dehydrogenase (SORD) levels in pre-cancerous

lesions, speciﬁcally in the sorbitol-aldose reductase

pathway. Though this pathway was only ranked 163

Complementary Domain Prioritization: A Method to Improve Biologically Relevant Detection in Multi-Omic Data Sets

in the unﬁltered transcriptomic data, following Com-

plementary Domain Prioritization (KEGG-based with

a variance ﬁlter of 50%), the rank improved to 29.

This clearly demonstrates that proteomics data can

prioritize transcriptomic data in a meaningful way

similar to published results.

In addition to having similar ﬁndings as the orig-

inal studies, Complementary Domain Prioritization

was able to highlight pathways that might have been

otherwise overlooked. Using Complementary Do-

main Prioritization and ﬁltering of the Marra data set

identiﬁed ’Protein Processing in the ER’ as being dif-

ferent in pre-cancerous colorectal lesions. Although

only 13 differentially abundant proteins were identi-

ﬁed in this pathway, the transcriptomic data was pri-

oritized such that this pathway was ranked 5th (versus

20th in the unﬁltered data). Protein folding is known

to be crucial in many oncogenic processes, especially

ER chaperones (Luo and Lee, 2012), and may be an

area of research to pursue further. These targets could

be conﬁrmed at the protein level by immunoblot anal-

ysis similar to the SORD conﬁrmation in the original

study. Lastly, by using both the KEGG and WikiP pri-

oritization approaches we identiﬁed similar pathways

emphasizing the importance of the complement cas-

cade in pre-cancerous lesions. By applying Comple-

mentary Domain Prioritization to the Marra data set,

we have shown how complementary proteomic data

can drive new hypotheses, though only future exper-

iments will demonstrate the relevance of these ﬁnd-

ings.

5 CONCLUSIONS

Utilizing data across biological domains is inherently

difﬁcult because it is not fundamentally understood

how multiple changes at each domain result in phe-

notypic changes. Signiﬁcant changes at the tran-

script level are not always present at the protein level

(Zhang et al., 2014; Gygi et al., 1999; Foss et al.,

2007), while studies of the low-abundance transcrip-

tome of some cancer have conﬁrmed that changes in

low-abundance genes are responsible for deleterious

biological changes (Bizama et al., 2014). This may

be extrapolated to other domain relationships mean-

ing that moving from gene to transcript to protein

to metabolite, changes at each step are not depen-

dent on single changes at previous steps. Likewise,

this means that signal from each domain is represen-

tative of signal from a larger number of features in

the previous domain. For this reason, we did not

focus on directly combining data between domains

but instead present an approach that utilizes results

from one domain to prioritize data from the underly-

ing domain. Our results demonstrate that proteomic

data can be used to prioritize transcriptomic data,

though this approach is not limited to these comple-

mentary domains. Lipidomic data could be used to

prioritize genomic/transcriptomic/proteomic data via

LIPID MAPS pathway database, and metabolomics

data could be likewise used to prioritize data follow-

ing analysis with XCMS Online. Stepping further

away from genes in this hierarchy of biological do-

mains creates smaller and smaller prioritization lists,

but ones that are more biologically relevant to phe-

notypic changes. Utilizing data from complementary

domains as a prioritization tool can be a powerful ap-

proach to integrating complex high-dimensional bio-

logical data sets.

ACKNOWLEDGEMENTS

The authors wish to thank Benilton Carvalho for his

help utilizing the oligo R package as well as Gian-

carlo Marra and Anuli Uzozie for invaluable assis-

tance in harmonizing published data sets. This re-

search was supported in part by the National Institute

of General Medical Sciences of the National Institutes

of Health under Award Number P20GM103542, the

South Carolina SmartState Center of Economic Ex-

cellence in Proteomics, and the Medical University of

South Carolina Proteomics Center. Identiﬁcation of

certain commercial equipment, instruments, software

or materials does not imply recommendation or en-

dorsement by the National Institute of Standards and

Technology, nor does it imply that the products identi-

ﬁed are necessarily the best available for the purpose.

REFERENCES

Alves, G. and Yu, Y. K. (2014). Accuracy evaluation of the

uniﬁed P-value from combining correlated P-values.

PLoS ONE, 9(3).

Bizama, C., Benavente, F., Salvatierra, E., Guti

errez-

Moraga, A., Espinoza, J. a., Fern

andez, E. a., Roa, I.,

Mazzolini, G., Sagredo, E. a., Gidekel, M., and Pod-

hajcer, O. L. (2014). The low-abundance transcrip-

tome reveals novel biomarkers, speciﬁc intracellular

pathways and targetable genes associated with ad-

vanced gastric cancer. International Journal of Can-

cer, 134:755–764.

Boja, E. S., Kinsinger, C. R., Rodriguez, H., and Srinivas,

P. (2014). Integration of omics sciences to advance

biology and medicine. 11(1):1–12.

ornigen, D., Tranchevent, L. C., Bonachela-Capdevila, F.,

Devriendt, K., De Moor, B., De Causmaecker, P., and

BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms

Moreau, Y. (2012). An unbiased evaluation of gene

prioritization tools. Bioinformatics, 28(23):3081–

3088.

Bourgon, R., Gentleman, R., and Huber, W. (2010). Inde-

pendent ﬁltering increases detection power for high-

throughput experiments. Proceedings of the National

Academy of Sciences of the United States of America,

107:9546–9551.

Cancer, T. and Atlas, G. (2012). Comprehensive molecu-

lar characterization of human colon and rectal cancer.

Nature, 487(7407):330–7.

Carvalho, B. S. and Irizarry, R. a. (2010). A framework for

oligonucleotide microarray preprocessing. Bioinfor-

matics, 26(19):2363–2367.

Cattaneo, E., Laczko, E., Buffoli, F., Zorzi, F., Bianco,

M. A., Menigatti, M., Bartosova, Z., Haider, R.,

Helmchen, B., Sabates-Bellver, J., Tiwari, A., Jiricny,

J., and Marra, G. (2011). Preinvasive colorectal le-

sion transcriptomes correlate with endoscopic mor-

phology (polypoid vs. nonpolypoid). EMBO molec-

ular medicine, 3(6):334–47.

de Klerk, E. and a.C. t Hoen, P. (2015). Alternative

mRNA transcription, processing, and translation: in-

sights from RNA sequencing. Trends in Genetics,

31(3):128–139.

Dudoit, S., Dudoit, S., Shaffer, J. P., Shaffer, J. P., Boldrick,

J. C., and Boldrick, J. C. (2003). Multiple hypothesis

testing in microarray experiments. Statistical Science,

18(1):71–103.

Foss, E. J., Radulovic, D., Shaffer, S. a., Ruderfer, D. M.,

Bedalov, A., Goodlett, D. R., and Kruglyak, L. (2007).

Genetic basis of proteome variation in yeast. Nature

genetics, 39(11):1369–1375.

Geiger, T., Wehner, a., Schaab, C., Cox, J., and Mann, M.

(2012). Comparative Proteomic Analysis of Eleven

Common Cell Lines Reveals Ubiquitous but Varying

Expression of Most Proteins. Molecular & Cellular

Proteomics, 11:M111.014050–M111.014050.

Gygi, S. P., Rochon, Y., Franza, B. R., and Aebersold,

R. (1999). Correlation between protein and mRNA

abundance in yeast. Molecular and cellular biology,

19(3):1720–1730.

Haider, S. and Pal, R. (2013). Integrated analysis of tran-

scriptomic and proteomic data. Current genomics,

14(2):91–110.

Kanehisa, M., Goto, S., Sato, Y., Kawashima, M., Furu-

michi, M., and Tanabe, M. (2014). Data, informa-

tion, knowledge and principle: Back to metabolism in

KEGG. Nucleic Acids Research, 42(D1):199–205.

Kelder, T., Pico, A. R., Hanspers, K., Van Iersel, M. P.,

Evelo, C., and Conklin, B. R. (2009). Mining biologi-

cal pathways using WikiPathways web services. PLoS

ONE, 4(7):2–5.

Kumar, D., Bansal, G., Narang, A., Basak, T., Abbas, T.,

and Dash, D. (2016). Integrating transcriptome and

proteome proﬁling: Strategies and applications. Pro-

teomics, pages 1–12.

Kuo, T.-C., Tian, T.-F., and Tseng, Y. J. (2013). 3Omics: a

web-based systems biology tool for analysis, integra-

tion and visualization of human transcriptomic, pro-

teomic and metabolomic data. BMC systems biology,

7(1):64–78.

Larance, M. and Lamond, A. I. (2015). Multidimensional

proteomics for cell biology. Nature Reviews Molecu-

lar Cell Biology, 16(5):268–80.

Luo, B. and Lee, a. S. (2012). The critical roles of en-

doplasmic reticulum chaperones and unfolded protein

response in tumorigenesis and anticancer therapies.

Oncogene, 32(7):805–818.

Matys, V., Kel-Margoulis, O. V., Fricke, E., Liebich, I.,

Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev,

D., Krull, M., Hornischer, K., Voss, N., Stegmaier,

P., Lewicki-Potapov, B., Saxel, H., Kel, a. E., and

Wingender, E. (2006). TRANSFAC and its mod-

ule TRANSCompel: transcriptional gene regulation

in eukaryotes. Nucleic acids research, 34(Database

issue):D108–D110.

Neely, B. A., Wilkins, C. E., Marlow, L. A., Malyarenko,

D., Kim, Y., Ignatchenko, A., Sasinowska, H., Sasi-

nowski, M., Nyalwidhe, J. O., Kislinger, T., Cop-

land, J. A., and Drake, R. R. (2016). Proteotranscrip-

tomic Analysis Reveals Stage Speciﬁc Changes in the

Molecular Landscape of Clear-Cell Renal Cell Carci-

noma. PloS one, 11(4):e0154074.

Pirhaji, L., Milani, P., Leidl, M., Curran, T., Avila-

Pacheco, J., Clish, C. B., White, F. M., Saghatelian,

A., and Fraenkel, E. (2016). Revealing disease-

associated pathways by network integration of untar-

geted metabolomics. Nature methods, 13(9):770–776.

Piruzian, E., Bruskin, S., Ishkin, A., Abdeev, R.,

Moshkovskii, S., Melnik, S., Nikolsky, Y., and Nikol-

skaya, T. (2010). Integrated network analysis of tran-

scriptomic and proteomic data in psoriasis. BMC sys-

tems biology, 4(41).

Polpitiya, A. D., Qian, W. J., Jaitly, N., Petyuk, V. a.,

Adkins, J. N., Camp, D. G., Anderson, G. a., and

Smith, R. D. (2008). DAnTE: A statistical tool for

quantitative analysis of -omics data. Bioinformatics,

24(13):1556–1558.

Robinson, M. D. and Smyth, G. K. (2007). Mod-

erated statistical tests for assessing differences in

tag abundance. Bioinformatics (Oxford, England),

23(21):2881–7.

Shimwell, N. J., Bryan, R. T., Wei, W., James, N. D.,

Cheng, K. K., Zeegers, M. P., Johnson, P. J., Martin,

a., and Ward, D. G. (2013). Combined proteome and

transcriptome analyses for the discovery of urinary

biomarkers for urothelial carcinoma. British journal

of cancer, 108(9):1854–61.

Smyth, G. K. (2004). Linear Models and Empirical Bayes

Methods for Assessing Differential Expression in Mi-

croarray Experiments Linear Models and Empirical

Bayes Methods for Assessing Differential Expression

in Microarray Experiments. Statistical applications in

genetics and molecular biology, 3(1):1–26.

Talloen, W., Clevert, D.-A., Hochreiter, S., Amaratunga, D.,

Bijnens, L., Kass, S., and G

ohlmann, H. W. H. (2007).

I/NI-calls for the exclusion of non-informative genes:

a highly effective ﬁltering tool for microarray data.

Bioinformatics (Oxford, England), 23(21):2897–902.

Complementary Domain Prioritization: A Method to Improve Biologically Relevant Detection in Multi-Omic Data Sets

Uhl

en, M., Fagerberg, L., Hallstr

om, B. M., Lindskog, C.,

Oksvold, P., Mardinoglu, A., Sivertsson,

A., Kampf,

C., Sj

ostedt, E., Asplund, A., Olsson, I., Edlund, K.,

Lundberg, E., Navani, S., Szigyarto, C. A.-k., Ode-

berg, J., Djureinovic, D., Takanen, J. O., Hober, S.,

Alm, T., Edqvist, P.-h., Berling, H., Tegel, H., Mulder,

J., Rockberg, J., Nilsson, P., Schwenk, J. M., Ham-

sten, M., Feilitzen, K. V., Forsberg, M., Persson, L.,

Johansson, F., Zwahlen, M., Heijne, G. V., Nielsen, J.,

and Pont

en, F. (2015). Tissue-based map of the human

proteome. Science, 347(6220):1260419.

Uzozie, A., Nanni, P., Staiano, T., Grossmann, J., Barkow-

Oesterreicher, S., Shay, J. W., Tiwari, A., Buffoli, F.,

Laczko, E., and Marra, G. (2014). Sorbitol dehydro-

genase overexpression and other aspects of dysregu-

lated protein expression in human precancerous col-

orectal neoplasms: a quantitative proteomics study.

Molecular & Cellular Proteomics, 13(5):1198–1218.

Wang, J., Duncan, D., Shi, Z., and Zhang, B. (2013). WEB-

based GEne SeT AnaLysis Toolkit (WebGestalt): up-

date 2013. Nucleic acids research, 41(Web Server

issue):77–83.

Zhang, B., Wang, J., Wang, X., Zhu, J., Liu, Q., Shi, Z.,

Chambers, M. C., Zimmerman, L. J., Shaddox, K. F.,

Kim, S., Davies, S. R., Wang, S., Wang, P., Kinsinger,

C. R., Rivers, R. C., Rodriguez, H., Townsend, R. R.,

Ellis, M. J. C., Carr, S. a., Tabb, D. L., Coffey, R. J.,

Slebos, R. J. C., and Liebler, D. C. (2014). Proteoge-

nomic characterization of human colon and rectal can-

cer. Nature, 513(7518):382–387.

BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms