APPLYING CONCEPTUAL MODELING TO ALIGNMENT TOOLS

ONE STEP TOWARDS THE AUTOMATION

OF DNA SEQUENCE ANALYSIS

Maria Jos´e Villanueva, Francisco Valverde and Oscar Pastor

Centro de Investigaci´on en M´etodos de Producci´on de Software, Universidad Polit´ecnica de Valencia

Camino de Vera S/N Valencia, Spain

Keywords:

Variation detection, Alignment tools, Software engineering.

Abstract:

Nowadays, the search of variations in DNA samples according to a reference sequence is performed using

several bioinformatic tools. Due to the process complexity, none of these tools fulﬁll all the functionality

required by biologists. For that reason, the deﬁnition of an integration process between these different tools

becomes a mandatory requirement. One interesting issue is that bioinformatic tools do not comply with any

standard format for expressing the output reports. As a consequence, the ﬂow among tools must be manually

solved. This paper proposes a conceptual model in order to formalize how the output from alignment tools

must be produced. This work also provides a textual format based on this conceptual model. Thanks to both

contributions, the integration is handled in the problem space and the related technological details are avoided.

As a proof of concept of these ideas, the proposed format has been applied in a DNA sequence analysis process

which uses two bioinformatic tools.

1 INTRODUCTION

DNA sequence analysis is a process that is currently

not efﬁciently solved in the context of disease diag-

nosis. Because of the complexity of the process, sev-

eral different tools are required to produce an accu-

rate analysis. Biologists claim that none of these tools

provide all the functionality required to fulﬁll a com-

plete sequence analysis process (Rusk, 2009). Brieﬂy,

this process is divided into several phases that are per-

formed with a different tool or by the biologist:

1. Basecalling phase: basecalling tools obtain the

nucleotide chain from an electropherogram.

2. Basecalling revision phase: biologists correct the

sequence provided by the basecalling tools.

3. Variation detection phase: alignment tools obtain

the variations of a sequence regarding a reference

sequence.

4. Phenotype assessment phase: variation analysis

tools associate the suitable phenotype to every

variation found.

5. Diagnosis report phase: biologists gather manu-

ally all the results and write down their conclu-

sions in a report.

In order to support the whole analysis process, all

these different tools and manual procedures must be

combined. The main drawback of this approach is

that the data ﬂow among them is not a trivial task that

must be performed manually for every analysis. Con-

cerning variation detection (3) and phenotype assess-

ment phases (4), an automatedintegrationof tools still

cannot be achieved because of: 1) the lack of stan-

dards in the output results of alignment tools; and 2)

the ambiguity about what exactly has to be imported

by variation analysis tools.

This work proposes a solution for both problems

in order to support the integration between align-

ment tools and variation analysis tools. The presented

approach introduces the use of conceptual models

(K¨uhne, 2005) to provide a formal deﬁnition about

what exactly a variation is and, which are the relevant

concepts in a variation report.

With the aim of formalizing variation reports, this

work reviews several alignment tools used by biolo-

gists in the variation detection phase. This review has

been useful to determine which kind of variations are

detected and how they are usually described. From

the extracted conclusions, a conceptual model is de-

ﬁned to support the formal speciﬁcation of these vari-

137

Villanueva M., Valverde F. and Pastor O..

APPLYING CONCEPTUAL MODELING TO ALIGNMENT TOOLS ONE STEP TOWARDS THE AUTOMATION OF DNA SEQUENCE ANALYSIS.

DOI: 10.5220/0003142001370142

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2011), pages 137-142

ISBN: 978-989-8425-36-2

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

ations reports. Based on this conceptual model, a tex-

tual format is deﬁned using the XMLSchema speci-

ﬁcation (Biron and Malhotra, 2004). For validation

purposes, this textual format has been applied into the

integration between two alignment tools and one vari-

ation analysis tool.

The rest of the paper is organized as follows. In

section 2, the related work is presented. Section 3 ex-

plains the alignment tools review. Section 4 presents

a conceptual model that formalizes the variation re-

ports and the proposed XML format. In section 5,

the contributions of this work are applied into a real

integration scenario. Finally, section 6 presents the

concluding remarks.

2 RELATED WORK

In order to formalize the heterogeneity of the con-

cepts in the genomic domain, several works propose

the use of conceptual models. For instance (Paton

et al., 2000) describes a collection of conceptual mod-

els in yeast. Paton’s work models general genomic

data and data related to experiments, proteins or alle-

les. Another approach, as PaGE-OM (Brookes et al.,

2009), proposes a conceptual model that represents

genomic data in relation to assays performed by biol-

ogists. The project Atlas (Shah et al., 2005) presents

an integration attempt that deﬁnes the genomic data

models to be integrated from different databases. And

ﬁnally, the Gene Ontology (Gene Ontology Consor-

tium, 2004) deﬁnes a set of vocabularies and classiﬁ-

cations, which are related to biological functions, pro-

cesses, and cellular components.

A common issue in these approaches is that the

proposed conceptualizations are highly related to the

experimental data, the used technologies or the rep-

resentation formats. As a consequence, these ap-

proaches cannot be easily adopted by variation analy-

sis tools. The purpose of this work is to use concep-

tual models to achieve a domain representation that

only considers the precise biological concepts.

Focusing on the problem of alignment output rep-

resentation, other attempts to solve the lack of stan-

dards can be found as well. For example, the Se-

quence Alignment Map (SAM) format (Li et al.,

2009) is a compact format to express variation results

from alignments. The main drawbacks are the com-

plexity of the syntax and the mandatory implemen-

tation of a low level mechanism to extract the data.

Our proposal overcomes these drawbacks by the use

of a conceptual model that is easier to understand by

biologists. The complexity of data representation is

reduced thanks to the formalization of the variation

detection domain. As one implementation of this con-

ceptual model, it is presented a textual format based

on the XML language: a standard language supported

by several software development environments. The

implementation of the software integration compo-

nents is simpliﬁed by the conceptual model and the

corresponding XML format, that can be used inside a

model-driven software development process.

3 ALIGNMENT TOOLS REVIEW

With the purpose of detecting the most relevant con-

cepts that alignment tools use in their reports, a set

of the most representative ones has been reviewed:

Sequencher (Gene Codes Corporation, 2010), SeqS-

cape (Applied Biosystems, 2010), Mutation Surveyor

(Softgenetics, 2010), Codon Code Aligner (Codon

Code Corporation, 2010), Polyphred (Department

Genomic Sciences, 2010), InSNP (Manaster et al.,

2005) and the WebTool BLAST from the NCBI

(NCBI, 2010).

To perform this review a real test has been carried

out with these tools. Real samplesof the BRCA1 gene

were provided by a research laboratory to give value

to the results. The strategy followed in this test was:

1. Installation of the tools in a computer under

Windows 7 (Sequencher, SeqScape, Mutation-

Surveyor, CodonCode Aligner, old versions of

Staden and InSNP). For the tools only supported

in Linux, the installation was done in another

computer under Ubuntu v8.04 (Polyphred).

2. Reading of the introduction tutorials and user

guides to understand the general principles of the

tool, the graphical user interface and the sup-

ported functionality.

3. Checking of the functionality for each tool, using

the samples provided.

4. Searching of variations within the samples in or-

der to compare the results and limitations under

the same conditions.

While working with these tools, the required concepts

around variations have been gathered and three main

issues have been detected: In the ﬁrst place, the in-

troduction of a complete DNA Sequence is not pos-

sible due to technological limitations. Sequencing

machines are constrained to a maximum sequence

length, so the sequenced region must be split up in

small pieces called contigs. In the second place, the

limitations of the sequencing process produces erro-

neous bases. So, in order to improve the analysis

quality, this process must be realized several times.

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

138

Table 1: Alignment tools comparison.

Sequencher SeqScape CCAligner M.Surveyor Polyphred InSNP Blast

Sample edition X X X x x x x

Assembly

among samples X X X x x x X

to RefSeq X X X X X X X

Variations

Homozygosis

insertions X X X X x x X

deletions X X X X x x X

indels X X X X X X X

Heterozygosis

insertions x X X X x X -

deletions x X X X x X -

indels X X X X X X -

Report Format

PDF X X X X x x x

TXT X X X X X X X

XLS X X X X x x x

XML x X x X X x X

HTML X x x X x x x

From all the sequences obtained, a consensus se-

quence is derived. And, in the third place, some vari-

ations can not be expressed with the common letters

used to identify the DNA bases. Due to the fact that

a DNA sequence is made up of two alleles, varia-

tions can be homozygous, if the nucleotide changes

in both alleles, or heterozygous, if the nucleotide only

changes in one allele. As a consequence, an addi-

tional set of speciﬁc letters is required for reporting

heterozygous variations.

Concerning the functionality of the tools, the gen-

eral procedure workﬂow is deﬁned in ﬁve steps:

1. Alignment project creation: the contigs to be an-

alyzed are introduced into the tool and the align-

ment is conﬁgured according to several parame-

ters.

2. Contigs Assembly: the different contigs are or-

dered and aligned according to a reference se-

quence.

3. Contig basecalling correction: Biologists check

the different contig bases using as guideline the

reference sequence and their knowledge in the

ﬁeld. Then the errors produced by the sequenc-

ing machine or the basecalling algorithms are cor-

rected and a consensus sequence is obtained.

4. Comparison: An alignment between the consen-

sus sequence and a reference sequence is carried

out to search for variations (insertions, deletions

and indels in homozygosis or heterozygosis).

5. Report generation: All the detected variations are

gathered in a variation report. This report can be

exported and used in another bioinformatic task,

for instance to document which variation can pro-

duce a disease.

A comparison among all tools is summarized the

Table 1. According to these results, all tools are able

to assembly sequences into its correct position in-

side a reference, but only Sequencher, SeqScape and

Codon Code Aligner support the sample edition to

correct basecalling. Regarding variation detection Se-

qScape, Codon Code Aligner and Mutation Surveyor

are the only tools that search for all kinds of varia-

tions. Each tool uses its own notation to generate the

reports and several formats to export these reports.

4 CONCEPTUAL MODEL FOR

VARIATION REPORTS

The main contribution of this work is to formalize the

common concepts that are used in the alignment tools

for generating the output reports. Taking into account

the common expressiveness from these tools, a con-

ceptual model has been deﬁned (Figure 1).

While performing the variation detection phase,

the ﬁrst step is to align the input sequence and a ref-

erence sequence. One Alignment is always deﬁned

by a Consensus sequence and a Reference sequence.

Both conceptual entities inherit from the conceptual

APPLYING CONCEPTUAL MODELING TO ALIGNMENT TOOLS ONE STEP TOWARDS THE AUTOMATION OF

DNA SEQUENCE ANALYSIS

139

-geneId : string

Alignment

-startPos : int

-endPos : int

-isHeterozygous : bool

Difference

-bases : string

Insertion

-length : int

Deletion

-bases : string

Substitution

Reference

Consensus

1..*

-id : int

-refSource : string

-sequence : string

-startPos : int

-endPos : int

DNASequence

Figure 1: Alignment Report Conceptual Model.

entity DNASequence. The Alignment entity has an at-

tribute called geneId, which identiﬁes the gene to be

analyzed. For standardization purposes, this attribute

complies with the standard nomenclature of Human

Genome Nomenclature Committee (HGNC) (Povey

et al., 2001)

A DNASequence deﬁnes the set of features associ-

ated to a sequenced DNA sample. A DNASequence is

represented by a numerical identiﬁer, a sequence that

is a string of letters representing the nucleotides of

the sample, a refSource that indicates the datasource

(a database, a local ﬁle, etc.) where the sequence

comes from, and a range composed by startPos and

endPos sequence positions, that can be used to es-

tablish a delimitation in the sequence. The Consen-

sus entity models the DNA sequence that is analyzed,

for instance a patient sample, and the Reference entity

models a DNA sequence usually used for comparison

purposes.

All the differences found in the Alignment be-

tween both sequences are considered variations and

are modeled by the entity Difference. When a vari-

ation is found in one Alignment, the position where

it is located has to be indicated. To avoid ambigu-

ities, and following the recommendations of Human

Genome Variation Society (HGVS) (Den Dunnen and

Antonarakis, 2000), the Difference entity has two at-

tributes for deﬁning where a variation starts and ends:

startPos and endPos. Moreover a Difference has the

boolean attribute isHeterozygous that indicates if a

variation occurs in one allele or in both alleles (ho-

mozygosis).

Differences are categorized into three entities ac-

cording to the change performed in the sequence:

Insertion (additional nucleotides are inserted), Dele-

tion (several nucleotides are deleted), and Substitu-

tion(some nucleotides change their value). The en-

tity Insertion has the attribute bases to indicate the in-

serted nucleotides; the entity Deletion has an attribute

length to indicate how manynucleotides have been re-

moved and the entity Substitution has also an attribute

bases that indicates the new value of the changed nu-

cleotides.

The presented conceptual model is implemented

deﬁning a corresponding XMLSchema. An example

of this XML format is:

</Reference>

<sequence>atggta....aattggcca</sequence>

</Consensus>

<Sub endPos="20" initialPos="20"

heterozygous="true">a </Sub>

<Del length="5" endPos="52"

initialPos="48">aaaaa</Del>

<Del length="1" endPos="68"

initialPos="68">g</Del>

</Differences>

</alignment>

5 PROOF OF CONCEPT:

SEQUENCE ANALYSIS TOOLS

INTEGRATION

Thanks to the conceptualization of the variation re-

ports, the data ﬂow between alignment tools and

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

140

Scenario1

Sequencher

Scenario 2

PROS

Protototype

Blast NCBI

Specific

Translator A

Specific

Translator B

Sequencher

Blast NCBI

Conceptual

Model

Common

XML Schema

Variations

Report

Variations

Report

Variations

Report

Variations

Report

-geneId : string

Alignment

-startPos : int

-endPos : int

-heterozygous : bool

Difference

-bases : string

Insertion

-length : int

Deletion

-bases : string

Substitution

Reference

Consensus

1..*

-id : int

-refSource : string

-sequence : string

-startPos : int

-endPos : int

DNASequence

PROS

Protototype

Figure 2: Integration process.

variation analysis tools becomes a systematized step.

Concretely, the integration between these two type of

tools can be easily implemented using the proposed

XML format.

At the moment, data ﬂow between alignment tools

and variation analysis tools requires the development

of format translators to achieve the communication

among tools (see Figure 2). Alignment tools gener-

ate reports in their own formats and variation analysis

tools import data also in their own formats. Hence,

the integration requires a speciﬁc translator to trans-

form the format of each alignment tool to the for-

mat of each variation analysis tool (Scenario 1). The

problem lays in the fact that these translators are not

reusable. So the more tools to be integrated the more

translators must be implemented.

However, this work solves the dependency among

tools and reduces the number of translators using the

conceptual model. As each tool manages the same

concepts (already deﬁned in the conceptual model),

the integration is achieved by means of using the same

XML format (Scenario 2). On the one hand, the re-

ports generated by alignment tools must be expressed

following the conceptual model. Therefore, each re-

port format is translated to the common XML for-

mat. On the other hand, variation analysis tools read

data from the conceptual model, so XML data is con-

verted to each input format. Using this solution the

developed translators can be reused in other integra-

tion process. For this reason it is only necessary to

develop one translator for each tool to be integrated,

for alignment tools and variation analysis tools.

For evaluation purposes, the alignment tools Se-

quencher and Blast from the NCBI Website have been

integrated with a variation analysis tool. The selected

tool is the PROS Prototype Tool (Martinez et al.,

2010). Hence, three translators are implemented:

1. In the case of Sequencher, a translator that obtains

the reference sequence, the consensus, the varia-

tions, and creates the XML ﬁle.

2. In the case of the BLAST Webtool, a transla-

tor that obtains the reference, the consensus se-

quence, parses the output to obtain the variations

and creates the XML ﬁle.

3. Regarding the PROS prototype, since it is devel-

oped in Java language, it has been used the JAXB

(Java Architecture for XML Bindings) API (Ort

and Mehta, 2003). This API allows the parsing

of XML data into objects available in the con-

text of the application. In order to consume the

XML data, the translator instantiates the classes

obtained with the binding compiler of JAXB (xjc),

extracts the data required and transforms it into

objects that can be used by the variation analysis

tool.

Because of the three translators implementation, the

ﬂow among the tools is supported. Therefore, biolo-

gists perceive that the variation detection and the vari-

ation analysis phases are executed in a single step.

6 CONCLUDING REMARKS

This work proposes a conceptual model to achievethe

integration of biological tools that perform two differ-

APPLYING CONCEPTUAL MODELING TO ALIGNMENT TOOLS ONE STEP TOWARDS THE AUTOMATION OF

DNA SEQUENCE ANALYSIS

141

ent phases of a DNA sequence analysis process.

The use of this conceptual model as a integra-

tor solution provides several advantages in relation

to the current state: On the one hand, the conceptual

model is based on the common biological concepts

used by the alignment tools. Furthermore, because

the proposed implementation of the conceptual model

is based on the standard XML language, the data ex-

change among different processes and tools is feasi-

ble. On the other hand the use of conceptual models

provide several advantages: 1) concepts are well de-

ﬁned; and 2) it is easier to reﬂect new changes and

adapt the software to the new requirements. If bio-

logical concepts change or alignment tools evolve, the

conceptual model and its implementations can be eas-

ily modiﬁed in order to reﬂect the new concepts. For

these reasons, biologists are free to choose the most

suitable alignment tool that ﬁts their needs.

Apart from the beneﬁts that offers this proposal, it

must be taken into account that it also presents several

issues: One issue is that the conceptual model could

be incomplete because the commercial tools has been

tested in trial versions, where some functionality is

restricted. Therefore it is possible that some concepts

are missing. Another issue arises because it is not

possible to modify the speciﬁc implementation of the

alignment tools. The data has to be previously ex-

ported in order to be translated to the proposed for-

mat. This additional step must be carried out by biol-

ogists, so the process is not fully automated. As fu-

ture work, with the goal of achieving a complete au-

tomation of DNA sequence analysis, there are some

phases that must be addressed as well. For instance,

the next step is to study how to create diagnosis re-

ports automatically taking into account the phenotype

associated to the reported variations.

ACKNOWLEDGEMENTS

Thanks to the Instituto de M´edicina Gen´omica

(IMEGEN, http://www.imegen.es) for its support pro-

viding the data. This research work is supported by

the Spanish MICINN under the FPU grant AP2010-

1985.

REFERENCES

Applied Biosystems (2010). Seqscape. http://www3.app

liedbiosystems.com/ABHome/index.htm.

Biron, P. V. and Malhotra, A., editors (2004). XML Schema

Part 2: Datatypes. W3C Recommendation. W3C, 2nd

edition.

Brookes, A. J. et al. (2009). The Phenotype and Geno-

type Experiment Object Model (PaGE-OM): A Ro-

bust Data Structure for Information Related to DNA

Variation. Human Mutation, 30(6):968–77.

Codon Code Corporation (2010). Codon Code Aligner.

http://www.codoncode.com/aligner/.

Den Dunnen, J. T. and Antonarakis, S. E. (2000). Muta-

tion Nomenclature Extensions and Suggestions to De-

scribe Complex Mutations: A Discussion. Human

Mutation, 15(1):7–12.

Department Genomic Sciences (2010). Polyphred.

http://droog.gs.washington.edu/polyphred/.

Gene Codes Corporation (2010). Sequencher.

http://www.genecodes.com/.

Gene Ontology Consortium (2004). The Gene Ontology

(GO) Database and Informatics Resource. Nucleic

Acids Research, 32(suppl1):D258–261.

K¨uhne, T. (2005). What is a Model. In Language Engineer-

ing for Model-Driven Software Development, number

04101 in Dagstuhl Seminar Proceedings, pages 200–

0. IBFI, Schloss Dagstuhl, Germany.

Li, H. et al. (2009). The Sequence Alignment/Map Format

and SAM Tools. Bioinformatics, 25(16):2078–2079.

Manaster, C. et al. (2005). InSNP: a tool for automated de-

tection and visualization of SNPs and InDels. Human

mutation, 26(1):11–19.

Martinez, A. M. et al. (2010). Facing the challenges of

genome information systems: a variation analysis pro-

totype. Caise Forum.

NCBI (2010). BLAST (Basic Local Alignment Search

Tool). http://blast.ncbi.nlm.nih.gov/Blast.cgi.

Ort, E. and Mehta, B. (2003). Java Architecture for XML

Binding (JAXB). Technical Report Sun Developer

Network.

Paton, N. W. et al. (2000). Conceptual modelling of ge-

nomic information. Bioinformatics, 16(6):548–57.

Povey, S., Lovering, R., Bruford, E., Wright, M., Lush, M.,

and Wain, H. (2001). The HUGO Gene Nomenclature

Committee (HGNC). Human Genetics, 109(6):678–

680.

Rusk, N. (2009). Focus on Next-Generation Sequencing

Data Analysis. Nature Methods, 6(11s):S1.

Shah, S. et al. (2005). Atlas - A Data Warehouse for Inte-

grative Bioinformatics. BMC Bioinformatics, 6(1):34.

Softgenetics (2010). Mutation Surveyor. http://www.soft

genetics.com/.

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

142