Flow Index based Characterization of next Generation Sequencing

Errors

Visualizing Pyrosequencing and Semiconductor Sequencing to Cope

with Homopolymer Errors

Peter Sarkozy

, Márton Enyedi

and Peter Antal

Department of Measurement and Information Systems, Budapest University of Technology and Economics,

Magyar tudósok körútja 2, Budapest, Hungary

Institute of Genetics, Biological Research Centre, Hungarian Academy of Sciences, Szeged, Hungary

Keywords: Next Generation Sequencing, Homopolymer Errors, Pyrosequencing, Semiconductor Sequencing,

Visualization.

Abstract: We characterized the error sources of multiple resequencing measurements performed on the Ion Torrent

Personal Genome Machines and the Roche 454 sequencing platforms. Homopolymer insertions and

deletions are the most common error types for these platforms, and there are many underlying factors which

define their occurrence patterns. In the paper we investigate the effect of flow order, specifically the

difference in the average value of the flow values for each homopolymer run length, based on the position in

the flow cycle.

1 INTRODUCTION

Next generation sequencing (NGS) is rapidly

becoming a mature technology. With an increasing

number of platforms distributed throughout the

world, more and more researchers are gaining access

to the world of low-cost, high-throughput

sequencing. The total cost of sequencing is dropping

so rapidly, that other, previously marginal costs like

data analysis are overtaking the cost of consumables.

Researchers must familiarize themselves with the

error characteristics of their chosen platform, or

resort to very stringent quality filtering in order to

identify relevant results amidst the increasing

amount of data.

The first truly NGS platform was the Roche 454

sequencing platform. This technology utilizes clonal

amplification of template libraries on magnetic

beads in emulsion PCR. The beads are then loaded

onto a PicoTiterPlate, where a single bead fits into a

single well. After the addition of reagents

(polymerase, luciferase) to the plate –which allow

the detection of light emitted by luciferase on

nucleotide incorporation–, a repeated order of four

nucleotides are successively flowed over the plate –

with stringent washes in between each flow – and

the light signal generated by the incorporated

nucleotides are recorded and analysed to produce the

base called sequence corresponding the well. Only

one type of nucleotide (A, C, G or T) is flowed in

each cycle, and a nucleotide is incorporated into the

strands only if their complementary nucleotide is the

next free base on the template strand. If multiple

identical bases are next in the template, then the

recorded light signal is proportional to the number of

incorporated bases. Runs of identical nucleotides are

called homopolymers.

Ion Torrent semiconductor sequencing by Life

Technologies uses a very similar approach

(Rothberg et al., 2011). The library preparation stage

also employs emulsion PCR, and coated beads (ion

sphere particles) provide the immobilization of

template strands on a semiconductor plate (chip), but

the detection of nucleotide incorporation is not done

by detecting the light emitted by the luciferase

enzyme, but by using a CMOS semiconductor layer

at the bottom of the plate to detect the change in the

pH of the reaction solution caused by the emission

of a proton when a nucleotide is incorporated into

template strand.

In this paper, we investigate the effect of the

flow index (the position in the flow cycle) on the

271

Sarkozy P., Enyedi M. and Antal P..

Flow Index based Characterization of next Generation Sequencing Errors - Visualizing Pyrosequencing and Semiconductor Sequencing to Cope with

Homopolymer Errors.

DOI: 10.5220/0004924902710277

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2014), pages 271-277

ISBN: 978-989-758-012-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

flow values and resulting base calls and error types

in a sequencing run.

2 NGS SEQUENCING ERRORS

There are multiple sources of errors in

pyrosequencing and semiconductor sequencing, and

many of them are common to the two platforms

because of the technological similarity (Quail et al.,

2012, Metzker, 2010).

One of the main sources of errors are the carry

forward/incomplete extension (CAFIE) errors

(Margulies et al., 2005). The carry forward

phenomenon refers to the event when the

nucleotides from a previous flow are not fully

washed out of a well, and these residual nucleotides

incorporate into the clonal template strands if they

match the next complementary nucleotide in the

template.

Carry forward has two major effects: (1) a flow

signal that is higher (since more incorporation events

occurred) than what would be recorded in the

absence of any residual nucleotides in a well, and

(2), the clonal template strands in which the residual

nucleotides are incorporated become desynchronized

(out of phase) compared to the rest of the strands on

a bead, as they are further along in synthesis than the

majority of the clonal templates, and will thus

contribute to the flow signals in different flows than

the rest.

Incomplete extension occurs when not every

nucleotide on a strand whose next nucleotide is

complementary to the current flow is incorporated,

thus resulting in a lower than expected flow signal,

and templates that are lagging behind in synthesis.

As the sequencing progresses with each

successive flow, CAFIE events result in increasingly

asynchronous clonal template strands on each bead.

This effect is directly observable as the baseline

flow signal increases during the sequencing run, and

it results in homopolymer over and undercalls, as

well as single base mismatches.

Many methods have been proposed to help

mitigate the effect of CAFIE errors; including the

alteration of the flow order to allow clonal template

strands to catch up and synchronize, as well as post-

sequencing mathematical methods to model the

effects of CAFIE. The accumulation of CAFIE

errors on a bead during the sequencing run result in

higher variance of the flow signal, and lead to higher

error rates, which are reflected in the read’s base

quality scores. The methods used for reducing

CAFIE errors are commonly referred to as phase

correction, and are highly platform specific and are

implemented in the signal processing and base

calling pipelines of each vendors’ software

platforms.

2.1 Key Signal Normalization

Both sequencing platforms employ a 4-base TACG

key sequence ligated to the 3’ end of each fragment.

This allows easy identification of wells with

populated beads, as well as providing the

normalization levels for the flow signals of each

read. Incorrect normalization can result in multiple

under or overcalls in an affected read.

2.2 Flow Order Optimization

The Ion Torrent Personal Genome Machine

measurements in this paper were sequenced using a

flow order referred to as the Samba, which is a 32

step sequence of

TACGTACGTCTGAGCATCGATCGATGTACAGC

repeated 15.6 times, for a total of 500 flows.

Compared to the previously used flow order of

TCAG, this flow order has the advantage of

allowing the clonal template strands to

resynchronize to some degree, at the expense of non-

optimal read length. It can be demonstrated that for

random sequences, optimal flow order with respect

to read length is attained through repeating the same

flow order of the 4 possible nucleotides. If, for

example the flow order contains only three bases

(e.g. TACTACTACTAC), then all strands are

elongated to the next G nucleotide. The

3 PREVIOUS WORKS

Besides the standard vendor supplied solutions,

multiple approaches have been published that utilize

the flow values underlying the base calls to increase

the base call accuracy of pyrosequencing and

semiconductor sequencing.

3.1 Flow-space Alignment

A read is intrinsically represented as a series of flow

values, where each flow value is proportional to the

length of the corresponding homopolymer run. The

reference sequence can be transformed into series of

flow values, and the alignment can be performed in

flow space with the Flowgram Alignment Tool

(Vacic et al., 2008), allowing higher mapping

accuracy.

BIOINFORMATICS2014-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

272

3.2 Maximum Likelihood Sequence

Clustering

In an early, non-generative approach, a set of

flowgrams were clustered based on the distance

between a flowgram and a sequence, using a

probabilistic model derived from the alignments to

the parent sequence with an exact Needleman-

Wunsch algorithm that empirically models

sequencing noise, and is applied to amplicon

sequencing. Using maximum likelihood, the quantity

and number of true underlying sequences can be

reliably estimated, especially in metagenomics

studies where similar regions are present with

widely varying coverage with PyroNoise (Quince et

al., 2009).

The length of each homopolymer run can be

more accurately characterized using Bayesian

approaches, as the most probable number of

identical bases given the observed flow values with

the help of PyroBayes (Quinlan et al., 2008)

3.3 Hidden Markov Models

Generative approaches attempt to model the

generative process that produces the flowgrams. The

characteristic homopolymer errors in

pyrosequencing and semiconductor sequencing often

require special care when mapping the reads to a

reference sequence. Hidden Markov Models

(HMMs), specifically the pFam models and

extensions widely used in biological sequence

analysis allowed position specific scoring and

management of indels, e.g. to manage that the gap

open penalty of most aligners is often higher than

that of a substitution.

An extended Hidden Markov Model to emit

values in the flow space was proposed as an

integrated solution, which could cope with an error,

where a homopolymer indel follows a base

substitution. The HMM can be constructed with

parameter estimation from the raw flow values and

from the reference sequence transformed into flow

space, and the read alignments to the reference

sequence can be adjusted by decoding the model

with the Viterbi algorithm (Zeng et al., 2013). This

results in higher specificity and sensitivity in variant

detection.

4 MATERIALS AND METHODS

In this paper, we used a human BRCA1 and BRCA2

exon targeted resequencing run on an Ion Torrent

316 chip. All DNA samples were prepared from

blood samples obtained in the Biological Research

Centre (BRC, Szeged). The amplicons (81 PCR

fragments) were generated from germline blood

DNA and covered all coding sequences of the

BRCA1 and BRCA2 genes. These libraries were

prepared with the commercially available Ion Plus

Fragment Library Kit (Life Technologies) using

custom barcoded adaptor sequences.

Generated libraries were controlled for adaptor

dimers and size range using agarose gel

electrophoresis. Samples were isolated from the gel

using DNA fragment extraction kit (Geneaid).

Fragment library quantification was carried out by

Q-PCR (Kapa Biosystems) followed by emulsion

PCR with the Ion PGM™ Template OT2 200 Kit

using the Ion OneTouch™ 2 System (Life

Technologies). Ion sphere particles (ISP) were

enriched using the E/S module and were sequenced

with an Ion PGM in a 200-bp configuration run

using 316 chip (Life Technologies).

The run had a total number of 2,808,212 reads

and an average read length of 115 (σ = 55). The

mapped mismatch rate and insertion/deletion rates

were 0.92%, 0.66%, and 0.5% respectively. The

mismatch rate is inflated by multiple true SNP’s, but

the number insertions and deletions far outnumber

the true indel count in the reference sequence. The

results obtained from our run were compared to the

results from a publicly available Ion Torrent dataset

(FLO-528).

We also tested a publicly available

Acinectobacter baylyi shotgun sequencing run on

the Roche 454 platform as a comparison from the

CloVR public datasets. The dataset had 250,000

reads with an average read length of 467 (σ = 87).

The 454 sequencing measurements analysed in

our research use a fixed flow order of TACG

repeated (though recent advancements have allowed

variable flow order to resynchronize reads) 200

times, for a total of 800 flows.

In order to obtain a deeper understanding of the

underlying measurement, we did not use any quality

clipping or filtering on any of the datasets beyond

that offered by the Torrent Suite 3.6 software

defaults and by the Roche Signal Processing

application defaults, respectively.

The Torrent Suite software as of version 3.2+

does not support the exporting of the results into a

.sff (standard flowgram format) file with phase-

corrected flow values. Since it only exports the raw

key normalized flow values, we created a software

package that allows the conversion of the phase-

corrected flow values from the unaligned BAM files.

FlowIndexbasedCharacterizationofnextGenerationSequencingErrors-VisualizingPyrosequencingandSemiconductor

SequencingtoCopewithHomopolymerErrors

273

This software has been published at

https://github.com/psarkozy/sffviz.

All mapping computations were performed with

BOWTIE2 2.1.0 (Langmead et al., 2012), with the --

very-sensitive option enabled.

5 RESULTS

Previous works acknowledge the significant effect

of the flow index on flow signal distributions, but

they only visualize the flow signal distribution

histograms based on the frequency of each flow

signal value for each nucleotide (Balzer et al., 2010).

Plotting the flow signal distributions vs. the flow

indices allows greater insight into the characteristics

of pyrosequencing and semiconductor sequencing.

In the case of pyroseqencing, Figure 1. shows

that the spread of the flow signal increases with

homopolymer run length, and also increases with the

flow index. As CAFIE errors accumulate during the

sequencing run, the noise floor of the 0-mer flows

(flows which did not achieve sufficient signal

intensity to classify as a base call) increase, and the

average flow value per mer length decrease. The

flow signal distributions remain easily identifiable

and do not overlap at the beginning of the

sequencing run, but as the flow index increases, they

become more difficult to separate. In

semiconductor sequencing, the raw normalized flow

value histogram (Figure 2.) shows similar

characteristics to pyrosequencing. The variance of

the raw normalized flow values for each

homopolymer run length show greater variance from

the start of the run, and there is a marked signal level

droop as the mer count and flow index increase. A

visible high-frequency jitter is apparent on the

expected flow values of the shorter homopolymer

runs. The noise floor produced also shows a fast

increase rate, especially compared to the 454 dataset

Figure 1: Flow values (Y axis) vs. the flow indices (X

axis) heat map plotted for a Roche 454 shotgun

sequencing run shows good separation of the flow values

corresponding to each homopolymer run length.

Figure 2: The raw normalized (left) and phase corrected (right) flow values from an Ion Torrent 318 chip run. The X axis is

the index of the flow in the flow order, the Y axis represents the flow values. A phase corrected flow value is used to call

the number of bases in a homopolymer run. The cutoff points for each homopolymer length are at multiples of the length.

(100*(mer count)-50). The temperature of the heat map shows the count of each flow index – flow value pair. Flow values

greater than 1000 along with homopolymer lengths of 10+ are clipped from the images to maintain readability.

BIOINFORMATICS2014-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

274

The phase corrected flow values show a more

refined image of the underlying true homopolymer

lengths, and it can be seen that the phase correction

performs well in reducing the signal level droop and

lowering the noise floor. The increased separation of

the flow values corresponding to each mer count is

visible throughout the entire flow sequence. Base

calling is performed on the phase corrected flow

values (Equation 1.).

mer_count = round(flow_value/100) (1)

Single base mismatches occur predominantly in 1-

mer homopolymer base calls, as it is expected based

on the technological platform, with a higher number

of insertions and deletions in the longer

homopolymer runs (Figure 3.). The distribution of

insertions and deletions over the flow index vs. flow

value heat maps (Figure 4.) shows that the values

corresponding to each insertion and deletion usually

occur near the rounding points of Equation 1. The

error rate for all three types of errors increase as the

flow index increases, and the number of errors start

to drop as the read count at each flow index starts to

decrease, as the ends of the reads are surpassed

(Figure 5.).

The high frequency jitter in the flow values

based on the position in the flow cycle has been

observed by others (Bragg et al., 2013). The main

finding presented in the paper is that the cause of

Figure 3: Percentage of errors in N-mer homopolymer

lengths.

this variation in the average flow values per

homopolymer run length can be traced to the

distance between two identical bases in the flow

cycle. Indeed, if the distance between two identical

flows is small (in the Samba cycle, the possible

distance values are characteristic to each base), then

the average flow value corresponding to each mer

count is lower than the expected value of

100*mer_count. Respectively, the average flow

value is higher for greater distances. This

phenomenon is illustrated in Figure 6. Because both

carry forward and incomplete extension errors occur,

the distance metrics should take into account the

distance to the previous identical base, and the next

identical base.

Figure 4: The phase corrected flow value vs. flow index distributions for deletions (left) and insertions (right). The

temperature of the heat map shows the number of indels with a specific flow value and flow index. Homopolymer indels are

clustered near the cutoff points in the flow values, with the majority of deletions closer to the low cutoff points, and

insertions closer to the high cutoff points. The data used for this figure also may contain a small number of non

homopolymer and true indels in the sequencing runs, but their numbers are much lower than the 1% sequencer-specific

error. The final variant calling for this sequencing run is still under evaluation. Flow values greater than 1000 are clipped

from the images for readability.

050100

Mismatch

Insertion

Deletion

1 2

3 4

5 6

7 8

9 10

FlowIndexbasedCharacterizationofnextGenerationSequencingErrors-VisualizingPyrosequencingandSemiconductor

SequencingtoCopewithHomopolymerErrors

275

Figure 6: The four bases (A, C, G and T) have different repeat offsets in the Samba flow cycle. Each line represents the

average flow value for a given mer count vs. the flow index. Only mer counts of 1, 2 and 3 are plotted, because low flow

value counts for higher homopolymer lengths result in excessive variance to their average flow values. The lines are color

coded based on the distance of the previous identical base in the flow cycle; magenta = 2, red = 3, yellow = 4, cyan = 5,

green = 6, blue = 7.

Figure 5: The Y axis shows the number of errors in an Ion

Torrent sequencing run against the X axis of the flow

index. The blue line represents base mismatches, the green

line represents deletions, while the red line represents

insertions. The curves are smoothed with a window of 32

(the length of one Samba cycle) to allow visual

differentiation. Note that the counts are not normalized to

the number of reads still under sequencing at each flow

index.

6 CONCLUSIONS

The Samba flow order allows for the mitigation of

the CAFIE errors, but it introduces additional

complexity through dependence on the distance

between identical base flows. In the paper we

investigated this phenomenon, as an explanation for

the difference in the average value of the flow values

for each homopolymer run length, based on the

position in the Samba cycle. The better exploitation

of this effect can lead to improved variant detection

methods. Indeed, current methods are robust for high

coverage sequencing and identifications of germline

mutation, but quantitative applications such as

metagenomics and somatic mutation detection

require higher specificity at lower coverage values.

The software tools developed to convert

unaligned .BAM files exported from the Torrent

Suite software into standard flowgram format file

BIOINFORMATICS2014-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

276

with phase corrected flow values and the

visualization tools are available at

https://github.com/psarkozy/sffviz.

7 FURTHER WORK

The reported findings allow the refinements of

existing generative flowgram based models, to

improve the quality of sequencing measurements.

We are evaluating models that take into account the

position in the flow cycle and the distances to

previous and next identical bases in the flow cycle,

to allow for the correction of the flow signal

distributions, and to enable the reduction of

homopolymer insertions and deletions.

ACKNOWLEDGEMENTS

The publication was supported by the TÁMOP-

4.2.2.C-11/1/KONV-2012-0001 project. The project

has been supported by the European Union, co-

financed by the European Social Fund. This research

was partially supported by the ARTEMIS JU and the

Hungarian National Development Agency (NFÜ) in

frame of the R3-COP (Robust & Safe Mobile Co-

operative Systems) project. The research was also

partially supported by OTKA 81466, OTKA 81941,

OTKA 83766, and GOP-1.1.1-11-2012-0030.

REFERENCES

Rothberg, JM., Hinz, W., Rearick, TM., Schultz, J.,

Mileski, W., 2011. An integrated semiconductor

device enabling non-optical genome sequencing.

Nature 475: 348–352.

Metzker, ML., 2010. Sequencing technologies - the next

generation. Nature Reviews Genetics,11:31-46.

Quail, MA., Smith, M., Coupland, P., Otto, TD., Harris,

SR., Connor, TR., Bertoni, A., Swerdlow, HP., Gu,Y.,

2012. A tale of three next generation sequencing

platforms: comparison of Ion Torrent, Pacific

Biosciences and Illumina MiSeq sequencers. BMC

Genomics 2012. 13:341.

Margulies, M., Egholm, M., Altman, WE., Attiya, S.,

Bader, JS., Bemben, LA., Berka, J., Braverman, MS.,

Chen, YJ., Chen, Z., Dewell, SB., Du, L., Fierro, JM.,

Gomes, XV., Godwin, BC., He, W., Helgesen, S., Ho,

CH., Irzyk, GP., Jando, SC., Alenquer, ML., Jarvie,

TP., Jirage, KB., Kim, JB., Knight, JR., Lanza, JR.,

Leamon, JH., Lefkowitz, SM., Lei, M., Li, J., Lohman,

KL., Lu, H., Makhijani, VB., McDade, KE.,

McKenna, MP., Myers, EW., Nickerson, E., Nobile,

JR., Plant, R., Puc, BP., Ronan, MT., Roth, GT.,

Sarkis, GJ., Simons, JF., Simpson, JW., Srinivasan,

M., Tartaro, KR., Tomasz, A., Vogt, KA., Volkmer,

GA., Wang, SH., Wang, Y., Weiner, MP., Yu, P.,

Begley, RF., Rothberg, JM., 2005. Genome

sequencing in microfabricated high-density picolitre

reactors. Nature. 2005,437:376-80.

Vacic, V., Jin, H., Zhu, JK., Lonardi, S., 2008. A

probabilistic method for small RNA flowgram

matching. Pacific Symposium on Biocomputing

2008:75-86.

Quince, C., Lanzén, A., Curtis, TP., Davenport, RJ., Hall,

N., Head, IM., Read, LF., Sloan, WT., 2009. Accurate

determination of microbial diversity from 454

pyrosequencing data. Nat Methods. 2009,9:639-41.

Quinlan, AR., Stewart, DA., Strömberg, MP., Marth, GT.,

2008. Pyrobayes: an improved base caller for SNP

discovery in pyrosequences. Nature Methods.

2008,5:179 – 18.

Zeng, F., Jiang, R., Chen, T., 2013. PyroHMMsnp: an

SNP caller for Ion Torrent and 454 sequencing data.

Nucleic Acids Research, 2013 Jul;41(13):

Langmead, B., Salzberg, S., 2012. Fast gapped-read

alignment with Bowtie 2. Nature Methods. 2012,

9:357-359.

Balzer, S., Malde, K., Lanzén, A., Sharma, A., Jonassen,

I., 2010. Characteristics of 454 pyrosequencing data—

enabling realistic simulation with flowsim.

Bioinformatics. 2010. 26(18):i420-i425.

Bragg, LM., Stone, G., Butler, MK., Hugenholtz, P.,

Tyson, GW., 2013. Shining a Light on Dark

Sequencing: Characterising Errors in Ion Torrent

PGM Data. PLoS Comput Biol 9(4): e1003031.

FlowIndexbasedCharacterizationofnextGenerationSequencingErrors-VisualizingPyrosequencingandSemiconductor

SequencingtoCopewithHomopolymerErrors

277