Environmental Metagenome Classiﬁcation for Soil-based

Forensic Analysis

Jolanta Kawulok and Michal Kawulok

Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland

Keywords:

Metagenomes, Environmental Classiﬁcation, CoMeta, Forensic Analysis, Soil Sample.

Abstract:

Metagenome analysis makes it possible to extract essential information on the organisms that have left their

traces in a given environmental sample. In some cases, it is sufﬁcient to determine the origin of an environmen-

tal sample, rather than being able to accurately identify the organisms living there (which may be a challenging

task). For example, in forensic soil analysis, it could be possible to conﬁrm or exclude that a defendant was

present in a certain place by comparing a soil sample acquired from his belongings against the samples derived

from a variety of places (including the suspected place). In this paper, we present a method to identify the

environmental origins of metagenomic reads by comparing them with entire metagenomic collections derived

from reference samples. For this purpose, we exploit our CoMeta program, which allows for fast classiﬁcation

of metagenome samples, and we apply it to classify the extracted soil metagenomes to various collections of

soil samples. The experimental results reported in this paper indicate that the proposed approach is effective,

which allows us to outline the future research pathways to extend and improve our method.

1 INTRODUCTION

Nowadays, we may witness rapid development of the

methods for analysis of metagenomic reads, which

are sets of DNA fragments, represented as strings of

nucleotide symbols, derived from microbes living in a

given environment. The analysis of samples acquired

from explored places is aimed at answering the fol-

lowing questions: “Who is out there?”, “How much

of each?”, “What are their proportions?”, “What are

they doing?”, and “In what conditions appear?” (Han-

delsman, 2004; Simon and Daniel, 2011). Answer-

ing these questions requires solving two classiﬁcation

tasks, which respond to particular bioinformatic prob-

lems, falling into two major categories, namely super-

vised and unsupervised classiﬁcation.

1.1 Related Work

Supervised classiﬁcation of metagenomic reads con-

sists in comparing presented DNA fragments (termed

as a query sample) against a set of reference se-

quences, and the query sample is assigned to one

of these sequences (or to none of them). There are

many programs for sequence classiﬁcation, which can

be divided into (i) composition-based and (ii) sim-

ilarity search ones. The composition-based meth-

ods compare the features extracted from the refer-

ence sequences, such as the frequency, with which

certain substrings of a given length k occur in an an-

alyzed sequence (Weitschek et al., 2014). A num-

ber of methods are employed to classify the ex-

tracted feature vectors, including interpolated Markov

models, support vector machines (Patil et al., 2011),

k-nearest neighbors (Weitschek et al., 2015), ran-

dom forests (Chen and Lonardi, 2009), or naive

Bayes classiﬁer (Rosen et al., 2011). In the sim-

ilarity search methods, the reads are compared di-

rectly with the reference sequences—they include

MEGAN (Huson et al., 2007) and CARMA3 (Ger-

lach and Stoye, 2011) programs. There are also some

approaches to combine the elements of both strate-

gies (e.g., CoMeta (Kawulok and Deorowicz, 2015),

LMAT (Ames et al., 2013) or Kraken (Wood and

Salzberg, 2014)).

Comparing thousands of DNA fragments against

a huge database is very time consuming. Therefore,

in order to effectively search the databases, the simi-

larity measure between the DNA fragments is deﬁned

and computed employing speciﬁc optimization tech-

niques, including compression and indexing. Based

on the similarity between the reads and the refer-

ence sequences, the reads may be classiﬁed into some

groups of the reference sequences, deﬁned according

182

Kawulok, J. and Kawulok, M.

Environmental Metagenome Classiﬁcation for Soil-based Forensic Analysis.

DOI: 10.5220/0006659301820187

In Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2018) - Volume 3: BIOINFORMATICS, pages 182-187

ISBN: 978-989-758-280-6

to the objective of the study. Depending on the main

goal of the analysis, which determines the way the

reference groups are deﬁned, supervised classiﬁcation

of metagenomic data can be broken down as follows:

• Taxonomic classiﬁcation—each reference group

contains DNA fragments of organisms assigned

to the same taxon, whose rank may span from the

superkingdom to the species (Gerlach and Stoye,

2011; Bazinet and Cummings, 2012).

• Functional classiﬁcation—a reference group con-

tains the DNA fragments that enable the microor-

ganisms fulﬁll a certain function (e.g., degrada-

tion of petroleum alkanes) (Bazinet and Cum-

mings, 2012; Kennedy et al., 2011).

• Environmental classiﬁcation—each reference

group is formed with a metagenomic sample (or

samples) acquired from a certain environment.

The goal of such classiﬁcation scheme is to

determine the characteristics of the environment,

rather than identifying the organisms living there.

It is worth noting that the reference sequences

within each reference group do not have to

be annotated (assigned to speciﬁc species or

taxonomic units), which facilitates the procedure

in many cases.

The metagenomic reads may also be analysed

without using any reference sequences, which is re-

ferred to as unsupervised classiﬁcation. In such sce-

nario, the obtained reads are grouped into opera-

tional taxonomic units (OTUs) based on their mu-

tual similarities. This process is termed as binning—

among many applications, it is used for analysing

the proportion between different groups of organisms.

Here, similar optimization techniques (like compres-

sion and indexing of the sequences) can also be used

so as to accelerate the comparison process.

1.2 Contribution

In this paper, we address a problem of environmental

classiﬁcation, which has a wide variety of potential

practical applications. Here, we focus on analysing

soil samples for forensic purposes—the goal is to con-

ﬁrm or reject a hypothesis that a certain defendant

visited a speciﬁc place—soil traces acquired from his

belongings can be veriﬁed against a set of samples ac-

quired from a variety of places. Our contribution lies

in comparing the samples by measuring their simi-

larity directly in the space of the metagenome reads.

This is in contrast to earlier research in this ﬁeld

(Khodakova et al., 2014), in which the samples were

ﬁrst analysed to identify the microorganisms, whose

genomes are present in these samples, and then the

similarity was assessed by comparing the identiﬁed

species. Importantly, in our approach, we do not need

a reference database that is necessary to identify the

microorganisms. Although in this work we validate

our method for forensic data analysis, the developed

solution is generic and may be adopted to a differ-

ent scenario of environmental metagenomic classiﬁ-

cation. Our main point here is that the knowledge

of particular species is not necessary to recognize the

origin of the sample.

1.3 Paper Structure

The paper is structured as follows. Section 2.1

presents the metagenomic sets we use for validation,

whilst Section 2.2 describes how they are exploited

in the classiﬁcation process. Section 3 presents the

results of experimental validation, and Section 4 con-

cludes the paper.

2 MATERIALS AND METHODS

2.1 Metagenomic Sampling

For testing our method, we decided to select samples

derived from the soils. Owing to the large microbial

diversity of soil, soil sample classiﬁcation can serve

as a powerful tool for forensic soil examination. Soil

can be found on items submitted for forensic analysis.

Soil sticks under ﬁngernails, tools, weapons or cloths

and it can be transferred during the commission of a

criminal act (Khodakova et al., 2014).

We selected soil samples derived from four lo-

cations examined within three different projects.

The data sets were downloaded from EBI Metage-

nomics website

. Two of these projects were con-

ducted in the USA—in Alabama and Massachusetts

states (Stewart et al., 2011). Soil samples from the

third project were collected from two different sites

in Adelaide in South Australia (Khodakova et al.,

2014). These locations are approximately 3 km from

each other. The most relevant characteristics of these

metagenomic sets are presented in Table 1. The sets

contain 2 or 3 samples, each of which consists of hun-

dreds of thousands metagenomic reads.

2.2 Research Methodology

There are a number of tools for comparing metage-

nomic data, out of which for our study we selected our

Available at https://www.ebi.ac.uk/metagenomics (ac-

cessed on 9th October 2017)

Environmental Metagenome Classiﬁcation for Soil-based Forensic Analysis

183

CoMeta (Classiﬁcation of Metagenomes) program

(Kawulok and Deorowicz, 2015) due to its versatil-

ity and ease of use. Most of the existing tools are

intended for a speciﬁc application, and we found it

difﬁcult (if feasible at all) to deploy some of them for

a different purpose. Basically, CoMeta has been de-

signed for fast and accurate classiﬁcation of reads ob-

tained after sequencing entire environmental samples

and it allows a database to be built without any re-

strictions. The similarity (termed the match score) be-

tween the query read and each group (class) of the ref-

erence sequences is determined by counting the num-

ber of the nucleotides in those k-mers (i.e., all sub-

strings in the sequence of length k), which occur both

in the read and in the group. The read is classiﬁed to

that group, for which the match score is the largest.

We organize the reference sequences into the groups

of DNA sequences acquired from soils derived from

various places (each of these places is treated as a sep-

arate class). In this way, a new metagenomic sample

is classiﬁed to one of the created classes.

A simpliﬁed diagram of our classiﬁcation scheme

is shown in Figure 1. We have built N = 10 separate

sets of k-mers from the reads of metagenomic data

acquired from each sample in the reference (train-

ing) set. The reads acquired from a given sample

were compared only to other samples. The reads

derived from a query sample are compared against

the number of groups equal to the number of all in-

vestigated samples in the reference set (in the pre-

sented experiment—N − 1 = 9). The set of reference

sequences consists of n

+ n

I + ··· + n

databases,

wherein n

ones are derived from ﬁrst metagenomic

set (D

, D

, ..., D

), n

I from the second, and so on.

They are compared with q reads derived from a query

sample. The result of a single matching is termed

the match rate score (Ξ

Ri j

). During the intermediate

analysis, each read is attributed to one of the created

groups after exceeding a certain threshold deﬁned for

each group; otherwise it is marked as unknown (U).

Finally, the initial assignment of each read and collec-

tively all match rate scores (yellow boxes in the dia-

gram) are completely analyzed and the query sample

is classiﬁed to the appropriate class.

3 RESULTS

At the beginning of our experimental validation, we

veriﬁed the correctness of our framework. For this

purpose, we built four metagenomic databases—one

Available at https://github.com/jkawulok/cometa (ac-

cessed on 9th October 2017)

N 1

N 2

N n

Comparison

Query Sample

R1I1

R1I2

R1In

R1N 1

R1N 2

R1N n

R2I1

R2I2

R2In

R2N 1

R2N 2

R2N n

RqI1

RqI2

RqIn

RqN 1

RqN 2

RqN n

Intermediate analysis

I/II/

. . .

/N/U

I/II/

. . .

/N/U

I/II/

. . .

/N/U

Intermediate analysis

Complete analysis

I/II/. . . /N/U

Figure 1: The processing pipeline for metagenomic reads

classiﬁcation to the one of the created classes.

for each place. Subsequently, we compared each sam-

ple with all of them so as to make sure that every read

is properly assigned to the environment which it came

from. In the next step, the databases of each sample

BIOINFORMATICS 2018 - 9th International Conference on Bioinformatics Models, Methods and Algorithms

184

Table 1: Metagenomic data sets.

No. Project ID Site

Number Average number

of samples of reads

1 SRP016569 Bankhead National Forest (Alabama, US) 2 322 856

2 SRP005264

(Stewart et al.,

2011)

Harvard Forest (Massachusetts, US) 2 1 182 612

3A ERP004852

(Khodakova

et al., 2014)

1st Adelaide park (AU) 3 402 093

3B ERP004852

(Khodakova

et al., 2014)

2nd Adelaide park (AU) 3 330 957

were created separately. As a result, we received 10

databases. At ﬁrst, we used them also by comparing

each read against all the databases. We have achieved

the same effect as in the previous case—100% of the

reads were assigned correctly. This was expected,

given that CoMeta uses an approximate matching of

two sequences, hence for each read we received the

exact matching to that very read that was found in the

sample of origin.

In order to validate the classiﬁcation method, it

was necessary to compare the reads with the sets,

in which they were not located. This approach was

already described in Section 2.2—each read derived

from one of N = 10 samples was compared with

N − 1 = 9 databases.

Figure 2 shows how many percent of the reads

were matched to each location. If a read from the ﬁrst

sample in a given location is correctly matched to its

location, then it means that the read has been assigned

to a database built on the second (and/or third) sam-

ple (samples) from that location (it is not compared

with other reads from its sample of origin). We mea-

sure the classiﬁcation accuracy at a sample level—we

consider a sample as correctly assigned when the ma-

jority of the reads from that sample are attributed to

the correct location.

From Figure 2, it can be noticed that 7 out of 10

samples are classiﬁed correctly, which generally con-

ﬁrms that our approach is correct. It is worth not-

ing that only the samples from the 2nd Adelaide park

location were incorrectly classiﬁed—actually, all of

them were assigned to the 1st Adelaide park loca-

tion. Certainly, these samples are similar to each other

due to small distance between these locations, but it

is worth noting that the ﬁrst location contains more

samples than the second one (see Table 1), so the

classiﬁcation could be biased towards the ﬁrst loca-

tion. We will address this problem by (i) introducing

the weights according to the cardinality of a sample

and (ii) rejecting (or reducing the impact) of the reads

classiﬁed to several samples. Basically, if a read is

assigned to only one sample, than it may be consid-

ered as a more speciﬁc indicator of a certain location

than another read that is matched to several locations.

In our initial study reported here, we do not take into

account the uniqueness of the reads and we suspect

that this could be the main reason for the misclassiﬁed

samples observed for the 2nd Adelaide park location.

Overall, these results clearly indicate that it is feasi-

ble to identify the origin of a sample without the need

for identifying the microorganisms that have left their

traces in that sample.

4 CONCLUSIONS AND FUTURE

WORK

In this paper, we proposed a method for classify-

ing metagenomic reads to the reference environmen-

tal groups. The presented experimental results proved

the feasibility of our approach and it may be consid-

ered for the purpose of forensic analysis. A very im-

portant advantage of our approach lies in measuring

the sample similarity at the reads level without the

necessity to understand the contents of these samples.

We also consider a hybrid method to exploit both the

information on the organisms identiﬁed in the sam-

ples, as well as to beneﬁt from the reads-level similar-

ity. Hence, if some organisms are identiﬁed in a sam-

ple, then this can be utilized during classiﬁcation, but

information of the unknown organisms whose traces

are found in a sample, will not be lost.

Our ongoing research is aimed at improving the

environmental classiﬁcation engine, following our

observations reported earlier in Section 3. Instead of

counting the matching reads, we intend to analyse the

matches in terms of their uniqueness. Also, we plan

to improve the testing methodology and split the sam-

ples into smaller parts (so as to analyze the classiﬁca-

Environmental Metagenome Classiﬁcation for Soil-based Forensic Analysis

185

1st Adelaide park 2nd Adelaide park

(Soil metagenomes (Soil metagenomes Harvard Forest Bankhead National Forest

place-1) place-2)

sample

correctly classiﬁed as

1st Adelaide park

incorrectly classiﬁed as

1st Adelaide park

correctly classiﬁed as

Harvard Forest

correctly classiﬁed as

Bankhead National

Forest

sample

correctly classiﬁed as

1st Adelaide park

incorrectly classiﬁed as

1st Adelaide park

correctly classiﬁed as

Harvard Forest

correctly classiﬁed as

Bankhead National

Forest

sample

correctly classiﬁed as

1st Adelaide park

incorrectly classiﬁed as

1st Adelaide park

Figure 2: Results of environmental classiﬁcation.

tion scores at a lower-than-sample level).

The second important direction of our future work

is concerned with applying the proposed framework

to solve other practical challenges in medicine, en-

gineering, agriculture, and ecology. In particular,

we plan to compare the performance of our method

against the state of the art (Turnbaugh et al., 2009;

Cui and Zhang, 2013) in diagnostics. Here, the

metagenome is exploited to conﬁrm or exclude a spe-

ciﬁc disorder for a patient, whose metagenomic sam-

ple is compared against two groups of metagenomic

reads, derived from (i) positively diagnosed patients

and (ii) a control group.

ACKNOWLEDGEMENTS

This research was supported in part by PL-Grid In-

frastructure. This work was supported by the Pol-

ish National Science Centre under the project DEC-

2015/19/D/ST6/03252 .

BIOINFORMATICS 2018 - 9th International Conference on Bioinformatics Models, Methods and Algorithms

186

REFERENCES

Ames, S., Hysom, D. A., Gardner, S. N., Lloyd, G. S.,

Gokhale, M. B., and Allen, J. E. (2013). Scalable

metagenomic taxonomy classiﬁcation using a refer-

ence genome database. Bioinformatics, 29(18):2253–

2260.

Bazinet, A. L. and Cummings, M. P. (2012). A comparative

evaluation of sequence classiﬁcation programs. BMC

Bioinformatics, 13(1):1–13.

Chen, J. Y. and Lonardi, S. (2009). Biological data mining.

CRC Press.

Cui, H. and Zhang, X. (2013). Alignment-free supervised

classiﬁcation of metagenomes by recursive SVM.

BMC Genomics, 14(1).

Gerlach, W. and Stoye, J. (2011). Taxonomic classiﬁcation

of metagenomic shotgun sequences with CARMA3.

Nucleic Acids Research, 39(14):E91–E101.

Handelsman, J. (2004). Metagenomics: application of ge-

nomics to uncultured microorganisms. Microbiol Mol

Biol Rev., 68(4).

Huson, D. H., Auch, A. F., Qi, J., and Schuster, S. C. (2007).

MEGAN analysis of metagenomic data. Genome Re-

search, 17(3):377–386.

Kawulok, J. and Deorowicz, S. (2015). CoMeta: Clas-

sication of metagenomes using k-mers. PLoS ONE,

10(4):e0121453.

Kennedy, J., O’Leary, N., Kiran, G., Morrissey, J., O’Gara,

F., Selvin, J., and Dobson, A. (2011). Functional

metagenomic strategies for the discovery of novel en-

zymes and biosurfactants with biotechnological appli-

cations from marine ecosystems. Journal of Applied

Microbiology, 111(4):787–799.

Khodakova, A. S., Smith, R. J., Burgoyne, L., Abarno, D.,

and Linacre, A. (2014). Random whole metagenomic

sequencing for forensic discrimination of soils. PloS

one, 9(8):e104996.

Patil, K. R., Haider, P., Pope, P. B., Turnbaugh, P. J., Mor-

rison, M., Scheffer, T., and McHardy, A. C. (2011).

Taxonomic metagenome sequence assignment with

structured output models. Nature Methods, 8(3):191–

192.

Rosen, G. L., Reichenberger, E. R., and Rosenfeld, A. M.

(2011). NBC: The naive Bayes classiﬁcation tool

webserver for taxonomic classiﬁcation of metage-

nomic reads. Bioinformatics, 27(1):127–129.

Simon, C. and Daniel, R. (2011). Metagenomic Analy-

ses: Past and Future Trends. Appl Environ Microbiol,

77(4):1153–1161.

Stewart, F. J., Sharma, A. K., Bryant, J. A., Eppley,

J. M., and DeLong, E. F. (2011). Community tran-

scriptomics reveals universal patterns of protein se-

quence conservation in natural microbial communi-

ties. Genome biology, 12(3):R26.

Turnbaugh, P. J., Hamady, M., Yatsunenko, T., Cantarel,

B. L., Duncan, A., Ley, R. E., Sogin, M. L., Jones,

W. J., Roe, B. A., Affourtit, J. P., Egholm, M., Henris-

sat, B., Heath, A. C., Knight, R., and Gordon, J. I.

(2009). A core gut microbiome in obese and lean

twins. Nature, 457(7228):480–484.

Weitschek, E., Fiscon, G., Fustaino, V., Felici, G., and

Bertolazzi, P. (2015). Clustering and classiﬁcation

techniques for gene expression proﬁle pattern analy-

sis. Pattern Recognition in Computational Molecular

Biology: Techniques and Approaches, page 347.

Weitschek, E., Santoni, D., Fiscon, G., De Cola, M. C.,

Bertolazzi, P., and Felici, G. (2014). Next generation

sequencing reads comparison with an alignment-free

distance. BMC research notes, 7(1):869.

Wood, D. E. and Salzberg, S. L. (2014). Kraken: Ultra-

fast metagenomic sequence classiﬁcation using exact

alignments. Genome Biology, 15(3):R46.

Environmental Metagenome Classiﬁcation for Soil-based Forensic Analysis

187