FUZZY CONCEPT LATTICE-BASED APPROACH FOR REACTIVE
MOTIFS DISCOVERY
Thanapat Kangkachit and Kitsana Waiyamai
Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand
Keywords:
Complete substitution group, Fuzzy concept lattice, Reactive motifs, Enzyme function classification, Binding
and catalytic site, Amino acid sustitution matrix, Biochemical knowledge.
Abstract:
Reactive motifs are short conserved regions discovered from binding and catalytic sites of enzymes sequences.
Thus, reactive motifs provide more biological meaning than statistic-based motifs because they are directly
extracted from where the chemical reaction mechanism occurs. Main problem of discovering reactive motifs
is that only 4.94% enzymes sequences contain sites information. To overcome this problem, we present fuzzy
concept lattice-based (FCL-based) method for discovering more general reactive motifs by incorporating bio-
chemical knowledge. Fuzzy concept lattices are used to represent both binary and multi-value biochemical
knowledge. The fuzzy concept lattice Join operator is applied to determine complete substitution groups that
obtains more general reactive motifs. Experiments are conducted among dierent methods of determining
complete substitution groups: FCL-based, concecpt lattice-based (CL-based) and similarity-based method.
Experimental results show that FCL-based method significantly outperforms other methods in term of cov-
erage value and F-measure with SVM learning algorithm. Therefore, fuzzy concept lattice provides more
ecient computational support for complete substitution groups operation than that of other existing methods.
1 INTRODUCTION
Reactive motifs are short conserved regions discov-
ered from binding and catalytic sites of enzymes
sequences. Compared with statistic-based motifs
(Sander and Schneider, 1991; Eidhammer et al., 1999;
Huang and Brutlag, 2001; Bennett et al., 2003), en-
zyme function classification model using reactive mo-
tifs gives the better accuracy with explanation in
terms of reactive motifs combination. Compared with
expert-based motifs i.e. PROSITE (Bairoch, 1993),
the performance of classification model using reactive
motifs is more ecient in term of accuracy due to a
small number of occurrences in protein sequences of
those expert-based motifs. Main problem in discover-
ing reactive motifs is the lack of binding and catalytic
sites information, only 4.94% of enzymes sequences
contain binding and catalytic sites information in the
UNiProtKB/Swiss-Prot Version 9.2 (Bairoch and Ap-
weiler, 2000). (Waiyamai et al., 2008) have pro-
posed a concept lattice-based reactive motifs discov-
ery method called CL-based. They introduced the
concept of mutation control determine a complete
amino acid substitution group for each position in the
sequences, such that the substitution group contains
all possible amino acids that can be substituted. Con-
cept lattice operators have been defined to support
mutation control operations. The proposed technique
yields good results (70 % accuracy of enzyme func-
tion classification) and can overcome problems such
as lack of information about binding and catalytic
sites. However, only binary-value context is sup-
ported, conversion method is needed to support both
binary and multi-value knowledge. Moreover, the oc-
currences of reactive motifs need to be improved to
obtain better accuracy of enzyme function classifica-
tion model.
This paper presents FCL-based approach for reac-
tive motifs discovery. Main objective is to further in-
crease the occurrences of reactive motifs in enzymes
sequences dataset and to improve their quality by in-
corporating various types of biochemical knowledge.
Both binary and multi-value biochemical knowledge
is formally constructed in a unique structure fuzzy
concept lattice that overcomes the problem of infor-
mation loss while converting multi-value knowledge
into binary-value context. Fuzzy concept lattices are
then used to determine complete substitution groups
in reactive motifs. As result, more general reactive
motifs are generated. We also show that FCL-based
approach provides ecient computational support for
complete substitution groups determination in reac-
326
Kangkachit T. and Waiyamai K..
FUZZY CONCEPT LATTICE-BASED APPROACH FOR REACTIVE MOTIFS DISCOVERY.
DOI: 10.5220/0003787503260330
In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2012), pages 326-330
ISBN: 978-989-8425-90-4
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
tive motifs discovery. The quality of FCL-based re-
active motifs can be expressed through the high per-
formance of enzyme function classification model us-
ing them as features to learning algorithm i.e. C4.5
(Quinlan, 1993) and SVM (Boser et al., 1992; Cris-
tianini and Shawe-Taylor, 2010).
The experiments are conducted among dierent
complete substitution groups determination methods:
FCL-based, CL-based and similarity-based methods.
The performance of reactive motifs is measured using
the coverage value (occurrences of motifs in enzyme
sequences dataset) and F-measure with SVM and
C4.5. The experimental results show that FCL-based
method gives better performance than other methods
in terms of coverage value, F-measure, Precision and
Recall. In the following, we describe the process of
reactive motifs discovery using fuzzy-concept lattice.
Experimental results and conclusion are presented in
section 3 and section 4 consequently.
2 FUZZY CONCEPT
LATTICE-BASED APPROACH
FOR REACTIVE MOTIFS
DISCOVERY
In this section, the overall process of reactive motifs
discovery is presented. It is composed of three steps:
data preparation and block scan filtering, complete
substitution groups determination, and reactive site
group definition as the same framework as introduced
in (Waiyamai et al., 2008).
2.1 Data Preparation and Block Scan
Filtering
In this step, we collect enzyme sequences dataset
(Bairoch, 1993; Apweiler et al., 2004), covers 22,637
sequences of 237 functions, while only 4.94% of en-
zyme sequences contain binding or catalytic sites.
By considering sites position as center, enzyme sub-
sequences having 15 amino acids length; initial sites
patterns are extracted. This is based on the 21-atoms
average substrate size in BRENDA database (Schom-
burg et al., 2004) which corresponds to 7 amino acids
(3 atoms per 1 amino acid) on each side of the sites
position.
To solve the problem of lack information at bind-
ing and catalytic sites, the generalization is performed
at each position of the initial sites patterns using a
block scan filtering. Then, a scan operation is per-
formed to retrieve all the sub-sequences having the
same site description to generate sequences blocks.
Figure 1: Enzyme sub-sequence filtering to obtain high
quality block.
Figure 2: A block pattern obtained from the MSA operation
on a quality block.
Sequences are ranked according to amino-acid sim-
ilarity scores using BLOSUM62 table (Heniko and
Heniko, 1992). In Figure 1, to obtain high quality
block, each sub-sequence in the block must has at
least 3 positions presenting the same type of amino
acids as proposed in (Smith et al., 1990) for the high
quality motifs.
To discard some positions that may not involve in
protein functional mechanism, a multiple sequences
alignment (MSA (Ramu et al., 2003)) is performed
to all sub-sequences in each high quality block. Fi-
nally, a block pattern is extracted as a represen-
tative block in the form of regular expression e.g.
x[FMV]x[GS]C[DQN][ST]CHxxxxx, where x be
any amino acid, [ ] be a set of possible amino acids at
a given position, called substitution group, as shown
in Figure 2.
Due to the lack sites information, there will pos-
sibly be other amino acids in some substitution group
positions of a block pattern. Thus, determining com-
plete substitution groups to generate more general and
high quality reactive motifs from a set of block pat-
terns is required.
2.2 Complete Substitution Groups
Determination
In the following, we explain how the complete sub-
stitution groups are determined using FCL-based
method. First, both binary and multi-value biochem-
ical knowledge from various sources can be formally
constructed in a unique fuzzy concept lattice struc-
ture (Yahia and Jaoua, 2001; S., 2003; Elloumi et al.,
2004). The values in each properties of biochemi-
FUZZY CONCEPT LATTICE-BASED APPROACH FOR REACTIVE MOTIFS DISCOVERY
327
cal knowledge are normalized to [0,1] and then for-
mulated as an amino acid properties context to con-
struct fuzzy concept lattice. Due to the fact that bio-
chemical knowledge can contain a large number of
amino acid properties, FCL-based uses (Nakai et al.,
1988) approach to select proper properties. A hierar-
chical cluster analysis is used to investigate the rela-
tionships among amino acid properties (circles with
index) and represent them using a minimum spanning
tree as shown in Figure 3(a). Figure 3(b) shows the
selection of properties in largest cluster (region with
dots) to be input to the amino acid properties context.
Once constructed from the amino acid properties con-
text, the fuzzy concept lattice structure is able to ex-
press a hierarchy of concepts which represent amino-
acid substitution groups sharing the common proper-
ties. Then, complete substitution groups operations
can be defined based on the fuzzy concept lattice Join
operator. More details of fuzzy concept lattice can be
found in (Yahia and Jaoua, 2001; S., 2003; Elloumi
et al., 2004).
Here, we show how FCL-based method works
to determine complete substitution groups. In Fig-
ure 2, the substitution group [FMV] at 2
nd
position
of a block pattern is used as an example. Accord-
ing to amino acid properties context in Table 3(b),
the common properties of this group can be deter-
mined by the minimum value of all properties i.e.
(#383
0.52
, #96
0.60
, #159
0.15
, . . . , #27
0
). Then, a set of
amino acids {F,M,V,I,L} is extracted as a complete
substitution group with respect to their common prop-
erties of this substitution group. Finally, we obtain
pattern x[FMVIL]x[GS]C[DQN][ST]CHxxxxx as a
set of complete reactive motifs.
2.3 Reactive Sites Groups Generation
Enzymes can have dierent catalytic and binding
structures to fit and function the same substrate. Thus,
complete reactive motifs can be grouped together
according to their sites description as specified in
UNiProtKB/Swiss-Prot database. As result, 291 re-
active sites groups which correspond to the reactive
motifs groups are generated and used as input features
to build the enzyme functions classification model.
3 EXPERIMENTAL RESULTS
In the following, the results of dierent types of re-
active motifs: FCL-based, baseline (without use of
background knowledge), CL-based, similarity-based,
are compared using a dataset containing 22,637 pro-
tein sequences with 237 enzyme functions collected
from UNiProtKB/Swiss-Prot Version 9.2. The dif-
ferent types of reactive motifs are explained in Sec-
tion 3.1. The performance comparison is conducted
in terms of the reactive motif generalization and the
prediction model accuracy.
3.1 Dierent Types of Reactive Motifs
Here, implementation details of the dierent types of
reactive motifs are given.
Baseline method discovers reactive motifs with-
out use of biochemical knowledge. The propose
of this method is to compare how ecient the
other methods gain in using proper background
knowledge.
CL-based method integrates binary-value bio-
chemical knowledge via a concept lattice, and per-
forms mutation control operations to determine
complete substitution groups to generate reactive
motifs. More details of CL-based can be found in
(Waiyamai et al., 2008).
CL-based* method extends CL-based method by
performing a multiple sequence alignment oper-
ation on quality blocks (refer to step 2.1) before
determining complete substitution groups.
Similarity-based method uses amino acid substi-
tution matrices instead of biochemical properties
knowledge. For each substitution group, the min-
imum similarity score obtained by any pairs of
amino acids in such group is computed. The other
amino acids are determined as complete substitu-
tion group if the similarity scores between them
and any amino acid in the substitution group are
higher than or equal to minimum similarity score.
For example, the minimum similarity score of
substitution group [MFV] is -1 derived from sim-
ilarity score between F and V using BLOSUM62.
Then, we can extract [MFVHILWYARCQKST]
as a complete substitution group.
3.2 Evaluation of Reactive Motifs
Discovery Methods
The performance of reactive motifs is evaluated in
terms of their occurrences in enzymes sequences
dataset through coverage value and the quality of pre-
diction model using them as input features measured
by using F-Measure, Precision and Recall. Table1
comparesperformance of the dierent methods for re-
active motifs discovery.
In term of coverage value, FCL-Based and
similarity-based methods produce significantly higher
value compared to no use of knowledge. It means that
BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms
328
(a) (b)
Figure 3: AAIndex: physicochemical properties (a) Representation of 3 clusters (region with dots) using minimum spanning
tree (b) Amino acid-properties context derived from the largest cluster in (a).
Table 1: Coverage value, F-measure, recall and precision comparison among dierent types of reactive motifs.
F-Measure
4
Recall
2
Precision
3
Reactive motifs types Coverage(%) SVM C4.5 SVM C4.5 SVM C4.5
Similarity-based method
- BLOSUM62 94.55 0.699 0.662 0.709 0.672 0.745 0.675
- BLOSUM80 93.80 0.691 0.666 0.700 0.675 0.742 0.686
- PAM30 95.43 0.705 0.661 0.683 0.677 0.720 0.687
- PAM70 94.33 0.689 0.654 0.698 0.664 0.735 0.668
- PAM250 94.35 0.699 0.667 0.708 0.676 0.743 0.684
Mutation control-based methods
- FCL-Based
1
98.02 0.746 0.635 0.751 0.646 0.766 0.646
- CL-Based* 91.81 0.672 0.660 0.684 0.676 0.722 0.691
- CL-Based 64.84 0.590 0.586 0.549 0.545 0.615 0.609
Baseline method 91.91 0.662 0.660 0.675 0.670 0.713 0.684
1
Using integrated biochemical knowledge
2
Recall = TP / (TP + FN)
3
Precision = TP / (TP + FP)
4
F-Measure = 2*Precision*Recall / (Precision + Recall);
both methods eciently utilize biochemical knowl-
edge in order to produce more general reactive motifs.
Moreover, FCL-based method provide best computa-
tional support for generating general reactive motifs
due to its highest coverage value.
In term of prediction model accuracy, FCL-based
method gives highest precision, recall and F-measure
with SVM. It means that FCL-based method also pro-
duces highest true positive(TP) value but lowest false
positive(FP) and false negative(FN) values. In con-
trast, the other methods aim to produce high precision
but low recall that aected F-Measure i.e. high FP and
FN values. However, there is no significantly dier-
ent of all measures obtained by FCL-based method
and other methods with C4.5.
In summary, with highest coverage, F-Measure,
precision and recall values, FCL-based method pro-
vides more general reactive motifs while retains accu-
racy of the prediction model by reducing the eect of
FP and FN values.
4 CONCLUSIONS AND
DISCUSSION
Main problem of discovering reactive motifs is that
only 4.94% enzymes sequences contain sites infor-
mation. To overcome this problem, we present fuzzy
concept lattice-based (FCL-based) method for discov-
ering more general reactive motifs by incorporating
biochemical knowledge. Fuzzy concept lattices are
used for both representing binary and multi-value bio-
chemical knowledge, and determining complete sub-
FUZZY CONCEPT LATTICE-BASED APPROACH FOR REACTIVE MOTIFS DISCOVERY
329
stitution groups that produces more general reactive
motifs. Used as input features of SVM, generated
FCL-based reactive motifs provide highest coverage,
F-measure, precision and recall values without aect-
ing the FP and FN values in the prediction model.
Rule-based learning methods can be investigated
to provide meaningful and understandable informa-
tion to biologists. Among rule-based methods, asso-
ciative classification technique (Liu et al., 1998; Li
et al., 2001; Belohlvek et al., 2007) is recognized to
be more accurate over traditional classification tech-
niques i.e. C4.5 for very large number of classes to
predicted. In the future work, we will increase both
accuracy and explanatory ability of the protein func-
tion classification model using reactive motifs as in-
put features to an associative classification.
ACKNOWLEDGEMENTS
The authors gratefully acknowledge the Thailand Re-
search Fund (TRF) and Kasetsart University for finan-
cial support through the Royal Golden Jubilee Ph.D.
scholarship program (1.0.KU/49/A.1).
REFERENCES
Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C.,
Boeckmann, B., Ferro, S., Gasteiger, E., Huang,
H., Lopez, R., Magrane, M., Martin, M. J., Natale,
D. A., O’Donovan, C., Redaschi, N., and Yeh, L.-
S. L. (2004). Uniprot: the universal protein knowl-
edgebase. Nucleic Acids Research, 32(Database-
Issue):115–119.
Bairoch, A. (1993). The prosite dictionary of sites and pat-
terns in proteins, its current status. Nucleic Acids Re-
search, 21(13):3097–3103.
Bairoch, A. and Apweiler, R. (2000). The swiss-prot protein
sequence database and its supplement trembl in 2000.
Nucleic Acids Res, 27:49–54.
Belohlvek, R., Baets, B. D., Outrata, J., and Vychodil, V.
(2007). Inducing decision trees via concept lattices.
In CLA, volume 331 of CEUR Workshop Proceedings.
CEUR-WS.org.
Bennett, S. P., Lu, L., and Brutlag, D. L. (2003). 3matrix
and 3motif: a protein structure visualization system
for conserved sequence motifs. Nucleic Acids Re-
search, 31(13):3328–3332.
Boser, B. E., Guyon, I., and Vapnik, V. (1992). A training
algorithm for optimal margin classifiers. In COLT,
pages 144–152.
Cristianini, N. and Shawe-Taylor, J. (2010). An Introduction
to Support Vector Machines and Other Kernel-based
Learning Methods. Cambridge University Press.
Eidhammer, I., Jonassen, I., and Taylor, W. R. (1999).
Structure comparison and structure patterns. JOUR-
NAL OF COMPUTATIONAL BIOLOGY, 7:685–716.
Elloumi, S., Youssef, C. B., and Yahia, S. B. (2004). The
fuzzy classifier by concept localization in a lattice of
concepts. In Proceedings of the CLA 2004 Interna-
tional Workshop on Concept Lattices and their Appli-
cations (CLA).
Heniko, S. and Heniko, J. G. (1992). Amino acid sub-
stitution matrices from protein blocks. Proceedings
of the National Academy of Sciences, 89(22):10915–
10919.
Huang, J. Y. and Brutlag, D. L. (2001). The emotif database.
Nucleic Acids Research, 29(1):202–204.
Li, W., Han, J., and Pei, J. (2001). Cmar: Accurate and e-
cient classification based on multiple class-association
rules. In Proceedings of the 2001 IEEE International
Conference on Data Mining (ICDM), pages 369–376.
Liu, B., Hsu, W., and Ma, Y. (1998). Integrating classi-
fication and association rule mining. In Knowledge
Discovery and Data Mining, pages 80–86.
Nakai, K., Kidera, A., and Kanehisa, M. (1988). Cluster
analysis of amino acid indices for prediction of protein
structure and function. Protein Eng, 2(2):93–100.
Quinlan, J. R. (1993). C4.5: programs for machine learn-
ing. Morgan Kaufmann Publishers Inc., San Fran-
cisco, CA, USA.
Ramu, C., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J.,
Higgins, D. G., and Thompson, J. D. (2003). Multi-
ple sequence alignment with the clustal series of pro-
grams. Nucleic Acids Research, 31(13):3497–3500.
S., K. (2003). Cluster based ecient generation of fuzzy
concepts. In Neural Network World, pages 521–530.
Sander, C. and Schneider, R. (1991). Database of
homology-derived protein structures and the structural
meaning of sequence alignment. Proteins: Structure,
Function, and Genetics, 9(1):56–68.
Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt,
C., Huhn, G., and Schomburg, D. (2004). BRENDA,
the enzyme database: updates and major new de-
velopments. Nucleic acids research, 32(Database
issue):D431–433.
Smith, O., T., A. M., and S., C. (1990). Finding se-
quence motifs in groups of functionally related pro-
teins. Proceedings of the National Academy of Sci-
ences, 87(2):826–830.
Waiyamai, K., Liewlom, P., Kangkachit, T., and Rakthan-
manon, T. (2008). Concept lattice-based mutation
control for reactive motifs discovery. In PAKDD,
pages 767–776.
Yahia, S. B. and Jaoua, A. (2001). Discovering knowledge
from fuzzy concept lattice, pages 167–190. Physica-
Verlag GmbH, Heidelberg, Germany, Germany.
BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms
330