Correlated Mutations of Positions Among Structural Proteins in Delta
and Omicron Variants for SARS-CoV-2 Amino Acid Sequences
Yuichi Shimaya and Kouich Hirata
Kyushu Institute of Technology, Kawazu 680-4, Iizuka 820-8502, Japan
Keywords:
Correlated Mutation, Structural Proteins, Spike Protein Substutution, Amino Acid Sequences, SARS-CoV-2,
Delta Variant, Omicron Variant.
Abstract:
In this paper, we find the correlated mutations of positions among structural proteins of spike, envelop, mem-
brane and nucleocapsid proteins in amino acid sequences of SARS-CoV-2. Here, we adopt the algorithm de-
signed by Shimada et al. (2012) of finding the correlated mutations formulated by joint entropy. In particular,
we discuss whether or not the found correlated mutations contains spike protein substitutions in SARS-CoV-2
Delta and Omicron variants.
1 INTRODUCTION
Coronavirus disease 19 (COVID-19) is caused by a
novel coronavirus designated as severe acute respi-
ratory syndrome coronavirus 2 (SARS-CoV-2). The
SARS-CoV-2 is an enveloped virus with a positive-
sense, single-stranded RNA genome. Figure 1 illus-
trates the schematic presentation of the SARS-CoV-2
genome organization, the canonical subgenomic mR-
NAs and the virion structure (Kim et al., 2020).
Coronaviruses (CoVs) were named because they
are characterized by spike protein projections on the
surface of the viral particles, and their shape resem-
bles a crown (corona) under electron microscopy, the
virion structure in Figure 1.
The CoVs carry the largest genomes among all
RNA virus families. The genomic RNA is trans-
lated to produce nonstructural protein (nsps) from two
open reading frames (ORFs), ORF1a and ORF1b.
ORF1a and ORF1b encode 11 and 5 nonstructural
proteins; nsp1 to nsp11 and nsp12 to nsp16, respec-
tively. Whereas ORF1a is translated directly from the
genomic RNA, the expression of ORF1b requires a -1
rebosomal frameshift near the end of ORF1, resulting
in a single ORF1ab polypeptide.
Downstream from the ORF1ab, there exist
ORFs encoding a few or more than 10 struc-
tural/nonstructural proteins. The common structural
proteins of CoVs are spike (S), envelope (E), mem-
brane (M) and nucleocapsid (N) proteins. SARS-
CoV-2 is also known to have at least 6 accessory
proteins, ORF3a, ORF6, ORF7a, ORF7b, ORF8 and
ORF10. However, the ORFs have not yet been exper-
imentally verified for expression. Therefore, it is cur-
rently unclear which accessory genes are actually ex-
pressed from this compact genome (Kim et al., 2020).
On the other hand, since the correlated muta-
tions of positions in amino acid sequences are fre-
quently observed among spatially close residues and
valuable for analyzing the structure of proteins, many
researchers (cf. (Jeong and Kim, 2010; Lee and
Kim, 2009; Oliveria et al., 2002)) have developed to
find them. In their works, the correlated mutations
are formulated by using an entropy (Oliveria et al.,
2002), a CM-score (Lee and Kim, 2009) or log-odds
scores (Jeong and Kim, 2010).
In this paper, we adopt the correlated mutations
of positions based on a joint entropy ratio introduced
by Shimada et al. (Shimada et al., 2012). Here, the
joint entropy ratio of positions is the ratio of the mini-
mum entropy in each position in them to the joint en-
tropy (cf., (Cover and Thomas, 2006)) of them. Then,
we say that positions are correlated (Shimada et al.,
2012) if the joint entropy ratio of them is greater than
a given joint entropy threshold τ (0 < τ 1). Further-
more, based on their correlated mutations, they have
designed the algorithm FINDCM
1
of finding all of the
correlated mutations with the joint entropy threshold
τ and an exclusion threshold ρ (ρ > 0) to exclude the
positions in which elements do not change well.
Recently, Yonashiro et al. (Yonashiro et al., 2022)
have applied the algorithm FINDCM to find the corre-
lated mutations of positions for structural proteins of
1
The algorithm FINDCM in this paper coincides with
an algorithm named by FINDCM2 in (Shimada et al.,
2012).
344
Shimaya, Y. and Hirata, K.
Correlated Mutations of Positions Among Structural Proteins in Delta and Omicron Variants for SARS-CoV-2 Amino Acid Sequences.
DOI: 10.5220/0011676000003411
In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 344-349
ISBN: 978-989-758-626-2; ISSN: 2184-4313
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
Figure 1: Schematic presentation of the SARS-CoV-2 genome organization, the canonical subgenomic mRNAs and the virion
structure (Kim et al., 2020).
S, E, M and N proteins from amino acid sequences of
SARS-CoV-2, by focusing on the positions observed
as spike protein substitutions for SARS-CoV-2 Delta
variant reported from CDC
2
and NIIB
3
.
Here, SARS-CoV-2 Delta and Omicron variants
are classified to VOCs (Variant of Concerns) from
June in 2021 to April in 2022 and from November in
2021 to today, respectively, from CDC. Here, a VOC
is a variant for which there is evidence an increase in
transmissibility, more several disease, significant re-
duction in neutralization by antibodies generated dur-
ing previous infection or vaccination, reduced effec-
tiveness of treatments or vaccines, or diagnostic de-
tection failures
2
.
Their variants are characterized as the spike pro-
tein substitutions. Table 1 illustrates the spike pro-
tein substitutions for SARS-CoV-2 Delta and Omi-
cron variants. Here, a
1
na
2
denotes that an amino acid
a
1
at the position n is substituted to another amino
acid a
2
. Also “(*)” denotes the substitution detected
in some sequences but not all. Furthermore, “del211
denotes that the amino acid at the position 211 is
deleted, “del69-70” denotes that the amino acids from
the position 69 to 70 are deleted and “ins214EPE” de-
notes that the amino acids EPE are inserted from the
position 214 to 215.
Hence, in this paper, after finding the correlated
mutations obtained by the algorithm FINDCM under
τ and ρ, we discuss whether or not the found corre-
lated mutations among all of the structural proteins
contains spike protein substitutions in not only SARS-
CoV-2 Delta variant and but also SARS-CoV-2 Omi-
cron variant. Note that the results for the Delta variant
are updated from the previous work (Yonashiro et al.,
2022) by using newly amino acid sequences.
2
CDC, Centers for Disease Control and Prevention.
https://www.cdc.gov.
3
NIIB, National Institute of Infectious Diseases.
https://www.niid.go.jp/nidd/ja.
Table 1: The spike protein substitutions for SARS-CoV-2
Delta and Omicron variants.
Delta variant
T19R (V70F*) T95I G142D del156
del157 R158G (A222V*) (W258L*) (K417N*)
L452R T478K D614G P681R D950N
Omicron variant
A67V del69-70 T95I del142-144 Y145D
del211 L212I ins214EPE G339D S371L
S373P S375F K417N N440K G446S
S477N T478K E484A Q493R G496S
Q498R N501Y Y505H T547K D614G
H655Y N679K P681H N764K D796Y
N856K Q954H N969K L981F
2 FINDING CORRELATED
MUTATION
In this section, we introduce the algorithm FINDCM
designed by (Shimada et al., 2012).
Let Σ be an alphabet and Σ
n
the set of all strings
on Σ with length n. Also let [n] be {1,..., n}. For
a string w Σ
n
and a set I = {i
1
,... ,i
k
} [n] (1
i
1
< ·· · < i
k
n), we denote the string w[i
1
]··· w[i
k
]
constructed from concatenating the symbols at from
i
1
to i
k
by w[I]. Furthermore, let P(w[I] = v) be the
probability that w[I] is v Σ
k
.
Definition 1. Let S Σ
n
and I [n]. Then, we define
the joint entropy H
S
(I) of S at a set I of positions (cf.,
(Cover and Thomas, 2006)) as:
H
S
(I) =
vΣ
|I|
,wS
P(w[I] = v) × logP(w[I] = v)
.
An entropy of S at a position i [n] coincides with
H
S
({i}) (cf., (Cover and Thomas, 2006)), which we
denote by H
S
(i). Then, we define the joint entropy
ratio R
S
(I) of S at a set I of positions as:
R
S
(I) =
min{H
S
(i) | i I}
H
S
(I)
.
Correlated Mutations of Positions Among Structural Proteins in Delta and Omicron Variants for SARS-CoV-2 Amino Acid Sequences
345
For the joint entropy ratio, the following lemma
holds.
Lemma 1. (Shimada et al., 2012) Let S Σ
n
and
I, J [n]. If I J, then H
S
(I) H
S
(J).
Then, we formulate the correlated mutation by in-
troducing the joint entropy threshold as the main topic
in this paper.
Definition 2. (Shimada et al., 2012) Let S Σ
n
and
I [n]. Also let 0 < τ 1 be a joint entropy thresh-
old. Then, we say that I is correlated if R
S
(I) τ.
Furthermore, we call the mutations at I in S corre-
lated mutations of S.
When setting τ to 1, that is, R
S
(I) = 1, the corre-
lated mutations are exact. On the other hand, when
setting τ less than 1, that is, R
S
(I) < 1, the correlated
mutations contains exception. Furthermore, we use
another threshold ρ 0, called an exclusion thresh-
old, to exclude the positions in which elements do not
change well.
The algorithm FINDCM (Shimada et al., 2012) in
Algorithm 1, which is based on the set enumeration
algorithm (Rymon, 1992), describes the algorithm of
finding all of the correlated mutations with an entropy
pruning at lines from 5 to 7. Here, the correctness of
the entropy pruning follows from the following theo-
rem.
Theorem 1. (Shimada et al., 2012) Let S Σ
n
and
I [n]. If R
S
(I) τ, then it holds that τH
S
(i)
H
S
( j) H
S
(i)/τ for every i, j I.
3 EXPERIMENTAL RESULTS
In this section, by focusing on the structural proteins,
that is, spike (S), envelope (E), membrane (M) and
nucleocapsid (N) proteins, we give experimental re-
sults of finding correlated mutations among them.
We use amino acid sequences of the four structural
proteins for a Delta variant and an Omicron variant
provided from NCBI
4
, whose length of amino acid
sequences for S, E, M and N proteins is 1,273, 75,
222 and 419, respectively (Nakagawa and Miyazaki,
2020). The amino acid sequences of the Delta variant
are stored from April 1 in 2021 to May 31 in 2022
and the number of sequences is 834,870. On the other
hand, the amino acid sequences of the Omicron vari-
ant are stored from October 1 in 2021 to July 29 in
2022 and the number of sequences is 393,655.
We use amino acids as 20 characters of A, C, D, E,
F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y, so the
4
NCBI, National Center for Biotechnology Informa-
tion. https://www.ncbi.nlm.nih.gov.
procedure FINDCM(S,τ,ρ)
/* S Σ
n
, 0 < τ 1, ρ 0 */
C {1, . ..,n};1
for i = 1 to n do2
if H
S
(i) ρ then3
C C \ {i}; C
next
C;4
foreach j C
next
do5
if H
S
( j) ρ and
H
S
( j) < τH
S
(i)
6
or H
S
(i)/τ < H
S
( j)
then
C
next
C
next
{ j};7
EXPAND(S,{i},C
next
,τ);8
procedure EXPAND(S,I,C,τ)
/* I,C {1, . ..,n} */
C
next
C;9
foreach i C do10
C
next
C
next
\ {i};11
if R
S
(I {i}) τ then12
output I {i};13
EXPAND(S,I {i},C
next
,τ);
Algorithm 1: FINDCM.
set of the above characters is regarded as Σ. Also we
insert a gap symbol ‘-’ at the position where an amino
acid is deleted. Furthermore, since a character ’X’
is inserted if an amino acid is unknown in the amino
acid sequences provided from NCBI
4
, we replace ‘X’
at a position as the most frequent amino acid at the
position in all the amino acid sequences.
3.1 Running Time
In this subsection, we investigate the running time of
the algorithm FINDCM. Here, the computer environ-
ment is that OS is WSL2 over Windows 10, CPU is
Intel Core i7-7700 3.6GHz and RAM is 16GB.
Table 2 and Table 3 illustrate the running time
of the algorithm FINDCM for amino acid sequences
of the Delta variant and the Omicron variant, respec-
tively.
Tables 2 and 3 show that, for 0.80 τ 0.90, the
computation time of the Delta variant is smaller than
that of the Omicron variant for ρ = 0.10 and 0.05, the
computation time of the Delta variant is much larger
than that of the Omicron variant for ρ = 0.01. Note
that the number of sequences for the Delta variant is
834,870 and that for the Omicron variant is 393,655,
so the former is more than twice the latter. Hence, by
comparing the computation time for FINDCM with
the number of the sequences, the number of candi-
dates, not pruned by τ and ρ in FINDCM, of the corre-
lated mutations for the Omicron variant is much larger
than that for the Delta variant for small ρ.
ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods
346
Table 2: The running time (sec.) of the algorithm FINDCM
for amino acid sequences of the Delta variant.
τ ρ time (sec.)
0.90 0.10 56
0.05 74
0.01 1,366
0.85 0.10 89
0.05 114
0.01 2,089
0.80 0.10 144
0.05 173
0.01 2,817
Table 3: The running time (sec.) of the algorithm FINDCM
for amino acid sequences of the Omicron variant.
τ ρ time (sec.)
0.90 0.10 2
0.05 9
0.01 11,840
0.85 0.10 3
0.05 11
0.01 80,522
0.80 0.10 4
0.05 16
0.01 427,746
τ ρ time (sec.)
0.75 0.10 5
0.70 5
0.65 5
0.60 6
0.55 6
0.50 7
0.45 21
0.40 30
0.35 40
0.30 78
0.25 127
0.20 242
3.2 Correlated Mutation for Delta
Variant
In the following subsections, we discuss the corre-
lated mutations found by the algorithm FINDCM.
Here, we denote the correlated mutations by the set
of the following form:
structural proteinposition.
In this subsection, we investigate the found cor-
related mutations for the Delta variant. When we fix
ρ = 0.01, Table 4 illustrates the found correlated mu-
tations for the Delta variant obtained by varying τ
as 0.90, 0,85 and 0.80. Here, “id” means the index
“CMi of the i-th correlated mutations. Also, we de-
note the positions in the spike protein substitutions for
SARS-CoV-2 Delta and Omicron variants in Table 1
in Section 1 by bold faces. Furthermore, we denote
the newly added positions in a correlated mutation as
underlined.
Table 4 shows that, whereas the correlated muta-
tions of CM1 and CM4 do not change even if τ is
varied from 0.90 to 0.80, the correlated mutations of
CM2 and CM3 are added to new positions in corre-
lated mutations. Also, all of the correlated mutations
Table 4: The correlated mutations for the Delta variant ob-
tained by varying τ as 0.90, 0,85 and 0.80 under ρ = 0.01.
τ id correlated mutation
0.90 CM1 S570, S716, S982, S1118, N3,
N235
CM2 S452, S478, M82
CM3 S190, N80
CM4 S501, N204
0.85 CM1 S570, S716, S982, S1118, N3,
N235
CM2 S19, S452, S478, M82, N377
CM3 S20, S190, S655, N80
CM4 S501, N204
0.80 CM1 S570, S716, S982, S1118, N3,
N235
CM2 S19, S452, S478, S681, S950,
M82, N63, N203, N377
CM3 S20, S26, S190, S655, S1027,
S1176, N80
CM4 S501, N204
CM5 S289, N18
in Table 4 contain the positions of S and the other
structural proteins.
In particular, the correlated mutation CM2 at τ =
0.80 contains five positions for spike protein substitu-
tions that are just positions in S. Note that the five po-
sitions are known to mainly characterize SARS-CoV-
2 Delta variant (Chen et al., 2022).
3.3 Correlated Mutation for Omicron
Variant Under ρ = 0.01
In this and next subsections, we investigate the corre-
lated mutations for Omicron variant.
First, when we fix ρ = 0.01, Table 5 illustrates the
found correlated mutations for the Omicron variant
obtained by varying τ as 0.90, 0,85 and 0.80. Here,
“+” means that the correlated mutation is obtained by
adding the upper correlated mutation to the presented
positions.
Table 5: The correlated mutations for the Omicron variant
obtained by varying τ as 0.90, 0,85 and 0.80 under ρ = 0.01.
τ correlated mutation
0.90 S109, S110, S114, S115, S116, S117,
S119, S121, S122, S125, S126, S129,
S131, S136, S139, S140, S149
0.85 + S113, S120, S122, S124, S127, S137,
S144, S150
0.80 + S130, S134, S141, S154
Correlated Mutations of Positions Among Structural Proteins in Delta and Omicron Variants for SARS-CoV-2 Amino Acid Sequences
347
Table 5 shows that, under ρ = 0.01 and 0.80
τ 0.90, we can find one correlated mutation, which
is concerned with just S with positions from 109 to
154. Also the position of 144 is concerned with spike
protein substation in Table 1.
3.4 Correlated Mutation for Omicron
Variant Under ρ = 0.10
Next, when we fix ρ = 0.10, Table 6 illustrates the
found correlated mutations for the Omicron variant
obtained by varying τ from 0.70 to 0.20 decreasing
by 0.05.
Table 6: The correlated mutations for the Omicron variant
obtained by varying τ from 0.70 to 0.20 decreasing by 0.05
under ρ = 0.10.
τ id correlated mutation
0.70 CM1 S289, N18
0.65–0.55 CM1 S289, N18
CM2 S222, N215
0.50 CM1 S289, N18
CM2 S112, S222, N215
0.45 CM1 S5, S289, S809, S1104,
S1264, N9, N18, N63
CM2 S112, S222, N215
CM3 S95, S142
0.40–0.35 CM4 S5, S112, S222, S289, S809,
S1104, S1264, N9, N18,
N63, N215
CM3 S95, S142
0.30–0.20 CM5 S5, S95, S112, S142, S222,
S289, S809, S1104, S1264,
N9, N18, N63, N215
Table 6 shows that all of the correlated mutations
contain the positions of S and the other structural pro-
teins. Also, the correlated mutation CM3 contains the
positions for spike protein substitutions.
Also Table 6 shows that the correlated mutation
CM4 is the combination of CM1 and CM2, and the
correlated mutation CM5 is the combination of CM3
and CM4. Hence, we can regard that the correlated
mutation CM5 is the convergence of other found cor-
related mutations.
On the other hand, the number of positions in
spike protein substitutions occurring in the correlated
mutations is just two, which is small. Then, it is a fu-
ture work to find the correlated mutations containing
more positions in spike protein substitutions.
4 CONCLUSION
In this paper, we have found the correlated mutations
of positions among structural proteins in amino acid
sequences of SARS-CoV-2 Delta and Omicron vari-
ants by using the algorithm FINDCM designed by
(Shimada et al., 2012). Then, we have obtained the
correlated mutations containing the positions among
several structural proteins and containing the posi-
tions occurring in the spike protein substitutions in
SARS-CoV-2 Delta and Omicron variant.
In particular, we have found the correlated muta-
tion CM5 in Table 6 as the convergence of several
correlated mutations containing the positions in the
spike protein substitutions. On the other hand, it is a
future work to investigate the positions except a spike
protein of CM2 at τ = 0.80 in Table 4 and CM5 in
Table 6 in the genomic viewpoints.
The algorithm FINDCM is based on the set enu-
meration algorithm (Rymon, 1992). Then, it is a fu-
ture work to design the algorithm of finding correlated
mutations based on another enumeration algorithm,
with introducing another thresholds like τ and ρ.
Whereas we have found the correlated mutations
concerned with the positions in the spike protein sub-
stitutions, the number of them is small, in particular,
for the Delta variant. Also, whereas the algorithm
FINDCM finds all of the correlated mutations under
given τ and ρ, it is necessary to find the correlated
mutations concerned with the positions in the spike
protein substitutions directly and efficiently. Hence,
it is a future work to design a new algorithm of find-
ing the correlated mutations containing given several
positions like as the positions in spike protein sub-
stitutions, which is possible to be more efficient than
FINDCM.
REFERENCES
Chen, K.-W. K., Huang, D. T.-N., and Huang, L.-M. (2022).
SARS-CoV-2 variants – evolution, spike protein, and
vaccines. Biomed. J., 45:573–579.
Cover, T. M. and Thomas, J. A. (2006). Elements of infor-
mation theory (Second edition). John Wiley & Sons.
Jeong, C. and Kim, D. (2010). Linear predictive coding
representation of correlated mutation for protein se-
quence alignment. BMC Bioinform., 11:52.
Kim, D., Lee, J.-Y., Yang, J.-S., Kim, J. W., Kim, V. N., and
Chang, H. (2020). The architecture of SARS-CoV-2
transcriptome. Cell, 181.
Lee, B.-C. and Kim, D. (2009). A new method for reveal-
ing correlated mutations under the structural and func-
tional constraints in proteins. Bioinform., 25:2506–
2513.
ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods
348
Nakagawa, S. and Miyazaki, T. (2020). Genome evolution
of SARS-CoV-2 and its viological characteristics. In-
flamm. Regen., 40:17.
Oliveria, L., Paiva, A. C. M., and Vriend, G. (2002). Corre-
lated mutation analyses on very large sequence fami-
lies. ChemBioChem, 3:1010–1017.
Rymon, R. (1992). Search through systematic set enumera-
tion. In Proc. KR’92, pages 539–550.
Shimada, T., Hazemoto, T., Makino, S., Hirata, K.,
Yonezawa, K., and Ito, K. (2012). Finding correlated
mutations among rna segments in H3N2 influenza
viruses. In Proc. SCIS-ISIS’12, pages 1696–1705.
Yonashiro, K., Shimaya, Y., and Hirata, K. (2022). Find-
ing correlated mutations of positions among structural
proteins in SARS-CoV-2 amino acid sequences. In
Proc. ESKM’22, pages 61–64.
Correlated Mutations of Positions Among Structural Proteins in Delta and Omicron Variants for SARS-CoV-2 Amino Acid Sequences
349