(residues 221-228) based on H3 numbering (Das et 
al., 2009; Durand et al., 2015; Jiang et al., 2012; 
Stevens et al., 2006). It is considered that the 
mutations in the RBD could affect the receptor 
binding avidity and specificity of hemagglutinin 
(Chen et al., 2011; de Vries et al., 2013; de Vries et 
al., 2014; Schrauwen and Fouchier, 2014). The RBD 
is the primary target of neutralizing antibodies, 
which are induced by virus infection or by 
vaccination with specific antigen (Bright et al., 
2003; Chen et al., 2011; Jiang et al., 2012; Khurana 
et al., 2011; McCullough et al. 2012). However, the 
mutations in the RBD lead to change in viral 
immunogenicity and antigenicity (Chen et al., 2011; 
Xu et al., 2010). Jiang et al. (2012) state that RBD 
plays a critical role in the elucidation of antiviral 
immune response and protective immunity. 
McCullough et al. (2012) also state that a better 
understanding of mutations in the RBD may be 
useful in vaccine and drug design effort. To prepare 
the future emergence of potentially dangerous 
outbreaks caused by divergent influenza strains 
including human-adapted H5N1 strains, it is 
imperative that we understand the rule stored in the 
sequence of the RBD.
  
Information of life is stored as a code composed 
of four nucleotides: adenine (A), cytosine (C), 
guanine (G), and thymine (T). Therefore, we can 
consider that the DNA or gene in each organism is a 
code showing its inherent structure. In protein 
coding region, each group of three consecutive 
nucleotides is called a codon, and each codon 
corresponds to one amino acid. The total number of 
three nucleotide groups is the third power of 4, 
which means we have 64 codons. However, only 20 
proteinogenic amino acids exist in nature. Moreover, 
it is supposed that the third nucleotide for a codon 
will not play an essential role in making of an amino 
acid. This shows that a gene has redundancy to 
correct errors to some extent. In other words, it has a 
structure that is similar to one of an error-
correcting/detecting code for the transmission of 
information. In life-science research, it is important 
to determine the code structure of the target gene. 
Once we know the code structure, we can make use 
of mathematical results concerning coding theory for 
research in life science. How can the RBD 
sequences of influenza A viruses be discussed using 
coding theory? The present study was conducted to 
find out the code structure of the 220 loop of 
influenza A viruses, and to predict sequence changes 
in the 220 loop of H5N1 virus. 
2 METHODS 
2.1 Sequence Data 
We applied artificial codes in coding theory to 
sequence analysis of the 220 loop in the H1, H3, H5 
and H7 RBD. All full-length amino acid and 
nucleotide sequences of hemaggulutinin from 
influenza A H1, H3, H5, and H7 subtypes were 
downloaded from the Influenza Research Database 
on September 2014. The hemaggulutinin data set 
consists of 8,941 human sequences from the H1 
subtype between 1918 and 2014, 6,013 human 
sequences from H3 subtype between 1968 and 2014, 
230 human sequences from the H5 subtype between 
1997 and 2013, and 51 human sequences from H7 
subtype between 1996 and 2014. The sequences 
were aligned using MAFFT (Katoh and Toh, 2008) 
which can quickly process a large dataset. 
2.2  Sequence Analysis of the 220 Loop 
by Coding Theory 
We explain how to encode the nucleotide sequence 
of the 220 loop to detect the code structure. The 
method for applying artificial codes to sequence 
analysis has been described in detail previously 
(Ohya and Sato, 2000; Sato et al., 2013)
. Since the 
Galois Field GF(4) consists of four elements, 0, 1,  
and  
 such  that 
++1=0 ,  the four 
nucleotides can be expressed in 
each of four elements. 
There are  a total of 24 (= 4!) different possible 
combinations to map the four nucleotides to the four 
elements in GF(4). 
First, an important part of the nucleotide 
sequence of the 220 loop from an influenza strain, 
namely the nucleotide sequence excluding the third 
nucleotide of each codon, is transformed into the 
information sequence which consists of the elements 
of GF(4). Next, the information sequence is grouped 
into blocks and then encoded into code words of an 
error-correcting/detecting code C. The total length of 
such a code (code word length) is multiples of 3 and 
the length of the information symbols (information 
block length) is multiples of 2. The check symbols 
in each code word are placed into the corresponding 
position of the third nucleotide of codon. Then, the 
encoded sequence, which consists of the set of the 
code words, is written back to nucleotide sequence. 
We call it the encoded nucleotide sequence. After 
that, the encoded nucleotide sequence is converted 
into amino acid sequence. We call it the encoded 
amino acid sequence. Finally, the degree of 
similarity between the amino acid sequence of the