Later  in  order  to  improve  the  AM  model,  the 
analysis  of  the individual elemental contributions  is 
furthered  which  is  to  estimate  the  errors  may  be 
caused from C, H, N, O and S. 
2  EXPERIMENT AND METHOD 
Comprehensive simulation analysis can help us find 
the  essential pattern  hidden  behind  the  complicated 
data  sets  in  most  cases.  To  find  out  the  regular 
patterns of mass errors estimated when applying the 
averagine model on all human proteins, an in-house 
program was developed using the MATLAB toolbox 
which has multiple functions and bioinformatics tools 
that can deal with massive amount of protein data, as 
well  as  its  capacity  of  transferring  massive  digital 
results  to  visualized  diagrams  such  as  scatters  and 
bars conveniently.  
The averagine model is used in the experimental 
fundamentals which in this case offers the basic idea 
of how to estimate unknown large proteins as well.  
Here the molecular information for each protein 
in human protein database were utilized and then the 
estimated  masses  were  compared  with  the  actual 
theoretical  masses  calculated  using  the  formula 
provided from the database. Both  the  average mass 
errors and the monoisotopic mass errors are obtained 
along with the different mass ranges.  
All the statistical calculations presented here are 
based  on  Human  protein  database,  which  is  a 
collection  of  20,341  sequences  of  proteins  (June, 
2019). The primary task of our study is to get the mass 
error distribution covered the full mass range, which 
will provide the experimental foundation to improve 
AM by reducing its estimated errors when applied to 
large proteins with MW larger than 30 kDa. 
2.1  Main Analysis Process 
To get the estimated mass errors, four computational 
steps are conducted as below (figure 1):   
  Step 1: Computing every formula of protein in the 
Human Protein Database; 
  Step 2: Using the obtained formula result from the 
first step and the emass algorithm to compute the 
theoretical isotopic distributions; 
  Step  3:  Using  the  AM  and  the  average  mass 
provided in the second step, estimate the formula 
for each protein; 
  Step 4: Generating two types of mass errors, i.e., 
average mass errors and monoisotopic mass errors.  
 
Figure 1: Diagram of the four computing steps. 
 
Figure 2: Key process of AM application. 
Although  average  mass is  widely used  for  large 
molecule  mass  estimation,  the  monoisotopic  mass 
still  represents  the  most  accurate  mass  for  a 
compound.  
Here  in  this  experiment,  two  sets  of  errors  are 
computed  through  the  four  computing  processes 
introduced which are monoisotopic  and  average 
element mass. (figure 2) 
The  reason  why  for  taking  both  errors  in 
consideration is that the former error could offer hints 
on how to improve AM while the latter error offers 
the  information  related  to  the  unknown  large 
molecules  validated  by  the  information  from  the 
database. 
2.2  Simulation on the Estimated Mass 
Errors for All Proteins from 
Human Database 
We statistically computed two types of mass errors 
between  Averagine-fit  and  theoretical  isotopic 
clusters.  According  to  the  distribution,  we  then 
compared  the  differences  between  the  mass  error 
ranges  for  both  average  masses  and  monoisotopic 
masses.  The  results  showed  that  the  mass  accuracy 
can  be  improved  remarkably  for  large  proteins  in 
terms of the monoisotopic mass errors.  
However,  this  is  not  enough  for  high-resolution 
mass spectrometers, therefore, futher analysis of the 
elemental  contribution  are  provided  to  estimate  the 
mass errors from all individual elements which are C, 
H, N, O, and S.  
More  detailed  results  will  be  shown  in  next 
section.  
3  RESULT AND CONCLUSION 
As  stated  previously,   the  estimated   average  mass