3 RESULTS AND DISCUSSION 
First, we had to determine specific cutoff values to 
systematically adjust how similar or dissimilar 
complexes are incorporated for training. We tried 
different settings and consequently decided that, for 
protein structure similarity, the cutoff value c 
increases from 0.40 to 1.00 with a step size of 0.01 in 
the direction specified by 
, and decreases from 
0.99 to 0.40 and then to 0 in the opposite direction 
specified by 
; for ligand fingerprint similarity, c 
increases from 0.55 to 1.00 in 
, and decreases 
from 0.99 to 0.55 and then to 0 in 
; for pocket 
topology dissimilarity, c decreases from 10.0 to 0 
with a step size of 0.2 in 
, and increases from 0.2 
to 10.0 and then to +∞ in 
. 
Next, we plotted the number of training 
complexes against the three types of cutoff (Figure 1), 
in order to visibly show that these distributions are 
hardly even. In fact, the distribution of training 
complexes under the protein structure similarity 
measure is extraordinarily skewed, e.g. as many as 
859 training complexes (accounting for 31% of the 
original full training set of 2764 complexes) have a 
test set similarity greater than 0.99 (note the sheer 
height of the rightmost bar), and 156 training 
complexes have a test set similarity in the range of 
(0.98, 0.99]. Incrementing the cutoff by just 0.01 from 
0.99 to 1.00 will include 859 additional training 
complexes, whereas incrementing the cutoff by the 
same step size from 0.90 to 0.91 will include merely 
17 additional training complexes, even none from 
0.72 to 0.73. Therefore, one would seemingly expect 
a significant performance gain from raising the cutoff 
by just 0.01 if the cutoff is already at 0.99. This is also 
true, although less apparent, for ligand fingerprint 
similarity, where 179 training complexes have a test 
set similarity greater than 0.99. The distribution under 
the pocket topology dissimilarity measure, however, 
seems relatively more uniform, with just 15 
complexes falling in the range of [0, 0.2) and just 134 
complexes in the range of [10, +∞). Hence 
introducing this supplementary similarity measure 
based on pocket topology, which is novel in this study, 
offers a different tool to investigate the influence of 
data similarity on the scoring power of SFs with 
training set size unbiased towards both ends of cutoff. 
Keeping in mind the above-illustrated non-even 
distributions, we re-trained the three classical SFs 
(MLR::Xscore, MLR::Vina, MLR::Cyscore) and the 
four machine-learning SFs (RF::Xscore, RF::Vina, 
RF::Cyscore, RF::XVC) on the 61 nested training sets 
generated with protein structure similarity measure, 
evaluated their scoring power on the PDBbind v2013 
core set, and plotted their predictive performance (in 
terms of Rp, Rs and RMSE) in a consistent scale 
against both cutoff value and number of training 
complexes in two similarity directions (Figure 2). 
Looking at the top row alone, where RF::Xscore was 
not able to surpass MLR::Xscore until the similarity 
cutoff reached 0.99, it is therefore not surprising for 
Li and Yang to draw their conclusion that after 
removal of training proteins that are highly similar to 
the test proteins, machine-learning SFs did not 
outperform classical SFs in Rp (Li and Yang, 2017) 
(note that the v2007 dataset employed in previous 
studies has an analogously skewed distribution as the 
v2013 dataset employed in this study; data not 
shown). Nonetheless, if one looks at the second row, 
which plots essentially the same result but against the 
associated number of training complexes instead, it 
becomes clear that RF::Xscore trained on 1905 (the 
number of training complexes associated to cutoff 
0.99, about 69% of the full 2764 complexes) 
complexes was able to outperform MLR::Xscore, 
which was already the best performing classical SF 
considered here. In terms of RMSE, RF::Xscore 
surpassed MLR::Xscore at cutoff=0.91 when they 
were trained on just 1458 (53%) complexes whose 
proteins are not so similar to those in the test set. This 
is more apparent for RF::XVC, which outperformed 
MLR::Xscore at a cutoff of just 0.70, corresponding 
to only 1243 (45%) training complexes. In other 
words, even if the original training set was split into 
two halves and the half with proteins dissimilar to the 
test set was used for training, machine-learning SFs 
would still produce a smaller prediction error than the 
best classical SF. Having said that, it does not make 
sense for anyone to exclude the most relevant samples 
for training (Li et al., 2018). When the full training set 
was used, a large performance gap between machine-
learning and classical SFs was observed. From a 
different viewpoint, through comparing the top two 
rows showing basically the same result but with 
different horizontal axis, the crossing point where 
RF::Xscore started to overtake MLR::Xscore is 
located near the right edge of the subfigures in the 
first row, whereas the same crossing point is 
noticeably left shifted in the second row, suggesting 
that the outstanding scoring power of RF::Xscore and 
RF::XVC is actually attributed to increasing training 
set size but not exclusively to a high similarity cutoff 
value as claimed previously. 
Due to the skewness of the distribution of training 
complexes under the protein structure similarity 
measure, it should be understandable to anticipate a 
remarkable performance gain from raising the cutoff 
by only 0.01 if it already touches 0.99 because it will