Authors:
Kam-Heung Sze
1
;
Zhiqiang Xiong
1
;
Jinlong Ma
1
;
Gang Lu
2
;
Wai-Yee Chan
2
and
Hongjian Li
1
;
2
Affiliations:
1
Bioinformatics Unit, SDIVF R&D Centre, Hong Kong Science Park, Sha Tin, New Territories, Hong Kong
;
2
CUHK-SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong
Keyword(s):
Molecular Docking, Binding Affinity Prediction, Machine Learning, Feature Engineering, Data Similarity.
Abstract:
Inconsistent conclusions have been drawn from recent studies exploring the influence of data similarity on the scoring power of machine-learning scoring functions, but they were all based on the PDBbind v2007 refined set whose data size is limited to just 1300 protein-ligand complexes. Whether these conclusions can be generalized to substantially larger and more diverse datasets warrants further examinations. Besides, the previous definition of protein structure similarity, which relied on aligning monomers, might not truly reflect what it was supposed to be. Moreover, the impact of binding pocket similarity has not been investigated either. Here we have employed the updated refined set v2013 providing 2959 complexes and utilized not only protein structure and ligand fingerprint similarity but also a novel measure based on binding pocket topology dissimilarity to systematically control how similar or dissimilar complexes are incorporated for training predictive models. Three empirica
l scoring functions X-Score, AutoDock Vina, Cyscore and their random forest counterparts were evaluated. Results have confirmed that dissimilar training complexes may be valuable if allied with appropriate machine learning algorithms and informative descriptor sets. Machine-learning scoring functions acquire their remarkable scoring power through mining more data to advance performance persistently, whereas classical scoring functions lack such learning ability. The software code and data used in this study and supplementary results are available at https://GitHub.com/HongjianLi/MLSF.
(More)