Influence of Data Similarity on the Scoring Power of Machine-learning Scoring Functions for Docking

Kam-Heung Sze, Zhiqiang Xiong, Jinlong Ma, Gang Lu, Wai-Yee Chan, Hongjian Li, Hongjian Li

2020

Abstract

Inconsistent conclusions have been drawn from recent studies exploring the influence of data similarity on the scoring power of machine-learning scoring functions, but they were all based on the PDBbind v2007 refined set whose data size is limited to just 1300 protein-ligand complexes. Whether these conclusions can be generalized to substantially larger and more diverse datasets warrants further examinations. Besides, the previous definition of protein structure similarity, which relied on aligning monomers, might not truly reflect what it was supposed to be. Moreover, the impact of binding pocket similarity has not been investigated either. Here we have employed the updated refined set v2013 providing 2959 complexes and utilized not only protein structure and ligand fingerprint similarity but also a novel measure based on binding pocket topology dissimilarity to systematically control how similar or dissimilar complexes are incorporated for training predictive models. Three empirical scoring functions X-Score, AutoDock Vina, Cyscore and their random forest counterparts were evaluated. Results have confirmed that dissimilar training complexes may be valuable if allied with appropriate machine learning algorithms and informative descriptor sets. Machine-learning scoring functions acquire their remarkable scoring power through mining more data to advance performance persistently, whereas classical scoring functions lack such learning ability. The software code and data used in this study and supplementary results are available at https://GitHub.com/HongjianLi/MLSF.

Download


Paper Citation


in Harvard Style

Sze K., Xiong Z., Ma J., Lu G., Chan W. and Li H. (2020). Influence of Data Similarity on the Scoring Power of Machine-learning Scoring Functions for Docking. In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) - Volume 3: BIOINFORMATICS; ISBN 978-989-758-398-8, SciTePress, pages 85-92. DOI: 10.5220/0008873800850092


in Bibtex Style

@conference{bioinformatics20,
author={Kam-Heung Sze and Zhiqiang Xiong and Jinlong Ma and Gang Lu and Wai-Yee Chan and Hongjian Li},
title={Influence of Data Similarity on the Scoring Power of Machine-learning Scoring Functions for Docking},
booktitle={Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) - Volume 3: BIOINFORMATICS},
year={2020},
pages={85-92},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0008873800850092},
isbn={978-989-758-398-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) - Volume 3: BIOINFORMATICS
TI - Influence of Data Similarity on the Scoring Power of Machine-learning Scoring Functions for Docking
SN - 978-989-758-398-8
AU - Sze K.
AU - Xiong Z.
AU - Ma J.
AU - Lu G.
AU - Chan W.
AU - Li H.
PY - 2020
SP - 85
EP - 92
DO - 10.5220/0008873800850092
PB - SciTePress