Sharing Bioinformatic Data for Machine Learning: Maximizing Interoperability through License Selection

Alexander Bernier, Adrian Thorogood


Efficient machine learning in bioinformatics requires a large volume of data from different sources. Bioinformatics is shifting from a paradigm of siloed analysis of individual datasets by researchers to the aggregation and analysis of disparate sets of health and biomedical data across from academic, healthcare and commercial settings. Data generating organizations must give thought to selecting legal terms for dataset release that will promote compatibility with other datasets. In releasing bioinformatic data for open use, care must be taken to ensure that the terms of the licenses selected ensure maximum interoperability. The following technical elements should inform the choice of license: License hybridity; waivers of liability, warranties and guarantees; commercial/non-commercial use; attribution and copyleft; granular permission and bilateral or multilateral licensing. Licenses are compared to inform optimal license selection and enable data integration and analysis; consideration is given to an eventual standard license for open sharing of bioinformatic data.


Paper Citation