Authors:
Carly A. Bobak
1
;
Alexander J. Titus
2
and
Jane E. Hill
1
Affiliations:
1
Dartmouth School of Graduate and Advanced Studies, United States
;
2
Dartmouth School of Graduate and Advanced Studies and Dartmouth Geisel School of Medicine, United States
Keyword(s):
Tuberculosis, Random Forest, Machine Learning, Transcriptional Signatures, Data Integration.
Abstract:
There has been increasing concern amongst the scientific community of a reproducibility crisis, particularly
in the field of bioinformatics. Often, published research results do not correlate with clinical success. One
theory explaining this phenomenon is that findings from homogeneous cohort studies are not generalizable
to an inherently heterogeneous population. In this work, we integrate data from 4 distinct tuberculosis (TB)
cohorts, for a total of 1164 samples, to find common differentially regulated genes which may be used to
diagnose active TB from latent TB, treated TB, other diseases, and healthy controls. We selected 25 genes
using random forest to get an AUC of 0.89 in our training data, and 0.86 in our test data. A total of 18 out of
25 genes had been previously associated with TB in independent studies, suggesting that integrating data may
be an important tool for increasing micro-array research reproducibility.