Authors:
Maham Khokhar
1
;
Burcu Bakir-Gungor
2
and
Malik Yousef
3
Affiliations:
1
Department of Data Science, Social Sciences Institute, Abdullah Gul University, Kayseri, 38080, Turkey
;
2
Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, 38080, Turkey
;
3
Department of Information Systems, Galilee Digital Health Research Center, Zefat Academic College, 13206, Zefat, Israel
Keyword(s):
Transcriptomics Data Analysis, Feature Selection, Machine Learning, Biomarker Discovery.
Abstract:
The advent of high-throughput transcriptomic technologies has generated vast transcriptomic datasets, challenging current analytical methodologies with their sheer volume and complexity. The Grouping-Scoring-Modeling (G-S-M) approach is one of the recent approaches that treat groups of genes (or clusters of genes) by embedding prior biological knowledge with machine learning in order to detect the most significant groups for classification tasks. The G-S-M might need to treat thousand ten thousand of groups (scoring those groups) which might affect the speed and performance of the algorithm. In response, this study introduces the Pre-Scoring G-S-M model, an enhancement of the established Grouping-Scoring-Modeling (G-S-M) framework. This approach incorporates a Pre-Scoring component that leverages the Limma package for its empirical Bayes methods to optimize initial transcriptomic data evaluation through a percentage-based selection of statistically significant gene groups. Aimed at r
educing computational demand and streamlining feature selection, the model also addresses data redundancy by eliminating duplicate gene-disease associations. Application to nine human gene expression datasets from the GEO database showed promising results. It demonstrated improvements in computational efficiency and analytical precision while reducing the number of features selected per dataset compared to the traditional G-S-M approach, without compromising accuracy. These initial findings highlight the Pre-Scoring G-S-M model's potential to enhance transcriptomic data analysis, indicating a promising direction for future bioinformatics research.
(More)