Authors:
Karolina Widzisz
1
;
Mateusz Kania
2
;
Joanna Zyla
3
and
Andrzej Polański
1
Affiliations:
1
Department of Computer Graphics, Vision and Digital Systems, Silesian University of Technology, Akademicka 16, Gliwice, Poland
;
2
Department of Applied Informatics, Silesian University of Technology, Akademicka 16, Gliwice, Poland
;
3
Department of Data Science and Engineering, Silesian University of Technology, Akademicka 16, Gliwice, Poland
Keyword(s):
scRNA-seq, Clustering Performance, Binary Data, Data Information Reduction.
Abstract:
The primary objective of this study was to test the hypothesis that the binary information on the presence or absence of gene expression can sufficiently capture the inherent heterogeneity within single-cell RNA sequencing (scRNA-seq) data. This hypothesis posits that even without detailed expression levels, valuable insights about cellular diversity can be obtained. Utilizing this method can be particularly advantageous when analyzing large datasets, a common scenario in the field of scRNA-seq. In this paper, we evaluate clustering performance and cluster separability of a variety of model-based algorithms and distance-based methods to analyze both expression level data and threshold-encoded binarized data. We examined the performance of the Bernoulli-mixture model and Gaussian-mixture model. These were compared against traditional clustering techniques such as hierarchical clustering, K-means, and the Louvain algorithm on a range of scRNA-seq datasets. Our findings reveal that mixt
ure models exhibit a lower dependence on the specific dataset compared to distance-based methods. Mixture models, particularly, demonstrate greater efficacy in accurately estimating the number of clusters present within the data. Among analyzed algorithms, the Bernoulli-mixture model stands out, outperforming distance-based approaches in several key aspects. Binary data, presence/absence of gene expression, seem to be indeed adequate to capture the heterogeneity of scRNA-seq data when clustering with methods specifically designed for binary datasets. The implications of this finding are significant, as it opens up new possibilities for simplifying data analysis in scRNA-seq studies without compromising the accuracy of the results.
(More)