pregnancy history. LR assigned these variables
moderate importance (e.g., age odds ratio: 1.15,
p=0.02), suggesting cumulative metabolic wear or
hormonal shifts over time. In contrast, RF
downplayed their significance, possibly because
interactions between age and other variables (e.g.,
age-specific glucose thresholds) were overshadowed
by stronger predictors like BMI. This discrepancy
underscores the importance of model selection: LR’s
interpretability aids hypothesis testing (e.g., age as an
independent risk factor), while RF’s flexibility may
better reflect multifactorial risk profiles in
heterogeneous populations. Both models agreed on
the limited predictive value of skinfold thickness and
insulin levels. Skinfold thickness, a proxy for
subcutaneous fat, may lack specificity compared to
BMI, which encompasses visceral fat, a more direct
contributor to insulin resistance. Similarly, insulin
levels alone might fail to capture dynamic feedback
mechanisms (e.g., pancreatic β-cell compensation)
critical in early diabetes stages. These findings
suggest that simplified biomarkers (glucose, BMI)
hold greater utility in screening protocols compared
to niche measurements. The complementary strengths
of LR and RF advocate for their combined use in
clinical practice. For instance, LR could prioritize
high-risk patients based on glucose/BMI thresholds,
while RF might refine predictions by incorporating
subtle interaction effects (e.g., age-adjusted BMI
thresholds). Such integration could enhance
personalized prevention strategies, enabling early
interventions like lifestyle modifications or targeted
glucose monitoring. Future studies should validate
these models across diverse populations and explore
hybrid algorithms to balance interpretability and
predictive power (Zimmet et al., 2014).
5 CONCLUSION
The findings of this study, while insightful, are
inherently constrained by the demographic
homogeneity of the dataset. All observations were
derived exclusively from female Pima Indians aged
21 and older, a population with a well-documented
genetic predisposition to metabolic disorders. While
this homogeneity reduces confounding variables, it
severely limits the generalizability of the models. For
instance, biological differences across gender and
racial/ethnic variations in diabetes risk factors may
render the current models inapplicable to broader
populations. Future research must prioritize
ethnically diverse cohorts—including Asian, African,
and European ancestries—and balanced gender
representation to validate and refine these predictive
frameworks. Methodologically, advancements could
be achieved through feature engineering, ensemble
techniques, or deep learning architectures.
Additionally, addressing the dataset’s class
imbalance—a common issue in medical datasets
where non-diabetic cases dominate—using
techniques like SMOTE or cost-sensitive learning
could reduce prediction bias. Integrating real-world
clinical variables, such as dietary habits, physical
activity metrics, and polygenic risk scores, would
further bridge the gap between algorithmic
predictions and clinical utility. For example, wearable
device data could dynamically update risk
assessments based on lifestyle changes. Finally,
ethical considerations around data privacy and model
transparency must accompany technical
improvements. By adopting these strategies, future
studies can develop robust, equitable tools for
diabetes prevention, ultimately supporting
personalized healthcare interventions across diverse
global populations.
ACKNOWLEDGEMENTS
All the authors contributed equally and their names
were listed in alphabetical order.
REFERENCES
Carlsson, S., Hammar, N., & Grill, V., 2005. Alcohol
consumption and type 2 diabetes: Meta-analysis of
epidemiological studies indicates a U-shaped
relationship. Diabetologia, 48, 1051–1054.
De Luis, D. A., Fernandez, N., Arranz, M. L., Aller, R.,
Izaola, O., & Romero, E., 2005. Total homocysteine
levels relation with chronic complications of diabetes,
body composition, and other cardiovascular risk factors
in a population of patients with diabetes mellitus type 2.
Journal of Diabetes and its Complications, 19(1), 42–
46.
Dehghan, A., Kardys, I., de Maat, M. P., Uitterlinden, A.
G., Sijbrands, E. J., Bootsma, A. H., ... & Witteman, J.
C., 2007. Genetic variation, C-reactive protein levels,
and incidence of diabetes. Diabetes, 56(3), 872–878.
Deshpande, A. D., Harris-Hayes, M., & Schootman, M.,
2008. Epidemiology of diabetes and diabetes-related
complications. Physical Therapy, 88(11), 1254–1264.
Gale, E. A., & Gillespie, K. M., 2001. Diabetes and gender.
Diabetologia, 44, 3–15.
Gulati, S., & Misra, A., 2014. Sugar intake, obesity, and
diabetes in India. Nutrients, 6(12), 5955–5974.
Kaggle Datasets, n.d. Pima Indians Diabetes Database.
Available at: