4.2 Interpretability Insights: Smoking,
BMI, and Residual Patterns
The SHAP analysis (Figures 7–8) identifies smoking
status and BMI as the most influential predictors,
consistent with prior studies linking these variables to
chronic diseases and elevated healthcare utilization
(Panay et al., 2020). Smoking’s outsized impact
(SHAP value range: +5,000 to +20,000) reflects its
association with conditions such as lung cancer and
cardiovascular diseases, which incur high treatment
costs (Boodhun & Jayabalan, 2018). The positive
correlation between BMI and charges (SHAP range:
+1,000 to +8,000) may stem from obesity-related
comorbidities like diabetes and hypertension, which
drive long-term medical expenses (Jain & Singh,
2018). Notably, the interaction between smoking and
BMI—though not explicitly modeled—could amplify
risks, as smoking exacerbates metabolic dysfunction
in obese individuals (Vijayalakshmi et al., 2023).
Future work should incorporate interaction terms or
employ interpretable models like Generalized
Additive Models (GAMs) to disentangle these
effects.
Residual patterns further reveal that ensemble
methods minimize systematic errors for high-cost
cases, whereas simpler models struggle with outliers.
This aligns with the findings of Jain and Singh
(2018), who emphasized that tree-based models
inherently handle skewed distributions through
hierarchical partitioning, unlike linear models reliant
on Gaussian assumptions.
4.3 Limitations and Future Direction
While this study advances medical insurance cost
prediction, several limitations warrant attention. First,
the dataset (n=1,338) is relatively small and lacks
granular clinical variables (e.g., pre-existing
conditions, medication history), limiting the model’
s ability to capture nuanced health risks. Expanding
data sources to include electronic health records
(EHRs) or claims histories could improve predictive
granularity (Panda et al., 2022). Second, the study
focuses on tree-based models and polynomial
regression; alternative approaches like Bayesian
networks or transformer-based architectures remain
unexplored. Recent work by Ejiyi et al. (2022)
suggests that Bayesian methods excel in uncertainty
quantification, a valuable feature for risk-sensitive
applications like insurance pricing.
Finally, the ethical implications of using
behavioral features (e.g., smoking) for pricing require
careful consideration. While these variables improve
accuracy, they risk penalizing individuals for lifestyle
choices influenced by socioeconomic factors. Future
research should integrate fairness-aware machine
learning techniques to ensure equitable premium
calculations (Billa & Nagpal, 2024).
By addressing these limitations and building on
the interpretability frameworks established here,
subsequent studies can further bridge the gap between
predictive accuracy and ethical, transparent insurance
pricing.
5 CONCLUSION
This paper establishes a robust framework for make
predictions in medical insurance by harmonizing
advanced machine learning techniques with model
interpretability tools. Through systematic
comparisons of algorithms including XGBoost,
Random Forest, and polynomial regression, the
research demonstrates that ensemble methods
outperform traditional linear models in capturing
complex feature interactions, achieving an R² of 0.88.
Crucially, the integration of SHAP values provides
granular insights into the drivers of insurance costs—
notably smoking status and BMI—while residual
analysis validates the generalizability of these
models. By bridging predictive accuracy with
transparency, this work addresses a critical gap in
actuarial science, where interpretability is essential
for ethical pricing strategies and stakeholder trust.
This research validates the utility of machine
learning approaches, particularly ensemble methods
like XGBoost and Random Forest, in accurately
predicting insurance costs. By integrating SHAP
values, the study identifies significant predictors and
explains their impact on individual predictions
transparently. The findings highlight the importance
of behavioral factors, such as smoking status, in
determining insurance premiums, offering insights
into the complex interplay of health-related variables
and financial risk.
Financially, these findings have several
implications. Firstly, recognizing smoking status as a
critical cost driver underscores the need for insurers
to develop risk-adjusted premium structures that
reflect behavioral health risks adequately. This could
encourage healthier lifestyles among policyholders,
potentially reducing overall claims and enhancing the
financial sustainability of insurance portfolios.
Secondly, the study underscores the value of
advanced machine learning techniques in risk
assessment, offering insurers a tool to improve the
accuracy of their pricing strategies. Accurate cost