commercial health insurance in China highlights the
current research on the impact of aging on insurance
purchase intention and believes that participation of
middle-aged and elderly people in health insurance
plans has a positive impact on their physical condition
(Huang et al., 2022). Bhawani analyzed its impact on
the health service industry from the aspects of
enterprise competition and the supply of production
factors and proposed to increase the financial support
for the industry and the guidance function of the
government to avoid vicious competition (Bhawani,
2010). Zhao & Li (2024) believe that consumers'
participation in commercial medical insurance is
mainly influenced by subjective factors such as
gender, job category, and economic status and
objective factors such as the management and
operation of insurance companies.
According to Davilla & Jones (2025), forward-
looking policies aimed at improving the efficiency of
healthcare systems through better control and
distribution of healthcare services require information
from the general population. Existing research has not
distinguished the impact of various factors on
insurance demand among different insured
individuals. This study established four linear
regression models and screened for the optimal
model, examining key factors and characteristics that
influence the insurance purchasing decisions of
insured individuals or their family members.
2 PRINCIPLES OF MULTIPLE
LINEAR REGRESSION AND
MODEL SELECTION
CRITERIA
The main principle of multiple linear regression is to
build a regression equation containing multiple
independent variables and a dependent variable and
interpret and infer the value of the dependent variable
according to the value of multiple independent
variables. Therefore, using the optimal set of multiple
independent variables to predict the dependent
variables is more accurate than using a single
independent variable to predict. The core of the
multiple linear regression model is to build a function
that can minimize the square value of the difference
between the predicted value and the true value. The
multiple linear regression model is usually used to
describe the random linear relationship between the
dependent variable π and the independent variable
π. The use of multiple linear regression also requires
the following three conditions: first, there must be a
random linear relationship between π and π .
Second, each observation value π is independent of
each other. Third, the residual should be subject to the
normal distribution with the mean value of 0 and the
variance of πΏ
ξ¬Ά
, that is, for any group of observations
of the independent variable, the dependent variable π
has the same variance and is subject to the normal
distribution (Lu et al., 2025).
The purchase factor prediction model of medical
health insurance studied in this paper is aimed at a
regression problem, and the adjustment coefficient of
determination RΒ²_adj, coefficient of prediction value
of determination RΒ² _pred, root mean square error
Root Mean Square Error (RMSE), sum of prediction
error of Prediction Residual Error Sum of Squares
(PRESS), and Cp statistics are used to evaluate the
prediction model. The above evaluation indicators of
the original model and the new model were calculated
and compared one by one, and the optimal model was
finally selected (Qin, 2024).
Adjusted R-squared is an adjustment based on R-
squared, which is used to measure the degree of
interpretation of independent variables to dependent
variables. R-squared predict value is an indicator used
to measure the degree of fitting and prediction ability
of linear regression models to new data not used for
model training. In practical applications, their values
are usually between 0 and 1, and the closer to 1, the
better the model fits the data; The closer to 0, the
worse the fitting effect of the model. RMSE quantifies
the average error range of model prediction. The
smaller the value, the higher the prediction accuracy
of the model. PRESS value can evaluate the
generalization ability of the model and reflect the
prediction error of the model to new data. The smaller
the value, the stronger the prediction ability of the
model, and the lower the risk of overfitting. The Cp
value is used to measure the comprehensive impact of
model deviation and variance. The closer the number
of independent variables p is, the better the model will
be.
3 EMPIRICAL ANALYSIS
3.1 Data Collection and Processing
In view of the fact that this paper is going to study the
factors affecting medical and health insurance, this
paper collects and collates the statistical data set of
relevant factors affecting personal medical expenses
charged by health insurance from the Kaggle website
for empirical analysis. This data set is the first-hand