enterprise customer retention rate and proposes
solutions to improve customer retention rate
(Schmidt, Kabir, et al. 2022). With the development
of Internet technology and online shopping models,
the traditional e-commerce industry has gradually
begun to transform and upgrade to digitalization,
intelligence and automation. In this process, big data
and machine learning technologies have gradually
been integrated into the e-commerce industry and
play an important role. Both big data and machine
learning can help companies identify and predict
potential customer churn risks. That's why big data
and machine learning techniques are used in this
article (Swamy, 2022). This paper takes a live
streaming e-commerce platform as the research
object, firstly collects the user data, related product
data, marketing campaign data and user purchase
behavior data of the live streaming e-commerce
platform as the data source, then uses the Spark
framework to preprocess the relevant data, and finally
uses random forest algorithm and XGBoost algorithm
to construct a customer churn risk prediction model
(Swamy, 2022).
2 RESEARCH METHODS
In this paper, three classification methods in the field
of machine learning, including decision tree, support
vector machine and random forest, are used to
conduct predictive analysis based on live broadcast e-
commerce data (Thipparaju, Sushmitha et al. 2022).
The preprocessing step of the data includes dividing
the dataset into a training set and a test set, and
labeling the samples. The purpose of the
normalization step is to optimize the model training
performance. When building a model, you need to
adjust the parameters to balance the accuracy of the
model with the computational complexity. In this
study, in order to facilitate further research, a decision
tree algorithm was selected to construct a customer
churn risk prediction model (Yang, Lee, et al. 2022).
Decision trees are favored because of their simple
structure, wide applicability, and robustness, while
they have relatively low requirements for data
preprocessing. The performance of the model is
measured by classification accuracy, including area
under the ROC curve (AUC), AUC, and mean
absolute error (MAE). In general, the closer the AUC
value is to 1, the better the classification performance
of the model, and the smaller the MAE value, the
better the model performance.
3 RESEARCH PROCESS
When processing the huge dataset of live streaming
e-commerce, a large amount of unstructured data
needs to be preprocessed, including data cleaning,
feature refinement, and data standardization. The
detailed steps are as follows: First, data cleaning is
carried out, and according to the characteristics of live
broadcast e-commerce, the user's purchase records
should be carefully cleaned, and abnormal data and
duplicate records should be eliminated to improve the
accuracy of data processing. After that, feature
extraction is carried out to classify user attributes,
such as gender, age, watch time, etc., and use these
attributes as features for subsequent modeling. After
feature extraction is complete, a predictive model
needs to be built to predict the purchase behavior of
new users and prevent customer churn. Customer
churn risk prediction is divided into two categories:
classification and regression: for classification
problems, random forest algorithms are used to
predict, such as XGBoost, XPadding,
XGBoost+XGBoost and other models, to predict
whether new users will buy products. For the
regression problem, the logistic regression algorithm
is used to make predictions. In view of the large
number of attributes and large amount of data in live
broadcast e-commerce, the logistic regression
algorithm is prone to overfitting, so the K-means
algorithm is used to classify the data to reduce the risk
of overfitting. Finally, the performance of the model
needs to be evaluated by cross-validation, including
indicators such as accuracy, recall, and F1 value.
Based on the results of the comprehensive evaluation,
the optimal model is selected and new users are
recommended. 。
3.1 Data Collection and Pre-Processing
In the field of live streaming e-commerce, user data
plays a crucial role. However, in actual operation,
there are often defects in the collection and
preprocessing of data, resulting in incomplete and
non-standard data, which brings challenges to the
establishment of models. In view of this, the
beginning of the work should focus on the cleaning
and sorting of data, and build a model on this basis.
In this study, we used R language to analyze the user
data of live streaming e-commerce, and selected user
browsing history as the main data input. In view of
the fact that users' browsing history on live streaming
e-commerce platforms is recorded on an hourly rather
than daily basis, the study adopted an innovative
approach to data cleaning and processing. The