distillation to build a streamlined model of the student
model to ensure the accuracy of the teacher model
(Liu et al., 2024). Previously, Michael T. Lash and
Kang Zhao used social network analysis and text
mining techniques to delve deeper into the film's cast
and release season, ultimately speculating on the
film's profitability (Lash & Zhao, 2016). N.
Darapaneni et al. chose XGBoost as the final film
success prediction algorithm after working with film
outline plots and testing it with several models
(Darapaneni et al., 2020). Differently, Rijul Dhir and
Anand Raj chose released data from the film as their
data set, rather than listening to what critics and
others had to say about whether the film would be a
success or not, but the final model's predictive
accuracy wasn't very high, and they're also looking
ahead to future work where they want to include
analysis of user comments from social media (Dhir &
Rah, 2018). Rather than simply choosing the optimal
model, Vedika Gupta et al. integrated and compared
different models before using them to predict the
success of the film (Gupta et al., 2023).
At the same time, with the help of social media
analysis, the accuracy of the prediction model will
further improve, as a result of which provides more
scientific decision-making support for film investors.
Social media is an important platform for modern
information dissemination, contains a large amount of
user-generated content, including film reviews, topic
discussions, etc., and this data reflects the audience's
real-time feedback on the film, as well as their
emotional inclination, which is of high research value
(Castillo et al., 2021).
Based on this, this paper will examine the
correlation between various factors that influence a
film's popularity and its rating and will develop a
prediction model.
2 METHOD
2.1 Research Design
The purpose of this paper is to explore the potential
value and laws of film-related data. Through the
comprehensive analysis of multi-dimensional data
such as film release time and audience rating, this
study reveals the internal relationship between film
popularity and these factors and their influence
degree, that is, the weight, to establish a film rating
prediction model. Meanwhile, social media analysis
can be used to capture the expectation value of the
audience, i.e., the social media buzz, at this point
before the film's release, thus predicting the film's
rating and popularity.
Due to the complexity of the factors that influence
a film's popularity, it is difficult for a simple
prediction model to produce accurate results (Geng &
Guo, 2021; He & Yuan, 2021). First of all, this paper
obtains the film data set through Kaggle, the reason
for choosing Kaggle is because it is a more
authoritative and data-rich website, where there is no
lack of scholars in various fields to select the data
inside for academic research, and then remove null
values as well as outliers in the data set. To explore
the factors that affect the popularity of films, this
paper will carry out statistical analysis and correlation
analysis on the popularity of films, which is difficult
to express in numerical terms. Therefore, the
audience rating of films is selected as an indicator to
measure the popularity of films, which is used as a
dependent variable for correlation analysis, and the
influencing factors are obtained by combining
statistical analysis. The first step of this paper is to
conduct a descriptive statistical analysis based on the
data results to explore the correlation between
nonnumerical items, namely, actors, directors, actors,
and directors' specific combinations and film scores.
The second step is to conduct a correlation analysis
between numerical items and film scores and use the
analysis results to establish a prediction model of film
popularity (Abidi et al., 2020). Because the ultimate
goal is to predict the popularity of films, the
popularity of films is defined by five levels, namely
terrible, poor, satisfactory, good, and excellent. The
research design is shown in Fig.1.
Figure. 1. Research design.
2.2 Data Collection
To better predict the popularity of movies, a perfect
data set is needed to obtain more accurate results.
There are 880,990 original datasets in the data set
used in this paper, including 27 dimensions, such as