new approach to style transfer that employs CNNs
(Miah, 2023); Gauri et al. summarized the research on
using artificial intelligence technologies to filter,
diagnose, monitor, and disseminate information
about COVID-19 through human audio signals.
This overview will help develop automated
systems to support COVID-19 related efforts to
utilize non-invasive and user-friendly biosignals in
human non-verbal and verbal audio (Deshpande,
2022); One of the most important directions for this
is the prediction of music popularity, and here are
some examples: HuaFeng et al. developed a model for
predicting song popularity that combines multimodal
feature fusion with LightGBM. The model consists of
a LightGBM component, a multimodal feature
extraction framework and a logistic regression
component (Zeng, 2022); Notably, the research by
Seon et al. empirically examined how acoustic
features enhance the likelihood of songs reaching the
top 10 on the Billboard Hot 100, analyzing data from
6,209 unique songs that appeared on the chart
between 1998 and 2016, with a particular emphasis
on acoustic features supplied by Spotify (Kim, 2021);
In the research by Bang Dang et al., the paper focuses
on predicting the rankings of popular songs for the
next six months. The dataset, used for the Hit Song
Prediction problem in the Zalo AI Challenge 2019,
includes not only songs but also details like
composers, artist names, release dates, and more. The
paper advocates for treating hit song prediction as a
ranking problem using Gradient Boosting techniques,
rather than the typical regression or classification
methods employed in previous studies. The optimal
model demonstrated strong performance in predicting
whether a song would become a top Ten dance hit
versus lower-ranked positions (Pham, 2020).
Thanks to the robust development in this field,
this paper also aims to employ AI algorithms for
popularity prediction. To achieve this objective, the
study utilizes extensive streaming data, including
official metrics such as Spotify's track play counts
and datasets from Kaggle relevant to the model.
Experimental results validate the effectiveness of the
proposed methods.
2 METHODS
2.1 Dataset Preparation
The Dataset which this paper picked was a Spotify
Songs dataset that recorded 114,000 songs with their
popularity, artists, genre, duration, etc.
These features can be used to predict a song's
popularity and also to explore how these features
influence that popularity. Additionally, this study
conducted an online search for streaming play counts
and popularity data for singles from 2004 to 2024. To
account for regional differences, data was collected
primarily from Spotify, YouTube Music, and QQ
Music. These datasets were used as another critical
source of information. Utilizing these datasets, the
study conducts a regression task to examine the
relationship between play counts and a song's
popularity.
After cleaning the data, this paper selected
features that were not popularity to become the
independent variables. Then, were selected only the
popularity as out dependent variable since its the
target that this study aims to predict. In terms of data
preprocessing, this paper conducted normalization
training. To properly evaluate the model's
performance, it's important to split the dataset into
training and testing sets. This paper makes use of the
"train-test-split" function from the
“sklearn.model_selection” module, allocating 80% of
the data to the training set and 20% to the testing set.
2.2 Machine Learning Models-Based
Prediction
About the models this study chosen, this paper
selected three different models. They are Random
Forest Regressor(RF),Gradient Boosting Machines
(GBM) and Simple Linear Regression.
2.2.1 Random Forest
Firstly, RF shown in Figure 1 is an ensemble method
that constructs multiple decision trees and merges
their results. It leverages bootstrapping and feature
randomness to enhance model performance and
reduce overfitting. It's Methodology including
Ensemble Construction which generates multiple
decision trees using bootstrap samples from the
training data. Besides, each tree is trained on a unique
subset of the data, which aids in minimizing variance
and preventing overfitting. The reasons of why this
study chose it are as follows: 1. powerful ensemble
learning method; 2. It is capable of effectively
handling both linear and non-linear relationships; 3. it
offers robustness against overfitting, especially in
datasets with many features.