2 LITERATURE REVIEW
Predict restaurant revenue effectively has received
considerable attention in recent years. There exist
numerous studies using various machine learning and
statistical techniques to tackle this complex problem.
In these studies, they have highlighted the growing
interest in leveraging advanced computational
techniques to provide more accurate and reliable
financial forecasts, which are crucial for the
operational success of businesses in the restaurant
industry.
Siddamsetty et al. (2021) have predicted
restaurant revenue by using different machine
learning methods. their study evaluated the
effectiveness of several commonly used machine
learning algorithms in revenue prediction, including
Catboost algorithm and Random Forest algorithm,
and the typical regression models. The results
indicated that the Random Forest method and the
Catboost method could significantly improve
prediction accuracy compared with the traditional
Bayesian linear regression, especially when dealing
with large-scale, multidimensional datasets. This
study provides important insights into how to select
and optimize machine learning algorithms for more
efficient predictions.
Bera (2021) attempts to explore the topic from an
operational analytics perspective, studying the
application of machine learning algorithms in
predicting restaurant sales revenue. This study
focused on the importance of picking the truly
relevant feature, it also demonstrated how various
machine learning algorithms (such as random forests
and gradient boosting machines) can be used to
analyze and predict restaurant sales revenue. In the
study several ways are tried to compare and choose
the feature, like their relevance. To select the model,
they tested different basic regression models and
ensemble models, which provide a solid basis for
future studies.
Gogolev and Ozhegov (2019) have compared the
performance of different machine learning algorithms
in restaurant revenue prediction. In their work, they
examined the differences in the performance of
various algorithms (such as random forests, elastic
net, and support vector regression) on different
datasets and discussed the strengths and limitations of
each algorithm. The study found that while all
algorithms can effectively predict sales under certain
conditions, some algorithms perform better when
dealing with certain types of data. They concluded
that SVR and RF outperformed the results of linear
regression.
Parh et al (2021) further explored the application
of supervised learning methods in predicting
restaurant sales. Their study not only compared the
performance of different supervised learning
algorithms, but also suggested several possible
improvements to increase the accuracy of these
algorithms in real-world applications. The results
showed that Lasso regression outperformed the others
in terms of prediction accuracy.
In summary, the existing literature indicates that
machine learning techniques have promising
applications in predicting restaurant sales. However,
there seems to be a lack of research on the use of p-
values to determine the relevance of variables in
previous studies, so this paper will use p-values to
select relevant features before building the model and
present the results.
3 METHODOLOGIES
Predicting a restaurant's annual turnover is a classic
regression model, as the aim is to predict the value of
the restaurant's annual turnover based on several
features of the restaurant. In this article, in order to
compare the fitting performance of the tree-based
model and the simple regression model, the multiple
regression model and the random forest method are
chosen to represent the regression method and the
tree-based model, respectively.
3.1 Summary of the Whole Dataset
The correlated dataset is downloaded from Kaggle,
after getting the dataset, the first thing needed to do is
to summarize the whole dataset and get some brief
information. Table 1 shows the results.
From the table, the dataset has a total of 8368 non-
null samples. For each sample, 15 characteristics are
included in the dataset, including 3 categorical
variables (location, cuisine, parking availability) and
11 numerical variables. Among these numerical
variables, different aspects of the restaurant are
included, for example, the rating of the restaurant, the
average meal price of the restaurant and the chef's
years of experience in the restaurant. Thus, it can be
concluded that the data is quite good as it covers a lot
of unique information of each restaurant. Revenue is
the target variable that needs to be predicted. Based
on this dataset, the final goal is to select the useful
variables from these characteristics to predict the
annual revenue of the restaurant.