E-Commerce Sales Analysis and Prediction in UK

Xinyi Wu

College of Business, City University of Hong Kong, Hong Kong, China

Keywords: E-Commerce, Prediction, UK, Random Forest.

Abstract: During recent years, it has been remarkable that the e-Commerce industry has already grown rapidly. It could

provide customers with access to a wide range of products that enable them to make purchases from the

convenience of their homes. Nevertheless, this has brought a great number of new challenges to businesses,

especially regarding sales, which is the most crucial aspect in e-commerce. To obtain insights and engineer

features, the study adopted the EDA and RFM models. Subsequently, several predictive models, including

Artificial Neural Network, Linear Regression, Decision Trees, and Random Forest, are utilized to forecast

sales. In addition, the performance of these models was evaluated and compared on the basis of various

metrics. However, there are still certain limitations and future directions. This research contributes to a better

understanding of e-commerce sales dynamics in the UK that could provide valuable insights for businesses

and researchers in the field, thus improving sales prediction accuracy and decision making.

1 INTRODUCTION

In contemporary times, e-commerce has grown

remarkably and meteorically. The digital economy

has been exponential expanded, thus providing

customers with access to an almost limitless range of

products and services. Both the ease of use and

accessibility have revolutionized shopping habits.

This could allow people to make purchased anytime

and anytime that transcending their location and time

constraints. It has not only brought more convenience

to consumers, but also changed the corporate

environment significantly. And as to access a wider

customer base, companies of all sizes, from small

startups to massive multinational corporations, have

rushed to create an online presence. The intensified

competition leading to continuous innovation in

delivery methods, customer service, marketing

strategies and so on. Furthermore, technological

advancements in fields such as mobile applications,

secure payment gateways, and personalized

recommendation systems have further contributed to

the growth of e-commerce. It has seamlessly

integrated into people's lives, becoming an

indispensable component of the contemporary

shopping experience.

https://orcid.org/ 0009-0009-7648-3663

Despite this remarkable growth, existing studies

in the field of e-commerce have left certain aspects

uncovered or inadequately explored. This has given

rise to the current research. In this study, it focused

on the analysis and prediction of online retail

operations that do not have brick-and-mortar stores.

To achieve this, the study collected a publicly

available transactional dataset from the Kaggle

website. Then employed exploratory data analysis

and the reliability, frequency, and monetary value

models to extract valuable insights and engineer

effective features. Additionally, it utilized a diverse

range of methods, including the potent artificial

neural network, the classic linear regression, the

intuitive decision tree, and the robust random forest,

for modelling and prediction. By comparing the

performance of these algorithms, the research aimed

to determine the most appropriate approach for

precise predictions in the e-commerce domain.

This research holds significant importance as it is

expected to fill the existing knowledge gaps, provide

a deeper understanding of online retail patterns, and

offer valuable guidance for businesses operating in

this highly competitive and rapidly evolving

landscape. It has the potential to enhance business

strategies, optimize customer experiences, and drive

the continued growth and success of e-commerce

enterprises.

Wu, X.

E-Commerce Sales Analysis and Prediction in UK.

DOI: 10.5220/0013268600004568

In Proceedings of the 1st International Conference on E-commerce and Artiﬁcial Intelligence (ECAI 2024), pages 447-452

ISBN: 978-989-758-726-9

447

2 LITERATURE REVIEW

In the realm of e-commerce, prediction has become a

crucial area of research. Previous studies have

utilized a variety of methods to forecast different

aspects of online retail. For instance, Zheng et al.

(2013) employed artificial neural networks to predict

customer behavior. This approach demonstrated the

power of machine learning algorithms in

understanding complex patterns in e-commerce data.

Similarly, Usmani et al. (2017) focused on

predicting sales in e-commerce, highlighting the

importance of accurate predictions for business

planning and decision-making. The use of advanced

analytics and data-driven models in these studies has

shown promising results in terms of improving the

efficiency and effectiveness of e-commerce

operations.

However, there are several aspects where the

present study differentiates itself. While existing

studies have often focused on a single prediction

method or a limited set of variables, this research

takes a comprehensive approach by incorporating

multiple algorithms such as Artificial Neural

Network, Linear Regression, Decision Tree, and

Random Forest. This allows for a more robust

comparison and identification of the most suitable

method for accurate predictions in the specific

context of online retail without physical stores.

Moreover, previous studies may not have fully

explored the potential of combining exploratory data

analysis and the RFM model. In this study, the use of

Exploratory Data Analysis helps in understanding the

characteristics and patterns of the dataset, providing a

solid foundation for further analysis. The RFM model

(Wei et al., 2010), with its segmentation of customers

based on Recency, Frequency, and Monetary value,

offers a detailed view of customer behavior and

allows for targeted strategies.

By integrating these techniques and a diverse set

of algorithms, this study aims to fill the gaps in

existing research and provide more valuable insights

and practical solutions for businesses in the rapidly

evolving field of e-commerce. It is expected to

contribute to a better understanding of online retail

patterns and enhance the ability to make accurate

predictions, ultimately leading to improved business

strategies and customer experiences.

3 METHODOLOGY AND

RESULTS

3.1

Feature Engineering

3.1.1 EDA

A recent study found that Exploratory Data Analysis

(EDA) is a fundamental approach in data analysis that

could summarize the key features of a dataset. And it

involves a range of techniques to make date

visualizable, thus enabling a better understanding of

the different aspects of the data, such as the

distribution, structure and relationships (Komorowski

et al., 2016).

This powerful methodology is a crucial

foundation that could help people to extract

meaningful patterns and insights from complex

datasets. It enables to gain an initial understanding of

the data's nature, and thereby laying the groundwork

for more in-depth analyses and informed decision-

making. The results of the total sales by month and

day are as follow in Figure 1 and Figure 2.

• Total sales by month December, November,

October and September have the highest sales in one

year.

• Total sales by day

Figure 1: Total Sales by Month.

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

448

Figure 2: Total Sales by Day.

Table 1: RFM Segmentation Summary.

Recency Frequency Monetary

Segment Mean Count Sum Mean Count Sum Mean Count Sum

About to sleep 53.819 343 18460 1.201 343 412 441.32 343 151372.76

At risk 152.159 611 92969 3.074 611 1878 1188.878 611 726404.651

Can’t loose 124.117 77 9557 9.117 77 702 4099.45 77 315657.65

Champions 7.119 663 4720 12.554 663 8323 6852.264 663 4543051.143

Hibernating 213.886 1015 217094 1.126 1015 1143 403.978 1015 410037.504

Loyal customers 36.287 742 26925 6.830 742 5068 2746.067 742 2037581.976

Need attention 53.266 207 11026 2.449 207 507 1060.357 207 219493.9

New customers 8.580 50 429 1.000 50 50 386.199 50 19309.960

Potential loyalists 18.793 517 9716 2.017 517 1043 729.511 517 377157.18

Promising 25.747 87 2240 1.000 87 87 367.087 87 31936.55

Sunday, Monday, Tuesday, Wednesday,

Thursday and Friday have the highest sales in one

week.

3.1.2 RFM

The RFM (Recency, Frequency, Monetary) model

was adopted for its comprehensive view of customer

behavior. It helps identify engaged and at-risk

customers, understand loyalty through frequency, and

highlight valuable customers via monetary value. The

study used the RFM model to segment the customer

base into distinct groups, like "Champions" and

"About to sleep." This enabled analyzing group

characteristics and patterns, allowing for more precise

marketing and customer engagement. The study

focused on reactivating at-risk customers, nurturing

potential loyalists, and rewarding loyal customers to

enhance retention and increase revenue.

Recency: It refers to how recently a customer

purchase the things. And more recent the customer’s

purchase, the more likely continue purchasing have.

Therefore, this information could be used to

encourage lapsed customers to resume buying and

attract recent customers to make more purchases

(Segal, 2022).

Frequency: It could be influenced by several

factors such as products type, the purchase price and

the need for replenishment or replacement. And it

could remind customers to visit the business within a

given time frame through predicting the cycle of

purchase (Segal, 2022).

Monetary: It is based on how much that the

customer purchases within a specific time period. In

general, the more a client spends, the more valuable

to the business. And note that it is significate not to

alienate the customers who spend consistently but

spend less per transaction (Segal, 2022). Some basic

information is shown in Table 1.

3.2

Data Prepossessing

In the data preprocessing stage of the study, several

crucial steps were carried out.

Firstly, data cleaning was performed to handle

missing values, outliers, and any data anomalies to

E-Commerce Sales Analysis and Prediction in UK

449

ensure the data's quality and reliability. The creation

of the 'quantity per invoice' feature was another

important aspect. This feature helped in

understanding the distribution and patterns of the

quantity of items per invoice, providing valuable

insights into the purchasing behavior. Besides,

bucketing the quantity and unit price feature was also

implemented to categorize the data into meaningful

ranges, facilitating easier analysis and interpretation.

Furthermore, extracting and bucketing dates was

essential for temporal analysis and identifying

patterns based on specific time periods.

Finally, the dataset was separated into training and

testing subsets through the 'Train – Test split'

procedure. The test size was set to 0.20, and a random

state of 42 was used. This division allowed for the

accurate assessment and validation of the models

during the subsequent analysis stages. These

preprocessing steps were fundamental in preparing

the data for effective analysis and ensuring the

validity and reliability of the subsequent results.

3.3 Modelling

3.3.1 ANN

An Artificial Neural Network (ANN) is a

computational model that people inspired from

biological nervous processing. And because of the

reliability, nonlinearity, simplicity and robustness it

has, ANN can be used to solve and model numerous

complex environment systems (Malekian & Chitsaz,

2021).

In typical, an ANN have three parts that are one

input layer, the hidden layer which could have one or

more and one output layer. Besides, there are multiple

neurons in each layer.

The output 𝑦 of a neuron in a layer is calculated

based on the weighted sum of the previous layer’s

inputs 𝑥



and a bias term 𝑏, and then followed by the

application of an activation function 𝑓. It can be

represented mathematically as:

𝑦 =

𝑓

(𝑤



𝑥



+ 𝑏



)

(1)

where 𝑤



are the weights associated with the

inputs 𝑥



. In this method, there are there common

activation functions that could be used, including the

sigmoid function 𝑓

(

𝑧

)







, 𝑓

(

𝑧

)

=max (0,𝑧)

named the rectified linear unit (ReLU) function, and

the hyperbolic tangent function 𝑓

(

𝑧

)

=tanh (𝑧) .

And selecting the most accurate and effective

approach in the different circumstances (Yang &

Wang, 2020).

During the training of an ANN, according to study

(Yang & Wang, 2020), the weights 𝑤



and the biases

𝑏 are adjusted to minimize a loss function 𝐿 as large

as possible. This function is usually used to measure

the difference between the output of the network and

the desired. This is typically accomplished by using

optimization algorithms such as gradient descent.

Therefore, ANNs have shown remarkable

capabilities in different problems. For instance,

handling vast amounts of data, pattern recognition,

classification, regression, prediction and so on.

Additionally, because of these numerous great

characters, it has been applied in various fields

widely, such as the recognition of image and speech,

financial forecasting and the process of natural

language.

3.3.2 Linear Regression

As James et al. (2023) stated that Linear regression is

actually a statistical method. And it could be

employed to establish a model that describe the

relationship between a dependent variable and

independent variable which have one or more.

In simple linear regression, the relationship is

assumed to follow the equation:

𝑦 = 𝛽



+ 𝛽



𝑥 + 𝑒

(2)

𝛽



is the intercept, 𝛽



is the line’s slop, and 𝑒 is

the error term. That 𝑒 could represent the deviation

from the perfect linear relationship.

The goal of linear regression is to estimate 𝛽



and

𝛽



values, thereby making the sum of the squared

residuals is minimized. Additionally, the residual is

the differences between the observed 𝑦 and the

predicted values based on the linear equation (James

et al., 2023).

Mathematically, the sum of all squared residuals

is given by:

𝑅𝑆𝑆 = 𝑦



−

(

𝛽



+ 𝛽



𝑥



)











(3)

To find the optimal values of 𝛽



and 𝛽



, the least

squares method can be used. The formulas for the

estimators of 𝛽



and 𝛽



are:

𝛽





∑

(𝑥



−𝑥

)(𝑦



−𝑦)





∑

(𝑥



−𝑥

)









(4)

𝛽





= 𝑦−𝛽





𝑥

(5)

where 𝑥̅ and 𝑦 are the means of 𝑥 and 𝑦 ,

respectively.

Therefore, linear regression has wide applications

in diverse fields which including prediction, trend

analysis, and understanding the relationship between

variables.

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

450

3.3.3 Decision Tree

Decision Tree have flowchart-like structure. And it is

a non-parametric supervised learning algorithm that

could be used in several tasks like classification and

regression. Besides, it has a hierarchical tree

structure, and a complete decision tree consists of

four different parts that are a root node, branches and

the nodes from internal and leaf (Song & Lu, 2015).

Each internal node of the decision tree would

evaluate a feature in order to separate the data into

subsets. And the goal is to create splits, thus

maximizing the purity or homogeneity of the target

variable in the resulting subsets.

In actual, there are numerous ways to select the

best attribute at every node. But there are two

methods that are commonly used as segmentation

criteria for decision tree modeling which named Gini

impurity and entropy.

The Gini impurity for a node 𝑡 with 𝑦 classes and

the proportion 𝑝



of instances belonging to class 𝑘

are given by:

𝐺𝑖𝑛𝑖

(

𝑡

)

=1−𝑝











(6)

The entropy of a node 𝑡 is defined as:

𝐸𝑛𝑡𝑟𝑜𝑝𝑦

(

𝑡

)

= −𝑝



𝑙𝑜𝑔



(𝑝



)









(7)

In general, Decision trees could be understood and

interpret easily. Additionally, that is little to no data

preparation required and more flexible than other

algorithms.

3.3.4 Random Forest

Random Forest is an ensemble learning method

which could be used in various tasks such as

classification, regression and so on. Besides, when

making predictions at a training time, it would

combine multiple decision trees (Rigatti, 2017).

The fundamental idea of Random Forest is to

build a collection of decision trees that each one is

trained on a random subset of the training data, then

using a randomly selected features’ subset. That 𝑋 =

{𝑥



, 𝑥



,…,𝑥



} be the input features and 𝑌 =

{𝑦



, 𝑦



,…,𝑦



} be the corresponding target variables.

And when training a decision tree in a Random

Forest, a random subset of features could be

considered for splitting the data at every node of the

tree. There are two functions able to choose when

splitting a certain impurity measure maximize which

named entropy and Gini impurity.

For a node 𝑡 with classes 𝐶



, 𝐶



,…,𝐶



and the

proportion of class 𝐶



given by 𝑝(𝐶



), the equation of

the Gini impurity is defined as follow:

𝐺𝑖𝑛𝑖

(

𝑡

)

=1−𝑝(𝐶



)









(8)

After training, in order to predict an input 𝑥, the

Random Forest 𝑦

(

𝑥

)

could be realized by averaging

all the individual trees’ predictions:

𝑦

(

𝑥

)

𝑇

𝑦



(𝑥)







(9)

Where 𝑇 is the decision trees number and 𝑦



(𝑥) is

the prediction of the 𝑖-th tree in the certain input of 𝑥.

Therefore, random Forest has several advantages,

including high accuracy and the ability to handle the

data in high dimensional.

3.4 Results and Comparison

The different model’s performance is calculated

using the coefficient of determination (𝑅



) (See Table

2).

Table 2: Model Performance Evaluation by Coefficient of

Determination.

ANN Linear

Regression

Decision

Tree

Random

Fores

𝑅



0.0173

0.1563 0.5521 0.5859

The result of the coefficient of determination for

the ANN method was -0.0173. 𝑅



is examined the

degree to which the statistical model predicts the

outcome. And according to the properties of 𝑅



when the coefficient of determination is negative, it

indicates poor performance of the model in capturing

the relationship between the variables of input and

output.

Moreover, according to research (Turney, 2022),

the better a model is at making predictions, the closer

its 𝑅



will be to 1. And it is obvious that the model of

random forest resulted in the highest 𝑅



value which

was 0.5859.

Therefore, random forest performed best in sales

prediction in this dataset. And followed by the

decision tree algorithm, then the linear regression has

the poorest performance. This superior performance

can be attributed to the nature of ensembles that

include multiple decision trees combined, thus

making more accurate predictions. The randomness

of feature selection and tree construction helps reduce

overfitting risk and increase generalization ability.

E-Commerce Sales Analysis and Prediction in UK

451

4 CONCLUSIONS

The e-commerce industry has been seen a rapid

growth in recent years. Customers are enabled to

access a wide variety of products and purchase the

things from the comfort of their homes. However, this

has presented numerous new challenges to

businesses, particularly in sales, which is the most

critical aspect in e-commerce. To gain insights and

engineer features, the study employed the EDA and

RFM models. Then, several predictive models such

as Artificial Neural Network, Linear Regression,

Decision trees, and Random Forest were used to

predict sales. Additionally, the performance of these

different models was measured and compared based

on various metrics. The comparison revealed that the

method of Random Forest performed the best in terms

of accuracy, followed by the Decision Tree model.

However, both the Artificial Neural Network and

Linear Regression models had relatively lower

accuracies. There are still certain limitations and

future directions. This research could help to have a

better understanding of the dynamics of e-commerce

sales in the UK and can offer valuable insights for

businesses and researchers in the field, thereby

enhancing sales prediction accuracy and decision-

making.

The study has limitations that future research

should address. In feature engineering, not all features

were fully utilized, potentially missing hidden

patterns. Future efforts need to explore more

comprehensive techniques to extract deeper insights.

Generalizability is a concern as the model may not

work well on new e-commerce datasets.

Incorporating diverse datasets and cross-validation

can improve applicability. Additionally,

experimenting with its hyperparameters and

architectures can enhance e-commerce sales

predictions. Validating on external datasets is crucial

for understanding robustness and effectiveness. By

focusing on these areas, future research can build on

current findings and provide more reliable solutions

for e-commerce businesses.

REFERENCES

James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor,

J. (2023). Linear Regression. An Introduction to

Statistical Learning, 69–134.

Komorowski, M., Marshall, D. C., Salciccioli, J. D., &

Crutain, Y. (2016). Exploratory Data Analysis.

Secondary Analysis of Electronic Health Records, 185–

203.

Rigatti, S. J. (2017). Random Forest. Journal of Insurance

Medicine, 47(1), 31–39.

Segal, T. (2022, November 19). Inside Recency,

Frequency, Monetary Value (RFM). Investopedia.

https://www.investopedia.com/terms/r/rfm-recency-

frequency-monetary-value.asp

Song, Y. Y., & Lu, Y. (2015). Decision tree methods:

applications for classification and prediction. Shanghai

archives of psychiatry, 27(2), 130–135.

Turney, S. (2022, April 22). Coefficient of Determination

(R2) | Calculation & Interpretation. Scribbr.

https://www.scribbr.com/statistics/coefficient-of-

determination/#:~:text=coefficient%20of%20determin

ation-

Usmani, Z. A., Manchekar, S., Malim, T., & Mir, A. (2017).

A predictive approach for improving the sales of

products in e-commerce. 2017 Third International

Conference on Advances in Electrical, Electronics,

Information, Communication and Bio-Informatics

(AEEICB).

Wei, J. T., Lin, S. Y., & Wu, H. H. (2010). A review of the

application of RFM model. African journal of business

management, 4(19), 4199.

Yang, G. R., & Wang, X.-J. (2020). Artificial Neural

Networks for Neuroscientists: A Primer. Neuron,

107(6), 1048–1070.

Zheng, B., Thompson, K., Lam, S. S., Yoon, S. W., &

Gnanasambandam, N. (2013). Customers’ behavior

prediction using artificial neural network. In IIE Annual

Conference. Proceedings (p. 700). Institute of Industrial

and Systems Engineers (IISE).

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

452