Data-Driven Consumer Behaviour Prediction: Key Factors and
Machine Learning Approaches
Yiling Wang
a
University College London School of Management, University College London, London, U.K.
Keywords: Consumer Behaviour Prediction, Machine Learning, Big Data Analytics, Marketing Strategy, Data Privacy.
Abstract: In today's data-driven market environment, consumer behaviour prediction has become an indispensable tool
for companies seeking to develop more targeted marketing strategies, optimise supply chains, and enhance
customer experience. This article systematically explores the internal and external factors that influence
consumer behaviour prediction, including psychological characteristics, demographic factors, market trends,
advertising campaigns, and social media influences. It also examines the intricate interplay between these
elements, highlighting how they can collectively shape purchasing decisions. In addition, the article analyses
in detail different prediction methods, ranging from traditional statistical models (e.g., linear regression and
decision trees) to emerging machine learning techniques (e.g., integrated learning and clustered classification
models), and assesses their respective strengths and weaknesses. Addressing challenges such as data privacy,
data imbalance, model interpretability, and real-time performance, this paper proposes a series of practical
solutions while looking ahead to future trends in consumer behaviour prediction. Ultimately, the research
provides valuable insights for enterprises, marketers, and researchers to optimise data-driven marketing
strategies and business decisions.
1 INTRODUCTION
In the digital age, understanding and predicting
consumer behaviour has become an important part of
marketing, business planning and economic
forecasting. Accurate prediction of consumer
preferences and purchasing decisions help companies
to optimise marketing strategies, product offerings
and enhance customer satisfaction. Consumer
behaviour prediction is influenced by factors such as
psychological characteristics, demographics, social
influences and market dynamics. With the
development of big data and artificial intelligence,
prediction methods have evolved from traditional
statistical models to sophisticated machine learning
techniques to improve accuracy.
This paper explores the key factors affecting the
prediction of consumer behaviour, which are divided
into two dimensions: internal (emotions, values,
demographic characteristics) and external (market
trends, advertising strategies, social media,
macroeconomics). Meanwhile, different prediction
methods from linear regression and decision trees to
a
https://orcid.org/0009-0004-0086-5108
integrated learning and cluster classification are
analysed and their strengths and weaknesses are
evaluated to provide practical guidance.
In addition, this paper discusses the challenges of
consumer behaviour prediction, such as data privacy,
data imbalance, model interpretability and real-time
prediction challenges, and explores solutions such as
privacy protection, data augmentation and
interpretable AI. The outlook emphasises the
combination of ethical AI, real-time data processing
and innovative machine learning frameworks to
enhance the accuracy and reliability of predictive
models.
Through multi-dimensional analysis of consumer
behaviour and the development of prediction
technologies, this paper aims to deepen the
understanding of the consumer decision-making
process, provide data-driven strategic support for
enterprises, marketers and policymakers, and help
build a more efficient and personalised consumer
interaction system.
Wang, Y.
Data-Driven Consumer Behaviour Prediction: Key Factors and Machine Learning Approaches.
DOI: 10.5220/0013685900004670
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 2nd International Conference on Data Science and Engineering (ICDSE 2025), pages 247-252
ISBN: 978-989-758-765-8
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
247
2 FACTORS INFLUENCING THE
PREDICTION OF CONSUMER
BEHAVIOUR
In the field of consumer behaviour forecasting, the
study of influencing factors is crucial. These factors
not only determine consumers' purchasing decisions,
but also influence market trends and corporate
marketing strategies. From the view of internal
factors, consumer behaviour is driven by
psychological characteristics, social identity,
personality preferences, etc. From the view of the
external environment, market supply and demand,
advertising, social media influence, etc. also play a
key role. In addition, with the development of big data
and artificial intelligence, data such as consumers'
historical behaviours, search records and product
evaluations have become an important basis for
predicting consumer behaviour.
2.1 Internal Factors
Consumers' emotions play an important role in
shopping decisions. Research has shown that
pleasurable moods typically motivate consumers to
make faster, more impulsive purchasing decisions,
while negative moods reduce the desire to buy or
prompt more cautious choices. For example, during
the festive season, consumers tend to be willing to
spend more due to pleasurable moods, while stress or
anxiety may increase the demand for comfort or
emotional comfort products.
In addition, demographic characteristics
significantly influence consumer behaviour. Age,
gender, income, occupation and education level
determine consumption preferences, e.g. younger
consumers favor personalised products, while older
consumers pay more attention to practicality and
quality assurance. Highly educated groups tend to
consume rationally, and there are also differences in
product choices between consumers of different
genders.
Values also influence consumption decisions,
especially in terms of environmental protection,
sustainable development and social responsibility.
For example, environmentally conscious consumers
tend to choose recyclable or eco-friendly products,
while socially responsible consumers are more likely
to support brands with good CSR performance. As the
concept of sustainable consumption becomes more
popular, companies are also starting to introduce
products that are in line with consumer values.
Family structure is also a key factor. Family size
and life cycle influence consumption patterns, such as
families with children spending more on food and
education, while single consumers are more inclined
to spend on personal interests and experiences. Newly
married families prioritise household products, while
retired groups are more concerned with health and
retirement products.
Income level directly restricts consumers'
purchasing power. Low-income groups are more
concerned about value for money and tend to prefer
discounted goods, while high-income groups place
more emphasis on brand, quality and personalised
experience. For example, in the luxury goods market,
high-income earners are more likely to accept high
premiums, while low and middle-income earners tend
to favor promotions or substitutes.
Price sensitivity affects how responsive
consumers are to price changes. High price sensitivity
consumers tend to buy promotional items or wait for
discounts, while low price sensitivity consumers are
more concerned with product value. For example, in
the electronics market, some consumers will wait for
price reductions to buy, while loyal users are willing
to pay a premium to get new products. This suggests
that price works in conjunction with other
psychological and social factors to influence the final
purchase decision.
2.2 External Factors
In the digital age, the Internet and social media
profoundly influence consumer behaviour. The
popularity of online shopping has made product
information readily available to consumers, and
social media advertising and recommendation
algorithms have enhanced this trend. Users often refer
to reviews, watch review videos and discuss product
experiences on social media platforms before making
purchases, and this socialised shopping model
enhances the interaction between brands and
consumers.
Advertising and promotional strategies directly
shape consumer decisions. Precise advertising,
discount promotions and membership offers can
effectively stimulate the desire to buy. For example,
promotions such as limited-time discounts and buy-
one-get-one-free boost sales, while brand precision
marketing attracts target users and influences their
brand choice through personalised recommendations.
Brand awareness, image and reputation affect
consumer trust and loyalty. Well-known brands are
more likely to win favour by virtue of their market
accumulation and word-of-mouth, such as Apple and
ICDSE 2025 - The International Conference on Data Science and Engineering
248
Nike, which shape a high degree of loyalty and
remain competitive even at higher prices. In addition,
a brand's social responsibility, innovation and after-
sales service also affect long-term loyalty.
Public emergencies change consumer demand and
behaviour. During major disasters or public health
crises, demand shifts from non-essentials to
necessities and health products. For example, during
the New Crown epidemic, demand for protective gear
and telecommuting equipment surged, while the
travel and entertainment sectors were hit. In addition,
such events prompt consumers to pay more attention
to food safety, health protection and emergency
stockpiling, influencing long-term consumption
trends.
3 CONSUMER BEHAVIOUR
FORECASTING METHODS
3.1 Conventional Prediction Methods
3.1.1 Linear Regression
Linear Regression (LR) is a basic statistical method
for consumer behaviour prediction, which assumes
that there is a linear relationship between the
independent variables (influencing factors) and the
dependent variable (consumer behaviour). For
example, it can be used to analyse the relationship
between consumers' purchase amount and frequency
of purchase and factors such as income level,
advertising investment and brand loyalty. Multiple
linear regression improves predictive accuracy by
introducing multiple independent variables (Thalji,
2022).
Thalji's study used multiple linear regression to
explore the impact of after-sales service information
on consumer purchase decisions on e-commerce
platforms (Thalji, 2022). Based on 533 questionnaires
from the city of Dammam, Saudi Arabia, the study
analysed the effect of return policy, maintenance,
communication, and warranty on purchasing
decisions, and the results showed that these factors
were significantly and positively related to
purchasing behaviour. In addition, the study used
hierarchical clustering and K-mean clustering
methods to group consumers by behavioural and
demographic characteristics to reveal the attention of
different groups to after-sales service information.
Multiple linear regression has the advantage of
being able to analyse the effects of multiple variables
on consumer behaviour simultaneously, capturing
complex relationships. For example, in addition to
after-sales service factors, the study can incorporate
variables such as consumers' demographic
characteristics (age, gender, income level) as well as
psychological characteristics (brand loyalty, purchase
motivation) to provide a more comprehensive
forecast. The method is particularly suitable for
consumption scenarios where linear relationships are
more obvious, such as pricing sensitivity analyses and
evaluation of the effectiveness of promotional
activities.
3.1.2 Classification Decision Trees
However, Gkikas et al. pointed out that decision trees
are susceptible to overfitting and have weak
generalisation ability. For this reason, the study
combines Genetic Algorithm (GA) to optimise the
decision tree structure, reduce redundancy through
feature selection and pruning, and improve model
robustness. In addition, the study suggests the use of
integrated learning methods (e.g., Random Forest,
Gradient Boosting Tree) to enhance the classification
performance and improve the prediction stability and
accuracy (Gkikas et al., 2022)
3.1.3 Nearest Neighbour Algorithm
The KNN algorithm is an instance-based learning
method that selects the nearest K neighbors by
calculating the distance between a new sample and
the samples in the training dataset and performs
classification or regression based on the categories of
these neighbors. In consumer behaviour prediction,
KNN is widely used to analyse user preferences. For
example, Putra & Ilmi used KNN algorithms for
consumer behaviour identification and product
personalisation recommendation (Putra & Ilmi,
2024).
Putra & Ilmi's study shows that KNN can
effectively capture consumer purchase patterns and
improve the accuracy of personalised
recommendations in a big data environment (Putra &
Ilmi, 2024). The study uses KNN for consumer
classification and product matching based on data
such as social media interactions, purchase history
and product browsing behaviour. The results show
that consumers with similar behaviours tend to prefer
similar products, which improves the effect of
precision marketing. In addition, KNN does not
require a complex training process and is suitable for
real-time recommendation systems on e-commerce
platforms, providing support for dynamic adjustment
of marketing strategies.
Data-Driven Consumer Behaviour Prediction: Key Factors and Machine Learning Approaches
249
However, KNN also has the disadvantages of high
computational complexity and sensitivity to data size.
Computing distance on large-scale datasets consumes
more resources, and noisy or redundant features may
affect prediction accuracy. The researchers suggest
optimising data quality through principal component
analysis (PCA) dimensionality reduction, feature
selection, and accelerating the computation by
combining it with KD trees or ball trees.
Another study PLOS ONE extends the application
of KNN in online shopping behaviour prediction
Azad et al. (Azad et al., 2023). The study analysed
consumers’ browsing history, shopping cart
behaviour, purchase records and other data, and the
results showed that KNN can effectively identify
consumers’ potential needs, accurately recommend
products, and improve the effect of personalised
marketing.
3.2 Novel Prediction Methods
In the field of consumer behaviour prediction, with
the continuous advancement of data analytics,
researchers have developed a variety of novel
methods to improve the accuracy and reliability of
predictions. The following are several common novel
prediction methods.
3.2.1 Integrated Learning Methods
Integrated learning improves prediction by
combining the results of multiple models. The
method reduces the bias and variance of a single
model and improves prediction robustness. Common
methods include Bagging, Boosting, and Stacking.
e.g., Bagging can train multiple models by sampling
the dataset multiple times and reduce the risk of
overfitting by averaging or voting.
Liu et al. showed that integrated learning
demonstrated strong comprehensive capabilities in
consumer behaviour prediction (Liu et al., 2020). The
study combines models such as decision trees,
support vector machines, and neural networks to
construct an integrated learning framework to
improve prediction accuracy. The results show that
Bagging effectively reduces overfitting and improves
model generalisation, especially for complex and
non-linear data.
Verma further explored the application of
Bagging method in the field of e-commerce (Verma,
2020). The study trains multiple decision tree models
through random sampling and uses a voting
mechanism to synthesise the results, reducing
variance and enhancing the robustness of the model
to noisy data, especially when dealing with high-
dimensional data.
The application of Boosting methods in consumer
behaviour prediction has also attracted much
attention.Alizamir et al. used Gradient Boosting Tree
(GBT) to analyse consumers' purchase history,
browsing behaviour and demographic characteristics
to predict their future purchase propensity (Alizamir
et al., 2022). It was found that Boosting performs well
with unbalanced data (e.g., few class samples) and
can effectively identify potential high-value
customers.
3.2.2 Hybrid Method of Clustering and
Classification
The hybrid method of clustering and classification
improves prediction accuracy by first analysing
consumers by clustering and then applying
classification or regression models to each group.
Common clustering algorithms include K-means and
hierarchical clustering, which can be used by
companies to identify different groups of consumers
and develop precise marketing strategies. For
example, Cai & Rodavia (2022) used K-means to
perform a cluster analysis of consumer behaviour and
applied classification models across the clusters to
optimise purchase prediction.
Cai & Rodavia's study shows that the approach
combines the pattern recognition capabilities of
cluster analysis with the accurate prediction
capabilities of classification models to improve the
accuracy of the models. The study used the K-means
algorithm to cluster consumers based on purchase
history, browsing behaviour and demographic
characteristics, and applied decision tree or logistic
regression models within each cluster to predict
purchase propensity.
In addition, Cai & Rodavia further validated the
effectiveness of the method by using K-means
clustering and applying a random forest classification
model within each cluster. The results show that the
method can handle the heterogeneity of consumer
behaviour data more effectively and improve the
prediction performance. For example, the study
identifies high-value customers and price-sensitive
customers and develops differentiated marketing
strategies for different clusters, which significantly
improves marketing effectiveness.
Hierarchical clustering eliminates the need to
preset the number of clusters, and can be applied to
exploratory analyses by showing the hierarchical
structure of data through a tree diagram. For example,
Anitha & Patil (2022) used the K-Means algorithm
ICDSE 2025 - The International Conference on Data Science and Engineering
250
combined with the RFM (Recency, Frequency,
Monetary) model to perform cluster analysis of
consumer purchase behaviour, identify different
consumer groups, and develop targeted marketing
strategies.
4 EXISTING LIMITATIONS AND
FUTURE OUTLOOK
4.1 Current Status of Data Privacy
With the in-depth analysis of consumer behaviour by
predictive models, the risk of user privacy leakage
increases significantly. For example, unauthorised
data collection and data misuse issues are frequent,
making user information security a serious challenge.
Therefore, future development requires further
improvement of laws and policies to ensure legal and
compliant use of data, and enhancement of data
security through encrypted storage, access control
and other technical means.
4.2 Challenge of Data Imbalance
Problem
The data imbalance problem is a common challenge
in consumer behaviour prediction, especially when
dealing with a few types of data (e.g., customer churn,
uncommon purchasing behaviours), the model often
struggles to learn efficiently, resulting in a
degradation of prediction performance. To address
this problem, a synthetic minority class oversampling
technique (SMOTE) can be used to balance the data
distribution by generating new minority class
samples. In addition, the WOE (Weight of Evidence)
method can be used for variable binning and
transformation to compute the ratio of minority
classes to majority classes in each feature bin to
improve model performance on unbalanced data
(Hambali & Andrew, 2024).
4.3 The Need for Model
Interpretability
In the application of predictive models, it is crucial to
understand the logic behind the prediction results,
especially in highly sensitive fields such as healthcare
and finance, and improving model interpretability can
help to enhance decision transparency and user trust.
Flowchart visualisation techniques can be used to
show the decision path of the model, for example, in
decision tree models, using branch nodes to visually
present the prediction process. In addition, interactive
visualisation tools (e.g. Tableau, Power BI) can
dynamically display the model's prediction process
and feature contributions, enabling users to gain a
deeper understanding of the model's decision-making
mechanism (Kovalerchuk et al., 2023).
4.4 Real-time challenges of models
The problem of latency in data processing has been a
challenge for consumer behaviour prediction models,
mainly due to the long time consuming data
preprocessing and feature engineering, which affects
the real-time performance of the model. To improve
the real-time performance of the model, the
computational cost can be reduced by streamlining
the feature set and selecting key features without
significantly reducing the model performance. In
addition, optimising the real-time data acquisition
process and adopting incremental learning methods
can reduce the redundant steps in data processing and
improve the real-time prediction capability of the
model.
4.5 Future Outlook
With the continuous progress of data analysis
technology and machine learning algorithms, future
consumer behaviour prediction models will be
continuously optimized in terms of data security, data
quality, interpretability and real-time performance.
The parallel promotion of data privacy protection
through legal and technical means, the improvement
of fairness in data collection and processing, and the
combination of efficient model optimisation
strategies will enable predictive models to play a
greater role in business decision-making and user
experience enhancement.
5 CONCLUSIONS
Consumer behaviour forecasting is becoming
increasingly important in the modern business world.
Through in-depth analysis of the internal and external
factors that influence consumer behaviour,
companies are able to identify market needs more
accurately and optimise their marketing strategies.
This study examines the key role of factors such as
consumer psychology, demographic characteristics,
social media influence, and advertising in predicting
consumer behaviour. In addition, with the
advancement of data analytics and artificial
Data-Driven Consumer Behaviour Prediction: Key Factors and Machine Learning Approaches
251
intelligence technologies, the combination of
traditional statistical methods and modern machine
learning algorithms significantly improves the
accuracy and usefulness of forecasting.
Despite significant progress in accuracy and
applicability, current prediction methods still face
many challenges. For example, data privacy
protection is a growing concern, and organisations
need to follow strict compliance requirements when
using consumer data for forecasting. In addition, the
problem of data imbalance makes the model limited
in dealing with niche consumer behaviours, and how
to optimise data sampling and model training
becomes an important direction for future research.
Meanwhile, model interpretability and real-time
performance are also key factors that affect the
effectiveness of consumer behaviour prediction, and
companies need to find a balance between increasing
model complexity and transparency.
Looking ahead, with the further development of
big data, deep learning and personalised
recommendation technologies, the accuracy of
consumer behaviour prediction will continue to
improve. Enterprises need to continuously optimise
their data collection and analysis methods and
combine them with the latest AI technologies to
ensure the reliability and feasibility of prediction
results. In addition, strengthening interdisciplinary
cooperation, such as combining research in the fields
of behavioural economics and psychology, will help
to understand consumer behaviour patterns more
comprehensively. Ultimately, while utilising data-
driven strategies, companies should also focus on
ethical and legal norms in order to achieve sustainable
development and long-term competitive advantage in
the market.
REFERENCES
Anitha, P., & Patil, M. M. (2022). RFM model for customer
purchase behavior using K-Means algorithm. Journal of
King Saud University-Computer and Information
Sciences, 34(5), 1785-1792.
Azad, M. S., Khan, S. S., Hossain, R., Rahman, R., &
Momen, S. (2023). Predictive modeling of consumer
purchase behavior on social media: Integrating theory
of planned behavior and machine learning for
actionable insights. PLOS ONE, 18(12), e0296336.
https://doi.org/10.1371/journal.pone.0296336
Cai, K., & Rodavia, M. R. (2022, October). K-Means
Cluster Analysis Based on Consumer Behavior. In 2022
4th International Conference on Artificial Intelligence
and Advanced Manufacturing (AIAM) (pp. 143-146).
IEEE.
Gkikas, D. C., Theodoridis, P. K., & Beligiannis, G. N.
(2022, May). Enhanced marketing decision making for
consumer behaviour classification using binary
decision trees and a genetic algorithm wrapper. In
Informatics (Vol. 9, No. 2, p. 45). MDPI.
Hambali, M. A., & Andrew, I. (2024). Bank Customer
Churn Prediction Using SMOTE: A Comparative
Analysis. Qeios, Mar.
Kovalerchuk, B., Dunn, A., Worland, A., & Wagle, S.
(2023). Interactive decision tree creation and
enhancement with complete visualization for
explainable modeling. arXiv preprint
arXiv:2305.18432.
Liu, X., Zhang, Y., Chen, W., & Li, H. (2020). Ensemble
learning-based consumer behavior prediction: A hybrid
model approach. IEEE Transactions on Neural
Networks and Learning Systems, 31(9), 3725–3738.
https://doi.org/10.1109/TNNLS.2020.2995362
Putra, S. E., & Ilmi, M. (2024). Application of K-Nearest
Neighbor Algorithm for Consumer Behaviour
Identification and Product Personalisation Based on
Big Data Analysis. Jurnal Ecotipe (Electronic, Control,
Telecommunication, Information, and Power
Engineering), 11(2), 205-213.
Thalji, Z. J. (2022). Using multiple linear regression model
to predict the customers' purchase decision based on
after-sales services. Research Square.
Verma, A. (2020). Consumer behaviour in retail: Next
logical purchase using deep neural network. arXiv
preprint arXiv:2010.06952.
ICDSE 2025 - The International Conference on Data Science and Engineering
252