Machine Learning-Based Customer Segmentation: A Comprehensive

Investigation of Techniques, Challenges and Applications

Yazhi Zhang

Steinhardt School of Culture, Education, and Human Development, New York University, New York, U.S.A.

Keywords: Customer Segmentation, Machine Learning, Deep Learning.

Abstract: Customer segmentation is vital for optimizing targeted marketing strategies, improving customer experiences,

and driving profitability across various industries. This study proposes to provide a comprehensive analysis

of how machine learning can improve segmentation accuracy and offer deeper insights into customer

behaviours. To achieve this, this paper conducted a detailed examination of machine learning methods used

in customer segmentation across banking, telecommunications, and healthcare industries. The methods

reviewed include decision trees, random forests, k-means clustering, hierarchical clustering, auto machine

learning (AutoML) tools like H2O, and deep learning models. The study also involved analyzing specific

machine learning workflows, including problem definition, data collection, preprocessing with techniques

like Local Outlier Factor (LOF) and Principal Component Analysis (PCA), model selection, training,

evaluation, and deployment. Each industry-specific case study was scrutinized to emphasize the effectiveness

and risks of these methods in real-world applications. The results demonstrate that while machine learning

significantly enhances customer segmentation, it also introduces challenges related to model interpretability,

domain applicability, and privacy concerns. The study supports the hypothesis that incorporating

interpretability tools like SHAP and LIME, leveraging transfer learning, and adopting federated learning are

crucial for overcoming these challenges.

1 INTRODUCTION

Customer segmentation entails categorizing a

customer base into several different groups depended

on common characteristics like reactions,

requirements, and predilections. (Monil et al., 2020).

In the context of machine learning, this process

utilizes advanced algorithms to analyze vast datasets,

uncovering patterns that traditional methods might

miss. Employing machine learning techniques in

customer segmentation helps to strengthen the

advantages of targeted marketing strategies, upgrade

customer experiences, optimize marketing efforts,

and increase profitability for businesses across

various sectors, particularly in the highly competitive

landscape of modern commerce.

Machine learning can be generally divided into

two main categories: supervised learning and

unsupervised learning. Supervised learning is

commonly utilized for addressing classification and

regression problems, where the data includes an

https://orcid.org/0009-0001-8080-1451

objective standard that the model aims to predict in

future scenarios, such as estimating a student’s grade

or forecasting the number of current transactions

(Narayana et al., 2022). In contrast, unsupervised

learning does not involve predicting a specific label

or target variable. Instead, it focuses on identifying

patterns and grouping data based on similarities, such

as categorizing students according to their purchasing

behavior or learning patterns, without predefined

labels (Narayana et al., 2022). Previous studies on

customer segmentation in the industries have

predominantly focused on traditional statistical

methods, such as cluster analysis and regression

models. These methods, while useful, often lack the

ability to handle large and complex datasets

effectively. Recent research has started to explore the

use of machine learning techniques, including k-

means clustering, auto machine learning, and neural

networks, to enhance segmentation accuracy and

provide deeper insights. For instance, Narayana and

other researchers utilized methods of K-means,

Zhang, Y.

Machine Learning-Based Customer Segmentation: A Comprehensive Investigation of Techniques, Challenges and Applications.

DOI: 10.5220/0013207200004568

In Proceedings of the 1st International Conference on E-commerce and Artiﬁcial Intelligence (ECAI 2024), pages 97-101

ISBN: 978-989-758-726-9

Agglomerative, and Mean Shift to identify possible

customer segments in the mall, depended on their

gender, age, yearly income, and consuming score

(Narayana et al., 2022); Turkmen investigated k-

means, Hierarchical clustering, and Density-Based

Spatial Clustering of Applications with Noise

(DBSCAN) for online retail industry (Turkmen,

2022); Yadegaridehkordi et al. scholars researched

the k-means, Technique for Order of Preference by

Similarity to Ideal Solution (TOPSIS) , and

Classification and Regression Trees (CART) method

to analyze travelers for eco-friendly hotel

(Yadegaridehkordi et al., 2021).

The primary aim of the article is to investigate the

latest trends in customer segmentation in various

industries using machine learning within the context

of an omnichannel world. Initially, this paper

examines the current state and innovations in

customer segmentation. The methods utilized in

different industries are then recapitulated. Subsequent

sections discuss the advantages and disadvantages of

implementing these models for customer

segmentation, offering valuable insights for

practitioners and researchers alike. The final part

summarizes the theoretical and practical conclusions.

2 METHODS

2.1 Introduction of Machine Learning

Workflow

The focus of machine learning is on developing

algorithms that enable computers to learn and make

decisions or predictions based on data, which is a

branch of artificial intelligence. Figure 1 shows that

machine learning can be divided into four categories:

Supervised Learning, Unsupervised Learning, Semi-

supervised Learning, and Reinforcement Learning

(Taye, 2023). Supervised learning involves the

progress of classification and regression, a traditional

method of fitting the data into a pre-existing structure

and types. It usually utilizes the knowledge of Linear

Regression, Logistic Regression, Random Forest, and

Network Neutral. Unsupervised learning, on the other

hand, consists of clustering and association. It often

includes K-means clustering, principal component

analysis, t-distributed stochastic neighbor

embedding, and association rule. Semi-supervised

Learning contains the step of classification and

clustering, with the algorithm of uClassify and

GATE. And reinforcement learning requires

classification and control, having the methods such as

Q-learning, Monte Carlo Tree Search, Temporal

Difference, and Asynchronous Actor-Critic Agents.

A machine learning workflow shown in Figure 2

encompasses several key stages, each crucial for

creating effective and reliable models. The first stage

in the machine learning workflow is problem

regression, or clustering. The second step is data

preparation and preprocessing, which involves data

cleaning, data transformation, and data splitting. The

third step might be Model selection and training. The

selection of appropriate machine learning algorithms

is made during this phase according to the type of

problem and the nature of the data. After training, the

model’s performance must be evaluated and

validated. Visualizing the results and interpreting

Figure 1: Different Categories and Algorithms of Machine Learning (Taye, 2023).

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

Figure 2: Workflow of the Machine Learning Procedure

(Monge, 2021).

the mode’s predictions are crucial for understanding

the insights generated and communicating them to

stakeholders. The final stage in the machine learning

workflow is model deployment and monitoring,

where it can be used for real-world applications. This

often involves integrating the model into existing

systems through APIs or user interfaces, making it

accessible and functional. Detecting any degradation

or drift over time requires continuous monitoring of

the model’s performance.

2.2 Customer Segmentation in Bank

Yuping et al. researchers explore the changing

landscape of customer behavior and payment

methods, especially in an omnichannel world where

physical money is becoming obsolete (Yuping et al.,

2020). Commercial banks face challenges when

evaluating personal credit using traditional methods

due to this shift. During the procedure of personal

credit evaluation, the Local Outlier Factor (LOF) test

method is used to identify and remove outliers from

the sample data. Afterward, any missing values in the

original sample data, as well as those that result from

removing outliers, are filled in using a random forest

model. This approach replaces the traditional

statistical methods of filling in missing values with

averages or most frequent values, thereby enhancing

the accuracy of the predictions. The key variables that

impact individual credit evaluation are identified

using a gradient boosting decision tree model.

Following this, utilizing either logistic regression or a

Backpropagation (BP) neural network, a scorecard

model is established to assess and forecast using the

selected key variables, resulting in a personal credit

score. This approach compensates for inaccuracies in

credit scoring that may arise from relying solely on

credit data and the traditional model for generating

the personal credit score.

Examining the machine learning methods in

predicting bank crises, Buetel et al. highlight that

neural networks, especially with a single hidden

layer, can offer greater flexibility than traditional

logit models (Buetel et al., 2019). However, they also

note the risk of overfitting, particularly with more

complex architectures like deep neural networks. The

potential for neural networks to outperform simpler

models like the logit model depends heavily on

controlling this overfitting risk and effectively

managing the network’s complexity. However,

authors consider that in some circumstances, machine

learning methods can’t perform better than the

traditional methods.

2.3 Customer Segmentation in

Telecommunication

The challenge of risk management within the

telecommunications industry includes the need for

efficient handling of large datasets, ensuring the

scalability of the ML models, and enabling non-

expert users to deploy and maintain these models

(Ferreira et al., 2020). Specifically, it focuses on

automating and scaling ML applications for tasks

such as churn detection, event forecasting, and fraud

detection. The AutoML tools, particularly H2O

AutoML, were considered a robust solution for

developing machine-learning models with minimal

human intervention. This tool is noted for its ability

to handle large datasets and integrate various machine

learning algorithms, such as Gradient Boosting

Machine (GBM), Generalized Linear Model (GLM),

XGBoost, Random Forest, and Deep Learning

models. The choice of H2O AutoML by the project’s

stakeholders suggests a strong endorsement of its

capabilities in efficiently managing complex

machine-learning tasks.

To deal with the problem of understanding and

segmenting telecom customers based on their

behavior and tailoring services more effectively, PCA

was used to reduce the dimensionality of the dataset,

and the elbow method helped to figure out the most

favorable number of clusters before conducting

clustering (Sharaf et al., 2022). K-means clustering

was the primary approach emploted to segment

telecom customers based on various attributes such as

demographic, behavioral, and regional aspects.

Subsequently, an interactive web-based dashboard

called INSIGHT has been created to help telecom

managers in obtaining a thorough comprehension of

Machine Learning-Based Customer Segmentation: A Comprehensive Investigation of Techniques, Challenges and Applications

their customers and creating more informed business

decisions.

2.4 Customer Segmentation in Healthcare

K-means clustering and hierarchical clustering are

identified as the most prevalent and useful techniques

in addressing the issue of understanding healthcare

consumer behaviors and attitudes, which is crucial in

the shift towards patient-centered care (Swenson,

2018). While both machine learning methods are

significant, K-means clustering is highlighted as

particularly useful due to its ease of use, ability to

handle large datasets, and widespread application in

healthcare market segmentation, compared to

hierarchical clustering. The method’s capability to

form well-defined clusters based on similarity

measures makes it a preferred choice for segmenting

healthcare data into actionable groups, which can

inform targeted healthcare services and marketing

strategies.

With the advent of big data in healthcare,

customized segmentation can be utilized in web-

based healthcare content (Guni et al., 2021). Deep

learning used in advanced recommender systems like

YouTube’s processes a rich set of user and content

features to generate and rank candidate items. It is

highlighted for its ability to handle and integrate a

wide variety of content features, both high-level

(semantic features like tags, genre, and actors) and

low-level (stylistic features like colors, texture, and

lighting). And deep learning outperforms traditional

methods by providing more accurate and

personalized content recommendations, making it a

powerful tool for processing large and complex

datasets in healthcare industry.

3 DISCUSSIONS

3.1 Limitations and Challenges

Interpretability is a significant challenge in target

marketing, particularly when complex machine

learning models are used. These models often

function as “black boxes”, making it difficult to

understand which factors are influencing the

outcomes. In business contexts where trust and

compliance depend on understanding the decision-

making process, a lack of transparency can be

problematic. For example, if a marketing model

predicts customer behavior but doesn't provide clear

reasoning behind these predictions, it may be difficult

for marketers to explain or justify the strategies to

stakeholders, potentially leading to skepticism and

reduced confidence in the model’s results.

Another challenge is the limited applicability of

specific models across different scenarios. Many

models are tailored to specific datasets or marketing

environments, which means they might not generalize

well to other cases. This limitation is especially

problematic in target marketing, where consumer

behavior can vary significantly across different

industries, regions, and time periods. The inability to

generalize can reduce the effectiveness of the model

when applied outside its initial context, requiring

extensive retraining or adjustments that can be

resource-intensive.

Privacy concerns are another significant

limitation in target marketing, particularly with the

increasing amount of customer data being used for

personalized marketing strategies. The use of

sensitive customer data raises the risk of privacy

breaches, which can lead to significant legal and

reputational consequences. If customer data is not

handled securely, there is a high likelihood of privacy

leaks, which can result in the loss of customer trust

and, subsequently, customers themselves. This

challenge underscores the need for rigorous data

protection measures and accordance with privacy

rules, such as GDPR, to mitigate the risks associated

with customer data usage in marketing.

3.2 Future Prospects

To address the challenge of interpretability in target

marketing models, future efforts could focus on

integrating expert systems and advanced

interpretability tools such as SHapley Additive

exPlanations (SHAP) and Local Interpretable Model-

agnostic Explanations (LIME). These methods offer

knowledge about how different characteristics impact

the model’s predictions, providing a easier way to

understand and explain the decisions made by

complex models. SHAP, for instance, can supply a

unified method of feature significance across

different models, while LIME offers a way to

understand individual predictions by approximating

the model locally. By incorporating these tools,

marketers can gain greater transparency and trust in

the AI-driven decisions, ultimately leading to better-

informed marketing strategies.

To overcome the limitations of applicability,

transfer learning and domain adaptation techniques

could be employed. For instance, transfer learning

enables models to transfer comprehension learned

from one realm to another realm, lessening the

requirement for substantial retraining when models

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

100

are applied to new datasets or scenarios. Domain

adaptation, on the other hand, focuses on adjusting

models to perform well in different but related

domains. These techniques can significantly improve

the generalizability of target marketing models,

allowing them to be more flexible and effective

across various contexts and industries. This would

reduce the resource-intensive process of building new

models from scratch for different applications.

Privacy concerns in target marketing can be

addressed by adopting federated learning approaches.

Federated learning makes it possible to train models

on multiple decentralized devices or servers that

contain local data samples, without having to

exchange the data itself. This method ensures that

customer data remains on their devices, reducing the

risk of privacy breaches. Federated learning also

complies with stringent data privacy regulations, such

as GDPR, by enabling secure, decentralized learning

processes. By implementing federated learning,

organizations can enhance customer trust while still

benefiting from personalized marketing strategies

that leverage large-scale data insights.

4 CONCLUSIONS

The investigation into the latest trends in customer

segmentation reaffirms the critical role that machine

learning plays in enhancing targeted marketing

strategies. As highlighted in the introduction, the

transition from traditional methods to advanced

algorithms offers businesses a competitive edge in

understanding customer behavior and optimizing

marketing efforts. This review’s main contribution

lies in synthesizing the current state of machine

learning applications across various industries,

providing a comprehensive analysis of their strengths

and challenges. By examining case studies from

banking, telecommunications, and healthcare, this

paper demonstrates the effectiveness of machine

learning models like k-means clustering, auto

machine learning, decision trees, and neural networks

in improving segmentation accuracy and customer

insights. However, the review also identifies

significant limitations, including the challenges of

model interpretability, domain applicability, and

privacy concerns. Addressing these issues will

require future research focused on integrating

interpretability tools like SHAP and LIME, exploring

transfer learning and domain adaptation, and adopting

federated learning to enhance privacy. These

advancements are essential for ensuring that machine

learning continues to provide valuable and

trustworthy insights in customer segmentation.

REFERENCES

Beutel, J., List, S., & von Schweinitz, G. 2019. Does

machine learning help us predict banking crises?.

Journal of Financial Stability, 45, 100693.

Ferreira, L., Pilastri, A. L., Martins, C., Santos, P., &

Cortez, P. 2020. An Automated and Distributed

Machine Learning Framework for Telecommunications

Risk Management. In ICAART (2) (pp. 99-107).

Guni, A., Normahani, P., Davies, A., & Jaffer, U. 2021.

Harnessing machine learning to personalize web-based

health care content. Journal of medical Internet

research, 23(10), e25497.

Monge, M., Quesada-López, C., Martínez, A., & Jenkins,

M. 2021. Data mining and machine learning techniques

for bank customers segmentation: A systematic

mapping study. In Intelligent Systems and Applications:

Proceedings of the 2020 Intelligent Systems Conference

(IntelliSys) Volume 2 (pp. 666-684). Springer

International Publishing.

Monil, P., Darshan, P., Jecky, R., Vimarsh, C., & Bhatt, B.

R. 2020. Customer segmentation using machine

learning. International Journal for Research in Applied

Science and Engineering Technology (IJRASET), 8(6),

2104-2108.

Narayana, V. L., Sirisha, S., Divya, G., Pooja, N. L. S., &

Nouf, S. A. 2022. Mall customer segmentation using

machine learning. In 2022 International Conference on

Electronics and Renewable Systems (ICEARS) (pp.

1280-1288). IEEE.

Sharaf Addin, E. H., Admodisastro, N., Mohd Ashri, S. N.

S., Kamaruddin, A., & Chong, Y. C. 2022. Customer

Mobile Behavioral Segmentation and Analysis in

Telecom Using Machine Learning. Applied Artificial

Intelligence, 36(1).

Swenson, E. R., Bastian, N. D., & Nembhard, H. B. 2018.

Healthcare market segmentation and data mining: A

systematic review. Health Marketing Quarterly, 35(3),

186-208.

Taye, M. M. 2023. Understanding of machine learning with

deep learning: architectures, workflow, applications

and future directions. Computers, 12(5), 91.

Turkmen, B. 2022. Customer Segmentation with machine

learning for online retail industry. The European

Journal of Social & Behavioural Sciences.

Yadegaridehkordi, E., Nilashi, M., Nasir, M. H. N. B. M.,

Momtazi, S., Samad, S., Supriyanto, E., & Ghabban, F.

2021. Customers segmentation in eco-friendly hotels

using multi-criteria and machine learning techniques.

Technology in Society, 65, 101528.

Yuping, Z., Jílková, P., Guanyu, C., & Weisl, D. 2020. New

methods of customer segmentation and individual

credit evaluation based on machine learning. In “New

Silk Road: Business Cooperation and Prospective of

Economic Development” (NSRBCPED 2019) (pp. 925-

931) Atlantis Press.

Machine Learning-Based Customer Segmentation: A Comprehensive Investigation of Techniques, Challenges and Applications

101