
• Lead scoring: after cleaning the data and seg-
menting it into interest groups, the score can be
defined; available historical data will be used,
considering the activity and behavior of each lead
and customer (leads already converted and seg-
mented in the Segmentation step), as well as their
profiles (as defined by the company), for assign-
ing an initial score. The best ML model will be
used to define the score;
• Calibration: a critical step of the entire process,
where the contacts will be recycled according to
the score obtained in the previous lead scoring
stage, and those who have had some change in
behavior (online or offline) will receive an appro-
priate score, in addition to adding the results ob-
tained in the sentiment analysis. In this case, re-
views with 1 to 3 stars are classified as negative,
while those with 4 or 5 stars are classified as pos-
itive.
• Evaluation of results: this step consists of evaluat-
ing whether the score obtained is sufficient for the
contact to be considered a qualified lead. Partial
results will be evaluated according to the cutoff
score (minimum score defined for the model); in
practice, if the minimum score is reached, it will
indicate that the leads in question are ready to be
sent to the sales team.
For this study, we selected a public dataset pro-
vided by the Kaggle community (Kaggle, 2024), con-
taining 9,240 records and 37 attributes related to lead
behavior and the profile of a fictitious education-
focused company. This dataset was chosen for being
reasonably balanced between users who became cus-
tomers (3,561) and those who did not convert (5,679),
in addition to already containing the leads’ behavioral
history. These factors enable a more accurate analysis
of the experimental results.
Since the calibration step requires sentiment anal-
ysis, it is important to define the best strategy to per-
form it. So, we investigated how the use of ma-
chine learning models and the use of an artificial
intelligence-based algorithm (ChatGPT) can be ap-
plied to identify sentiments in evaluations posted by
social network users. The goal is, first, to identify the
best machine learning model for this context. Then,
investigate whether it is worth using a model trained
specifically with the text analysis or if it is better to
use the generic ChatGPT model. Details and results
of this study are in the subsection 4.4.
4.1 Predictive Model
Initially, the pre-processing step was performed (see
Figure 2). The first analysis was focused on balanc-
ing the classes “converted leads” and “unconverted
leads”. Although the data were not fully equalized in
proportion, we understand that they represented the
reality of conversion rate, establishing a proportion
of approximately 60% (not converted) to 40% (con-
verted). Then it was necessary to make a cleaning in
the base, as it had a lot of blank or null data. For
this reason, some variables were removed from the
dataset. The criterion adopted was to exclude vari-
ables that presented more than 50% of null values.
In this first stage the dimension was reduced from 37
to 22 attributes. It is worth mentioning that among
the attributes that remained, two represent the com-
pany’s expertise and were previously evaluated based
on the lead’s activities. Continuing with a more care-
ful analysis, some outliers were excluded, as they
could distort the results of statistical analyses. So, it is
important to identify and to treat them appropriately
(Mitchell, 1997). With this step, the execution of the
Pre-Processing stage of the solution was concluded.
In the segmentation stage (Figure 2), the data set
was divided into two parts, one referring to converted
leads and the other to non-converted leads. Using this
approach, it was possible to identify some interesting
behaviors and patterns. For example, both converted
and non-converted leads come from the same source:
Google. Also, in most cases, the last recorded ac-
tivity of converted leads was sending an SMS, while
for non-converters it was sending an email. These are
some examples of observed behavior.
After cleaning the data and segmenting it into
groups, the score could be defined in the Lead Scoring
stage of the process. For this end, it was necessary to
choose an appropriate Machine Learning (ML) model
for the database in use, which was the Logistic Re-
gression. The choice was made due to simplicity of
application and the success stories observed in simi-
lar situations and reported in the literature (Jadli et al.,
2022) (Yadavilli and Seshadri, 2021). The module
used to build the model was Logistic Regression from
the library Scikit-learn (Scikit-learn, 2024b).
For model training, the dataset was divided into
two sets: training data (70%) and test data (30%). Af-
ter evaluating the first training results, it was observed
that the excess of variables to be analyzed harmed the
results. Therefore, the tool Recursive Feature Elim-
ination (RFE) (Scikit-learn, 2024c) was used to as-
sist in the choice of the most important variables for
defining the final model, with the 15 best classified by
the method being chosen. As a last step, the p-valor
was analyzed and those attributes that had a p-valor
> 0.05 were eliminated.
The results obtained when we applied the model
to the test data revealed an excellent specificity of
ICEIS 2025 - 27th International Conference on Enterprise Information Systems
460