Suggesting Product Prices in Automotive E-Commerce: A Study

Assessing Regression Models and Explicability

Andr

e Gomes Regino

1 a

, Gilson Yuuji Shimizu

1 b

, Fernando Rezende Zagatti

1,4 c

Filipe Loyola Lopes

1 d

, Rodrigo Bonacin

1 e

, Julio Cesar Dos Reis

2 f

and

Cristina Dutra de Aguiar

3 g

DIMEC, Center for Technology Information Renato Archer, Campinas, Brazil

Institute of Computing, UNICAMP, Campinas, Brazil

Institute of Mathematical and Computer Sciences, USP, S

ao Carlos, Brazil

Department of Computing, UFScar, S

ao Carlos, Brazil

ﬂ

Keywords:

Pricing, E-Commerce Analysis, Machine Learning.

Abstract:

E-commerce pricing may involve complex processes, including various factors such as cost, perceived value,

and market demand. Exploring machine learning (ML) for informing pricing in the automotive sector presents

signiﬁcant open research challenges that require innovative solutions. This investigation examines a real-

world Brazilian e-commerce dataset to train, test, and compare several state-of-the-art regression models to

understand their applicability. Our study originally includes how SHapley Additive exPlanations (SHAP) help

to interpret the most inﬂuential features for price prediction. Results indicate that Light GBM and XGBoost

performed best, combining high predictive accuracy with computational efﬁciency, and reveal features such

as product weight, stock levels, and physical dimensions as the most inﬂuential on ﬁnal pricing. This study

outcome paves the way for novel data-driven pricing strategies in Brazilian automotive e-commerce.

1 INTRODUCTION

The advancement of technology and the widespread

adoption of e-commerce have increased the volume

and variety of data generated in retail environments.

Unlike in-person shopping, online platforms enable

simultaneous interactions with numerous customers

while facilitating the systematic collection and stor-

age of consumer information. This transformation

has expanded data availability and reshaped sev-

eral business processes, from customer acquisition

(Patel, 2023) to fraud detection (Mutemi and Ba-

cao, 2024), demanding intelligent tools to support

decision-making and operational efﬁciency.

Pricing in e-commerce is a strategic decision that

directly affects a company’s proﬁtability and compet-

https://orcid.org/0000-0001-9814-1482

https://orcid.org/0000-0003-3711-5592

https://orcid.org/0000-0002-7083-5789

https://orcid.org/0000-0002-4172-6532

https://orcid.org/0000-0003-3441-0887

https://orcid.org/0000-0002-9545-2098

https://orcid.org/0000-0002-7618-1405

itiveness (Kotler and Keller, 2022). Even small price

changes in high-volume retail can impact overall rev-

enue, especially in a low-margin market. Mispricing

can result in ﬁnancial losses and, in extreme cases,

business failure (Xuming, 2024). Effective pricing

strategies aim to maximize revenue, increase market

share, and enhance customer satisfaction, balancing

production costs, consumer demand, and competition.

Traditional pricing methods like mark-up or

competition-based pricing are often inadequate in

digital commerce scenarios where pricing dynamics

are complex. Online markets demand real-time price

adjustments informed by large-scale and heteroge-

neous datasets. These may include historical pricing

trends, user behavior patterns, and competitor pro-

motions. Dynamic pricing, powered by algorithmic

models, addresses these needs (El Youbi et al., 2023)

but poses challenges like data integration, continuous

competitor tracking, and adapting to fast-changing

market conditions.

Research into data-driven pricing models is es-

sential for competitiveness and ﬁnancial sustainabil-

ity in digital markets. In the same direction, Artiﬁcial

Regino, A. G., Shimizu, G. Y., Zagatti, F. R., Lopes, F. L., Bonacin, R., Reis, J. C. and Dutra de Aguiar, C.

Suggesting Product Prices in Automotive E-Commerce: A Study Assessing Regression Models and Explicability.

DOI: 10.5220/0013830400004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 147-158

147

Intelligence (AI) and Machine Learning (ML) help

businesses analyze large volumes of data, detect de-

mand patterns, adjust prices in real-time, and antici-

pate market trends (El Youbi et al., 2023). These tech-

nologies enable more dynamic and data-driven pric-

ing strategies (Aparicio and Misra, 2023).

Automatically determining the optimal price re-

mains complex and challenging due to factors like

seasonality, competition, and production costs. These

variables are often interdependent and can change

rapidly, introducing high levels of volatility and un-

certainty into pricing strategies. As a result, compa-

nies struggle to understand the rationale behind pric-

ing recommendations, which can hinder trust in au-

tomated systems and limit their adoption in dynamic

markets.

This article investigates regression models for the

pricing task. These models offer a practical balance

between simplicity, precision, and performance. They

may enable interpretation of how input variables –

such as product weight, dimensions, and stock levels–

affect pricing outcomes. In particular, our investiga-

tion considers the Brazilian automotive sector as the

context of our data-driven pricing models. This con-

text presents high production costs, ﬂuctuating de-

mands, and a strong inﬂuence on perceived values.

Such context was poorly studied in the literature and

motivates novel analyses and contributions.

Our investigation introduces the following contri-

butions:

• Original analyses leveraging real-world dataset

from the Brazilian automotive e-commerce sector.

• Evaluation of nine well-studied ML models with

default and ﬁne-tuned hyperparameters, includ-

ing linear, tree-based, and neural network models:

Linear Regression, Random Forest, Lasso, Ridge,

Support Vector Machine (SVM), XGBoost, Light

GBM, Long Short Term Memory (LSTM), and

Feed Forward.

• Interpretation of the model predictions using

SHapley Additive exPlanations (SHAP) (Lund-

berg and Lee, 2017), chosen for its ability to pro-

vide consistent local explanations. SHAP high-

lights the most inﬂuential features in pricing,

making it easier to understand how each variable

impacts predictions in the automotive sector.

Our study found that LightGBM and XGBoost

offered the best balance between accuracy and efﬁ-

ciency for price prediction tasks. Hyperparameter op-

timisation further enhanced model performance, con-

sistently reducing prediction errors. SHAP analysis

revealed that key factors inﬂuencing product price in-

clude weight, stock quantity, sales volume, and phys-

ical dimensions.

The remaining of this article is organized as fol-

lows. Section 2 reviews related work. Section 3 de-

scribes the overall methodology, including data col-

lection, exploratory analysis, preprocessing, model

development, training, and assessment. Section 4

presents the results whereas Section 5 discusses them.

Section 6 concludes the article and highlights future

investigations.

2 RELATED WORK

The study presented by (Bhaskar et al., 2022) ad-

dresses the issue of consumer deception in the used

car market through price manipulation. To mitigate

this problem, three regression models are developed

to predict the selling price of used vehicles based on

features such as listed price and mileage. The mod-

els evaluated include Linear Regression, Lasso Re-

gression, and an ensemble approach combining both.

The dataset, obtained from Kaggle, underwent pre-

processing steps like removing missing values and

categorical encoding. Among the evaluated models,

the ensemble regression achieved the highest predic-

tive accuracy (94%), indicating its effectiveness in es-

timating used car prices. In contrast to this approach,

our study explores the pricing problem within a dif-

ferent context—namely, the Brazilian automotive e-

commerce sector—where pricing dynamics are inﬂu-

enced by additional factors such as inventory levels,

product dimensions, and temporal variables like the

date of sale. We expand the methodological scope

by evaluating nine distinct regression models, encom-

passing linear, tree-based, and neural network mod-

els. While both studies share the goal of price pre-

diction in the automotive sector, our work addresses a

broader, more complex set of variable.

(Chowdhury et al., 2024) explored the use of su-

pervised ML models to optimize pricing strategies in

e-commerce, with a focus on predicting customer sat-

isfaction. Using a dataset with features such as histor-

ical prices, customer demographics, and transaction

data, the authors compare the performance of Linear

Regression, Decision Trees, Random Forest, SVM,

and Neural Networks models, to predict customer sat-

isfaction based on price prediction. The main con-

tribution of the study is in the comparative evalua-

tion of the models regarding their predictive capac-

ity, where Neural Networks presented the best overall

performance. However, the high computational cost

of the Neural Network may be a barrier to practical

use. Random Forest emerged as a viable alternative,

balancing accuracy (MAE 0.130, R

0.82) with inter-

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

148

pretability and lower resource demands. In contrast to

this general-purpose approach, our study focuses on

pricing in the Brazilian automotive e-commerce sec-

tor.

The work described by (Akash et al., 2024) inves-

tigated the price of the “Toyota” cars based on the cus-

tomer’s requirement. The dataset, obtained from Kag-

gle, contained 1442 examples. ML training used 80%

of the data, while 20% was used for testing. Ridge

was used to predict the sales price, and the model

showed results with almost 93% accuracy. In contrast,

our study focuses on the automotive parts and materi-

als, instead of the whole car. Furthermore, we are not

limited to just one ML model and provide model ex-

plainability, highlighting the most impactful features.

(Das et al., 2024) investigated real-time dynamic

pricing strategies in the context of retail and e-

commerce, benchmarking Linear Regression, Ran-

dom Forest, and Gradient Boosting Machines (GBM)

algorithms to predict optimal prices. The dataset,

composed of records from different sectors (e.g., elec-

tronics and apparel), included variables such as price

history, competitor prices, promotional status, inven-

tory levels, and consumer demographics. The results

indicated that GBM achieved the best performance in

all evaluated metrics, reaching MAE of 1.73, RMSE

of 2.01, and R² of 0.94, demonstrating a greater abil-

ity to capture nonlinear patterns and complex interac-

tions between variables. In addition to predictive eval-

uation, the authors highlight the relevance of attribute

engineering, class balancing, and simulation of real

scenarios to validate the applicability of the proposed

strategies. In contrast, our work focuses speciﬁcally

on the automotive domain and also seeks to promote

the explainability of the best model used.

Table 1 compares the related studies regarding the

models employed. To the best of our knowledge, our

study is the ﬁrst to investigate how these models be-

have experimentally and are relevant to apply to the

Brazilian automotive e-commerce context.

3 INVESTIGATING PRODUCT

PRICES THROUGH

REGRESSION METHODS

This section describes the methodology employed to

investigate the problem of product prices in Brazilian

automotive e-commerce. The methodology is struc-

tured into six main stages: data acquisition (Sec-

tion 3.1), exploratory analysis and cleaning (Section

3.2), model selection (Section 3.3), algorithm exe-

cution (Section 3.4), result evaluation (Section 3.5),

and model explainability (Section 3.6) as illustrated

in Figure 1.

3.1 Data Collection

We employed a dataset provided by GoBots

, a

Brazilian startup offering AI-based services for e-

commerce platforms. The dataset contains detailed

information about products, pricing, and sales history.

Data extraction was performed using MongoDB.

Filters: To ensure data quality and relevance, we

applied the following ﬁlters:

1. Sample Size: A maximum of 100, 000 products

were selected to ensure analytical feasibility while

maintaining representative diversity. This sample

was drawn from a larger pool of 4.5 million prod-

ucts.

2. Time Range: Data spanned from January to De-

cember 2023, covering seasonal events such as

Black Friday.

3. Automotive Sector: Chosen based on GoBots’ ex-

pertise and the strategic importance of automotive

parts within e-commerce, a segment associated

with heightened consumer urgency. Observations

indicate that consumers searching for automotive

products tend to exhibit stronger purchase intent

and greater immediacy than in other sectors. For

instance, a customer searching for a replacement

tire must substitute a worn-out tire to maintain the

vehicle’s operability promptly.

4. Status: Only products sold and active (i.e., not out

of stock) were considered.

Collected Fields: We collected the following at-

tributes (summarized in Table 2):

• orderCreated: A timestamp indicating when the

purchase was registered. This temporal ﬁeld en-

ables derivation of features such as month, week

of the year, or even position within the month

(e.g., begin, end). These derived features allow

the models to capture temporal price variation pat-

terns, such as consumer behavior around salary

periods or promotional events like Black Friday

and Christmas.

• initialQuantity: Represents the initial inventory

available at the start of a product’s sales cycle.

This value serves as a proxy for the product’s sup-

ply conditions and potential market expectations.

A higher initial quantity may indicate widespread

availability or popularity, which can inﬂuence

pricing strategies;

Available at: https://gobots.ai/

Suggesting Product Prices in Automotive E-Commerce: A Study Assessing Regression Models and Explicability

149

Table 1: Comparison of our work and related studies. The ML models considered are: Linear Regression (LR); Decision

Tree (DT); Lasso (La); Ridge (Ri); Random Forest (RF); Gradient Boosting Machines (GBM); XGBoost (XG); LightGBM

(Li); Support Vector Machine (SVM); Feedforward Neural Network (FF); and Long Short-Term Memory Neural Network

(LSTM). The ✓symbol indicates the ML models addressed by each study.

Study Domain LR DT La Ri RF GBM XG Li SVM FF LSTM

(Bhaskar et al., 2022) Automobilistic ✓ ✓

(Chowdhury et al., 2024) Not deﬁned ✓ ✓ ✓ ✓ ✓ ✓

(Akash et al., 2024) Automobilistic ✓

(Das et al., 2024) Not deﬁned ✓ ✓ ✓

This study Automobilistic ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Figure 1: Methodology: A) Data Collection; B) Data Analysis and Cleaning; C) Model Selection; D) Algorithm Execution;

E) Evaluation; F) Explainability.

• availableQuantity: Number of units in stock dur-

ing data collection. This dynamic attribute pro-

vides insights into stock pressure. A low value

suggests scarcity, justifying higher prices, while

a high value may indicate overstock, encouraging

discounts to boost demand.

• soldQuantity: Cumulative number of units sold.

This variable indicates product demand. High

sales volumes may be associated with stronger

pricing power, while low sales could trigger price

reductions or promotional activity.

• domain: Denotes the market vertical of the prod-

uct. In this study, the dataset is constrained to

the automotive domain, where consumers are of-

ten driven by necessity and urgency, which can

inﬂuence willingness to pay.

• category: Indicates the product subcategory (e.g.,

“brake light”, “air ﬁlter”). Different categories

may reﬂect different levels of urgency, elasticity,

and competition. Safety-critical parts may com-

mand premium prices and faster purchase deci-

sions, while others are more discretionary.

• condition: Describes the state of the product (e.g.,

new, used, refurbished). Product condition di-

rectly impacts consumer expectations and accept-

able pricing. New items generally have higher

perceived value, whereas used items may require

price adjustments to remain competitive.

• weight and dimensions: The logistical attributes

(width, height, length and weight) can indirectly

affect pricing, especially when shipping costs or

storage constraints are factored into the pricing

strategy.

• price: The target variable for the regression mod-

els. It reﬂects the ﬁnal listed price at the time of

sale and encapsulates the combined effect of all

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

150

Table 2: Summary of the Collected Fields.

Field Data Type Example

orderCreated datetime "2024-06-07T23:59:59.000Z"

initialQuantity number 1392

availableQuantity number 86

soldQuantity number 1306

domain string "MLB-AUTOMOTIVE TIRES"

category string "Luz de Freio"

condition string "new"

weight number 47.0

height number 2.0

width number 14.0

length number 19.0

price number 105

other features. The primary objective of the pre-

dictive model is to estimate this value based on the

available product and contextual attributes.

3.2 Exploratory Analysis and Data

Cleaning

Initial Exploratory Analysis: After data collection,

we performed an initial exploratory analysis by exam-

ining the numerical features using descriptive statis-

tics. Table 3 summarizes these statistics.

The variable price revealed great variability,

with prices ranging from R$5 to approximately

R$109,000. The variable initialQuantity, which rep-

resents the initial quantity of products in stock, pre-

sented an average of 2, 372.88 units and a high stan-

dard deviation (6, 503.62), indicating wide variation

in stock capacity among products and retailers. The

same dispersion pattern was observed in available-

Quantity and soldQuantity, which varied from 1 to

99, 999 available products (average of 873.56) and

from 0 to 45, 056 products sold (average of 1, 500.13),

respectively.

Next, we investigated missing values. About

27.1% of the rows had null values in the ﬁelds weight,

height, width, and length. These rows were removed

from the dataset. The analysis of unique values was

applied to the categorical variables. We observed

that the variable domain contained 329 unique val-

ues, category had 375 distinct categories, and con-

dition presented only three variations. Next, a cor-

relation analysis, based on Pearson’s correlation co-

efﬁcient, was constructed among the numerical vari-

ables

. We identiﬁed a strong positive correlation

between the variables initialQuantity and available-

Quantity (coefﬁcient of 0.91), indicating that these

variables tend to increase or decrease together. We

also observed a moderate correlation between ini-

tialQuantity and soldQuantity (0.51), suggesting that

products with larger initial stock tend to have higher

Available at: https://github.com/andreregino/pricing/

blob/main/matriz-correlacao.png

sales volumes. Further, the variables dimension and

weight exhibited moderate correlations both among

themselves and with the target variable price, support-

ing their inclusion as relevant features in the predic-

tive modeling process. The diversity of categories is

illustrated in Figure 2, which shows a word cloud fea-

turing the most frequent terms in the category vari-

able, including items such as “tires”, “ﬁlters” and

“speakers”.

Figure 2: Category word cloud.

Finally, we examined the behavior of the variable

price, which revealed a pronounced skewness. The

calculated skewness value was 2003.62, suggesting

the presence of a long right tail in the distribution.

This observation is supported by the descriptive statis-

tics presented in Table 3, which indicate a mean of

R$ 257.34 contrasted with a signiﬁcantly higher max-

imum value of R$ 109,097.10. Such skewness high-

lights the need for standardization to prevent extreme

values from negatively impacting the performance of

certain machine learning algorithms.

Data Cleaning and Transformation: We exe-

cuted the data cleaning and transformation, focusing

on preparing data for predictive modeling.

• Timestamp transformation: The orderCreated

timestamp was split into components (month, day,

weekday, week, and hour), enabling ﬁner-grained

temporal analysis;

• Categorical encoding: All categorical columns

were one-hot encoded. In particular, the category

and domain ﬁelds generated numerous new binary

columns due to their high cardinality. However,

given the dataset size (100, 000 rows), the added

dimensionality did not introduce sparsity-related

issues;

• Numeric standardization: All numerical features

were standardized using z-score normalization.

This standardization adjusts the data to have a

mean of zero and a standard deviation of one, con-

tributing to the performance and convergence of

various machine learning models employed dur-

ing the predictive modeling phase.

Suggesting Product Prices in Automotive E-Commerce: A Study Assessing Regression Models and Explicability

151

Table 3: Descriptive Statistics for the Numeric Features.

Variable Mean Std Min 25% 50% 75% Max

initialQuantity 2,372.88 6,503.62 1.00 81.00 301.00 1,536.00 101,401.00

availableQuantity 873.56 4,653.38 1.00 9.00 345.00 1,520.00 99,999.00

soldQuantity 1,500.13 4,347.25 1.00 26.00 125.00 713.00 45,056.00

weight 3,148.53 7,476.56 1.00 340.00 840.00 2,295.00 900,000.00

height 15.86 25.24 1.00 7.00 11.00 18.00 419.00

width 24.20 15.51 1.00 14.00 20.00 29.00 348.00

length 41.91 42.04 1.00 19.00 30.00 47.00 2,720.00

price 257.34 354.10 5.53 70.12 155.24 309.53 109,097.10

3.3 Techniques and Model Choice

The techniques were selected according to a previ-

ous literature review on dynamic pricing and the ap-

plication of machine learning in this context. The

following ML techniques were selected for evalua-

tion: Linear Regression, Lasso Regression, Ridge

Regression, Support Vector Machines (SVM), XG-

Boost, LightGBM, Feedforward Neural Networks,

and LSTM Neural Network. Cross-validation tech-

niques were used to ensure robust and generalizable

models.

3.4 Algorithm Execution

The next step involved developing the models using

popular ML frameworks (e.g. Scikit-learn

and Ten-

sorFlow

). Two executions were performed for each

ML technique: the baseline model training (using

the default hyperparameters provided by the libraries)

and the tuned model training (hyperparameters from

grid search).

The dataset was split into training, validation, and

test sets using a 70/15/15 ratio. All experiments were

carried out on Google Colab

, using a cloud server

with the following conﬁguration: 100 GB storage, 12

GB system RAM, and 15 GB GPU RAM (T4 GPU).

Further details regarding model conﬁgurations, hy-

perparameters, and experimental setup are provided

in Section 4.

3.5 Evaluation

In this step, we evaluated the models’ results. Efﬁ-

ciency metrics such as MSE, RMSE, MAE, and R

were employed. These metrics allow for assessing the

model’s accuracy in terms of the difference between

the predicted price and the actual price. We further

Available at: https://scikit-learn.org/

Available at: https://www.tensorﬂow.org/

Available at: https://colab.research.google.com/

measured the total training and prediction times —

i.e., the time required to run the models on the training

and test sets. These metrics are important for assess-

ing the practical feasibility of the models, particularly

in pricing systems that require fast responses for large

volumes of data.

3.6 Explainability

Once the best model and conﬁguration were identi-

ﬁed, the SHAP explainability algorithm was applied

to the test set to understand which product features

most signiﬁcantly contribute to the ﬁnal price.

4 RESULTS

This section presents our outcomes. Section 4.1 dis-

cusses the results of the baseline models, i.e., mod-

els trained using default hyperparameters. Section 4.2

presents the results of the reﬁned models, where hy-

perparameters were tuned. Section 4.3 describes the

results related to the model’s performance, focusing

on the time required to execute them. Section 4.4 ad-

dresses aspects related to model explainability.

4.1 Baseline Models

Table 4 presents the obtained results sorted in ascend-

ing order based on the R

values for the test set. These

values provide an initial assessment of each model’s

predictive performance.

The Lasso regression model yielded the poorest

results, with a R

of 0% on the test set. This suggests

that when trained using default hyperparameters, the

model could not learn meaningful patterns from the

data and failed to generalize appropriately.

The LSTM and XGBoost models presented com-

petitive effectiveness. The LSTM model achieved an

MAE of 0.264 and an R

of 64.12% on the test set.

Similarly, the XGBoost model showed strong results,

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

152

Table 4: Evaluation of Results for the Baseline Models. Training and testing times were measured in seconds. The best results

for each metric are shown in bold. SVM values are not reported in the table as the model did not complete execution within a

reasonable time (< 9,999 seconds). The values are displayed in ascending order by R

Model Validation Test Time (s)

MAE MSE RMSE R² MAE MSE RMSE R² Train Test

SVM - - - - - - - - - -

Lasso 0.570 0.845 0.919 0.00% 0.583 1.030 1.014 0.00% 2.040 0.089

Ridge 0.354 0.431 0.656 49.03% 0.363 0.557 0.747 45.88% 1.826 0.089

Linear Regression 0.351 0.428 0.654 48.90% 0.362 0.497 0.705 48.50% 9.400 0.159

Feed Forward 0.290 0.315 0.561 62.43% 0.299 0.378 0.615 60.78% 215.345 1.510

Light GBM 0.274 0.273 0.523 67.68% 0.278 0.399 0.632 61.24% 3.093 0.181

XGBoost 0.260 0.258 0.508 69.45% 0.264 0.370 0.608 64.12% 35.503 1.017

LSTM 0.279 0.280 0.529 66.58% 0.287 0.346 0.588 64.13% 176.017 1.466

Random Forest 0.207 0.222 0.471 73.78% 0.212 0.334 0.578 67.54% 734.724 0.851

followed closely by the LightGBM, which also per-

formed well under default settings. These models

stand out due to their ability to model complex, non-

linear relationships and temporal dependencies.

The Feedforward Neural Network, while outper-

forming the linear models in terms of MAE, still fell

short of the results exhibited by the tree-based mod-

els. This result reﬂects the model’s capacity to han-

dle non-linearities, with some limitations compared

to ensemble-based approaches.

Both Linear Regression and Ridge Regression

presented similar outcomes. As relatively inter-

pretable and straightforward models, they serve as

useful baselines, yet their inability to capture non-

linear patterns limits their effectiveness. It is worth

noting that the SVM model could not be evaluated

due to excessive training time. The model exceeded

9, 999 seconds without completing the training phase,

prompting the decision to abort its execution to ensure

computational feasibility. The Random Forest model

exhibited the longest training time among all models

whose execution duration did not exceed 9, 999 sec-

onds.

4.2 Tuned Models

Table 5 presents the obtained results from the re-

gression models after hyperparameter tuning. We

chose the best hyperparameter conﬁgurations

for

each model using a grid search strategy. The evalu-

ation was conducted on both the validation and test

sets using the same performance metrics employed in

the baseline evaluation, namely MAE, MSE, RMSE,

and R

The tuned models showed notable improvements,

signiﬁcantly varying across different approaches.

Among them, the tuned LightGBM model stood out

as the most effective. It achieved the lowest abso-

Available at: https://github.com/andreregino/pricing

lute and squared errors on validation and test sets,

along with R

scores exceeding 72%. This consis-

tent performance highlights its strong ability to gen-

eralize to unseen data, making it a robust choice for

price prediction tasks. The tuned XGBoost model

was followed closely, delivering slightly lower met-

rics than LightGBM and maintaining solid overall

performance.

These results suggest that XGBoost remains a vi-

able and competitive alternative, particularly in sce-

narios where interpretability and efﬁciency are also

valued. The tuned Random Forest showed compet-

itive results, though marginally below the top two

models. Meanwhile, the Feedforward Neural Net-

work showed more modest performance, with a lower

ability to explain the variance in the data, indicating

some limitations in capturing complex relationships

under the tested conﬁguration.

Despite hyperparameter tuning, linear models

such as Ridge and Lasso remained limited in accuracy

and explanatory power, with R

scores around 48%.

These ﬁndings underscore the importance of ﬁne-

tuning in more complex models—especially those

based on decision trees and gradient boosting frame-

works — as a way to maximize predictive perfor-

mance in tasks involving multiple, potentially non-

linear features.

4.3 Computational Performance Results

Regarding computational performance, Figure 3a il-

lustrates the training time across all models

. The

models were trained on 75, 000 products and vali-

dated on an additional 15, 000 products.

The training time analysis (Figure 3a) distin-

guishes between base (green) and tuned (blue) mod-

els. The tuned Feed Forward network required the

Prediction inference time available at: https://github.

com/andreregino/pricing/blob/main/prediction-time.png

Suggesting Product Prices in Automotive E-Commerce: A Study Assessing Regression Models and Explicability

153

Table 5: Evaluation of the Results for the Tuned Models. Training and test times were measured in seconds. The best results

for each metric are shown in bold. The values are displayed in ascending order by R

Model Validation Test Time (s)

MAE MSE RMSE R² MAE MSE RMSE R² Train Test

SVM 0.427 0.548 0.740 34.57% 0.441 0.624 0.790 35.26% 999.99+ 139.215

Lasso 0.351 0.427 0.654 49.00% 0.363 0.500 0.708 48.07% 88.272 0.095

Linear Regression 0.351 0.428 0.654 48.90% 0.362 0.497 0.705 48.50% 9.400 0.159

Ridge 0.351 0.428 0.654 48.90% 0.362 0.497 0.705 48.50% 9.052 0.092

Feed Forward 0.286 0.317 0.563 62.21% 0.291 0.354 0.595 63.26% 999.99+ 1.467

LSTM 0.276 0.275 0.525 67.14% 0.282 0.325 0.702 66.26% 360.400 1.170

Random Forest 0.243 0.219 0.468 73.85% 0.249 0.282 0.531 70.73% 637.962 0.383

XGBoost 0.241 0.211 0.459 74.87% 0.247 0.273 0.523 71.67% 169.045 0.505

Light GBM 0.227 0.203 0.450 75.79% 0.235 0.268 0.517 72.23% 21.386 0.768

longest training time (1274.8 seconds, approximately

21 minutes), more than double of the tuned Random

Forest model (637.9 seconds). These results highlight

the computational cost associated with hyperparam-

eter tuning in neural networks, particularly with re-

spect to parameters such as the number of layers, reg-

ularization techniques, and optimization strategies.

Conversely, Ridge and Lasso regression models —

both in base and tuned versions — were the fastest to

train, taking less than 10 seconds, reﬂecting their al-

gorithmic simplicity. Most of the linear models (e.g.,

Ridge, Lasso, and Linear Regression) had low com-

putational overhead, while tuned neural networks and

tree-based models required substantially more train-

ing time, as expected.

Figure 3b reports the prediction time for 10, 000

data entries. The tuned Feed Forward model was

again the slowest (1.510 seconds), closely followed

by its base version (1.467 seconds). LSTM mod-

els, both base and tuned, also recorded relatively long

prediction times due to their complex internal struc-

ture. Although Random Forest demonstrated high

predictive accuracy, it lagged behind in inference ef-

ﬁciency. Notably, LightGBM (both base and tuned)

showed exceptional prediction speed, with execution

times under 0.2 seconds, conﬁrming its suitability for

real-time pricing tasks.

4.4 Explainability

Figure 4 presents the results regarding explainabil-

ity. We identiﬁed which features contribute to the ﬁ-

nal predicted price and how much they inﬂuence the

model’s output.

The LightGBM model was chosen for this analy-

sis, as it achieved the best performance. The test set

(unseen during training) assessed the model’s ability

to generalize to new data. The SHAP summary plot in

Figure 4 presents the explainability results. The fea-

ture weight stands out as the most inﬂuential variable

in the model (cf. the top of the ﬁgure). The impact of

weight on price is substantial, with higher values gen-

erally increasing the predicted price. Heavier prod-

ucts tend to be more expensive due to higher manu-

facturing, shipping, and material costs.

The soldQuantity feature, representing the num-

ber of units sold, also plays a key role. Products with

higher sales volume are associated with higher prices,

likely reﬂecting their popularity and market demand.

This relationship suggests that the model captures de-

mand as a key factor in pricing. Similarly, the ini-

tialQuantity, the initial stock level of a product, di-

rectly affects the predicted price.

The presence of speciﬁc categories, such as VE-

HICLE AMPLIFIERS, indicates that the model can

capture nuanced market segments and their corre-

sponding price ranges, suggesting a degree of granu-

larity in how the model incorporates categorical prod-

uct information.

5 DISCUSSION

Figure 5 presents a comparative analysis of the R

scores across the evaluated regression models, high-

lighting the impact of hyperparameter tuning. The

top-performing models in terms of R

(cf. top of

the Figure 5) were LightGBM, XGBoost, and Ran-

dom Forest. In contrast, neural network-based models

such as LSTM and Feed Forward exhibited R

values

below 0.7, although still comparable to the top-tier

models. Linear models—including Linear Regres-

sion, Ridge, Lasso, and SVM—yielded the lowest R

scores.

Tree-based models, particularly LightGBM and

XGBoost, beneﬁted from hyperparameter optimiza-

tion, achieving improvements in R

and demonstrat-

ing enhanced explanatory power regarding the vari-

ance in the dataset. The tuned LightGBM model

achieved the best overall effectiveness, with an in-

crease of nearly 18% in R

compared to its base ver-

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

154

(a) Training times of base and tuned regression models for product pricing. Values are shown in seconds.

(b) Prediction times of base and tuned regression models for pricing 10,000 products. Values are shown in seconds.

Figure 3: Training and prediction times for regression models evaluated in the product pricing task.

sion. This result underscores its robustness in the

pricing task. Similarly, the tuned XGBoost model

presented a notable increase of approximately 12% in

, conﬁrming its effectiveness when ﬁne-tuned ap-

propriately.

We analyzed the trade-off between predictive ef-

fectiveness (measured by R

– Figure 5) and com-

putational efﬁciency (in terms of training and time –

Figure 3a). Linear models such as Linear Regression

and Ridge showed negligible or marginal improve-

ments in R

after hyperparameter tuning (+0.00% and

+5.70%, respectively). Nevertheless, these models

were the fastest in both training and prediction phases.

While this low computational cost could be advanta-

geous for applications requiring rapid responses, their

limited predictive accuracy renders them inadequate

for effective pricing tasks. Tree-based models, in

contrast, achieved higher R

scores at the cost of in-

creased computational time.

The tuned LightGBM model was the best-

performing option because it combined the highest R

score with fast prediction time (0.181 seconds) and

Suggesting Product Prices in Automotive E-Commerce: A Study Assessing Regression Models and Explicability

155

Figure 4: SHAP Summary Plot illustrating the explainability of the LightGBM regression model on the test set. Each row in

the plot corresponds to a model feature, ordered from the most inﬂuential (at the top) to the least inﬂuential (at the bottom).

The SHAP values (horizontal axis) indicate the impact of each feature on the model’s prediction: positive values increase the

prediction, while negative values decrease it. The spread of the points along the horizontal axis reﬂects the variation in the

impact of each feature across different predictions.

relatively low training time (21.4 seconds), making it

particularly well-suited for the pricing domain. Simi-

larly, the tuned XGBoost model showed strong effec-

tiveness improvements with acceptable training and

prediction times. Although Random Forest yielded

competitive accuracy, its excessively long training

time (637 seconds) makes it less practical for frequent

retraining scenarios.

Lasso and SVM improved slightly with hyper-

parameter tuning among the underperforming base

models. Their overall R

outcome remained uncom-

petitive. Furthermore, SVM incurred higher train-

ing and inference times, reducing its practical utility.

Neural network-based models (LSTM and Feed For-

ward) exhibited high computational costs for training

and prediction, despite achieving moderate R

values.

Although their effectiveness improved after tuning,

the associated resource requirements limit their ap-

plicability in pricing systems where rapid retraining

is necessary. In summary, the tuned LightGBM and

XGBoost models were revealed as the most viable

choices for this pricing task, offering an optimal bal-

ance between accuracy and computational efﬁciency.

In terms of explainability, the results indicate that

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

156

Figure 5: Comparison of R

scores for base and tuned regression models. The x-axis represents the R

, while the y-axis lists

the regression models. Red bars indicate the percentage increase in R

for the tuned models relative to their base versions.

higher initial availability is associated with higher

prices. This could reﬂect a pricing strategy based on

inventory management and the availability of automo-

tive products. Physical product characteristics such as

height and length signiﬁcantly inﬂuence the model’s

predictions. In an e-commerce context, these vari-

ables may relate to the product’s size, which affects

production and transportation costs. Larger products

typically require more materials and resources, result-

ing in higher prices.

In summary, the results show that tree-based mod-

els, especially LightGBM and XGBoost, are suitable

for production use in our automobile e-commerce

pricing context, combining high predictive accuracy

with low latency. Their explainability highlights key

pricing drivers that automotive e-commerce sellers

should monitor to make informed, data-driven deci-

sions.

Future work may explore the following directions:

(a) System Integration: Embedding the model into

GoBots’ systems to support automated, explainable

price recommendations for e-commerce vendors; (b)

Advanced Categorical Encoding: Using embed-

ding techniques to replace one-hot encoding, cap-

turing semantic relationships between categories and

enhancing multilingual applications; (c) Temporal

Price Analysis: Incorporating time series models

(e.g., SARIMA, Prophet) to detect seasonal trends

and price ﬂuctuations over time; (d) Scalable Data

Processing: Utilizing high-performance or cloud-

based infrastructure to process the complete dataset

and evaluate model scalability; (e) Cross-Domain

Application: Adapting the model to other pricing do-

mains, such as real estate, involving various data char-

acteristics and inﬂuencing factors.

6 CONCLUSION

The accurate pricing in online commerce plays a

strategic role for businesses. This study explored and

compared the effectiveness of several regression mod-

els applied to product pricing in the Brazilian au-

tomotive e-commerce sector. Our study evaluated

how various ML techniques contribute to this task,

balancing prediction accuracy with computational ef-

Suggesting Product Prices in Automotive E-Commerce: A Study Assessing Regression Models and Explicability

157

ﬁciency. Analysing a dataset comprising 100, 000

product records, we handled missing values, gener-

ated derived variables from temporal features, en-

coded categorical attributes, and standardized numer-

ical features. Our experimental results highlighted

the value of base and optimized models in the pric-

ing task. The comparative evaluation of nine re-

gression algorithms, including hyperparameter tun-

ing, revealed that tree-based ensemble methods such

as LightGBM and XGBoost offer a substantial trade-

off between predictive accuracy and training/predic-

tion time. The RMSE, MSE, and R

metrics were

used to quantify effectiveness, while runtime met-

rics supported the analysis of each model’s practical

feasibility in real-world scenarios. This study con-

tributed with methodological and original practical

assessment by systematically comparing regression

models in Brazilian automotive e-commerce. Our in-

vestigation contributed to the application of SHAP

to interpret model outputs. This analysis provided

insights into the most inﬂuential features for price

prediction, such as product weight, inventory levels,

and units sold. This investigation established a foun-

dation for future studies and is a reference for new

researchers and practitioners seeking to implement

data-driven pricing strategies.

ACKNOWLEDGMENTS

This work was supported by the S

ao Paulo Re-

search Foundation (FAPESP) and the Coordenac¸

ao de

Aperfeic¸oamento de Pessoal de N

ıvel Superior, Brazil

(CAPES). Prof. Dr. Julio Cesar Dos Reis thanks the

National Council of Technological and Scientiﬁc De-

velopment (CNPq), Brazil, grant #301337/2025-0.

REFERENCES

Akash, C. R., Vivekanandhan, P. K., Adam Khan, M.,

Ebenezer, G., Vinoth, K., Prithivirajan, J., and Kis-

han, V. J. P. (2024). Assessment of ridge regression-

based machine learning model for the prediction of

automotive sales based on the customer requirements.

Interactions, 245(1):289.

Aparicio, D. and Misra, K. (2023). Artiﬁcial intelligence

and pricing. Artiﬁcial intelligence in marketing, pages

103–124.

Bhaskar, T., Shiney, S. A., Rani, S. B., Maheswari, K., Ray,

S., and Mohanavel, V. (2022). Usage of ensemble re-

gression technique for product price prediction. In

2022 4th International Conference on Inventive Re-

search in Computing Applications (ICIRCA), pages

1439–1445. IEEE.

Chowdhury, M. S., Shak, M. S., Devi, S., Miah, M. R.,

Al Mamun, A., Ahmed, E., Hera, S. A. S., Mahmud,

F., and Mozumder, M. S. A. (2024). Optimizing e-

commerce pricing strategies: A comparative analysis

of machine learning models for predicting customer

satisfaction. The American Journal of Engineering

and Technology, 6(09):6–17.

Das, P., Pervin, T., Bhattacharjee, B., Karim, M. R., Sul-

tana, N., Khan, M. S., Hosien, M. A., and Kamruzza-

man, F. (2024). Optimizing real-time dynamic pric-

ing strategies in retail and e-commerce using machine

learning models. The American Journal of Engineer-

ing and Technology, 6(12):163–177.

El Youbi, R., Messaoudi, F., and Loukili, M. (2023). Ma-

chine learning-driven dynamic pricing strategies in e-

commerce. In 2023 14th International Conference

on Information and Communication Systems (ICICS),

pages 1–5. IEEE.

Kotler, P. and Keller, K. L. (2022). Marketing Management

16e. Pearson India.

Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed approach

to interpreting model predictions. Advances in neural

information processing systems, 30.

Mutemi, A. and Bacao, F. (2024). E-commerce fraud detec-

tion based on machine learning techniques: System-

atic literature review. Big Data Mining and Analytics,

7(2):419–444.

Patel, R. (2023). Customer acquisition and retention

in e-commerce using ai & machine learning tech-

niques. Journal of Harbin Engineering University,

44(8):879–886.

Xuming, Y. (2024). Research on the pricing wrong of elec-

tronic commerce contract. International Journal of

Frontiers in Sociology, 6(8).

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

158