TabM vs. Traditional ML for e-Commerce Product Ranking: A

Multi-Signal Framework for Frequently Bought Together

Recommendations

Dilge Karakas¸, Enis Teper and Okan Kaya

Hepsiburada, Istanbul, Turkey

Keywords:

e-Commerce, Purchase Frequency Prediction, Customer Behavior, Machine Learning.

Abstract:

We present a machine learning framework for ranking products in e-commerce recommendation systems,

speciﬁcally targeting “Frequently Bought Together” scenarios. Leveraging a TabM neural architecture with

parameter-efﬁcient BatchEnsemble mechanisms for ensemble learning, our system integrates similarity scores,

position signals, and commercial performance metrics to optimize purchase probability predictions. Deployed

on a major e-commerce platform, our approach demonstrates improved ranking performance while main-

taining computational efﬁciency through strategic weight sharing across ensemble members. TabM model

achieves 23.5% improvement in HR@5 over position-based baseline and 14.5% improvement in NDCG@10

over logistic regression. The model effectively handles class imbalance through diverse ensemble perspectives

and signiﬁcantly outperforms traditional machine learning approaches including gradient boosting and logistic

regression.

1 INTRODUCTION

In the competitive e-commerce landscape, product

ranking plays a pivotal role in ”Frequently Bought To-

gether” (FBT) contexts, especially in shaping user be-

havior and driving sales (Keskin et al., 2024b). FBT

recommendations traditionally rely on co-purchase

signals, but capturing these effectively while balanc-

ing relevance, business objectives, and computational

constraints remains a signiﬁcant challenge for large-

scale platforms.

Recent BERT-like models treat purchase histories

as sequences, signiﬁcantly improving recommenda-

tion accuracy and NDCG scores (Sun et al., 2019).

Studies on e-commerce ranking identify key chal-

lenges including heterogeneous data, class imbalance,

and balancing customer relevance with business goals

(Kabir et al., 2024). New recommendation systems

use logical reasoning to learn asymmetric product re-

lationships (e.g., batteries complement ﬂashlights but

not vice versa), going beyond simple co-purchase pat-

terns (Wu et al., 2022). Analyses of commercial FBT

systems reveal built-in biases and strategic position-

ing effects that impact business outcomes.

Despite these advances, existing approaches often

suffer from: (1) limited scalability for real-time in-

ference, (2) lack of interpretability required for busi-

ness operations, or (3) insufﬁcient integration of di-

verse signal types (similarity, commercial, and posi-

tional). There is a need for a practical framework that

balances predictive performance with operational re-

quirements.

We compare advanced neural architectures

(TabM with parameter-efﬁcient BatchEnsemble

mechanisms) against traditional machine learning

approaches, ultimately demonstrating that deep

learning models like TabM (Gorishniy et al., 2024)

can achieve strong performance while meeting

production constraints.

This paper makes several contributions:

• A systematic comparison of neural (TabM) and

traditional ML approaches for FBT ranking,

with detailed analysis of performance-complexity

trade-offs

• A comprehensive feature selection analysis us-

ing Weight of Evidence binning and Information

Value

• A production deployment case study demonstrat-

ing real-world implementation challenges and so-

lutions in a large-scale e-commerce environment

Karaka¸s, D., Teper, E. and Kaya, O.

TabM vs. Traditional ML for e-Commerce Product Ranking: A Multi-Signal Framework for Frequently Bought Together Recommendations.

DOI: 10.5220/0014297100004848

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 2nd International Conference on Advances in Electrical, Electronics, Energy, and Computer Sciences (ICEEECS 2025), pages 243-248

ISBN: 978-989-758-783-2

243

2 RELATED WORKS

Product ranking in e-commerce has evolved sig-

niﬁcantly, transitioning from rule-based heuristics

and standalone retrieval systems to learning-to-rank

(LTR) models that integrate heterogeneous signals.

Early approaches, such as collaborative ﬁltering and

content-based ranking, often decoupled retrieval and

ranking stages. Modern architectures now favor uni-

ﬁed pipelines that jointly optimize both tasks to align

with user preferences and business objectives (Kabir

et al., 2024).

LTR models, especially those based on gradient

boosting, have gained prominence for their predic-

tive strength and ﬂexibility in handling mixed fea-

ture types. XGBoost (Chen and Guestrin, 2016) and

LightGBM (Ke et al., 2017) are frequently employed

in large-scale ranking tasks due to their scalability and

regularization techniques. CatBoost (Prokhorenkova

et al., 2018), in particular, excels at handling categor-

ical variables without preprocessing, making it well-

suited for e-commerce data with diverse categorical

attributes. In parallel, ensemble methods like Ran-

dom Forests and Extra Trees (Geurts et al., 2006)

serve as strong baselines for both model interpretabil-

ity and feature importance estimation. HistGradient-

Boosting, available via scikit-learn, offers computa-

tional efﬁciency by combining histogram-based train-

ing with support for monotonic constraints and miss-

ing values.

Despite the advancements in tree-based models,

logistic regression continues to be widely used in

real-time production environments for its low infer-

ence latency, simplicity, and well-calibrated proba-

bilistic outputs. When paired with systematic fea-

ture selection techniques—such as sequential forward

selection and Weight of Evidence (WoE) (Raymaek-

ers et al., 2021) binning—logistic regression achieves

strong performance while maintaining interpretability

(Loukili et al., 2023).

Beyond product recommendation, similar ranking

strategies have been applied to other personalization

tasks. For instance, a recent study on homepage ban-

ner optimization demonstrates that click prediction-

based ranking using logistic regression leads to mea-

surable improvements in click-through and conver-

sion rates (Keskin et al., 2024a). This application fur-

ther supports the viability of interpretable models in

latency-sensitive production systems.

Finally, fairness and transparency concerns are

increasingly relevant in commercial ranking. Stud-

ies have revealed that certain platforms may intro-

duce systemic biases—such as favoring private-label

or sponsored items—through opaque ranking poli-

cies. These ﬁndings underscore the importance of

explainability and bias-aware evaluation in deployed

recommendation models.

3 DATA COLLECTION

To support ”Frequently Bought Together” (FBT) rec-

ommendations, we construct a training dataset by in-

tegrating user interaction history, candidate recom-

mendations, and product metadata. As illustrated in

Figure 1, users generate both recommendation ex-

posures and order events, which are then merged

with product-level features to form labeled product

pairs. Candidate products are ﬁrst retrieved using

embedding-based similarity, ranked by (1−distance).

Positive labels are assigned when the candidate was

co-purchased with the main product in the same order,

while negatives are drawn from unpurchased but rec-

ommended items. Data is split chronologically into

70% training, 15% validation, and 15% test to simu-

late real-world deployment.

Figure 1: Overview of the data collection and merging

pipeline.

The dataset integrates features from multiple

sources, including product catalog (category hierar-

chy, brand, merchant, reviews), pricing (listing prices,

view events, and historical orders), and user engage-

ment metrics (views, clicks, sales across six plat-

form touchpoints). Performance signals such as click-

through rate (CTR), conversion rate (CR), and per-

centile rankings are also incorporated.

Table 1 summarizes the groups. Prices follow a hi-

erarchical imputation strategy listing → view → order

→ default and are log-transformed to reduce skew.

Table 1: Overview of feature groups used in the ranking

model.

Feature Group # Features Source

Similarity & Position 2 Embedding, rank

Categorical Matching 5 Category, brand

Engagement Metrics 12 Views, clicks

Performance Ratios 6 CTR, CR

Commercial Signals 8 Price, reviews

Derived Features 4 Composites

ICEEECS 2025 - International Conference on Advances in Electrical, Electronics, Energy, and Computer Sciences

244

The binary target variable indicates whether the

candidate product was purchased after the user

viewed the main product. Temporal constraints en-

sure that purchases are linked to recent exposures to

prevent spurious attribution. This binary setup aligns

with the goal of maximizing conversion probability

and supports interpretable model evaluation.

Feature selection is performed using Information

Value (IV) analysis to identify the most predictive fea-

tures. The IV metric quantiﬁes a feature’s discrimi-

native power based on the distribution of positive and

negative classes across binned intervals, calculated as:

IV =

∑

− N

) × ln





Based on IV scores, we selected the most important

features with high discriminative power for the ﬁnal

ranking model.

4 SYSTEM ARCHITECTURE

AND PIPELINE

Our modeling framework supports an end-to-end

pipeline for ”Frequently Bought Together” (FBT) rec-

ommendations, integrating both candidate generation

and ranking stages in a production environment. The

full architecture is illustrated in Figure 2.

The process begins with user interactions col-

lected from e-commerce clickstream data, which are

passed through a trained embedding model to com-

pute dense product representations. These embed-

dings are enriched with metadata (e.g., category,

brand, price, and reviews), and stored in a centralized

product metadata repository. The resulting vectors are

indexed using a K-Nearest Neighbor (KNN) indexer,

enabling fast retrieval of similar products.

4.1 Serving API and Online Inference

The Serving API acts as the central online service for

delivering FBT recommendations to end users in real

time. When a user visits a product detail page, the

API orchestrates the following steps:

1. Candidate Retrieval: The API queries the KNN

index with the embedding of the currently viewed

product to retrieve top-k candidate items.

2. Feature Enrichment: Retrieved candidates are

enriched with additional metadata (e.g., product

category, brand, price, ratings, and recent engage-

ment statistics) from the centralized product meta-

data repository.

3. Ranking Inference: The enriched candidates are

passed to the Product Ranking API, which ap-

plies the trained model to compute purchase like-

lihood scores. This ranking step balances simi-

larity, commercial performance, and engagement

signals.

4. Response Delivery: The ﬁnal ranked list is re-

turned as a JSON response to the frontend service,

where it is displayed as ”Frequently Bought To-

gether” recommendations.

The Serving API is optimized for low-latency,

high-throughput environments. It leverages caching

for popular product embeddings, parallel batch

queries for feature retrieval, and asynchronous com-

munication with the ranking service. This ensures

that recommendation responses are typically gener-

ated within 20–30 milliseconds, meeting the strict la-

tency requirements of large-scale e-commerce plat-

forms.

Figure 2: End-to-end FBT candidate generation and ranking

pipeline.

5 METHODOLOGY

We conduct a comprehensive experimental evalua-

tion comparing TabM neural ensembles with tradi-

tional machine learning approaches on the FBT rank-

ing task.

5.1 Experimental Setup

Our evaluation framework compares the following al-

gorithms:

• TabM: Parameter-efﬁcient neural ensemble with

BatchEnsemble mechanisms

• CatBoost: Gradient boosting optimized for cate-

gorical features

• XGBoost: Tree-based ensemble with regulariza-

tion

• Random Forest: Bagging ensemble for baseline

comparison

• Logistic Regression: Linear model with L2 reg-

ularization

TabM vs. Traditional ML for e-Commerce Product Ranking: A Multi-Signal Framework for Frequently Bought Together Recommendations

245

Table 2: Model Performance Comparison on Test Set.

Model Hit Rate NDCG Precision

@1 @3 @5 @1 @3 @5 @10 @1 @5 @10

TabM 0.451 0.709 0.826 0.451 0.551 0.608 0.666 0.451 0.219 0.138

CatBoost 0.429 0.698 0.814 0.429 0.534 0.590 0.650 0.429 0.215 0.137

XGBoost 0.411 0.684 0.802 0.411 0.518 0.575 0.639 0.411 0.210 0.136

Random Forest 0.400 0.670 0.790 0.400 0.506 0.563 0.629 0.400 0.206 0.135

HistGradientBoosting 0.389 0.669 0.786 0.389 0.500 0.556 0.621 0.389 0.204 0.134

Extra Trees 0.339 0.596 0.724 0.339 0.439 0.497 0.569 0.339 0.186 0.127

Logistic Regression 0.359 0.614 0.732 0.359 0.456 0.510 0.574 0.359 0.189 0.125

Position Baseline 0.321 0.561 0.669 0.321 0.412 0.459 0.531 0.321 0.170 0.118

Our model selection process balances three criti-

cal factors:

1. Ofﬂine Performance: Evaluated using HR@5,

NDCG@10, and Precision metrics

2. Production Constraints: Inference latency <

30ms, memory footprint < 1GB

3. Operational Requirements: Model inter-

pretability for business stakeholders

5.2 TabM Model Architecture

TabM (Tabular Multiple predictions) is based on

parameter-efﬁcient ensembling techniques (Gorish-

niy et al., 2024), which employs BatchEnsemble

mechanisms speciﬁcally designed to efﬁciently rep-

resent multiple MLPs while sharing most param-

eters. The model uses strategic weight sharing

through learnable adapters and simultaneous training

to achieve superior performance on tabular datasets.

The core TabM architecture consists of k implicit

ensemble members, where each member processes in-

puts through modiﬁed linear layers with shared and

non-shared components:

(X) = ((X ⊙ R)W) ⊙ S + B (1)

where X ∈ R

k×d

contains k object representa-

tions (one per ensemble member), R, S, B ∈ R

k×d

are

the non-shared adapters for each ensemble member,

W ∈ R

d×d

is the shared weight matrix, and ⊙ denotes

element-wise multiplication.

The feature transformation for each ensemble

member i follows:

[i]

= s

⊙ (W(r

⊙ x

)) + b

(2)

where r

, s

, b

represent the individual adapters

for ensemble member i, and x

is the input represen-

tation for that member.

The critical ﬁrst adapter initialization ensures

proper ensemble diversity:

[1]

∼ N (0, 1), R

[l]

, S

[l]

= 1 for l > 1 (3)

The ﬁnal prediction aggregates predictions from

all ensemble members:

ˆy =

∑

i=1

f (h

[i]

) (4)

where f (·) represents the output transformation

(sigmoid for classiﬁcation, identity for regression).

For our implementation, we conﬁgure TabM with

k = 32 ensemble members, network depth N = 3 lay-

ers, and hidden dimension d = 512. The preprocess-

ing pipeline follows TabM’s advanced data prepara-

tion strategy: numerical features undergo Quantile-

Transformer normalization with noise injection for

stability, while categorical features are label-encoded

with cardinality tracking. We employ PiecewiseLin-

earEmbeddings with 48 bins and 16-dimensional em-

beddings for numerical features, and apply dropout

regularization during training. The model is trained

using AdamW optimizer with weight decay, and Fo-

cal Loss (α = 0.25, γ = 2.0) to handle class imbal-

ance.

6 RESULTS AND DISCUSSION

The Information Value analysis demonstrates that ini-

tial ranking position provides the strongest predictive

signals, while commercial performance differentials

and categorical matching offer valuable supplemen-

tary information.

6.1 Model Performance Comparison

We evaluate multiple models using ranking metrics

commonly adopted in recommendation systems: Hit

ICEEECS 2025 - International Conference on Advances in Electrical, Electronics, Energy, and Computer Sciences

246

Figure 3: Ranking model outputs for Android Phone (left) and Laptop (right) showing recommended complementary products.

First row shows the baseline recommendation order, second row shows our model’s reranked choices. Items outlined in green

represent actual co-purchases from historical transaction data, demonstrating the model’s ability to prioritize frequently bought

together items.

Rate at rank 5 (HR@5), Normalized Discounted Cu-

mulative Gain at rank 10 (NDCG@10), and Pre-

cision at ranks 5 and 10 (P@5, P@10). Among

these, NDCG is particularly informative as it accounts

for both the position and relevance of recommended

items, assigning higher scores when correct predic-

tions appear earlier in the ranked list.

TabM emerges as the clear performance leader,

achieving the highest scores across all evaluation

metrics. The neural ensemble demonstrates supe-

rior ranking quality with an HR@5 of 0.826 and

NDCG@10 of 0.666, representing substantial im-

provements over traditional machine learning ap-

proaches. CatBoost follows as the second-best per-

former among tree-based methods, while logistic re-

gression provides a strong baseline despite its sim-

plicity.

Figure 3 illustrates these quantitative improve-

ments through qualitative examples, showing how our

model effectively prioritizes items that were actu-

ally co-purchased (highlighted in green) compared to

baseline rankings. This demonstrates improved align-

ment with actual user purchase behavior while main-

taining recommendation diversity.

7 CONCLUSION AND FUTURE

WORK

This paper presented a comprehensive evaluation

of machine learning approaches for product rank-

ing in ”Frequently Bought Together” e-commerce

recommendation systems. We systematically com-

pared neural (TabM) and traditional ML methods,

ultimately demonstrating that deployment decisions

require careful consideration of the performance-

complexity-interpretability trade-off space.

7.1 Key Contributions

Our research makes several contributions to practical

recommendation systems:

1. Empirical Comparison: We provide a thor-

ough comparison of TabM neural ensembles

against traditional ML approaches, demonstrat-

ing that deep learning can indeed outperform gra-

dient boosting methods in tabular recommenda-

tion tasks. TabM achieves superior performance

across all metrics, with notable improvements of

+23.5% HR@5 over position baseline, +14.5%

NDCG@10 over logistic regression, and consis-

tent gains over state-of-the-art gradient boosting

methods including CatBoost and XGBoost.

2. Neural Architecture Validation: We

demonstrate that TabM’s parameter-efﬁcient

BatchEnsemble mechanisms can effectively

challenge the dominance of gradient boosting in

tabular data scenarios, particularly in e-commerce

ranking tasks where complex feature interactions

are crucial.

Future research directions include:

• Multi-objective optimization techniques balanc-

ing conversion probability with revenue and in-

ventory management objectives

• Personalization strategies incorporating user-

speciﬁc preferences and historical interaction pat-

terns

• Dynamic reranking approaches that adapt to real-

time inventory and promotional considerations

• Cross-category recommendation expansion using

graph-based relationship modeling

This research contributes to the growing ﬁeld of

practical recommendation systems in e-commerce,

providing a framework that prioritizes deployabil-

ity, interpretability, and business alignment alongside

predictive performance.

TabM vs. Traditional ML for e-Commerce Product Ranking: A Multi-Signal Framework for Frequently Bought Together Recommendations

247

REFERENCES

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable

tree boosting system. In Proceedings of the 22nd acm

sigkdd international conference on knowledge discov-

ery and data mining, pages 785–794.

Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely

randomized trees. Machine learning, 63(1):3–42.

Gorishniy, Y., Kotelnikov, A., and Babenko, A. (2024).

Tabm: Advancing tabular deep learning with

parameter-efﬁcient ensembling.

Kabir, M. A., Hasan, M. A., Mandal, A., Tunkelang, D.,

and Wu, Z. (2024). A survey on e-commerce learning

to rank. arXiv preprint arXiv:2412.03581.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W.,

Ye, Q., and Liu, T.-Y. (2017). Lightgbm: A highly

efﬁcient gradient boosting decision tree. Advances in

neural information processing systems, 30.

Keskin, M., Teper, E., B

uk, B., and Sezgin, M. S.

(2024a). Banner ranking based on click prediction

in the e-commerce. In 2024 International Congress

on Human-Computer Interaction, Optimization and

Robotic Applications (HORA), pages 1–6.

Keskin, M., Teper, E., and Kurt, A. (2024b). Comparative

evaluation of word2vec and node2vec for frequently

bought together recommendations in e-commerce. In

2024 9th International Conference on Computer Sci-

ence and Engineering (UBMK), pages 1–5.

Loukili, M., Messaoudi, F., and El Ghazi, M. (2023).

Machine learning based recommender system for e-

commerce. IAES International Journal of Artiﬁcial

Intelligence, 12(4):1803–1811.

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush,

A. V., and Gulin, A. (2018). Catboost: unbiased boost-

ing with categorical features. Advances in neural in-

formation processing systems, 31.

Raymaekers, J., Verbeke, W., and Verdonck, T. (2021).

Weight-of-evidence 2.0 with shrinkage and spline-

binning. arXiv preprint arXiv:2101.01494.

Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., and Jiang,

P. (2019). Bert4rec: Sequential recommendation with

bidirectional encoder representations from transform-

ers. In Proceedings of the 28th ACM International

Conference on Information and Knowledge Manage-

ment, pages 1441–1450. ACM.

Wu, L., Zhou, Y., and Zhou, D. (2022). Towards high-order

complementary recommendation via logical reason-

ing network. In 2022 IEEE International Conference

on Data Mining (ICDM), pages 1227–1232. IEEE.

ICEEECS 2025 - International Conference on Advances in Electrical, Electronics, Energy, and Computer Sciences

248