AMAKAN: Fully Interpretable Adaptive Multiscale Attention Through

Kolmogorov-Arnold Networks

Felice Franchini

and Stefano Galantucci

University of Bari Aldo Moro, 70125, Bari, Italy

Keywords:

Adaptive Multiscale Attention, Interpretability, Kolmogorov–Arnold Networks, Tabular Data.

Abstract:

This paper introduces AMAKAN, a novel method for tabular data classiﬁcation combining the Adaptive Multi-

scale Deep Neural Network with Kolmogorov–Arnold Network to ensure full interpretability without sacriﬁc-

ing predictive performance. The Adaptive Multiscale Deep Neural Network dynamically focuses on relevant

features at different scales by using learned attention mechanisms. These multiscale features are then reﬁned

by Kolmogorov–Arnold Network layers, which replace typical dense layers with learnable univariate func-

tions on network edges, providing transparency by allowing practitioners to visually see and inspect feature

transformations directly. Experimental results on a variety of real-world datasets demonstrate that AMAKAN

achieves performance equivalent to or better than state-of-the-art baselines while providing transparent and

actionable explanations for its predictions. By the seamless combination of interpretable attention mecha-

nisms with Kolmogorov–Arnold Network layers, the paper presents an explainable and efﬁcient deep learning

method for tabular data across a vast spectrum of application domains.

1 INTRODUCTION

Machine learning research has progressively focused

on creating advanced deep learning models capable

of processing various data modalities, such as text,

images, and time series signals. Despite these suc-

cesses, applying deep learning to tabular data, per-

haps one of the most prevalent data structures in nu-

merous industries, remains an open challenge. Con-

ventional multilayer neural networks tend to struggle

with maintaining high accuracy without compromis-

ing interpretability, a concern that is particularly im-

portant in areas where model transparency is the foun-

dation of trust and accountability.

Recent work has demonstrated that specialized ar-

chitectures can beneﬁt performance and explainabil-

ity for tabular data classiﬁcation. In particular, the

Adaptive Multiscale Deep Neural Network (Denta-

maro et al., 2024) was developed to provide better

feature weighting with design-time interpretability.

This is achieved using parallel Excitation Layers of

differing compression levels and a Trainable Atten-

tion Mechanism, which programmatically selects the

learned feature dynamically. Although these mecha-

https://orcid.org/0009-0000-8887-800X

https://orcid.org/0000-0002-3955-0478

nisms partly unlock the internal decision process of

the model, further ﬁne-tuning of its ultimate layers

needs to be conducted to render it a wholly inter-

pretable system.

Kolmogorov-Arnold Networks (KANs) (Liu

et al., 2024) are another powerful recent option for

representing complicated relationships among neural

networks without relying on pre-deﬁned activation

functions in the nodes. Instead, they place learnable

univariate functions on the edges.

In doing so, they automatically offer a path to-

wards transparency and interpretability: any edge in a

KAN may be viewed and analyzed to reveal the evolu-

tion of the input signals. Above all, there are rigorous

theoretical foundations for Kolmogorov-Arnold Net-

works, in the form of the Kolmogorov-Arnold rep-

resentation theorem (Schmidt-Hieber, 2021), to sup-

port their ability to approximate multivariate func-

tions with versatility.

This paper incorporates Kolmogorov-Arnold Net-

works into the latter sections of the Adaptive Mul-

tiscale Deep Neural Network to obtain end-to-end in-

terpretability. That is, the ﬁnal two dense layers of the

Adaptive Multiscale Deep Neural Network are sub-

stituted with Kolmogorov-Arnold Network layers, al-

lowing explicit insight into feature transformations.

Experimental ﬁndings show that such changes consis-

800

Franchini, F., Galantucci and S.

AMAKAN: Fully Interpretable Adaptive Multiscale Attention Through Kolmogorov-Arnold Networks.

DOI: 10.5220/0013654200003967

In Proceedings of the 14th International Conference on Data Science, Technology and Applications (DATA 2025), pages 800-808

ISBN: 978-989-758-758-0; ISSN: 2184-285X

tently result in improved performance while granting

the full transparency required for high-stakes applica-

tions.

The remainder of this paper is organized as fol-

lows. Section 2 discusses recent research on in-

terpretable deep learning methods for tabular data,

including other architectures employing attention

mechanisms and tree-based ensembles. Section 3 in-

troduces the initial Adaptive Multiscale Deep Neural

Network architecture and outlines how Kolmogorov-

Arnold Networks are incorporated into its architec-

ture. Section 4 discusses the experimental setup and

datasets considered, and Section 5 presents results

on classiﬁcation performance and interpretability im-

pact. Section 6 concludes and gives future work di-

rections for how this improved architecture can serve

as a faithful, fully interpretable deep learning solution

for tabular data.

2 RELATED WORK

2.1 Adaptive Multiscale Deep Neural

Network

The Adaptive Multiscale Deep Neural Network, pre-

sented by Dentamaro et al. (Dentamaro et al., 2024),

has been particularly engineered to improve feature

weighting in tabular data classiﬁcation, with a high

level of interpretability. Through the integration of

multiple Excitation Layers operating at different lev-

els of compression, the model enables a dynamic

evaluation of input feature relevance. These repre-

sentations are then combined within a Merge Layer

through operations like addition, averaging, concate-

nation, or the Hadamard product, thus enabling the

combination of multi-scale feature representations.

Finally, a Trainable Attention Mechanism further op-

timizes the determined feature importance by adjust-

ing weight parameters during training. This method

yields considerably enhanced classiﬁcation perfor-

mance, thanks to the ability of the model to ana-

lyze input data at a multi-resolution level, while inter-

pretability is also improved through attention-based

feature weighting.

This model aligns with other explainable deep

learning approaches developed for tabular data. Neu-

ral Oblivious Decision Tree Ensembles (NODE)

(Popov et al., 2019) offers differentiable decision

trees that support gradient-based learning, thus al-

lowing a balance between interpretability and perfor-

mance, similarly to other decision-tree-based method-

ologies (Luo et al., 2021; Dentamaro et al., 2018). As

opposed to typical tree-based algorithms, NODE sup-

ports end-to-end training while also ensuring trans-

parency at the feature level. Similarly, the Soft Deci-

sion Tree Regressor (SDTR) (Luo et al., 2021) blends

decision tree approaches with deep learning, leverag-

ing hierarchical representations to both improve pre-

dictive ability and model explainability.

Besides hybrid tree-based models, there have been

various novel deep learning architectures proposed

for tabular data analysis. TabNet (Arik and Pﬁs-

ter, 2021) leverages an interpretable sequential at-

tention mechanism that selectively extracts relevant

features step by step, thus improving feature selec-

tion and interpretability. However, it requires large

datasets for the best training outcomes. Also, Cat-

Boost (Prokhorenkova et al., 2018), a gradient boost-

ing system that was particularly tailored for efﬁcient

categorical feature management, combines ordered

boosting and fast categorical variable processing to

maintain high prediction quality with improved inter-

pretability. Aversano et al. (Aversano et al., 2023) fur-

ther contributed to the ﬁeld by proposing a data-aware

explainable deep learning approach that demonstrates

how interpretability can be integrated into predictive

systems, particularly in process-aware applications.

2.2 Kolmogorov-Arnold Networks

(KANs)

Kolmogorov-Arnold Networks (KANs) (Liu et al.,

2024) leverage the Kolmogorov-Arnold represen-

tation theorem (Schmidt-Hieber, 2021), which

states that every multivariate continuous function

f : [0, 1]

→ R can be represented as a superposition

of continuous single-variable functions:

f (x

, x

, ..., x

) =

2n+1

∑

q=1

∑

p=1

p,q

)

where Φ

and φ

p,q

are learnable functions.

In contrast to Multi-Layer Perceptrons (MLPs)

that apply ﬁxed activation at neurons, KANs apply

learnable functions to the edges, which increases ﬂex-

ibility and expressiveness.

Some of the most signiﬁcant distinctions between

KANs and MLPs are as follows:

• Learnable vs. Fixed Activations: KANs make

use of learnable function compositions instead of

ﬁxed activations, thus showing improved data pat-

tern adaptability.

• Weight Multiplication vs. Function Approx-

imation: KANs allow for dynamic transforma-

tions, unlike traditional static weighted sums, by

supporting the learning of functions.

AMAKAN: Fully Interpretable Adaptive Multiscale Attention Through Kolmogorov-Arnold Networks

801

• Shallower yet More Expressive: KANs achieve

strong function approximation using fewer layers,

thus reducing the need for large architectures.

KANs are highly interpretable due to their explicit

modeling of feature transformation. Their learned

functions can be visualized, thus offering important

insights into how features inﬂuence predictions. In

addition, their unique architecture makes them im-

mune to catastrophic forgetting, thus ensuring robust-

ness in sequential learning tasks.

Several research works have enabled the devel-

opment of the interpretability of Kolmogorov-Arnold

Networks. Xu et al. (Xu et al., 2024) have incor-

porated symbolic regression into KANs, as shown

by their implementations in Temporal KAN (T-KAN)

and Multi-Task KAN (MT-KAN), and improved

transparency in time-series tasks. Galitsky (Galit-

sky, 2024) has utilized KANs in natural language

processing (NLP) by using inductive logic program-

ming to give word-level explanations. A similar strat-

egy of combining explainability with advanced mod-

eling has also been applied in biomedical contexts,

such as AI-assisted spectroscopy for cancer assess-

ment (Esposito et al., 2023; Gattulli et al., 2023b;

Dentamaro et al., 2021; Gattulli et al., 2023a). De

Carlo et al. (De Carlo et al., 2024) and Sun (Sun,

2024) have explored using KANs in graph neural net-

works and scientiﬁc discovery, where inherent com-

plexity presents interpretability challenges. Bozor-

gasl and Chen (Bozorgasl and Chen, ) have pro-

posed Wav-KAN, which incorporates wavelets for en-

hanced transparency. Thanks to these advancements,

KANs have demonstrated their capability to add inter-

pretability in different architectural frameworks. For

this reason, this paper integrates them into the Adap-

tive Multiscale Deep Neural Network to replace its

black-box components and achieve full interpretabil-

ity.

3 REVISED ARCHITECTURE

3.1 Original Network Structure

Adaptive Multiscale Deep Neural Network is pro-

posed to boost feature weighting and attention mecha-

nisms for tabular data classiﬁcation. The model archi-

tecture consists of multiple parallel Excitation Layers

with different levels of compression to capture and

emphasize relevant features dynamically. The Exci-

tation Layers perform a dense transformation and an

Exponential Linear Unit (ELU) activation function to

enhance the gradient ﬂow, and a second dense trans-

formation and sigmoid activation function to normal-

ize attention weights to [0,1]. The number of exci-

tation layers is deﬁned proportionally to the number

of input features to achieve adaptability to different

datasets.

Excitation Layers’ outputs are merged through a

Merge Layer that carries out one of four operations:

Addition, Averaging, Concatenation, or Hadamard

Product. Through this operation, multi-scale feature

representations are effectively merged before further

processing. To normalize feature representations, the

network carries out Layer Normalization, followed by

a Hadamard Product between the normalized output

and original input features. This operation enhances

feature weighting without losing raw data properties.

A Trainable Attention Mechanism updates feature

importance dynamically using a trainable weight ma-

trix. The matrix, which is set to an identity ma-

trix with a small scaling factor at initialization, is

trained to learn the best feature weighting. With an

explicit model of feature contribution to the classiﬁca-

tion task, this mechanism improves both performance

and interpretability.

The interpretability of the Adaptive Multiscale At-

tention Block is realized through an extensive analy-

sis of features at multiple scales. First, the Excita-

tion Layers present an immediate view of the most

important features relevant for classiﬁcation by com-

puting attention weights at various levels of granular-

ity. These attention weights enable the investigation

of local feature importance, thereby indicating which

features are emphasized in different parts of the net-

work. Second, the overall importance of each input

variable is determined by summing attention scores

over all instances, providing a high level of under-

standing of feature relevance. Further, the model ex-

plores the variation of feature importance across dif-

ferent classes, allowing an analysis of class-speciﬁc

feature dynamics. Finally, the network gains insights

into nonlinear interactions between features, reveal-

ing complex dependencies that are not addressed in

conventional models.

With these mechanisms for interpretability built

into its design, the Adaptive Multiscale Deep Neural

Network is not based on post hoc explainability tech-

niques but is instead inherently transparent regarding

its choices. Having the ability to visualize attention

distributions and examine feature interactions makes

it particularly well-suited to transparency-requiring

applications such as medical diagnosis and ﬁnancial

modeling.

In order to improve the interpretability of the

model, the ﬁnal two dense layers have been replaced

by two dense layers obtained from the Kolmogorov-

Arnold Network, as depicted in Figure 1. To main-

DMDH 2025 - Special Session on Data-Driven Models for Digital Health Transformation

802

tain consistency with the previous architecture, the

ﬁrst layer of the Kolmogorov-Arnold Network is set

to have a number of units equal to the number of in-

put features, while the second layer is set to have a

number of units equal to the number of total output

classes.

Figure 1: Adaptive Multiscale Attention Network KAN Ar-

chitecture.

In contrast with the conventional dense layers that

depend on linear transformations followed by pre-set

activation functions, Kolmogorov-Arnold Network

layers inherently use complicated nonlinear transfor-

mations using adjustable function compositions. This

inherent non-linearity obviates the necessity for the

ELU activation function that was used in the dense

layers of the original network. The ﬂexibility of the

Kolmogorov-Arnold Network layers ensures that the

feature transformations are data-dependent, and the

model can adaptively learn the most signiﬁcant rep-

resentations without pre-deﬁned activation functions.

Every layer in the Kolmogorov-Arnold Network is

implemented using a grid size of 5 and a spline order

of 3, in line with Liu et al.(Liu et al., 2024)’s method-

ology in their original research on Kolmogorov-

Arnold Networks. Such dimensions achieve the best

possible trade-off between computational efﬁciency

and function approximation capacity and thus ensure

that the network enjoys a considerable amount of ex-

pressiveness while remaining interpretable.

4 EXPERIMENTAL SETUP

The datasets for evaluation are identical to those in

(Dentamaro et al., 2024). An overview of the datasets

is presented in table 1, including the number of in-

stances, features, and their balancing properties.

The datasets were chosen to provide an overall as-

sessment of the suggested architecture. The UCI Ar-

rhythmia dataset was chosen because it has a very

high number of features, and hence it is a suitable

benchmark for testing models that deal with intri-

cate feature interactions. In addition, the collection of

datasets includes imbalanced and balanced datasets so

that the model’s performance is evaluated under vary-

ing class distributions. The inclusion of the Higgs Bo-

son dataset with more than 11 million samples gives

a large-scale benchmark for the scalability of the new

approach. Likewise, the Click-Through Rate dataset

also gives a large yet structured classiﬁcation problem

with a reasonable feature space. Including the smaller

datasets also tests the model’s ability to handle low

data situations.

To compare and evaluate the suggested architec-

ture against its predecessor version, all experimental

approaches were performed using the same model pa-

rameters deﬁned in (Dentamaro et al., 2024). In ad-

dition, a new baseline experiment was added, where

a Kolmogorov-Arnold Network with a single hidden

layer was used. The number of units in this layer was

set to f × 0.8, where f is the number of features in the

dataset, to provide consistency with the MLP setup

described in (Dentamaro et al., 2024).

According to (Dentamaro et al., 2024), a base-

line Kolmogorov-Arnold Network was trained for a

maximum of 150 epochs with data standardization (Z-

score normalization) being applied before the training

phase was initiated. This approach enables fair com-

parisons by ensuring that all models operate under the

same preprocessing settings.

In order to create a general comparative frame-

work, the methodologies being considered can be

classiﬁed as follows:

• Tree-based approaches

– Decision Tree

– Random Forests (Breiman, 2001)

AMAKAN: Fully Interpretable Adaptive Multiscale Attention Through Kolmogorov-Arnold Networks

803

Table 1: Summary of Datasets Used.

Dataset Instances Features Size Balancing

UCI Arrhythmia (Guvenir and

Quinlan, 1997)

452 279 Small Imbalanced

UCI Wisconsin Breast Cancer

(Wolberg and Street, 1993)

569 32 Small Balanced

UCI Cervical Cancer

(Fernandes and Fernandes,

2017)

72 19 Small Imbalanced

UCI Diabetic Retinopathy

(Antal and Hajdu, 2014)

1,151 20 Small Balanced

UCI Heart Disease (Janosi and

Detrano, 1989)

303 14 Small Imbalanced

Click-Through Rate (Aden and

Wang, 2012)

1,000,000 12 Large Balanced

Higgs Boson (Baldi et al.,

2014)

11,000,000 28 Very

Large

Balanced

– XGBoost (Chen and Guestrin, 2016)

– CatBoost (Prokhorenkova et al., 2018)

• Non-tree-based approaches

– TabNet (Arik and Pﬁster, 2021)

– Multi-Layer Perceptron (MLP)

– Support Vector Machine (SVM) with RBF ker-

nel

– Kolmogorov–Arnold Network (KAN) (Liu

et al., 2024)

– Original Adaptive Multiscale Attention Net-

work (Dentamaro et al., 2024)

With regard to the architecture suggested in this

paper, all experimental protocols were carried out un-

der consistent conditions as outlined in (Dentamaro

et al., 2024). For every dataset, four different model

conﬁgurations were tested, differing in the type of

merge layer used. Let s

be the output of the i-th Ex-

citation Layer, where n is the number of Excitation

Layers. The merge layer integration was performed

using one of the following four methodologies:

• Addition: The outputs produced by every Exci-

tation Layer are combined through element-wise

addition to create the composite representation:

∑

i=1

(1)

where W

is the resulting weighted feature repre-

sentation.

• Averaging: The outcomes obtained from the Ex-

citation Layers undergo averaging to allow for an

equitable aggregation:

∑

i=1

(2)

where each s

preserves a fair contribution to the

ﬁnal merged representation.

• Hadamard Product: The output resulting from

the Excitation Layers is element-wise multiplied:

∏

i=2

i−1

· s

(3)

• Concatenation: The outputs of the Excitation

Layers are concatenated along the feature dimen-

sion:

= [s

⊕ s

⊕ · · · ⊕ s

] (4)

where ⊕ represents the concatenation operator,

thus increasing the feature space dimensionality.

The four setups enable a thorough evaluation of

the impact that different merging methods have on the

performance of the model, thus providing insightful

information on their respective contributions to inter-

pretability and accuracy.

5 RESULTS AND DISCUSSION

In this section, an empirical comparison is made

among the proposed architecture and the earlier

DMDH 2025 - Special Session on Data-Driven Models for Digital Health Transformation

804

Adaptive Multiscale Deep Neural Network (Denta-

maro et al., 2024) and the baseline models. The com-

parison is done across various datasets and is mea-

sured in terms of classiﬁcation accuracy using the F1-

weighted score, which is a balanced measure in case

of class imbalance. The motivation behind the com-

parison is to analyze if the addition of Kolmogorov-

Arnold Network layers can enhance the performance

of classiﬁcation along with interpretability.

5.1 Performance Evaluation

The results presented in Table 2 demonstrate that

the suggested Adaptive Multiscale Deep Neural Net-

work, equipped with Kolmogorov-Arnold Network

layers, performs better than the baseline architecture

on nearly all datasets, the sole exceptions being Heart

Disease and Higgs Boson. This is seen consistently

across merging strategies, indicating that the substitu-

tion of the last dense layers with Kolmogorov-Arnold

Network layers signiﬁcantly improves the model’s ca-

pacity to comprehend intricate feature relationships

without sacriﬁcing accuracy.

As per the initial Adaptive Multiscale Deep Neu-

ral Network (Dentamaro et al., 2024), the best con-

ﬁgurations in the new architecture include the use of

either the Hadamard Product or Concatenation within

the merge layer. Both methods appear to maximize

the use of multi-scale feature representations, thus im-

proving classiﬁcation performance. However, such

high performance levels are also maintained using

other merging layers, such as Addition and Averag-

ing, thereby highlighting the resilience of the new ar-

chitecture with different conﬁgurations.

Despite these general improvements, the sug-

gested architecture failed to surpass the baseline

model’s performance on the Heart Disease and Higgs

Boson datasets.

• Heart Disease Dataset: As a result of the intrin-

sic complexity of Kolmogorov-Arnold Networks,

the relatively low number of instances must have

impacted the model’s ability to generalize well.

• Higgs Boson Dataset: The performance of the

new architecture is the same as that of the old ar-

chitecture. New and old architecture scores are

still competitive but still fail to beat tree-based

models such as XGBoost and CatBoost.

The results show that the suggested changes sig-

niﬁcantly improve the performance of the Adaptive

Multiscale Deep Neural Network on most datasets,

while at the same time retaining its interpretability

beneﬁts. The slight underperformance on the Heart

Disease and Higgs Boson datasets is likely due to

problem-speciﬁc difﬁculties, as opposed to funda-

mental limitations of the new proposed architecture.

5.2 Impact of Interpretability

Kolmogorov–Arnold Networks (KANs) have a clear

interpretability advantage compared to standard dense

layers. As shown in (Liu et al., 2024), every connec-

tion in the network corresponds to a learnable univari-

ate function, enabling the visualization and investiga-

tion of the routes that input signals take through dif-

ferent layers. Spline-based functions naturally enable

investigation and can be optimized, where necessary,

with techniques like pruning or symbolic ﬁtting.

The Adaptive Multi-Scale Attention Block out-

lined in (Dentamaro et al., 2024) always preserves the

integrity of the original feature set. Speciﬁcally, the

recurrent processing of the input before the trainable

attention layer ensures that the subsequent transfor-

mations preserve information relevant to the original

features. Therefore, inspection of the last two dense

layers is instrumental for obtaining a complete under-

standing of the inﬂuence that these original features

have on the classiﬁcation.

A representative case is demonstrated by the ex-

amination of the second KAN layer of the model

learned on the Click dataset. This dataset has 11 in-

put features and a single binary target. Figure 2 shows

the layer after initialization, with each branch having

uniform intensity. While the functions might look su-

perﬁcially equivalent, they contain differences due to

small random ﬂuctuations, which result from the ran-

dom initialization process outlined in the initial paper

regarding Kolmogorov–Arnold networks.

Figure 2: 2nd KAN layer plot after the initialization on

Click-Through Rate (Aden and Wang, 2012).

When the second layer is trained on the Click

dataset, it undergoes a signiﬁcant change, as can be

seen in Figure 3. Each of the functions has been

adapted to ﬁt the distribution of the data, a fact well

represented by the difference in intensity between the

branches, which represent how much each of the fea-

tures affects the classiﬁcation. It can be seen that the

seventh feature seems to have the greatest impact, as

represented by the darker branch leading from it to

the output of the layer. This representation well illus-

AMAKAN: Fully Interpretable Adaptive Multiscale Attention Through Kolmogorov-Arnold Networks

805

Table 2: F1-weighted average scores (± standard deviation) on various datasets.

Method Arrhythmia Wisconsin

Breast

Cancer

Cervical

Cancer

Diabetic

Retinopa-

thy

Heart

Disease

Click

Through

Rate

Higgs

Boson

Support Vector

Machines with

RBF

0.486 ±

0.090

0.970 ±

0.010

0.991 ±

0.010

0.704 ±

0.020

0.521 ±

0.060

0.521 ±

0.060

0.764 ±

0.001

Decision Tree 0.646 ±

0.060

0.902 ±

0.030

0.994 ±

0.004

0.611 ±

0.020

0.490 ±

0.050

0.500 ±

0.030

0.716 ±

0.001

Multi Layer

Perceptron

0.634 ±

0.060

0.971 ±

0.010

0.989 ±

0.008

0.706 ±

0.020

0.503 ±

0.070

0.487 ±

0.050

0.761 ±

0.001

Kolmogorov

Arnold Network

0.673 ±

0.070

0.973 ±

0.010

0.992 ±

0.008

0.712 ±

0.020

0.510 ±

0.050

0.533 ±

0.020

0.758 ±

0.002

Random Forest 0.665 ±

0.030

0.948 ±

0.020

0.987 ±

0.010

0.685 ±

0.030

0.525 ±

0.080

0.536 ±

0.060

0.741 ±

0.001

XGBoost 0.675 ±

0.030

0.970 ±

0.008

0.991 ±

0.010

0.694 ±

0.001

0.507 ±

0.050

0.507 ±

0.050

0.774 ±

0.001

TabNet 0.392 ±

0.090

0.935 ±

0.050

0.990 ±

0.010

0.617 ±

0.050

0.464 ±

0.050

0.464 ±

0.050

0.763 ±

0.001

CatBoost 0.664 ±

0.050

0.970 ±

0.020

0.987 ±

0.009

0.682 ±

0.020

0.539 ±

0.050

0.539 ±

0.050

0.769 ±

0.001

Adaptive

Multiscale

Attention - Add

0.679 ±

0.040

0.970 ±

0.004

0.996 ±

0.004

0.740 ±

0.020

0.533 ±

0.050

0.550 ±

0.050

0.769 ±

0.001

Adaptive

Multiscale

Attention - Avg

0.619 ±

0.030

0.970 ±

0.010

0.995 ±

0.005

0.744 ±

0.010

0.535 ±

0.050

0.551 ±

0.050

0.765 ±

0.001

Adaptive

Multiscale

Attention - Mul

0.648 ±

0.004

0.969 ±

0.010

0.995 ±

0.005

0.724 ±

0.010

0.585 ±

0.050

0.609 ±

0.040

0.775 ±

0.001

Adaptive

Multiscale

Attention - Conc

0.630 ±

0.040

0.967 ±

0.010

0.995 ±

0.005

0.723 ±

0.020

0.497 ±

0.050

0.558 ±

0.050

0.766 ±

0.001

Adaptive

Multiscale

Attention With

KANs - Add

0.667 ±

0.083

0.986 ±

0.017

0.996 ±

0.008

0.739 ±

0.027

0.548 ±

0.121

0.582 ±

0.001

0.767 ±

0.003

Adaptive

Multiscale

Attention With

KANs - Avg

0.661 ±

0.077

0.988 ±

0.018

0.996 ±

0.008

0.740 ±

0.027

0.576 ±

0.083

0.591 ±

0.001

0.773 ±

0.002

Adaptive

Multiscale

Attention With

KANs - Mul

0.682 ±

0.077

0.979 ±

0.017

0.997 ±

0.008

0.706 ±

0.029

0.558 ±

0.056

0.615 ±

0.001

0.761 ±

0.002

Adaptive

Multiscale

Attention With

KANs - Conc

0.699 ±

0.075

0.984 ±

0.018

0.993 ±

0.009

0.753 ±

0.032

0.562 ±

0.085

0.589 ±

0.001

0.766 ±

0.002

trates how the spline-based functions are optimized

for the problem at hand, as well as providing an in-

stantaneous way of determining the most relevant in-

puts for the ﬁnal prediction.

DMDH 2025 - Special Session on Data-Driven Models for Digital Health Transformation

806

Figure 3: 2nd KAN layer plot after the training on Click-

Through Rate (Aden and Wang, 2012).

6 CONCLUSION AND FUTURE

WORK

This paper presents AMAKAN, a fully interpretable

version of the Adaptive Multiscale Deep Neural Net-

work architecture for tabular data classiﬁcation, re-

placing the last two dense layers of the original model

with Kolmogorov–Arnold Network layers. This en-

hancement takes advantage of the inherent ﬂexibil-

ity of Kolmogorov–Arnold Networks, which replace

ﬁxed activations with spline-based, learnable func-

tions, thereby offering explicit insight into feature

transformations at every stage of the network.

Empirical evaluations conducted over several

datasets demonstrate that the proposed modiﬁed ar-

chitecture performs better than the original Adaptive

Multiscale Deep Neural Network in the majority of

the test cases with higher F1-weighted scores. The

rationale behind such improvement is the fact that

the Kolmogorov–Arnold Network layers are adaptive

and thus efﬁciently learn complicated nonlinear map-

pings of the input data, thus offering a better and more

precise feature representation compared to standard

dense layers.

One of the most important advantages of Kol-

mogorov–Arnold Networks is the increased inter-

pretability. In contrast to dense layers founded on

linear transformations followed by ﬁxed activation

functions, Kolmogorov–Arnold Networks allow one

to utilize visualizable and learnable spline functions,

which are explicit descriptions of complicated non-

linear transformations. This allows to clearly see the

contribution of every feature to predictions with con-

siderably greater transparency.

Looking into the future, several promising direc-

tions for future work have appeared. One direction is

the extension of this hybrid architecture to regression

problems, which may further cement the general use-

fulness and applicability of the method beyond clas-

siﬁcation problems. Another direction is the appli-

cation of other interpretability techniques instead of

Kolmogorov–Arnold Networks for the ﬁnal two lay-

ers, which may yield novel insights and tools, and po-

tentially give a deeper insight into feature relations

and interactions in the data.

Overall, the integration of Adaptive Multiscale

Attention mechanisms with Kolmogorov–Arnold

Networks represents a signiﬁcant advance towards

fully interpretable, efﬁcient, and effective deep learn-

ing models for the speciﬁc needs of tabular data anal-

ysis. The hybrid solution is an attractive direction

of future research that addresses the essential trade-

off between model interpretability and performance

in real-world machine learning applications.

REFERENCES

Aden and Wang, Y. (2012). Kdd cup 2012, track 2. https:

//kaggle.com/competitions/kddcup2012-track2.

Antal, B. and Hajdu, A. (2014). Diabetic Retinopathy De-

brecen. UCI Machine Learning Repository. DOI:

https://doi.org/10.24432/C5XP4P.

Arik, S.

O. and Pﬁster, T. (2021). Tabnet: Attentive inter-

pretable tabular learning. In Proceedings of the AAAI

conference on artiﬁcial intelligence, volume 35, pages

6679–6687.

Aversano, L., Bernardi, M. L., Cimitile, M., Iammarino,

M., and Verdone, C. (2023). A data-aware explain-

able deep learning approach for next activity predic-

tion. Engineering Applications of Artiﬁcial Intelli-

gence, 126. Cited by: 7.

Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching

for Exotic Particles in High-Energy Physics with Deep

Learning. Nature Commun., 5:4308.

Bozorgasl, Z. and Chen, H. Wav-kan: Wavelet

kolmogorov-arnold networks, 2024. arXiv preprint

arXiv:2405.12832.

Breiman, L. (2001). Random forests. Machine learning,

45:5–32.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable

tree boosting system. In Proceedings of the 22nd acm

sigkdd international conference on knowledge discov-

ery and data mining, pages 785–794.

De Carlo, G., Mastropietro, A., and Anagnostopoulos, A.

(2024). Kolmogorov-arnold graph neural networks.

arXiv preprint arXiv:2406.18354.

Dentamaro, V., Giglio, P., Impedovo, D., Pirlo, G., and

Ciano, M. D. (2024). An interpretable adaptive mul-

tiscale attention deep neural network for tabular data.

IEEE Transactions on Neural Networks and Learning

Systems, pages 1–15.

Dentamaro, V., Impedovo, D., and Pirlo, G. (2018). Licic:

less important components for imbalanced multiclass

classiﬁcation. Information, 9(12):317.

Dentamaro, V., Impedovo, D., and Pirlo, G. (2021). An

analysis of tasks and features for neuro-degenerative

disease assessment by handwriting. In International

Conference on Pattern Recognition, pages 536–545.

Springer.

AMAKAN: Fully Interpretable Adaptive Multiscale Attention Through Kolmogorov-Arnold Networks

807

Esposito, C., Janneh, M., Spaziani, S., Calcagno, V.,

Bernardi, M. L., Iammarino, M., Verdone, C., Taglia-

monte, M., Buonaguro, L., Pisco, M., Aversano, L.,

and Cusano, A. (2023). Assessment of primary human

liver cancer cells by artiﬁcial intelligence-assisted ra-

man spectroscopy. Cells, 12(22). Cited by: 8; All

Open Access, Gold Open Access, Green Open Ac-

cess.

Fernandes, Kelwin, C. J. and Fernandes, J. (2017). Cer-

vical Cancer (Risk Factors). UCI Machine Learning

Repository. DOI: https://doi.org/10.24432/C5Z310.

Galitsky, B. A. (2024). Kolmogorov-arnold network for

word-level explainable meaning representation.

Gattulli, V., Impedovo, D., Pirlo, G., and Semeraro,

G. (2023a). Handwriting task-selection based on

the analysis of patterns in classiﬁcation results on

alzheimer dataset. In DSTNDS, pages 18–29.

Gattulli, V., Impedovo, D., Pirlo, G., and Volpe, F. (2023b).

Touch events and human activities for continuous

authentication via smartphone. Scientiﬁc Reports,

13(1):10515.

Guvenir, H., A. B. M. H. and Quinlan, R. (1997). Ar-

rhythmia. UCI Machine Learning Repository. DOI:

https://doi.org/10.24432/C5BS32.

Janosi, Andras, S. W. P. M. and Detrano, R. (1989). Heart

Disease. UCI Machine Learning Repository. DOI:

https://doi.org/10.24432/C52P4X.

Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J.,

Solja

c, M., Hou, T. Y., and Tegmark, M. (2024).

Kan: Kolmogorov-arnold networks. arXiv preprint

arXiv:2404.19756.

Luo, H., Cheng, F., Yu, H., and Yi, Y. (2021). Sdtr: Soft

decision tree regressor for tabular data. IEEE Access,

9:55999–56011.

Popov, S., Morozov, S., and Babenko, A. (2019). Neural

oblivious decision ensembles for deep learning on tab-

ular data. arXiv preprint arXiv:1909.06312.

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush,

A. V., and Gulin, A. (2018). Catboost: unbiased boost-

ing with categorical features. Advances in neural in-

formation processing systems, 31.

Schmidt-Hieber, J. (2021). The kolmogorov–arnold rep-

resentation theorem revisited. Neural networks,

137:119–126.

Sun, J. Q. (2024). Evaluating kolmogorov–arnold networks

for scientiﬁc discovery: A simple yet effective ap-

proach.

Wolberg, William, M. O. S. N. and Street, W.

(1993). Breast Cancer Wisconsin (Diagnos-

tic). UCI Machine Learning Repository. DOI:

https://doi.org/10.24432/C5DW2B.

Xu, K., Chen, L., and Wang, S. (2024). Kolmogorov-arnold

networks for time series: Bridging predictive power

and interpretability. arXiv preprint arXiv:2406.02496.

DMDH 2025 - Special Session on Data-Driven Models for Digital Health Transformation

808