Modeling (MLM) and Next Sentence Prediction
(NSP). Leveraging large-scale natural language
datasets, BERT can learn rich semantic relations
between contexts from these two tasks. BERT
exhibits excellent performance in domains such as
natural language inference, named entity recognition,
and machine translation. Consequently, fine-tuning
BERT for downstream tasks has emerged as a popular
method.
To further enhance the performance of pre-trained
language models in various NLP tasks, researchers
have proposed diverse methods based on the BERT
architecture. Robustly Optimized BERT Pretraining
Approach (RoBERTa) optimizes the training strategy
of BERT by removing the NSP task (Liu, 2019). To
overcome the limitations of BERT in capturing global
bidirectional context, XLNet introduces a
permutation language modeling with a bidirectional
autoregressive framework (Yang, 2019). ALBERT
improves the model performance by implementing
parameter decomposition and cross-layer parameter
sharing. Additionally, ALBERT replaces the NSP
task with Sentence Order Prediction (SOP),
considerably enhancing the model's ability to capture
sentence coherence (Lan, 2019). ELECTRA no
longer uses the [MASK] token for training. Instead, it
performs Replaced Token Detection (RTD) tasks and
trains a discriminator to predict whether each token in
the corrupted input has been replaced by a generator
sample (Clark, 2020). Decoding-enhanced-BERT-
with-disentangled attention (DeBERTa) incorporates
a disentangled attention mechanism and an enhanced
masked decoder, further refining the model's
understanding of content and positional information
(He, 2020). All of the aforementioned models aim to
improve performance by modifying the operational
mechanisms of BERT. Furthermore, researchers have
carried out numerous lightweight enhancements to
BERT. Stacked DeBERT adds a Denoising
Transformer Layer on top of the BERT architecture,
improving the model's robustness to incomplete data
(Sergio et. al., 2021). Adapter tuning introduces two
Adapter modules into each Transformer layer,
allowing the model to be significantly more adaptable
to downstream tasks by simply updating only the
parameters within the Adapter modules. This
approach greatly reduces the cost of fine-tuning
(Houlsby et. al., 2019). Similarly, P-tuning introduces
learnable prompt tokens into the input, resulting in
cost savings and improved efficiency (Liu et. al.,
2021).
The focus of this research is on enhancing BERT's
performance by integrating advanced attention
mechanisms with efficient fine-tuning strategies. The
proposed architecture, Prompt-enhanced BERT with
Denoising Disentangled Attention Layer (PE-BERT-
DDAL), introduces two key innovations. First, deep
prompt tuning is incorporated within the BERT layers
based on P-tuning v2, where learnable prompt tokens
are prefixed to the input sequence. This deep prompt
tuning allows for a more nuanced adaptation of the
model across multiple layers, leading to improved
context understanding. Second, the Disentangled
Attention mechanism from DeBERTa is integrated
into the model as part of a new Denoising
Transformer Layer, referred to as the DDAL. This
layer addresses the positioning bias and content
distortion that may arise from prompt tuning, as well
as noise from incomplete data. By enhancing the
robustness of BERT’s output, DDAL refines the
model's capacity to handle both noisy and clean data
inputs. Experimental results show that fine-tuning
only the final layer of BERT yields competitive
performance, significantly reducing computational
costs. Unlike DeBERTa, which applies disentangled
attention across all layers, PE-BERT-DDAL requires
fewer parameters, making it a more resource-efficient
alternative while maintaining strong classification
capabilities.
2 METHODOLOGIES
2.1 Dataset Description and
Preprocessing
This study utilizes three Kaggle datasets for different
classification tasks: sentiment classification, fake
news detection, and spam detection. The Twitter
Entity Sentiment Analysis Dataset includes tweets
with sentiment labels (positive, negative, or neutral).
Tweets are often unstructured and contain noise such
as emojis, spelling errors, and slang, which renders
this dataset appropriate for testing the model’s
capability to handle complex and noisy text. The Fake
News Detection Dataset consists of textual content
labeled as real or fake, where fake news is
characterized by ambiguous wording and misleading
information. The model is required to comprehend
semantics and identify deceptive reasoning in this
task. The Email Spam Detection Dataset contains
email bodies labeled as spam or non-spam, testing the
model's robustness in handling distracting features
and promotional language common in spam emails
(jp797498e, 2024; iamrahulthorat, 2024;
nitishabharathi, 2024).
During data preprocessing, Hyper Text Markup
Language (HTML) tags, special characters, and