2.2.1 UNet
UNet, originally developed for image segmentation
tasks, is employed in this study to handle the complex
structure of email data by capturing both local and
global features (Ronneberger, 2015). Its encoder-
decoder architecture, combined with skip
connections, allows for efficient feature extraction
and reconstruction. Although traditionally used in
image processing, UNet has proven to be adaptable to
other domains, including text classification.
The UNet architecture comprises of two main
elements: the encoder and the decoder. The role of the
encoder is to reduce the resolution of input data,
capturing crucial characteristics, whereas the decoder
enhances these features by increasing their
dimensions in order to reconstruct them effectively.
Skip connections between corresponding layers of the
encoder and decoder ensure that important
information is retained during the reconstruction
process, preventing the loss of critical features. The
key strength of UNet lies in its ability to retain both
high-level abstract features and detailed local
patterns. This feature proves to be highly beneficial in
identifying spam emails, as even slight variations in
the structure or content of an email can serve as
indicators for distinguishing between legitimate and
spam messages. The skip connections in the UNet
architecture allow the model to preserve these fine
details while processing the overall structure of the
email.
In this experiment, UNet was adapted for email
classification by applying 1D convolutional layers
instead of the traditional 2D layers used in image
processing. The email data, after being pre-processed
and vectorized, was passed through the UNet model,
where the encoder extracted important features, and
the decoder reconstructed them for final
classification. The training of the model involved
utilizing the Adam optimizer to ensure effective
convergence, while employing a cross-entropy loss
function. The UNet model in this experiment was
modified to accommodate textual data, with 1D
convolutions replacing the typical 2D layers. The
input data, which consisted of tokenized emails, was
processed through several down-sampling layers,
followed by up-sampling and feature reconstruction.
Batch normalization and dropout were applied to
prevent overfitting and ensure model robustness. The
final output was a probability distribution over the
two classes (spam and ham), using SoftMax
activation.
2.2.2 Diffusion Model
The Diffusion Model, a type of probabilistic
generative model, plays a complementary role in the
proposed framework. Diffusion models have been
widely used for tasks requiring the modelling of
complex, non-linear data distributions (Hu et.al,
2020). In this study, the Diffusion Model is used to
process noisy email data, simulating how spam
characteristics can evolve over time and how the
model can reverse these transformations to accurately
classify emails.
Diffusion models operate by progressively
incorporating noise into the input data and
subsequently acquiring the ability to undo this
procedure via a sequence of denoising iterations. This
process of reverse diffusion allows the model to
effectively grasp the inherent patterns within the data,
which proves highly advantageous in spam detection
scenarios where spam attributes are frequently
concealed or camouflaged. The strength of the
Diffusion Model lies in its ability to handle noisy and
complex datasets. Emails, especially spam, often
contain noise in the form of obfuscation techniques
designed to bypass filters. The Diffusion Model is
capable of recognizing these patterns and
reconstructing the original email features, leading to
more accurate classification results.
The application of the Diffusion Model entailed a
systematic incorporation of Gaussian noise into the
email dataset. Through a series of deliberate
iterations, the model was trained to effectively
counteract and reverse this introduced noise, thereby
enhancing its predictive accuracy for the original
data. The denoising process accuracy of the
reconstructed email features was evaluated by
training the model using a loss function based on
mean squared error (MSE), which aimed to ensure
close resemblance between the original input and its
corresponding reconstruction. The implementation of
the Diffusion Model involved incorporating 100
incremental noise steps into the input data, and
training the model to effectively undo this
progression. The denoised output obtained after
applying a fully connected layer was utilized for
classification purposes. Similar to UNet, the
optimization technique employed was Adam, and
grid search was conducted to fine-tune
hyperparameters like learning rate and noise steps in
order to achieve optimal performance.
2.2.3 Loss Function
The combined architecture of UNet and the Diffusion
Model required carefully chosen loss functions to