posts, averaging around 270 words each for their
content and 28 words for summaries. The features of
this dataset include various strings such as author,
body, normalizedBody, content, summary, subreddit,
and subreddit_id. The content is processed as the
main text document, while the summary serves its
summarization purpose. However, a notable
challenge with this dataset is its lack of predefined
labels, which are crucial for supervised learning tasks.
To address this limitation, this study turned to
TextBlob, a powerful library in Python that simplifies
the process of natural language processing. TextBlob
enables us to automatically generate labels for the
immense dataset, making it feasible to work with a
scale that would be impractical through manual
labeling. In fact, it seems almost impossible to make
the labels manually as 3,848,330 labels are needed.
Once the labels are produced, they are consolidated
into a .csv file. After that, they can be easily read with
Pandas.
The primary focus of this article is to explore how
variations in different settings regarding epochs and
batch sizes influence the overall accuracy of machine
learning models. By conducting experiments with
these parameters, this article tries to provide insight
into the critical factors that affect model accuracy in
predicting outcomes based on the labeled Reddit
dataset.
2 METHOD
This project wants to focus on the influence of the
number of epochs as well as the batch size. By setting
these parameters at different numbers, the effects of
these two variables can be analyzed.
In detail, the number of epochs is set to 10, 20 and
30 respectively, and the batch size is set to 32, 64 and
128 respectively. All the experiments will be repeated
3 times, using different seeds for the random
generator of the train and test dataset.
2.1 Dataset Preparation
This project uses the dataset “Reddit” from
tensorflow package. This dataset contains 3,848,330
posts from Reddit, averaging around 270 words each
for their content and 28 words for summaries. The
features of this dataset include various strings such as
author, body, normalizedBody, content, summary,
subreddit, and subreddit_id. The content is processed
as the main text document, while the summary serves
its summarization purpose. Originally, this dataset
doesn’t contain labels, so the labels used in this
project are generated with TextBlob, which is a
Python library for processing textual data, providing
a simple API for diving into common natural
language processing (NLP) tasks. Random 20% of
this dataset is used as the test set, while the remaining
80% of the dataset is used as the train set. By using
different seeds for the random generator, it can be
ensured that the split is reproducible, and that the
generator won’t affect the results of the model. In
exact numbers, the seeds that were used by this
project are 42,58 and 84 respectively.
2.2 Deep Learning-based Classification
This project uses a 4-layer Tensorflow keras
sequential model (Pang, 2020). The 4 layers are the
embedding layer, the GlobalAveragePooling1D
layer, the dense layer with ReLU activation, and the
dense layer with Sigmoid activation.
The embedding layer takes integers as input and
maps them to dense vector representations. The
parameter “input_dim” is set as 5000, so that the input
data will consist of integers between 0 and 4999. The
“output_dim” is set at 128, which means each word is
represented as a 128-dimensional vector. This layer
enables the model to learn a vector with a fixed length
from each word, capturing semantic meaning.
The GlobalAveragePooling1D layer reduces the
number of dimensions in the output from the
embedding layer by averaging all sequences for each
feature, resulting in a single vector per input sample,
hence reducing the complexity of the model.
The dense layer with ReLU activation is a fully
connected layer that takes the output from the pooling
layer and applies transformations to learn complex
patterns. The number of neurons in this layer is set to
10, and each neuron learns different features of the
input data. The Rectified Linear Unit activation
function outputs the input directly if it is positive and
outputs 0 if it is not. This helps introduce non-
linearity and allows the model to learn better. By
doing so, this layer empowers the model to be more
expressive as the model is able to learn higher-level
features after the pooling layer.
The dense layer with Sigmoid activation acts as
the final layer, generating the model’s output. In this
layer, the number of neurons is set to 1, as it is just for
the output. The sigmoid function is used here,
mapping the output to a value between 0 and 1,
representing the probability of the positive class in the
binary text classification. This layer takes the learned
features from the previous layer, and outputs the final
prediction.