Facebook Fake News Dataset, representing fact-
checked news articles that were shared over social
media platforms. Aside from these datasets,
researchers can gather data through scraping online
news sources, social media sites, or factchecking
sites such as Snopes and FactCheck. org. To avoid
bias in the model, it is important to have a balanced
number of real and fake news articles in the dataset.
Additional information for categorization might be
gleaned from metadata like publication date, author,
and social media engagement metrics.
B. Data Pre-processing Once the dataset is
collected, it undergoes pre-processing to clean and
structure the textual content for analysis. Raw news
articles contain noise such as punctuation, stop words,
special characters, and inconsistent formatting, which
can affect model performance. The key pre-
processing steps include text cleaning, where special
characters, punctuation, and numbers are removed,
and text is converted to lowercase to ensure
uniformity. Additionally, HTML tags and URLs from
web-scraped content are eliminated. Stop word
removal is performed using libraries like NLTK and
SpaCy to filter out common words such as "the," "is,"
and "and," which do not contribute significant
contextual information. Tokenization and
lemmatization are then applied to split text into
individual words or phrases and convert words into
their base form (e.g., "running" → "run"), improving
the model’s ability to understand word variations. To
make text machine-readable, text vectorization
methods such as Bag-of-Words (BoW), Term
Frequency-Inverse Document Frequency (TF-IDF),
and word embeddings like Word2Vec, GloVe, or
transformer-based embeddings like BERT are used to
capture contextual meaning. Another crucial step is
handling class imbalance, where techniques like
oversampling the minority class, undersampling the
majority class, or using Synthetic Minority Over-
sampling Technique (SMOTE) help ensure balanced
training data. Effective pre-processing enhances
model efficiency, reduces noise, and allows machine
learning models to extract relevant patterns for
accurate fake news classification, ultimately
improving the reliability of automated
misinformation detection systems.
4 PROPOSED METHODOLOGY
One of the significant applications of Natural
Language Processing (NLP) is fake news detection,
tasked with classifying news articles as fake or real.
The prevalence of misinformation on any digital
media is increasing, making the development of an
automatic system to evaluate the authenticity of news
content very high important. The steps followed in
our methodology are data collection, preprocessing,
feature extraction, model training, and testing with a
good and effective classification system.
The process begins with Data collection, where a
diverse dataset of real and fake news articles are
collected from trusted sources. Datasets usually
publicly available such as LIAR, FakeNewsNet, or
Kaggle datasets consist of labeled samples of news
that must serve as a base for training and testing.
These datasets include news headlines, contents and
meta data which enable the model to learn the
language patterns of false as well as true news. While
news stories are probably biased towards certain
issues, such as data covering various sources and
categories are useful in improving the generalizability
of the model. After data is collected, preprocessing is
done to clean and normalize text.
This operation keeps only essential parts like
punctuation, stopwords, special characters and
HTML tags and converts text into a normalized form,
mostly sometimes lower case. Text is split into words
or phrases using tokenization, and lemmatization or
stemming is used to reduce words to their root words.
These process help eliminate redundancy and
increase feature extraction, only useful parts of text
contribute towards the classification model.
Therefore, in order to transform the raw text into
numerical values readable by the machine learning
models, the first thing is to do a feature extraction.
Traditional methods such as Term Frequency-Inverse
Document Frequency (TF- IDF) and Bag of Words
(BoW) are commonly used to represent the
importance and frequency of words in a document.
More advanced approaches that use word
embeddings (e.g., Word2Vec, GloVe, contextual
embeddings with transformer models like BERT) that
capture the meaning between words. In addition to
text-based features, metadata such as source
credibility, writing style, and sentiment could be
additional sources of information used to help the
model differentiate between real and fake news.
Trained on machine learning and deep learning
models perform classification to identify patterns
over the extracted features. Other option classifier
logistic regression, SV, and Random forest seem to
work well in previous research. However, more
recent deep learning networks like Long-Short Term
Memory (LSTM) architectures, Convolutional
Neural Networks (CNN) and BERT with
Transformer architecture have been shown to
achieve superior performance on most NLP