Noise in data describes random error in raw data.
There are two primary categories of noise: attribute
noise and class noise. The former speaks of data that
is incorrectly labeled which can be introduced by
contradictory labeling or misclassifications, whereas
the latter case is when distortions are found in values
of attributes, including insufficient, missing, or
unidentified attribute values. The performance of the
models is closely tied to the quality of datasets and
the resilience of the model itself to noise. Hence, one
of the main tasks in data cleaning is identifying noises
in a data set. Numerous strategies have been
investigated to handle noisy data, and the more
important ones are noise filters, data polishing
methods and robust learners. Robust learners are
designed to be less affected by noisy data by using
pruning techniques to prevent overfitting. However,
even strong learners may perform poorly if the noise
level is excessive. The goal of data cleaning
techniques is to eliminate noisy examples prior to
training, but are most effective for small datasets due
to their time-consuming nature. According to
research, performance can be enhanced by correcting
noise in training data while leaving test data noisy.
Noise filters, on the other hand, detect and remove
noisy instances from the training data, and are
particularly helpful for those who are noise-sensitive.
2.3 Data Normalization
Data normalization is part of a preprocessing
approach where the data is either scaled or
transformed to make an equal contribution to each
feature. This is to improve data quality and
furthermore improve the performance of the machine
learning algorithm. The data normalization step
transforms features into a common range so that
greater numeric feature values cannot dominate small
features (Singh & Singh, 2020). However, be aware
that this process does not imply equal importance of
the features. Some features are tightly correlated to
others, whereas others are superfluous and
completely irrelevant. This is a problem to be solved
in data reduction.
As mentioned, there are two main tasks in data
normalization, re-scaling dominant features and
outliers. Mean and Standard Deviation Normalization
Methods and Min-Max Value Based Normalization
are two methods that are broadly used since they are
simple and effective in most cases, and the former
method performs particularly better when outliers are
present. However, there are also decimal scaling
normalization, median and median absolute deviation
normalization and tanh based normalization.
In Mean and Standard Deviation Normalization,
to normalise the data, the statistical mean and
standard deviation are applied. For example, z-score
normalization is where the raw data is rescaled such
that zero mean and unit variance characterise the
resulting features. These techniques aid in minimising
the impact of data outliers. Moreover, for Min-Max
value based normalization, typically, the data is re-
scaled to fall between 0 and 1 or -1 and 1. These
methods preserve the relationships among the
original input data, unlike mean and standard
deviation methods that could change over time.
However, the biggest issue with these methods is their
vulnerability to extreme values and outliers.
2.4 Data Reduction
Data reduction is the process of minimizing the size
of datasets while preserving the essential information
by reducing the redundant and irrelevant data, or by
summarizing the data into more concise form. The
definition sounds very similar to data cleaning.
However, data cleaning focuses more on fixing
errors, inconsistencies and inaccuracies, while data
reduction aims to decrease the size and complexity of
data. In a short word, data cleaning improves data
quality, whereas data reduction simplifies analysis
and storage. Dimensionality reduction, numerosity
reduction, and cardinality reduction are the three
primary categories into which basic data reduction
techniques can be divided. Examples of each
technique are PCA and t-SNE, sampling and
clustering, and encoding and discretization
respectively.
Dimensionality reduction focuses on reducing the
number of features or random variables in the data
set. The main algorithms established into feature
selection and feature extraction. Feature selection
identifies and removes irrelevant and redundant
information to obtain a subset of features. This
algorithm reduces the risk of overfitting (García et al.,
2016). Feature extraction reduces the number of
dimensions by generating a whole new set of features
by combining the original ones. One of the baseline
approaches is Principal Components Analysis (PCA).
Sample numerosity reduction extracts an alternative
smaller data representation for the original data.
There are either parametric or non-parametric
methods. Cardinality reduction applies
transformations to obtain a reduced representation of
the original data. Discretization is one of the broadly
used techniques in ML. This process transforms
quantitative data into qualitative data, i.e. numerical
attributes into discrete attributes. The main job of