A Brief Introduction to Data Preprocessing
Huiyi Wu
a
Warwick Mathematics Institute, University of Warwick, Coventry, CV4 7AL, U.K.
Keywords: Data Preprocessing, Big Data, Data Cleaning, Machine Learning.
Abstract: The contemporary digital era is characterized by an exponential growth in data volume. However, this
proliferation is paradoxically met with a decline in raw data quality, with most datasets suffering from
inconsistencies, missing values, and noise. This paper argues that data preprocessing is the indispensable
discipline that bridges the gap between low-quality raw data, and reliable, high-performance machine learning
models. Specifically, this essay will explore the multifaceted process of data preprocessing, examining its
four primary stages: data cleaning, which removes inaccuracies; data integration, which harmonizes disparate
data sources; data transformation, which normalizes and structures data; and data reduction, which increases
storage and computational efficiency. Special emphasis will be placed on the unique challenges of preparing
unstructured, non-numerical data for analysis, detailing the conversion techniques required to make them
compatible with quantitative models. Ultimately, this essay demonstrates that a thorough understanding of
data preprocessing is the foundational bedrock upon which all effective, data-driven insights are built.
1 INTRODUCTION
Machine learning traces its roots back to the dawn of
artificial intelligence in the 1950s. Simultaneously,
the field of data science was evolving from traditional
statistics, focusing on learning from data instead of
classical mathematical statistics. Exploratory Data
Analysis (EDA), Data Preprocessing (DP), and
Feature Engineering (FE) are distinct but related
stages in the data science workflow. The primary
difference lies in their objective and outcome. EDA is
the initial investigation phase, corresponding to "Data
Understanding." Its purpose is to analyze and
summarize a dataset to uncover patterns, spot
anomalies, and form hypotheses, typically through
statistical summaries and visualizations, without
actually changing the source data. Following this, DP
and FE fall under "Data Transformation," where the
data is actively modified. Data Preprocessing
emphasises processing and cleaning the raw data so
that it may be used with a model, while Feature
Engineering is the intellectual process of using the
data that already exists to create new, more insightful
features to boost a model's predictive power.
The history of data preprocessing evolved from
manual clerical work to a sophisticated, automated
a
https://orcid.org/0009-0007-0530-4643
science. Early forms consisted of manual "data
editing", where statisticians and census takers would
visually inspect tables for errors. A foundational shift
towards automation occurred with the work of Fellegi
and Holt (Fellegi & Holt, 1976), who, in 1976,
introduced a mathematical theory for automatic
editing and imputation, providing the first systematic
approach for computer-based data cleaning. This
formalisation was adopted and expanded in the 1990s
with the rise of data warehousing, where the
"Transform" stage of the Extract, Transform, Load
(ETL) process became a dedicated preprocessing step
to ensure data quality for business intelligence
(Kimball & Ross, 2013). Today, in the era of machine
learning, preprocessing is a critical discipline
encompassing techniques from feature scaling to
advanced imputation, recognised as essential for
model performance and codified in modern data
science libraries.
In essence, data preprocessing is the foundational
cleanup that ensures data quality and reliability,
creating a solid base for both feature engineering and
the final modeling phase.
216
Wu, H.
A Brief Introduction to Data Preprocessing.
DOI: 10.5220/0014325400004718
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2025), pages 216-221
ISBN: 978-989-758-792-4
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
2 DATA PREPROCESSING
In classification with neural networks, this process
can eliminate irrelevant data to produce faster
learning due to a smaller data set and reduced
confusion caused by those data. Overall, get more
presentable data through data preprocessing. There is
some overlap between data preprocessing and feature
engineering, for example, dealing with duplication
and missing values. However, data preprocessing
deals with raw and imperfect data so that the data set
is understandable and useful, and feature engineering
is considered an iterative process that creates optimal
features. The key techniques in data preprocessing
include Data integration, Data cleansing, Data
normalization, Data reduction, Data transformation.
2.1 Data Integration
Data integration refers to the process of combining
data from multiple sources into a single, unified
database. This process involves identifying and
accessing different data sources, converting the data
to a common format, and reconciling any
inconsistencies between sources. As a result, when
users integrate data, they can successfully keep clear
of issues that limit the precision and speed of machine
learning performance.
The traditional method to achieve this data
integration is called ETL (Extract, Transform, Load).
EFL is a framework that collects a series of tools,
including uniform resource identifier (URI), resource
description framework (RDF), protege ontology
editor and so on (Bansal, 2014). EFL describes the
process of extracting data from external sources,
transforming data to fit operational requirements and
load into the target database. Take a step into the three
processes: extraction involves selecting data from
appropriate sources, where data is usually available in
flat file formats like csv, xls; transformation activities
cover data normalization, data filtering, sorting and
grouping data which will carry on talking about later
in this review; then the process of loading is the
propagation of data into a data mart or data warehouse
where stores high-quality and accessible data. A data
ware house is a multidimensional view of databases
that is easy for data summary and data analysis. Here
are some books to read more about data warehousing,
discussing the integration of data warehousing to
support modern data-driven decision-making (Taniar
& Rahayu, 2022).
Application Programming interfaces (APIs)
integration is a process of connecting multiple
applications or systems to exchange data and perform
actions (Web & Dominte, 2023). There are many
types of them and are widely used in different areas
from mobile apps to the operating system on the
laptop or even a fridge. These characteristics lead to
a result of having plenty of types of APIs, such as
push/stream APIs, native APIs, SDKs APIs and
REST (representational state transder). REST API is
one of the alternative solutions for data exchange and
therefore data integration. There are three main
design precepts, addressability, uniform interface and
statelessness. The goal of REST API is to transmit
and receive data using JSON format (Javascript
Object Notation) (Alimuddin et al., 2020).
2.2 Data Cleansing
Broadly, most real-life data sets are dirty and
imperfect, including missing data, wrong data and
non-standard representation of the same data. Those
dirty data negatively impact the performance, for
example, bias, inaccurate, or unreliable predictions.
In depth, it can cause overfitting or underfitting which
is harder to spot. Because of this, filthy data is
typically displayed as either noisy data or missing
values (MV).
Missing values can usually be handled in three
different ways:
Discard the examples with MVs. There are two
ways to achieve this, listwise deletion and pairwise
deletion which means you can remove the entire rows
containing any missing values or remove MVs for
specific analysis while keeping the rest of the row
data for other analysis. However, this method can
reduce the size of the data set so it is only considered
when there is a small number of MVs. The more
popular way to solve the problem is the second
method below.
Imputation of MVs. One of the main advantages
of this technique is that it doesn't depend on the
learning algorithm being utilised. Therefore, users
can choose the most appropriate methods to deal with
MVs regardless of what model is chosen for ML.
There are numerous imputation techniques available,
ranging from fundamental ones like k-nearest
neighbours (KNN) and mean substitution, to more
advanced approaches that explore inter-attribute
relationships. These include methods based on
support vector machines (SVM), clustering
algorithms, logistic regression, maximum likelihood
estimation, and multiple imputation strategies. These
approaches are commonly used for quantitative data
since they are easily adaptable to other knowledge
areas (García, Luengo, & Herrera, 2015).
A Brief Introduction to Data Preprocessing
217
Noise in data describes random error in raw data.
There are two primary categories of noise: attribute
noise and class noise. The former speaks of data that
is incorrectly labeled which can be introduced by
contradictory labeling or misclassifications, whereas
the latter case is when distortions are found in values
of attributes, including insufficient, missing, or
unidentified attribute values. The performance of the
models is closely tied to the quality of datasets and
the resilience of the model itself to noise. Hence, one
of the main tasks in data cleaning is identifying noises
in a data set. Numerous strategies have been
investigated to handle noisy data, and the more
important ones are noise filters, data polishing
methods and robust learners. Robust learners are
designed to be less affected by noisy data by using
pruning techniques to prevent overfitting. However,
even strong learners may perform poorly if the noise
level is excessive. The goal of data cleaning
techniques is to eliminate noisy examples prior to
training, but are most effective for small datasets due
to their time-consuming nature. According to
research, performance can be enhanced by correcting
noise in training data while leaving test data noisy.
Noise filters, on the other hand, detect and remove
noisy instances from the training data, and are
particularly helpful for those who are noise-sensitive.
2.3 Data Normalization
Data normalization is part of a preprocessing
approach where the data is either scaled or
transformed to make an equal contribution to each
feature. This is to improve data quality and
furthermore improve the performance of the machine
learning algorithm. The data normalization step
transforms features into a common range so that
greater numeric feature values cannot dominate small
features (Singh & Singh, 2020). However, be aware
that this process does not imply equal importance of
the features. Some features are tightly correlated to
others, whereas others are superfluous and
completely irrelevant. This is a problem to be solved
in data reduction.
As mentioned, there are two main tasks in data
normalization, re-scaling dominant features and
outliers. Mean and Standard Deviation Normalization
Methods and Min-Max Value Based Normalization
are two methods that are broadly used since they are
simple and effective in most cases, and the former
method performs particularly better when outliers are
present. However, there are also decimal scaling
normalization, median and median absolute deviation
normalization and tanh based normalization.
In Mean and Standard Deviation Normalization,
to normalise the data, the statistical mean and
standard deviation are applied. For example, z-score
normalization is where the raw data is rescaled such
that zero mean and unit variance characterise the
resulting features. These techniques aid in minimising
the impact of data outliers. Moreover, for Min-Max
value based normalization, typically, the data is re-
scaled to fall between 0 and 1 or -1 and 1. These
methods preserve the relationships among the
original input data, unlike mean and standard
deviation methods that could change over time.
However, the biggest issue with these methods is their
vulnerability to extreme values and outliers.
2.4 Data Reduction
Data reduction is the process of minimizing the size
of datasets while preserving the essential information
by reducing the redundant and irrelevant data, or by
summarizing the data into more concise form. The
definition sounds very similar to data cleaning.
However, data cleaning focuses more on fixing
errors, inconsistencies and inaccuracies, while data
reduction aims to decrease the size and complexity of
data. In a short word, data cleaning improves data
quality, whereas data reduction simplifies analysis
and storage. Dimensionality reduction, numerosity
reduction, and cardinality reduction are the three
primary categories into which basic data reduction
techniques can be divided. Examples of each
technique are PCA and t-SNE, sampling and
clustering, and encoding and discretization
respectively.
Dimensionality reduction focuses on reducing the
number of features or random variables in the data
set. The main algorithms established into feature
selection and feature extraction. Feature selection
identifies and removes irrelevant and redundant
information to obtain a subset of features. This
algorithm reduces the risk of overfitting (García et al.,
2016). Feature extraction reduces the number of
dimensions by generating a whole new set of features
by combining the original ones. One of the baseline
approaches is Principal Components Analysis (PCA).
Sample numerosity reduction extracts an alternative
smaller data representation for the original data.
There are either parametric or non-parametric
methods. Cardinality reduction applies
transformations to obtain a reduced representation of
the original data. Discretization is one of the broadly
used techniques in ML. This process transforms
quantitative data into qualitative data, i.e. numerical
attributes into discrete attributes. The main job of
EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence
218
discretization, or data reduction, is necessary since
the majority of machine learning algorithms now in
use are made to learn categorical data, such as
nominal attributes, real-world apps typically
incorporate continuous features and this transition is
done within data reduction. There is a significant
overlap between data reduction and data
transformation since the idea of constructing a new
set of components is similar.
2.5 Data Transformation
Data transformation is a critical process in data
management and analytics, involving the conversion
of data from one format or structure into another to
enhance its usability and compatibility. Therefore, it
is not a surprise that data transformation is integrated
with another process have mentioned earlier. This
process is essential for ensuring data quality,
consistency, and readiness for analysis, particularly in
environments where raw data is often unstructured or
inconsistent. According to Kimball and Ross, data
transformation is a fundamental step in data
warehousing, where extracted data undergoes
cleaning, normalization, and aggregation before
being loaded into a target system (Kimball & Ross,
2013). Modern businesses rely on data transformation
to integrate disparate data sources, enabling
comprehensive insights through business intelligence
(BI) tools and machine learning models. Common
transformation techniques include data cleansing
(removing duplicates or errors), standardization
(converting data into a uniform format), enrichment
(augmenting data with additional information), and
aggregation (summarizing data for reporting). With
the rise of big data, technologies such as Apache
Spark, SQL-based ETL (Extract, Transform, Load)
tools, and cloud-based data pipelines have become
instrumental in automating and scaling
transformation processes. Effective data
transformation not only improves analytical accuracy
but also supports regulatory compliance by ensuring
data adheres to organizational and legal standards. As
organizations increasingly adopt data-driven
decision-making, the role of data transformation
continues to grow, making it a cornerstone of
successful data strategies (Inmon, 2005).
3 NON-NUMERICAL DATA
PREPROCESSING
3.1 Image
Numerical data is typically structured in formats such
as CSV and Excel, where each row indicates an
observation and each column a feature. However,
Image data are normally defined in two-dimensional.
Unlike structured numerical data, images require
specialized techniques due to their high
dimensionality, spatial relationships, and pixel-level
variations. The first step typically involves resizing
and cropping to standardize dimensions, ensuring
uniformity across the dataset, which is crucial for
neural networks that require fixed input sizes. Color
space conversion is another common preprocessing
step; for instance, converting RGB images to
grayscale reduces the computational load when color
information is unnecessary, while transformations to
HSV or LAB spaces can improve feature extraction
in certain tasks. Normalization is essential to scale
pixel values (typically 0-255) to a range of [0,1] or [-
1,1] to facilitate faster convergence during training.
Noise reduction techniques, such as Gaussian
blurring or median filtering, help mitigate artifacts
caused by sensor imperfections or compression,
enhancing image clarity.
Data augmentation artificially expands the
training dataset by applying transformations like
rotation, flipping, zooming, and brightness
adjustments. This technique helps prevent overfitting
and improves model generalization, especially in
scenarios with limited labeled data. For deep learning
applications, contrast enhancement methods like
histogram equalization or adaptive thresholding can
improve feature visibility, particularly in medical
imaging or low-light conditions. Additionally, edge
detection, e.g., using Sobel or Canny filters may be
employed to highlight structural features before
feeding images into traditional machine learning
models. In cases where images contain irrelevant
backgrounds, segmentation or masking techniques
isolate regions of interest, reducing noise and
focusing the model on relevant features.
The particular task determines which
preprocessing procedures are used, for instance,
facial recognition systems may prioritize alignment
and illumination correction, while autonomous
vehicles rely on robust noise reduction and edge
detection. Proper preprocessing not only optimizes
model accuracy but also reduces training time and
A Brief Introduction to Data Preprocessing
219
resource consumption, making it indispensable in
modern computer vision workflows.
3.2 Text
Text preprocessing is required in natural language
processing (NLP) that transforms raw textual data
into a structured format suitable for machine learning
models. Unlike numerical or image data, text requires
specialized techniques to handle linguistic variability,
noise, and high dimensionality. The first step
typically involves noise removal, where irrelevant
characters (e.g., HTML tags, punctuation, or special
symbols) are eliminated to retain only meaningful
content. Tokenisation then allows for more detailed
analysis by breaking the text down into smaller
components like words or subwords; this can be
language-specific, as in the case of compound words
in German or morphologically rich languages like
Arabic. Lowercasing is often applied to ensure
uniformity, though it may discard meaningful
capitalization (e.g., "Apple" as a company vs. the
fruit). For languages with inflectional morphology,
lemmatization (reducing words to their base forms
using dictionaries) or stemming helps normalize
variations like "running" "run". Stop word removal
filters out high-frequency but low-information words
(e.g., "the," "and") to reduce noise, though this may
harm tasks like sentiment analysis where context
matters. Vectorization converts text into numerical
representations, with traditional methods like bag-of-
words (BoW) or TF-IDF (term frequency-inverse
document frequency) capturing word importance.
Modern approaches leverage word embeddings (e.g.,
Word2Vec, GloVe) or contextual embeddings (e.g.,
BERT, ELMo) to encode semantic relationships. For
low-resource languages, transliteration or machine
translation may bridge preprocessing gaps.
Dimensionality reduction (e.g., PCA, LSA) can
further refine feature spaces, especially for high-
dimensional BoW models.
4 FEATURE ENGINEERING AND
RESTRICTIONS IN THIS AREA
Preprocessing, while essential for handling noise and
inconsistencies, introduces irreversible
transformations—normalizing or scaling data, for
instance, may obscure interpretability, particularly in
linear models where coefficients rely on original
units. Techniques like imputation for missing values
or outlier removal can inadvertently introduce bias, as
assumptions (e.g., mean imputation) may not reflect
true data distributions. For text and image data,
preprocessing choices (e.g., stop-word removal or
resizing) may discard semantically or structurally
meaningful information, limiting model adaptability.
Computational constraints also pose restrictions;
complex FE pipelines or high-dimensional
transformations (e.g., kernel methods) may be
infeasible for large-scale datasets without significant
resources. Ultimately, practitioners must balance
creativity in FE with caution in preprocessing to
avoid compromising model robustness, fairness, or
deployability.
5 CONCLUSIONS
The practice of data preprocessing is as old as data
collection, although its formalisation as a distinct
field is a more recent development. Its history is not
one of a single invention but an evolution reflexed by
the ever-growing scale and complexity of data, from
hand-checked paper ledgers to the automated
pipelines of machine learning. In conclusion, data
preprocessing is a fundamentally critical and non-
negotiable stage in any machine learning pipeline. Its
core purpose remains unchangeable across all data
types: to clean, transform, and structure raw and
chaotic data into to high-quality, efficient format that
machine learning algorithms can interpret. For
images, preprocessing moves beyond simple cleaning
but spatial transformation like resizing and cropping.
For text, preprocessing focuses on deconstructing
language into numerical representation, like
tokenization, identifying stop words and vectorizing
words. Preprocessing is the crucial preparatory work
that directly impacts models' accuracy, efficiency and
reliability. Data preprocessing is the bedrock upon
which effective data-driven insights are built.
REFERENCES
Alimuddin, A. A., Niswar, M., Indrabayu, & Rijal, M.
(2020). Design and implementation of REST API for
academic information system. IOP Conference Series:
Materials Science and Engineering, 875(1), Article
012047.
Bansal, S. K. (2014, June). Towards a semantic extract-
transform-load (ETL) framework for big data
integration. In 2014 IEEE International Congress on
Big Data (pp. 522–529). IEEE.
Fellegi, I. P., & Holt, D. (1976). A systematic approach to
automatic edit and imputation. Journal of the American
Statistical Association, 71(353), 17–35.
EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence
220
García, S., Luengo, J., & Herrera, F. (2015). Data
preprocessing in data mining (Vol. 72, pp. 59–139).
Cham, Switzerland: Springer International Publishing.
García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M.,
& Herrera, F. (2016). Big data preprocessing: methods
and prospects. Big Data Analytics, 1, 1–22.
Han, J., Kamber, M., & Pei, J. (2012). Data mining:
Concepts and techniques. Waltham: Morgan Kaufmann
Publishers.
Inmon, W. H. (2005). Building the data warehouse. John
Wiley & Sons, Incorporated.
Kimball, R., & Ross, M. (2013). The data warehouse toolkit:
The definitive guide to dimensional modeling. John
Wiley & Sons.
Singh, D., & Singh, B. (2020). Investigating the impact of
data normalization on classification performance.
Applied Soft Computing, 97, 105524.
Taniar, D., & Rahayu, W. (2022). Data warehousing and
analytics: Fueling the data engine. Poland: Springer
International Publishing.
Web, A. P. I., & Dominte, I. (2023). Web API development
for the absolute beginner.
A Brief Introduction to Data Preprocessing
221