Analysis of Blind Validation for Learning Models

Abhijeet Jadhav, Poonam M Shettar, Divya Karoshi, Aishwarya G, Akash Kulkarni

and Basawaraj Patil

School of Electronics and Communication Engineering, KLE Technological University, Hubli - 580031, India

Keywords:

Machine Learning, Deep Learning, Convolutional Neural Networks (CNN), Image Classiﬁcation, Blind

Validation, MNIST, USPS, EMNIST, Model Generalization.

Abstract:

Machine learning, deep learning, and image processing have gained signiﬁcant traction and are widely utilized

across various ﬁelds for many applications, contributing signiﬁcantly to advancements in medicine, security,

robotics, automation, and beyond. With the expanding range of applications, it is crucial to comprehend

how models perform on unseen data and assess their reliability for deployment in real-time scenarios. In

this context, this study employs CNN models to provide insights on blind validation. Convolutional Neural

Network (CNN) models have emerged as powerful tools for image classiﬁcation tasks. This study employs a

CNN model for digit classiﬁcation using the famous MNIST dataset. Subsequently, the model’s performance

is evaluated on two additional datasets: USPS and EMNIST. The evaluation aims to understand how the

model generalizes across different datasets with varying characteristics and to assess its robustness in real-

world applications. Blind validation is conducted by training the model on the MNIST dataset and testing

it on itself and the other datasets to observe potential biases and inconsistencies in the model’s behaviour

across diverse datasets. This analysis provides valuable insights into the model’s adaptability and reliability

for deployment in practical scenarios beyond the training dataset’s domain.

1 INTRODUCTION

In machine learning, deep learning, and predic-

tive modeling, the imperative to evaluate the model

performance extends far beyond theoretical con-

structs. As models transition from development en-

vironments to real-world applications, their effective-

ness in handling unseen data becomes paramount.

This makes blind validation crucial for determining

whether learning models can be deployed in real-

world situations.

Convolutional Neural Networks (CNNs) are

among the most widely used classiﬁcation techniques

in modern machine learning. Although CNNs often

achieve maximum accuracy when trained and evalu-

ated on the same dataset, they often perform poorly

when evaluated on the other datasets. One of the

signiﬁcant challenges in deep learning is understand-

ing and identifying potential problems within a pre-

trained CNN model for predicting image features.

Without a clear theoretical explanation, it is challeng-

ing to ascertain why a CNN model performs well

when tested on the same dataset but noticeably per-

forms worse when tested on unseen data. Typically,

CNN performance is assessed by testing its accuracy

with sample images to evaluate its effectiveness. This

gap in implementing CNN requires robust approaches

to check their reliability in real-world situations.

The primary goal of blind validation is to evaluate

the performance and reliability of a machine-learning

model. This validation method has been meticulously

developed to examine how the model behaves and

performs when subjected to unseen datasets, irrespec-

tive of the speciﬁc characteristics of these datasets.

Blind validation thoroughly evaluates the model’s

generalization ability outside of the training dataset

by exposing it to a wide range of novel and varied

data.

The MNIST dataset is the most widely used

dataset in machine learning for tasks concerned with

image classiﬁcation. It consists of grayscale images

of handwritten digits(0-9) and corresponding labels.

The MNIST dataset serves as a standard dataset for

evaluating the performance of machine learning al-

gorithms, especially for CNN, due to its simplicity

and relevance to real-world applications. MNIST pro-

Jadhav, A., M Shettar, P., Karoshi, D., G, A., Kulkarni, A. and Patil, B.

Analysis of Blind Validation for Learning Models.

DOI: 10.5220/0013614300004664

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 3, pages 295-300

ISBN: 978-989-758-763-4

295

vides standard and easily accessible training, testing,

and validation images. This study focuses on training

a CNN model using the MNIST dataset and assess-

ing its performance on two distinct datasets(USPS

and EMNIST), progressively diverging in similarity

from MNIST. These datasets serve as unseen data for

the model, facilitating a comprehensive evaluation of

its ability to generalize beyond the familiar MNIST

dataset.

The fundamental components of the CNN model

consist of convolutional layers, pooling layers, and

dense layers(Dai, 2021). Additionally, the paper ex-

amines how the model’s accuracy ﬂuctuates with al-

terations in CNN design, particularly in modifying the

convolutional layers. Furthermore, the study investi-

gates the impact on accuracy when adjusting various

learning parameters, elucidating how these modiﬁca-

tions inﬂuence the overall performance of the CNN

model for an unseen dataset.

This literature follows the introduction in section

1. Section 2 presents comprehensives the related

work on blind validation. Section 3 describes the

methodology and datasets. Results and discussions

are detailed in section 4 and conclude with section 5.

2 LITERATURE SURVEY

A research paper was conducted on the Convo-

lutional Neural Network (CNN) model for recogniz-

ing handwritten numbers. The model was trained us-

ing the well-known MNIST dataset, which consists of

grayscale handwritten digit photographs. The CNN

model achieved an impressive validation accuracy of

98.45% on the MNIST dataset. To test the model’s

ability to handle unknown data, the researchers ran it

through a series of random photos with handwritten

and printed digits. The model achieved a reasonable

accuracy of 68.57% on this new dataset. However,

it showed limitations in recognizing numbers not part

of the training data, particularly those in non-standard

formats.(Garg et al., 2019)

The paper delves deeper into the model’s archi-

tecture, revealing that it includes four convolutional

layers, ReLU activation, and max-pooling layers - a

standard arrangement for picture classiﬁcation tasks.

This study highlights that CNNs are highly effective

at recognizing handwritten digits and can generalize

to previously unexplored data. However, the model’s

performance deteriorates when it encounters data that

considerably differs from the training set.

The paper explores EEG-based emotion recog-

nition, lever- aging Convolutional Neural Network

(CNN) architectures to enhance subject-independent

accuracy. Unlike conventional methods relying on

spectral band power features, raw EEG data is utilized

after windowing, pre-adjustments, and normalization,

removing manual feature extraction and harnessing

CNN’s capacity to uncover hidden features(Cimtay

and Ekmekcioglu, 2020). A median ﬁlter further im-

proves classiﬁcation accuracy. The approach achieves

mean cross-subject accuracies of 86.56 and 78.34 on

the SEED dataset for two and three emotion classes,

re- respectively. Testing the SEED-trained model on

the DEAP dataset yields a mean accuracy of 58.1.

The paper extensively evaluates CNN models

in AI-assisted COVID-19 diagnostics, spotlighting

ResNet-50 as the top performer. Through itera-

tive rounds of training and testing across diverse

datasets, the study underscores the critical impor-

tance of achieving subject-independent accuracy and

the potential of enriching training datasets to bolster

model performance. Leveraging heatmaps and ac-

tivation features provides deeper insights into CNN

model learning dynamics, guiding future advance-

ments in COVID-19 and pneumonia detection diag-

nostic systems. During the initial evaluation round,

CNN models exhibited high accuracy rates of 95.2 to

99.2 for the Level 1 testing dataset, sourced from the

same clinic but designated solely for testing. How-

ever, model performance declined signiﬁcantly with

the Level 3 dataset, characterized by outlier images,

reducing mean sensitivity from 99 to 36. These ﬁnd-

ings emphasize the challenges outlier data poses and

the need for strategies to mitigate their impact on di-

agnostic model performance(Talaat et al., 2023).

This research gives a method for detecting biases

in picture attribute estimations learned by convolu-

tional neural networks (CNNs)(Zhang et al., 2018).

Even with great overall accuracy, these biases might

lead to erroneous ﬁndings. The method examines

CNN’s internal representation of characteristics to de-

tect probable blind spots (missing associations) and

failure modes (incorrect relationships) induced by bi-

ases in the training data. It does not require ad-

ditional labeled data and provides a more thorough

analysis than standard approaches. Experiments show

that the strategy successfully detects bias and outper-

forms other ways of identifying problems with CNN’s

learned representations.

3 COPYRIGHT FORM

Three datasets are chosen for experimentation,

consisting of grayscale images depicting handwritten

digits and characters, with dimensions of 28x28x1.

Only images containing digits 0-9 have been selected

INCOFT 2025 - International Conference on Futuristic Technology

296

for experimentation.

• MNIST Dataset: This dataset is a collection of

handwritten digits with numbers 0-9. It contains

60,000 train images and 10,000 test images. The

images are grayscale and have a resolution of

28x28 pixels. MNIST is often used as a bench-

mark to evaluate the performance of different ma-

chine learning models, particularly in deep learn-

ing and neural networks.(LeCun et al., 2010)

Figure 1: Samples from MNIST dataset

• USPS Dataset: The United States Postal Service

dataset is similar to the MNIST dataset but con-

sists of handwritten digits from the United States

Postal Service. The USPS dataset may exhibit

more variability in writing styles and quality com-

pared to the MNIST dataset. This is because the

images are scanned from real-world postal mail,

which can contain a wide range of handwriting

styles, variations in stroke thickness, and other

factors not present in carefully collected datasets

like MNIST(Bagul, ).

Figure 2: Samples from USPS dataset

• EMNIST Dataset: The extension of the MNIST

dataset, providing a broader and more varied

collection of handwritten digit samples. The

EMNIST dataset contains handwritten characters

from English alphabets (uppercase and lowercase)

and digits (0-9). It consists of 814,255 characters,

divided into 814,255 training images and 81,406

test images. Each image is grayscale with a reso-

lution of 28x28 pixels.(Cohen et al., 2017)

The naming convention utilized for the datasets in

our study is illustrated in Figure 4. This study em-

ploys Dataset A for training purposes, while all three

datasets validate the trained model. Dataset B exhibits

correlation and resembles Dataset A, whereas Dataset

Figure 3: Samples from EMNIST dataset

Figure 4: Naming convention used for datasets

C is uncorrelated and notably different from Dataset

Convolutional Neural Networks (CNNs) are

highly effective for multi-class classiﬁcation tasks,

particularly on image datasets like MNIST. CNNs

have many adjustable parameters that can inﬂuence

the model’s performance, like the number of layers,

ﬁlter counts, dimensions and types, and learning rate.

CNN models are crafted by adjusting various pa-

rameters, then trained on Dataset A and validated on

all three datasets. This methodology seeks to un-

derstand how these models perform when confronted

with different combinations of datasets for the same

task. This aids in gaining insight into the model’s be-

haviour when encountering datasets with varying de-

grees of similarity.

The initial model designed for observation con-

sists of two convolutional layers, a max-pooling layer,

two additional convolutional layers, and another max-

pooling layer, concluding with a fully connected

layer, as depicted in Figure 2. Additionally, the model

includes the following parameters:

• Learning rate set to 0.001

• Two dense layers with 64 units followed by 10

units.

• 32 ﬁlters in each convolution layer with a dimen-

sion of 3x3.

Analysis of Blind Validation for Learning Models

297

• ReLu activation function.

• Training spanned across 10 epochs.

Figure 5: Model designed for experimentation

A range of CNN models were developed and ex-

amined by modifying their parameters to ascertain

their impact on enhancing the models’ performance.

3.0.1 Number of dense layers and convolution

layers

The dense layers are incremented in multiples of 32,

speciﬁcally 32, 64, 128, and 256, with each dense

layer corresponding to either one or two convolution

layers at a ﬁxed learning rate of 0.001. Each convolu-

tion layer comprises 32 ﬁlters with dimensions of 3x3

each. This results in the formation of various combi-

nations of dense and convolution layers.

3.0.2 Number of Convolution Layers

The experimentation involves varying only the con-

volution layers from 1 to 10 while keeping the dense

layers constant at 64. Each convolution layer com-

prises 32 ﬁlters with dimensions of 3x3 each. This

aims to observe how the model’s behaviour evolves

with increasing convolution layers.

3.0.3 Learning Rate

The learning rate is a critical factor in model training.

The learning rate ranges from 0.05 to 0.001 to deter-

mine its impact on improving blind validation accu-

racy. The ideal learning rate is selected based on the

results obtained.

3.0.4 Number of Filters and their Dimension

The ﬁlter dimensions are varied across 3x3, 5x5, and

7x7, each utilizing 16, 32, and 64 ﬁlters, respectively.

It is carried out at a constant learning rate and ﬁxed

dense layers.

With these variations, it was evident that the

model performance tended to deteriorate with unseen

datasets; the same is described in the results section.

4 RESULTS AND DISCUSSIONS

The outcomes obtained from the initial model

design indicate that the accuracy was signiﬁcantly

higher when Dataset A was trained and tested on itself

compared to when tested on Dataset B and Dataset C,

as depicted in Table 1.

Table 1: Cross-Dataset Performance of Model Trained on

Dataset A

Train Test Validation Accuracy(%)

Dataset A Dataset A 98.43

Dataset A Dataset B 60.38

Dataset A Dataset C 14.29

The following illustrates the outcomes of the dif-

ferent CNN models designed.

4.1 Varying Number of Dense Layers

Increasing the number of dense layers by multiples of

32, speciﬁcally 32, 64, 128, and 256, did not signiﬁ-

cantly enhance validation accuracy.

Nevertheless, as the number of convolution layers

increased from 1 to 2, there was a corresponding en-

hancement in accuracy. Figure 6 illustrates that aug-

menting the number of dense layers has minimal im-

pact on accuracy. Conversely, it is apparent that aug-

menting the number of convolution layers positively

correlates with heightened accuracy.

Figure 6: Performance with change in number of Dense

Layers

INCOFT 2025 - International Conference on Futuristic Technology

298

4.2 Varying number of Convolution

layers

The validation accuracy shows a noticeable rise with

an increase in convolution layers, as illustrated in Fig-

ure 7. The mean accuracy shifts from 60 to 70%.

Figure 7: Performance with change in number of Convolu-

tion Layers

4.3 Varying Learning Rate

Figure 8 demonstrates that lower learning rates, like

0.001, consistently yield superior accuracy compared

to higher and excessively lower rates. This observa-

tion underscores the critical signiﬁcance of meticu-

lously selecting an appropriate learning rate to en-

hance model performance effectively. However, de-

spite varying learning rate, the maximum blind vali-

dation accuracy obtained is below 75%.

Figure 8: Performance with change in Learning rate

4.4 Varying Number of Filters and

Dimension of Filters

Figure 9 indicates that despite experimenting with

various ﬁlter combinations, there was no improve-

ment in blind validation accuracy. Furthermore, the

validation accuracy achieved by training and testing

on Dataset A was notably higher than that achieved

by training on Dataset A and testing on Dataset B.

Figure 9: Performance with variation with different combi-

nations of Filters

Based on the observed trends in validation accu-

racy improvement, the convolution layers were sys-

tematically increased with a consistent learning rate

of 0.001 while maintaining 64 dense layers. This was

in response to the observed trend of higher blind val-

idation accuracy associated with an augmented num-

ber of convolution layers, as depicted in Figure 3 and

Table 3.

Table 2 provides insights into the validation ac-

curacy as convolution layers increase. The ”Lay-

ers” column indicates the number of convolution lay-

ers, ”MNIST” represents the validation accuracy on

Dataset A, ”USPS” indicates the validation accuracy

on Dataset B, ”EMNIST” signiﬁes the validation ac-

curacy on Dataset C, and ”Training parameters” de-

notes the total trainable parameters for each respec-

tive number of convolution layers.

Figure 10 represents the convolution layers’ cor-

responding validation accuracy. The blue line shows

the training and testing done on Dataset A, while the

orange line shows the training done on Dataset A and

testing done on Dataset B. Similarly, the green line

represents the training done on Dataset A and testing

done on Dataset C.

The blind validation accuracy did not increase be-

yond 75%(approx.) when tested on Dataset B and

not beyond 15% (approx.) on Dataset C. Conversely,

the model achieved an accuracy of over 95% when

trained and tested on Dataset A, indicating a de-

cline in performance when exposed to different un-

seen datasets, despite adjustments parameters.

Analysis of Blind Validation for Learning Models

299

Table 2: Summary of experiment results

Layers MNIST USPS EMNIST Train Params

1 98.86% 56.73% 18.78% 1.61M

2 98.75% 62.03% 17.13% 1.64M

3 99.01% 67.51% 17.33% 1.68M

4 99.21% 68.46% 16.73% 1.72M

5 99.16% 66.73% 14.71% 1.76M

6 99.10% 65.63% 17.52% 1.79M

7 98.90% 67.66% 16.70% 1.83M

8 99.14% 68.92% 18.98% 1.83M

9 99.07% 69.07% 17.23% 1.90M

10 98.79% 67.89% 16.73% 1.94M

Figure 10: Performance with change in number of Convo-

lution Layers for validating across the Datasets

5 CONCLUSIONS

The research addresses the crucial challenge of gen-

eralization in CNN models, revealing that although

they achieve high accuracy on their training dataset

(98.43% on MNIST), their performance signiﬁcantly

drops on unseen datasets (60.38% on USPS and

14.29% on EMNIST). This ﬁnding highlights the vi-

tal role of blind validation in evaluating how well ma-

chine learning models can adapt to new data, which

is essential for their deployment in real-world sce-

narios. The study exposes the limitations of CNNs

when faced with diverse datasets, stressing the ne-

cessity for robust validation techniques to ensure the

models are reliable and effective outside of controlled

training settings. The signiﬁcance of this study lies in

illustrating the importance of thoroughly evaluating

any pre-trained model before it is deployed in real-

world applications, where adaptability and robustness

are crucial for maintaining consistent and reliable per-

formance.

REFERENCES

Bagul, D. USPS Digit Classiﬁcation. https://github.com/

darshanbagul/USPS\ Digit\ Classiﬁcation.

Cimtay, Y. and Ekmekcioglu, E. (2020). Investigating

the use of pretrained convolutional neural network on

cross-subject and cross-dataset eeg emotion recogni-

tion. Sensors, 20(7):2034.

Cohen, G., Afshar, S., Tapson, J., and van Schaik, A.

(2017). Emnist: an extension of mnist to handwrit-

ten letters.

Dai, D. (2021). An introduction of cnn: models and training

on neural network models. In 2021 International Con-

ference on Big Data, Artiﬁcial Intelligence and Risk

Management (ICBAR), pages 135–138. IEEE.

Garg, A., Gupta, D., Saxena, S., and Sahadev, P. P. (2019).

Validation of random dataset using an efﬁcient cnn

model trained on mnist handwritten dataset. In 2019

6th international conference on signal processing and

integrated networks (SPIN), pages 602–606. IEEE.

LeCun, Y., Cortes, C., and Burges, C. (2010). Mnist hand-

written digit database. ATT Labs [Online]. Available:

http://yann.lecun.com/exdb/mnist, 2.

Talaat, M., Si, X., and Xi, J. (2023). Multi-level training

and testing of cnn models in diagnosing multi-center

covid-19 and pneumonia x-ray images. Applied Sci-

ences, 13(18):10270.

Zhang, Q., Wang, W., and Zhu, S.-C. (2018). Examining

cnn representations with respect to dataset bias. In

Proceedings of the AAAI conference on artiﬁcial in-

telligence, volume 32.

INCOFT 2025 - International Conference on Futuristic Technology

300