which introduced the idea of residual blocks, which
was extended to more recent architectures like
WideResNet and ResNeXt.
Even with studies conducted on CNNs for many
years, designing and optimizing these models to
achieve top-level accuracy and computational
efficiency is an ongoing challenge and is still desired.
This paper explores the application of classic
convolutional neural network architectures to build
an efficient model for classifying images in the
CIFAR dataset. The proposed approach involves an
in-depth analysis of various neural network
configurations, data augmentation (Taylor &
Nitschke, 2017), and optimizations like dropout (Cai
et al., 2019), early stopping (Prechelt, 2002) and
gradient descent (Kingma & Ba, 2014) with the
ultimate goal of presenting a model that has high
performance. This paper discusses the preparation of
data from CIFAR dataset as well as data
augmentation, the architecture of the neural network
as well as the optimizations that are implemented on
the model. Furthermore, the paper analyses the results
of the model by evaluating its loss and accuracy
through creating and training such a model using the
Keras library from TensorFlow and discusses the
improvements that could be made to this model.
2 METHOD
2.1 Data augmentation from dataset
The dataset this model is trained on is CIFAR-10, a
dataset consisting of 10 classes (airplane, automobile,
automobile, bird, cat, deer, dog, frog, horse, ship,
truck), with 6000 32 by 32 pixel labeled colored
images for each class. Each image contains one main
object and belongs only to one class, meaning the
classes are mutually exclusive.
Data augmentation is performed with the images
from the CIFAR-10 dataset to create a more diverse
range of data. Data augmentation is the process of
generating more data from existing data through
transformations to increase the variety of the final
data.
In the model, data augmentation was performed in
ways including geometric-based transformations with
Horizontal flipping, rotation of the image by 15
degrees to either side, Resizing the image by zooming
in and out by 10 percent, and shifting images
horizontally and vertically by 10 percent. The model
also undergoes colour-based Transformations like
brightness adjustments by changing the color
brightness up and down by 10 percent and Noise
injections by applying random Gaussian noise to the
image.
2.2 Architecture
The architecture of the convolutional neural network
contains 26 layers, consisting mainly of
convolutional, pooling, normalization, Flatten, and
dense layers, as shown in Figure 1. The architecture
repeats eight times of convolutional layer with a 3 by
3 kernel and a normalization layer, with Max pooling
layers and a dropout layer repeating every two cycles.
Then the model uses the Flatten layer to transfer the
input into 1-dimensional for the Dense layers to
classify images into their respective classes.
The model architecture consists of eight
convolutional layers, all with a kernel size of 3 by 3
to extract key information and find similarities in data.
During each convolution process, a kernel traverses
across the input data, and for each of the 3 by 3 pixels
on the image, the pixel values are then performed dot
product with the filter (multiplying corresponding
elements and summing up) and put into feature maps.
With each layer, normalization is performed with the
ReLU algorithm, which creates non-linearity into the
computation.
In order to stabilize and optimise training, the
batch normalization layer normalizes the
convolutional layer's output. Data is transformed to a
range between 0 and 1 to execute batch
normalization.
Pooling layers are layers that reduce the
dimension of input by applying pooling operations
like maximum pooling and average pooling. The
model uses the maximum pooling method which is a
2 by 2 filter that also slides across the input in the
model. The operation finds the maximum value in
each 2 by 2 on the image and outputs a map of the
maximum in each kernel.
Next is the flattened layer, which Converts the 3-
dimensional input into a 1-dimensional vector to
reduce spatial complexity as well as maintain the
usefulness of the information. This is done by
reshaping the 3-dimensional input to a 1-dimensional
output.
Dense Layers are fully connected layers in which
every neuron is linked to every activation from the
layer before it. Ten output units make up the final
Dense Layer, providing options for every class. To
get the required quantity of output, the dense layer
uses the dot product, which involves taking an input,
multiplying it by the weight, and adding bias.