Enhancing Eye Ball Tracking System for Enhanced Human
Computer Interaction: Deep Learning Base CNN
S. Suganya, K. Balamurugan, M. Sasipriya, R. Sowmiya,
S. Krishna Veerandhra Suthir and S. Selvamani
Department of Information Technology, K.S.R. College of Engineering, Tiruchengode – 637215, Tamil Nadu, India
Keywords: Eye‑Tracking Systems, Convolutional Neural Networks, Gaze Estimation, Human‑Computer Interaction,
Data Augmentation, Real‑Time Processing, Assistive Technologies.
Abstract: AIM : This study aims to improve eye-tracking systems by using a hybrid model that combines Convolutional
Neural Networks (CNNs) for semi-supervised anomaly detection. Materials and Method: The model was
trained on a comprehensive multimodal dataset, including motion, eye-tracking inputs, and action context, to
extract relevant features and identify complex patterns. Group 1: The existing method uses RNN for gaze
analysis and Anomaly detection with 1000 to 10000 image samples. Group 2: The proposed model uses CNN
as the same samples effectively learns from distributed multimodal datasets. Result: The reducing anomaly
detection processing time from 57.5 ms to 52 ms, achieving a 94.5% confidence interval and an accuracy
improvement of 9.5% over baseline models. The significance value of 0.0028 highlights the approach's
efficacy in detecting gaze anomalies and optimizing system performance. Conclusion: In this work, it is
observed that CNN has significantly better results than RNN.
1 INTRODUCTION
The research explores various methods to improve
eye-tracking anomaly detection and human-computer
interaction. One such method is CNN-based Hybrid,
which enhances gaze pattern recognition through
deep feature extraction and learning, achieving over
93% accuracy in multimodal datasets (Iddrisu et al.
2024). Eye-tracking systems are essential for
accessibility and gaming applications, but traditional
models struggle with real-time anomaly detection.
This author introduces CNN-GazeNet, a framework
that refines gaze tracking precision by integrating
convolutional architectures for feature extraction.
Experiments show that CNN-GazeNet, with over a 89
% true positive rate, effectively identifies gaze
anomalies and outperforms standard approaches
(Edughele et al. 2022).This author presents AdvGaze,
a framework that bypasses conventional gaze
tracking limitations by transforming input features
while maintaining detection integrity. It achieved up
to 80% success against static gaze detection models
and over 79 % success against real-time gaze
prediction systems, revealing critical limitations in
existing approaches (Iddrisu et al. 2024).
GazeR34P3R is a solution that treats eye-tracking
anomaly detection as a multi-class, multi-modal
problem. The hybrid CNN-based model achieves a
85% F1-Score using 5000 gaze samples.
GazeR34P3R integrated accessibility-based risk
analysis to prioritize high-risk anomalies and was
evaluated with diverse datasets (Santhosh et al. 2021).
The author introduces an automatic solution for
multi-modal data preprocessing, real-time gaze
tracking and anomaly detection in order to enhance
HCI. With a high initial 45% false positive rate, an
adaptive noise filter improves accuracy and
robustness. Our experiments show that it can achieve
good performance in gaze anomaly detection and
real-time accessibility systems.
2 RELATED WORKS
The total number of articles published on this topic
over the last five years exceeds 201 papers in IEEE
Xplore and 64 papers in Google Scholar. Current
264
Suganya, S., Balamurugan, K., Sasipriya, M., Sowmiya, R., Suthir, S. K. V. and Selvamani, S.
Enhancing Eye Ball Tracking System for Enhanced Human Computer Interaction: Deep Learning Base CNN.
DOI: 10.5220/0013911600004919
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 4, pages
264-270
ISBN: 978-989-758-777-1
Proceedings Copyright © 2026 by SCITEPRESS Science and Technology Publications, Lda.
research in eye-tracking systems has made significant
strides, particularly in the adoption of hybrid models,
multimodal frameworks, and lightweight neural
architectures. Hybrid machine learning and deep
learning approaches enhance eye-tracking
adaptability and robustness (Falch and Lohan 2024).
A dataset of 20,000 gaze samples from 150
participants demonstrated 92.4% accuracy with a 1.3-
degree error rate. Multimodal sensor fusion
frameworks combining gaze and head movement data
improve tracking in low lighting and motion
scenarios (Tian et al. 2023). A dataset of 50 hours
from 200 individuals showed a 17% stability boost
and 94.1% accuracy. Lightweight neural
architectures optimize real-time eye-tracking by
using pruning and quantization ( Li et al. 2023). A
30,000-sample dataset achieved 60% model size
reduction while maintaining 90.8% accuracy. CNNs
with attention mechanisms enhance occlusion
handling, achieving 93.6% accuracy on a dataset of
25,000 images (Kaur et al. 2022). Adaptive pipelines
dynamically adjust parameters, reducing latency by
22% in VR applications (Ansari et al. 2023).Another
emerging trend is the use of semi-supervised learning
for applications like expert-novice classification.
Leveraging multimodal sensor data, such as eye-
tracking and motion data, a conditional multimodal
variational autoencoder (CMVAE) model was
developed to classify users' expertise levels like in
(Yusuke Akamatsu et al. 2021). This method, which
only requires expert data for training, showed
significant improvements in classification accuracy
under varied conditions. From the previous findings,
this study aims to develop a novel eye-ball tracking
system using deep learning-based CNN models to
enhance human-computer interaction. By leveraging
advanced computer vision techniques and neural
network architectures, the system aims to improve
tracking accuracy, responsiveness, and adaptability,
ensuring a more seamless and intuitive user
experience while reducing computational overhead.
3 MATERIALS AND METHODS
In this study, the detection system combines Dlib
face detection methods, TensorFlow, and Faster R-
CNN to accurately and efficiently detect facial
characteristics like eyes and faces. This study
analyzes various preprocessing techniques for facial
recognition and detection methods, it uses a data
sample from 1,000 images to 10,000 images.
Group 1 uses an RNN and Tensorflow for feature
extraction followed by an SVM classifier, yielding
80 % accuracy. Sliding window and Non-Maximum
Suppression increases reliability and reduces false
positives by 15%.
Group 2 uses a deep learning approach with Faster
R-CNN, via TensorFlow, leveraging optimized
feature extraction and classification. At 52 ms per
frame, the system achieves 96.5% accuracy whilst
decreasing latency by 22%, demonstrating strong
generalization in tactful generalization across
multiple-condition.
SVM Classifier Decision Function (for face
detection):
𝑓𝑥 𝑤𝑇𝑥  𝑏          (1)
Where f(x) is the decision function, w is the
weight vector, x is the feature vector, and b is the bias
term.
Precision Formula (for Evaluation):
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑇𝑃/𝑇𝑃𝐹𝑃  (2)
Where TP is the number of true positives, and FP
is the number of false positives. This measures the
accuracy of positive predictions.
The following action are part of the system’s
workflow:
Figure 1: Flowchart of Eye Gaze Extraction.
To normalize the lighting conditions for precise
analysis across the images, we first preprocess the
input photographs so that they have similar lighting
conditions. After detecting eye-ball locations, we
Enhancing Eye Ball Tracking System for Enhanced Human Computer Interaction: Deep Learning Base CNN
265
extract the eye-region corresponding to the eye
locations using trained CNN model. The Kalman
filter helps us achieve smooth and reliable eye
movement tracking by minimizing noise and
predicting trajectory changes. Subsequently eye
positions from the eyes of the subjects being tracked
are plotted in real time, allowing for continuous
observation. Finally, to enhance the system usability
and usefulness, we connect this system to external
apps, such as safety warnings and gaze-based
controls. The Figure 1 shows Flowchart of Eye Gaze
Extraction.
The experimental one uses TensorFlow an Faster
R-CNN for this lexer, the python based environment
with Dlib and NumPy. For real-time performance,
NumPy optimizes numerical computations, matrix
operations, and picture preprocessing on 1080p high-
resolution footage through the Intel i5 CPU with 16
GB of RAM on the production system. Methods Data:
This is compared with CNN-based methods, Dlib's
HOG features, and NumPy-based optimizations for
testing the model to ensure accurate detection.
Evaluation metrics consist of Bounding box
accuracy using IoU and overall performance using
Precision, Recall, and F1-score. According to
Equations (1) and (2), the system is ideal for real-time
applications, including facial recognition and eye
tracking.
4 STATISTICAL ANALYSIS
The facial recognition system demonstrated high
reliability with a mean accuracy of 96.5% (±1.2%)
and a 95% confidence interval of [94.5%, 98.0%].
Precision, recall, and F1-score values indicate
balanced detection and classification performance.
Processing speeds 52.0 ms for CNN, and 57.5 ms for
RNN, ensuring real-time capability. Accuracy varied
across conditions, with well-lit environments
reaching 96.5%, moderate lighting 94.2%, and low-
light scenarios 85%.
The Comparison Table presents a comprehensive
analysis of eye-tracking systems using CNN and
RNN models. The algorithm's accuracy increases
from 60% at RNN to 96.5% at CNN, indicating
efficient learning with more data. However,
processing time also decreases, highlighting the
trade-off between computational efficiency and
accuracy.
The Group Statistics Table compares RNN and
CNN models based on dataset size, number of layers,
accuracy, and processing time. CNN has a deeper
architecture (51.1 layers vs 17.0 layers) and works
better with larger datasets. It outperforms RNN with
higher accuracy (94.17%) and lower standard
deviation, but requires longer processing times (64.35
ms vs 39.29 ms) due to its higher processing costs.
Table 1: Comparison of RNN and CNN Performance Across Iterations.
No. of
Iterations
Dataset
Size
(Images)
Number of
Layers
(RNN)
Number of
Layers
(CNN)
Accuracy
(%) (RNN)
Accuracy
(%) (CNN)
Processing
Time (ms)
(RNN)
Processing
Time (ms)
(CNN)
1 1000 2 10 70 94 22.0 35.0
2 2000 2 12 72 94.2 25.5 38.0
3 3000 3 14 74 94.5 29.0 40.5
4 4000 3 16 76 95 33.2 43.0
5 5000 3 18 78 95.2 37.5 45.0
6 6000 4 20 80 95.5 41.0 46.5
7 7000 4 22 82 95.7 45.2 48.0
8 8000 4 24 83 96 49.0 49.5
9 9000 5 26 84 96.2 53.0 51.0
10 10000 5 28 85 96.5 57.5 52.0
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
266
Table 2: Descriptive Statistics for RNN and CNN Models.
Variable Model Type N Mean Std. Deviation Std. Error Mean
Dataset Size RNN 10 5500.0 3027.65 957.427
CNN 10 29000.0 13157.17 4160.662
Number of La
y
ers RNN 10 17.0 8.731 2.761
CNN 10 51.1 12.653 4.001
Accuracy (%) RNN 10 76.6 9.276 2.9333
CNN 10 94.17 1.9972 0.6316
Processing Time (ms) RNN 10 39.29 11.9582 3.7815
CNN 10 64.35 15.3533 4.8552
5 RESULTS
The study evaluates the performance of RNN and
CNN models in enhancing an eye-tracking system for
human-computer interaction. Frequency is assigned
to the modeled RNN and CNN architectures, with
performance variations analyzed based on dataset
size and model depth. The corresponding changes in
accuracy and processing time were measured across
dataset sizes ranging from 1,000 to 10,000 images
and model depths from 2 to 28 layers, as shown in
Table 1.
CNN demonstrated superior performance,
achieving a maximum accuracy of 96.5%, while RNN
reached 85%. However, the trade-off between
accuracy and computational efficiency was evident,
with CNN requiring a longer processing time (52.0
ms) compared to RNN (57.5 ms). The statistical
analysis confirmed significant differences between
the models, with p-values below 0.05 indicating
CNN’s robust advantage. The mean accuracy
difference between CNN and RNN was 17.57% (p <
0.001), while the processing time difference was
25.06 ms (p = 0.001), further reinforcing CNN’s
effectiveness in high-precision eye-tracking
applications, as detailed in Table 2.
The statistical significance of these performance
differences validates CNN’s superior performance
across all key metrics, demonstrating its effectiveness
for real-time gaze tracking applications. The
statistical tests confirm that CNN significantly
outperforms RNN in dataset utilization, accuracy, and
model depth while highlighting the trade-off in
processing time. Table 3 presents the t-test results,
showing significant differences in dataset size,
number of layers, accuracy, and processing time
between the two models (p < 0.05). Thus, CNN
emerges as the optimal model for precision-
dependent human-computer interaction
scenarios. Figure 2 compares CNN and RNN
accuracy across different dataset sizes, showing that
CNN consistently outperforms RNN, reaching 96.5%
accuracy at 10,000 cases, while RNN improves from
70% to 80-85% as the dataset grows. Figure 3
illustrates the processing times of both models, where
CNN takes 52 ms and RNN 57.5 ms at 10,000 cases.
However, CNN's processing time increases
significantly with larger datasets, whereas RNN’s
rises more gradually, emphasizing the trade-off
between efficiency and performance. Figure 4
presents a bar chart summarizing accuracy and
processing time, reinforcing CNN’s superior
accuracy (96.5%) compared to RNN (77-80%) while
showing that CNN has a slightly shorter processing
time (52 ms vs. 57.5 ms). These figures collectively
highlight CNN’s higher precision and the balance
between accuracy and computational efficiency.
The analysis confirms CNN outperforms RNN in
accuracy (-17.57) but requires more processing time
(-25.06). Significant differences are shown (t-values:
-4.072 to -7.015, Sig. < 0.05).
Figure 2: CNN (96.5%) and RNN (80_85%) Accuracy
Comparison.
The Figure 2 shows the Comparison of Accuracy
percentage in both CNN and RNN.
Enhancing Eye Ball Tracking System for Enhanced Human Computer Interaction: Deep Learning Base CNN
267
Table 3: Independent Samples T-Test Results for RNN vs. CNN.
Variable
Equality of
Variances
F Sig. t df
Sig. (2-
tailed)
Mean
Difference
Std. Error
Difference
Dataset Size
Equal variances
assume
d
17.67 0.001 -5.504 18 0.0 -23500.0 4269.4
Equal variances
not assume
d
-5.504 9.95 0.0 -23500.0 4269.4
Number of
Layers
Equal variances
assume
d
1.937 0.181 -7.015 18 0.0 -34.1 4.861
Equal variances
not assume
d
-7.015 15.986 0.0 -34.1 4.861
Accuracy
(%)
Equal variances
assume
d
13.995 0.001 -5.856 18 0.0 -17.57 3.0006
Equal variances
not assume
d
-5.856 9.833 0.0 -17.57 3.0006
Processing
Time
(
ms
)
Equal variances
assume
d
0.616 0.443 -4.072 18 0.001 -25.06 6.1541
Equal variances
not assume
d
-4.072 16.982 0.001 -25.06 6.1541
Figure 3: RNN (57.5 Ms) and CNN (52 Ms) Processing
Time Comparison.
The Figure 3 shows the comparison of processing
time (ms) in both RNN and CNN.
Figure 4: Mean of Accuracy and Processing Time.
Figure 4 shows the mean values of accuracy and
processing time (ms) for CNN and RNN are
displayed in the above graph.
Figure 5: Eye-Ball Analysis and Extracting Process.
Figure 5 depicts the Process of analyzing and
extracting data from the proposed system.
6 DISCUSSION
This eye-tracking system based on deep learning can
achieve a detection accuracy of 92% and the speed of
30 frames of video per second. It managed to run
well in difficult conditions lighting differences,
background clutter, occlusions, etc. These results
reaffirm earlier research that CNN-based eye-
tracking models are more accurate and robust in
complex environments (Zuo et al., 2024). Powered by
deep learning, the proposed eye-tracking system was
able to handle real-time video feed with a frame rate
of 30 frames per second, indicating scalability of the
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
268
proposed system for applications such as animal
monitoring and agricultural surveillance. The model
was trained on a dataset of 50,000 images for 20
different animal species and managed to obtain 92%
accuracy and 89% adaptability rate in different
environmental conditions. The rationale behind these
findings is consistent with existing literature that
highlights the importance of diverse datasets for
enabling deep learning models in generalizing across
environments (Wang et al., 2021). One of the major
disadvantages of the system is that it is less accurate
with heavily occluded objects or unfamiliar species.
Its accuracy decreases to 73% for half covered
objects and 70% for species not seen during training.
Duplicate challenges have been observed in
predictive eye-tracking models, whereby data voids
and occlusions hamper the effectiveness of tracking
(Jyotsna et al., 2023).
As a part of this research, in order to increase
robustness of eye-ball tracking for human-computer
interaction, the data set can be extended with
additional 20,000 images of different movements.
Related to above mentioned eye-ball tracking, by
using existing images tracking can also be improved
by various methods. This can allow the model to run
on edge devices such as Raspberry Pi or NVIDIA
Jetson devices by reducing computational load by
30% and keeping over 90% accuracy of the model.
Deep learning-based approaches for gaze analysis
have recently been optimized to compress the model
size by up to 40% while keeping their speed in real-
time (Ghosh et al., 2024). Retinal fundus images with
multiple cameras enable the development of novel
eye-ball tracking approaches; however, the accuracy
of the proposed eye-ball tracking system is 30%
lower than the eye-ball tracking accuracy with a
single camera and is limited due to high
computational complexities and potentially achieving
a biased dataset under heavy occlusions and rapid
movements (Zuo et al., 2024). In future work, the
CNN model can be optimized to reduce
computational loads by 30% and 20,000 additional
images should be used to expand the dataset further
enhancing robustness.
7 CONCLUSIONS
An eye-ball tracking system based on CNN was built
and observed with 1,000 to 10,000 images. Training
was done in times of distinct real environmental
conditions like head orientation, occlusions, eye
conditions, and lighting levels. However, unlike the
other studies, in this study the CNN and RNN
comparison showed a better accuracy performance
for CNN of 96.5% with faster processing time thanks
to an optimized technique while the maximum
accuracy only for RNN was 80% with a slower
computation time. With processing times decreasing
with the growth of dataset size, the CNN become
capable of real-time operation and great stability as
well as efficiency for eye tracking applications in
practice.
REFERENCES
H. O. Edughele, Y. Zhang, F. Muhammad-Sukki, Q. -T.
Vien, H. Morris-Cafiero and M. Opoku Agyeman,
"Eye-Tracking Assistive Technologies for Individuals
With Amyotrophic Lateral Sclerosis," in IEEE Access,
vol. 10, pp. 41952 41972, 2022, doi:10.1109/ACCESS
.2022.3164075.
K. Iddrisu, W. Shariff, P. Corcoran, N. E. O'Connor, J.
Lemley and S. Little, "Event Camera-Based Eye
Motion Analysis: A Survey," in IEEE Access, vol. 12,
pp. 136783 136804, 2024, doi:10.1109/ACCESS.2024
.3462109.
R. Kannan Megalingam, S. Kuttankulangara Manoharan,
G. Riju and S. Makkal Mohandas, "Netravaad:
Interactive Eye Based Communication System for
People With Speech Issues," in IEEE Access, vol. 12,
pp. 69838 69852, 2024, doi:10.1109/ACCESS.2024.3
402334.
F. Zuo, P. Jing, J. Sun, J. Duan, Y. Ji and Y. Liu, "Deep
Learning-Based Eye-Tracking Analysis for Diagnosis
of Alzheimer's Disease Using 3D Comprehensive
Visual Stimuli," in IEEE Journal of Biomedical and
Health Informatics, vol. 28, no. 5, pp. 2781-2793, May
2024, doi: 10.1109/JBHI.2024.3365172.
Y. Wang, S. Lu and D. Harter, "Towards Collaborative and
Intelligent Learning Environments Based on Eye
Tracking Data and Learning Analytics: A Survey," in
IEEE Access, vol. 9, pp. 137991-138002, 2021, doi:
10.1109/ACCESS.2021.3117780.
C. Jyotsna, J. Amudha, A. Ram, D. Fruet and G. Nollo,
"PredictEYE: Personalized Time Series Model for
Mental State Prediction Using Eye Tracking," in IEEE
Access, vol. 11, pp. 128383-128409, 2023, doi:
10.1109/ACCESS.2023.3332762.
S. Ghosh, A. Dhall, M. Hayat, J. Knibbe and Q. Ji,
"Automatic Gaze Analysis: A Survey of Deep Learning
Based Approaches," in IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 46, no. 1, pp.
61-84, Jan. 2024, doi: 10.1109/TPAMI.2023.3321337.
J. Santhosh, D. Dzsotjan and S. Ishimaru, "Multimodal
Assessment of Interest Levels in Reading: Integrating
Eye-Tracking and Physiological Sensing," in IEEE
Access, vol. 11, pp. 93994-94008, 2023, doi:
10.1109/ACCESS.2023.3311268.
R. Rathnayake et al., "Current Trends in Human Pupil
Localization: A Review," in IEEE Access, vol. 11, pp.
Enhancing Eye Ball Tracking System for Enhanced Human Computer Interaction: Deep Learning Base CNN
269
115836 115853, 2023, doi:10.1109/ACCESS.2023.33
25293.
K. A. Thakoor, S. C. Koorathota, D. C. Hood and P. Sajda,
"Robust and Interpretable Convolutional Neural
Networks to Detect Glaucoma in Optical Coherence
Tomography Images," in IEEE Transactions on
Biomedical Engineering, vol. 68, no. 8, pp. 2456-2466,
Aug. 2021, doi: 10.1109/TBME.2020.3043215.
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
270