The Current Research Status of Computer Vision
Yongtao Li
Mathematics Department, Santa Monica College, Santa Monica, U.S.A.
Keywords: Supervised Learning, Self-Supervised Learning, Computer Vision.
Abstract: This article explores two fundamental algorithmic techniques in the field of computer vision. For instance, in
image processing, the goal of computer vision is to enable the system to interpret, understand, and extract
useful information from visual input in a manner similar to human perception. However, these actions are
supported by algorithmic techniques. Currently, in the field of computer vision, research on individual
algorithmic techniques has become increasingly in-depth, and people's understanding of these technical
concepts has become increasingly clear. This article provides an overall overview of two methods: supervised
learning and self-supervised learning, and briefly lists their key methods as well as their respective advantages
and disadvantages. Additionally, a simple comparison of these three methods is made from aspects such as
training efficiency, data utilization, and generalization ability. In the end, this article proposes that future
research should integrate various algorithmic techniques to achieve a balance between efficiency and accuracy.
1 INTRODUCTION
Artificial intelligence(AI) is a tool designed to imitate
human behavior. The foundation of artificial
intelligence (AI) is the use of machine learning
methods and technologies to enable robots to do tasks
independently or with assistance, acting as though
they possess specific cognitive abilities (Morandín-
Ahuerma, 2022). Computer vision aims to enable AI
to "understand" the visual world (such as images and
videos). The main role of computer vision in image
processing is to enable machines to possess visual
understanding capabilities similar to those of humans,
allowing them to effectively interpret and respond to
visual information (Anjiali & Bappaditva, 2025). Its
research scope has evolved from simple data
collection to methods and concepts that combine
digital image processing, pattern recognition, machine
learning, and computer graphics. Computer vision
mainly focuses on image processing in numerous
application fields (Wiley & Lucas, 2018). It combines
a large amount of data analysis with technologies from
a wide range of application areas. Most computer
vision tasks are centered around extracting
information about events or descriptions from input
scenes (Kotappa et al, 2022). With the continuous
development of ai, its essence - the learning method -
is also constantly changing, with the aim of improving
the efficiency of the process and better helping
humans solve problems. The core of machine learning
is that machines use algorithms to analyze massive
amounts of data. By learning from the data, they mine
the potential connections existing in the data and train
an effective model, which is then applied to decision-
making or prediction (Kong, 2019).
At present, the classic learning models of artificial
intelligence are: supervised learning, self-supervised
learning. Supervised is a traditional way, compared to
others. It is what happens when a machine learns
under supervision, as the name suggests. This learning
paradigm gathers information on a system's input-
output relationship based on a specific collection of
input-output training examples in pairs (Nkemdilim et
al, 2024).
Self-supervised learning (SSL) is a paradigm that
enables models to learn rich and generalizable
representations from unlabeled data by solving pretext
tasks. SSL uses the natural structure of raw data to
create pseudo-labels, doing away with the necessity
for human annotation that comes with classical
supervised learning, which requires labeled data for
training. In fact, self-supervised learning has begun to
be utilized in computer vision at present. For example,
in the field of natural language, by analyzing a large
number of unlabeled language texts and other
materials to enrich the semantic representation, the
preset tasks can be accomplished (Blaire et al, 2025).
Li, Y.
The Current Research Status of Computer Vision.
DOI: 10.5220/0014356200004718
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2025), pages 381-386
ISBN: 978-989-758-792-4
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
381
In fact, the paper found that currently there are
very few articles that provide a preliminary
algorithmic summary for the field of computer vision.
Therefore, this article chooses to give a relatively
basic introduction and comparison of these two
common learning methods that are currently in use or
being researched. At the same time, it uses some case
model articles for certain introductions. This way,
both researchers or those who are eager to acquire
knowledge can obtain the basic concepts of these
learning methods and their applications in artificial
intelligence, and also meet the readers' need for
understanding related knowledge.
2 OVERVIEW OF MAINSTREAM
TECHNOLOGIES
2.1 Self-Supervised Learning
Self-supervised learning (SSL) is a method that
enables models to learn rich and generalizable
representations from unlabeled data, achieving this
goal by solving predefined tasks. Unlike traditional
supervised learning, which requires a large amount of
labeled data for training or a significant amount of
imitation of human behavior to achieve the goal, self-
supervised learning utilizes the inherent structure of
raw data to generate pseudo-labels, eliminating the
need for manual labeling. Essentially, self-supervised
learning is a learning method without supervision.
The following is the core goal for the self-supervised
learning (Blaire et al, 2025).
First, about the representation learning, the goal
of SSL is to acquire reliable, transferable feature
representations that apply to various applications and
datasets. Second, about the effectiveness of data, by
using large volumes of unlabeled data, SSL increases
the scalability of learning by reducing reliance on
manually labeled datasets. Third, about the
generalization, domain- invariant and task-agnostic
properties can be captured via self-supervised
representations, which allows for efficient
generalization to other domains. Fourth, about the
lower annotation cost, SSL techniques significantly
reduce the expense and labor associated with data
annotation because they do not call for explicit labels
In self-supervised learning, the model is trained
by automatically generating supervisory signals from
unlabeled data, thereby avoiding the cost of manual
annotation. Since the core of this article is to compare
the advantages and disadvantages of computer vision
under different algorithms, we will not elaborate on
each paradigm in detail. Instead, we will only briefly
introduce the core idea of each one. Next, the paper
will extract representative paradigms from them to
illustrate the characteristics of self-supervised
learning. The following are five commonly used self-
supervised learning paradigms and their typical
methods.
2.1.1 Contrastive Learning
The core idea is to learn representations by bringing
close similar samples (positive sample pairs) and
push apart dissimilar samples (negative sample pairs).
Otherwise, there are three key methods. SimCLR
means to enerate positive sample pairs through data
augmentation, with negative samples coming from
other samples in the same batch, and uses the
InfoNCE loss. MoCo (Momentum Contrast) means to
maintain a dynamically updated queue as the negative
sample library, and enhances consistency through the
momentum encoder. CLIP (Cross-modal Contrastive
Learning) means to align the embeddings of images
and texts, used for multimodal tasks. Furthermore, its
advantage lies in its strong adaptability to data
augmentation, making it suitable for fields such as
vision and speech. Its disadvantages are that the
machine depends on a large number of negative
samples, with high computational cost (Hu et al,
2024).
2.1.2 Generative Model
The core idea is to learn representations by
reconstructing input data or predicting missing parts.
The typical methods are Masked Autoencoder (MAE),
BERT (in the NLP field) and VAE/GAN. Masked
Autoencoder (MAE) means to randomly mask the
input (such as image blocks or text words), and train
the model to reconstruct the missing parts (widely
used in Vision Transformer). BERT (in the NLP field)
means to predict the masked words through Masked
Language Modeling (MLM). VAE/GAN means that
the generative model can also be regarded as a form
of self-supervised learning. Moreover, its advantages
is that the process do not need negative samples and
is suitable for high-dimensional data such as text and
images.
An disadvantages are that the reconstruction task
may be too simple and unable to learn high-level
semantics.
2.1.3 Based on Prediction Tasks
The core idea is to design auxiliary tasks (pretext
tasks), and utilize the inherent structure of the data as
EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence
382
the supervisory signal. Typically, the common tasks
are rotation prediction, puzzle reconstruction and
temporal sequence prediction. What’s more, Being
simple and intuitive is its advantages. And
disadvantages are that task design requires domain
knowledge and the features learned may not be
universally applicable.
2.1.4 Based on Clustering
Its core idea is that generate pseudo labels through
online clustering (such as K-means) and iteratively
optimize the representation. There are two typical
methods. First one is DeepCluster, which means to
alternate between clustering and classification tasks.
Another one is SwAV. In other words,, Online
Clustering combined with Contrastive Learning,
avoiding Explicit Negative Samples. Additionally,
about advantages, it reduces reliance on negative
samples. However, about disadvantages, the
clustering process may be unstable.
2.1.5 Based on Distillation
Its core idea is that different views of the same model
(such as different augmented samples) supervise each
other. Also, it has two typical methods. First, DINO
means the student network learns through the output
of the teacher network. The teacher network is the
exponential moving average (EMA) of the student
network. Second, BYOL means it eliminate negative
samples and rely solely on positive samples and
momentum encoders. The advantages are that it do
not need negative samples and is suitable for training.
The disadvantage is that it depends on design such as
momentum encoder.
2.2 Supervised Learning
Supervised learning is a machine learning paradigm
that trains models using labeled data. During the
process, it learns a mapping function (model) from
known inputs (such as images) and correspond
outputs (such as class labels, bounding boxes), so that
the model can make accurate predictions for new
input data. Then, by defining loss functions (such as
cross-entropy loss, mean squared error), the
difference between the model's predictions and the
true labels is quantified, and optimization methods
such as gradient descent are used to minimize the loss,
gradually adjusting the model parameters. The
ultimate goal is to ensure that the model performs
well on unseen test data, avoiding overfitting
(achieved through techniques such as regularization
and data augmentation). There are two typical
methods (Nkemdilim et al, 2024).
2.2.1 Convoluntional Neural Network
(CNN)
Its representative models are ResNet, EfficientNet,
MobileNet, etc. In fact, it efficiently captures the
hierarchical feature of the image through local
receptive fields, weight sharing, and pooling
operations.
2.2.2 Visual Transformer (ViT)
Its representative models are Vision Transformer,
Swin Transformer. In fact, the image is divided into
sequences, and the global dependency relationships
are modeled through the Self-Attention mechanism,
breaking through the locality limitation of CNN. To
be short, it performs exceptionally well on large
datasets, but requires a higher volume of data.
3 LITERATURE ANALYSIS
3.1 Research for Self-supervised
Learning
This paper uses the literature “Masked Autoencoders
Are Scalable Vision Learners” to show the
characteristic of self-supervised learning. The MAE
architecture was used in the experiment. During the
pre-training stage, a large number of image segments
(for example, 75%) were randomly masked out. The
visible few segments were encoded using the encoder.
After the encoder, mask tokens were introduced, and
then a small decoder processed the complete encoded
segments and mask tokens, which reconstructed the
original image in pixel form. After the pre-training
was completed, the decoder was discarded, and the
encoder was applied to the undisturbed images (the
complete segment set) for the recognition task. This
article highlights the advantages of self-supervised
learning by measuring and analyzing the performance
of the MAE model. The analysis is conducted from
three aspects: datasets, evaluation metrics, and
experimental results (He et al, 2021).
3.1.1 Dataset
The core experiment of this paper is based on the
ImageNet-1K dataset (1.28 million images) for pre-
training and evaluation. During the pre-training stage,
MAE adopted an innovative data usage approach:
The Current Research Status of Computer Vision
383
Input data, using original ImageNet-1K images;
Preprocessing method, randomly masking 75% of the
image blocks (patches); training mechanism, only
using 25% of the visible image blocks for encoding.
3.1.2 Evaluation Indicators
Table 1 shows that MAE demonstrates outstanding
performance across multiple dimensions.
Table 1: Performance comparison of the MAE frame framework across four key evaluation dimensions. The results are
reported in terms of training time (the shorter, the better), reconstruction accuracy (the higher, the better), Top-1 accuracy
(the higher, the better), and model scalability (the higher, the better).
Evaluation dimension Specific Indicators Performance
Pre-training efficiency Training time(ViT-L) 31hours
Image reconstruction quality Pixel-level Reconstruction Accuracy significantly better than the baseline
Transfer learning performance ImageNet Top-1 Accuracy 85.9%(Vit-L)
Model scalability ViT-Huge Accuracy 87.8%
3.1.3 Analysis of Experimental Results
About the pre-training effect, the high mask rate (75%)
strategy has been proven to be the optimal choice, as
it not only ensures the challenge of the learning task
but also maintains the training efficiency. The
asymmetric encoder-decoder design reduces the
computational load by 3.3 times. After 1600 epochs
of training, the performance of the model continued
to improve without reaching saturation.
About transfer learning performance, it achieved
an accuracy rate of 85.9% on the ImageNet-1K
classification task (ViT-L). In the COCO object
detection task, the APbox metric achieved 53.3. The
ADE20K semantic segmentation task achieved a
mIoU of 53.6.
About the model scalability, the method can
stably support the training of ultra-large models such
as ViT-Huge. It achieved an 87.8% accuracy rate on
ImageNet on ViT-Huge (448px). The computing
resource requirements increase linearly with the
model size, demonstrating excellent scalability.
About robustness performance, the error rate on
the damaged-image test set of ImageNet-C is only
33.8%. It achieved an accuracy rate of 76.7% in the
ImageNet-A adversarial sample test.
3.2 Research for Supervised Learning
This paper make an analysis of “Multiscale Residual
Learning of Graph Convolutional Sequence Chunks
for Human Motion Prediction”. The datasets used in
this paper are CMU Mocap and Human3.6M, both of
which are standard 3D human motion capture datasets,
containing various action categories (such as walking,
running, jumping, etc.).
Supervised learning (SL) methods (such as
Seq2Seq, TrajDep, MSR-GCN) rely on large-scale
labeled data and are trained by minimizing the error
between the predicted value and the actual value
(such as MSE).
3.2.1 Dataset
The paper uses two strictly annotated static motion
capture datasets, which meet the requirements of
supervised learning for "input-output" paired data.
First is CMU Mocap. Its data format is that the
original data contains the 3D coordinates of 38 joints,
and the paper selects 24 key joints. Also, each
sequence is divided into input-output pairs through a
sliding window (with a length of 3 seconds and a step
size of 10 frames). About the supervisory information,
the input is joint trajectories (represented by position
or axis-angle) for 1 second of history (e.g. 50 frames).
The output is the actual motion sequence for the next
second, strictly aligned with the input. About the
action diversity, it includes 8 types of actions (such as
basketball, running, jumping), to verify the model's
generalization ability for different movement patterns.
Second is human 3.6M, whose data format is 32
joints The paper selects 21 of them and represents
the posture using the angle-axis method (D = 3).
About the supervisory information, it uses fixed
division, S5 and S11 are used as the test/verification
set, while the remaining 5 subjects are used for
training. Then, input 1 second, predict the next 1
second (total of 15 types of actions, such as walking,
eating) (Zand, Etemad & Greenspan, 2023).
3.2.2 Evaluation Indicators
This paper uses MPJPE (Mean Per Joint Position
Error) to measure the Euclidean distance (in mm)
between the predicted joint positions and the actual
values.
EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence
384
In the Evaluation Scenario, it uses short-term
prediction and long-term prediction. Short-term
prediction uses Errors of 80ms, 160ms, 320ms, and
400ms to test the local dynamic modeling capability.
Long-term prediction uses an error of 1000ms (1
second) is allowed to verify the model's ability to
capture the movement trend.
About advantages of Supervised Learning, it can
direct optimization objective, which means MPJPE
clearly quantifies the deviation between the predicted
value and the actual value, and backpropagation can
precisely adjust the model parameters. Also, it do not
need to design complex rewards.
3.2.3 Experimental Result Analysis
First is comparative experiments. With the Baselines
comparison, it compared with other supervised
learning methods (such as Seq2Seq, MSR-GCN,
SGSN), all are based on the same dataset and MPJPE
metric. The key results are that in the 80ms prediction
error for the "running" action of CMU Moca,
ResChunk 12.35mm is less than MSR-GCN
12.84mm. In the 1000ms error for the "jump" action,
ResChunk 121.38mm is still less than MSR-GCN
124.79mm. About the average error of 1000ms of the
Human 3.6, ResChunk 81.95mm is close to the
optimal value (SGSN 81.13mm).
In the 1000ms long-term prediction, the error of
ResChunk was significantly lower than that of
traditional methods (such as the "jumping" action in
the CMU dataset-- 121.38mm vs. 162.84mm of
Seq2Seq). This indicates that supervised learning, by
directly optimizing MPJPE, can effectively model the
spatio-temporal continuity of motion sequences and
avoid the common pattern collapse in reinforcement
learning (where predictions degenerate into average
postures).
4 CHALLENGES AND
PROSPECTS
4.1 Challenges
About annotation dependency and cost, the
performance of SL is highly dependent on large-scale
annotated data (such as ImageNet, Human3.6M). The
annotation cost is high and it is difficult to scale it to
niche fields (such as healthcare, industrial inspection).
Besides, the SL model is prone to overfitting the
labeled data distribution, and has poor robustness
against adversarial samples (IN-A), occlusion or
distribution shift (ImageNet-C) (with a 7.5% lower
MAE error rate). ,What’s more, SL needs to process
the entire input (such as 100% image blocks of ViT-
L), while MAE reduces 3.3× FLOPs through the
masking mechanism (25% visible blocks),
highlighting the computational efficiency
disadvantage of SL.
The pixel reconstruction objective of MAE may
not directly relate to high-level semantics (such as
fine-grained classification), and more complex pre-
training tasks (such as combining contrastive learning)
need to be designed. Although MAE can be extended
to ViT-H, it requires hyperparameter tuning (such as
mask rate and loss weight), and the training stability
still relies on engineering techniques (such as gradient
clipping). The masking strategies of MAE for
sequential/3D data such as videos and point clouds
are not yet mature, while SL can be directly adapted
through labeling (e.g., 3D human pose estimation).
4.2 Future Outlook
About SSL + SL collaborative training, by combining
the general representation ability of MAE with the
semantic alignment advantage of SL (such as MAE
pre-training + SL fine-tuning), higher accuracy can be
achieved in the scenario with limited samples. About
dynamic masking strategy, MAE can explore
adaptive masking rates (such as adjusting according
to the complexity of the image), further enhancing the
efficiency of pre-training. About robustness
enhancement, by conducting adversarial training or
using diffusion models, the robustness of MAE
against noise and occlusion is improved, narrowing
the gap in generalization from human vision. About
cross-modal General Model, it extends MAE to
videos (spatio-temporal masking) and text-image
pairs (multimodal reconstruction), constructing a
unified self-supervised pre-training framework.
About low-cost labeling alternative, it can utilize
MAE to generate pseudo-labels to assist in SL
training (such as semi-supervised learning), reducing
reliance on manual labeling. It can establish a cross-
modal evaluation set covering distribution shifts and
adversarial attacks (such as ImageNet-C for SSL +
SL).
5 CONCLUSIONS
Supervised learning and self-supervised learning are
two core paradigms of machine learning. Supervised
learning relies on labeled data to directly optimize the
task objective, suitable for high-precision scenarios
but with high data costs; self-supervised learning uses
The Current Research Status of Computer Vision
385
designed auxiliary tasks to pre-train with unlabeled
data, reducing the reliance on labeling but requiring
fine-tuning. The former performs well in
deterministic tasks, while the latter is good at learning
general features. The current trend is to combine the
advantages of both, first extracting general features
through self-supervised pre-training, and then fine-
tuning specific tasks with supervised learning to
achieve efficient and high-performance AI systems.
This hybrid paradigm is demonstrating strong
potential in fields such as healthcare and autonomous
driving.
Future research can explore the integration of SSL
and SL (such as self-supervised pre-training +
supervised fine-tuning) to balance efficiency and
accuracy; at the same time, improving the training
stability of RL (such as combining imitation learning)
may enable it to play a greater role in long-term
sequence tasks. Moreover, dynamic masking
strategies, cross-modal pre-training, and other
directions are expected to further enhance the
generalization ability of SSL. This study provides a
theoretical basis for the selection of different learning
paradigms, especially in scenarios with limited data
or high robustness requirements, where SSL
demonstrates stronger application potential.
REFERENCES
Anjiali, V., & Bappaditva, J. (2025). Aim of Computer
Vision in Image Processing. International Journal of
Scientific Research in Science and Technology, 12(1),
114–116.
Blaire, L., Oakleigh, H., Brycen, C., Kaitlynn, L., & Nadine,
J. (2025). Understanding Self-Supervised Learning and
Its Future Directions. hal-04964098.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R.
(2021). Masked Autoencoders Are Scalable Vision
Learners (Version 3). arXiv. https://doi.org
/10.48550/ARXIV.2111.06377
Hu, H., Wang, X., Zhang, Y., Chen, Q., & Guan, Q. (2024).
A comprehensive survey on contrastive learning.
Neurocomputing, 610, 128645.
Kong, X. R. (2019). Overview of Machine Learning.
electronic production, (24), 82-84+38.
Kotappa, Y. G., Krushika, M., Ravichandra, M. et al.
(2022). A Review Paper on Computer Vision and
Image Processing. International Journal of Advanced
Research in Science, Communication and Technology,
68–72.
Morandín-Ahuerma, F. (2022). What is Artificial
Intelligence? International Journal of Research
Publication and Reviews, 03(12), 1947–1951.
Nkemdilim, M. N., Uzoamaka, P. R., Daniel, U., & Chidi,
M. K. (2024). An Overview of Supervised Machine
Learning Paradigms and their Classifiers. International
Journal of Advanced Engineering, Management and
Science, 10(3), 24–32.
Wiley, V., & Lucas, T. (2018). Computer Vision and Image
Processing: A Paper Review. International Journal of
Artificial Intelligence Research, 2(1), 22.
Zand, M., Etemad, A., & Greenspan, M. (2023). Multiscale
Residual Learning of Graph Convolutional Sequence
Chunks for Human Motion Prediction (Version 1).
arXiv. https://doi.org/10.48550/ARXIV.2308.16801
EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence
386