The Current Research Status of Computer Vision

Yongtao Li

Mathematics Department, Santa Monica College, Santa Monica, U.S.A.

Keywords: Supervised Learning, Self-Supervised Learning, Computer Vision.

Abstract: This article explores two fundamental algorithmic techniques in the field of computer vision. For instance, in

image processing, the goal of computer vision is to enable the system to interpret, understand, and extract

useful information from visual input in a manner similar to human perception. However, these actions are

supported by algorithmic techniques. Currently, in the field of computer vision, research on individual

algorithmic techniques has become increasingly in-depth, and people's understanding of these technical

concepts has become increasingly clear. This article provides an overall overview of two methods: supervised

learning and self-supervised learning, and briefly lists their key methods as well as their respective advantages

and disadvantages. Additionally, a simple comparison of these three methods is made from aspects such as

training efficiency, data utilization, and generalization ability. In the end, this article proposes that future

research should integrate various algorithmic techniques to achieve a balance between efficiency and accuracy.

1 INTRODUCTION

Artificial intelligence(AI) is a tool designed to imitate

human behavior. The foundation of artificial

intelligence (AI) is the use of machine learning

methods and technologies to enable robots to do tasks

independently or with assistance, acting as though

they possess specific cognitive abilities (Morandín-

Ahuerma, 2022). Computer vision aims to enable AI

to "understand" the visual world (such as images and

videos). The main role of computer vision in image

processing is to enable machines to possess visual

understanding capabilities similar to those of humans,

allowing them to effectively interpret and respond to

visual information (Anjiali & Bappaditva, 2025). Its

research scope has evolved from simple data

collection to methods and concepts that combine

digital image processing, pattern recognition, machine

learning, and computer graphics. Computer vision

mainly focuses on image processing in numerous

application fields (Wiley & Lucas, 2018). It combines

a large amount of data analysis with technologies from

a wide range of application areas. Most computer

vision tasks are centered around extracting

information about events or descriptions from input

scenes (Kotappa et al, 2022). With the continuous

development of ai, its essence - the learning method -

is also constantly changing, with the aim of improving

the efficiency of the process and better helping

humans solve problems. The core of machine learning

is that machines use algorithms to analyze massive

amounts of data. By learning from the data, they mine

the potential connections existing in the data and train

an effective model, which is then applied to decision-

making or prediction (Kong, 2019).

At present, the classic learning models of artificial

intelligence are: supervised learning, self-supervised

learning. Supervised is a traditional way, compared to

others. It is what happens when a machine learns

under supervision, as the name suggests. This learning

paradigm gathers information on a system's input-

output relationship based on a specific collection of

input-output training examples in pairs (Nkemdilim et

al, 2024).

Self-supervised learning (SSL) is a paradigm that

enables models to learn rich and generalizable

representations from unlabeled data by solving pretext

tasks. SSL uses the natural structure of raw data to

create pseudo-labels, doing away with the necessity

for human annotation that comes with classical

supervised learning, which requires labeled data for

training. In fact, self-supervised learning has begun to

be utilized in computer vision at present. For example,

in the field of natural language, by analyzing a large

number of unlabeled language texts and other

materials to enrich the semantic representation, the

preset tasks can be accomplished (Blaire et al, 2025).

Li, Y.

The Current Research Status of Computer Vision.

DOI: 10.5220/0014356200004718

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2025), pages 381-386

ISBN: 978-989-758-792-4

381

In fact, the paper found that currently there are

very few articles that provide a preliminary

algorithmic summary for the field of computer vision.

Therefore, this article chooses to give a relatively

basic introduction and comparison of these two

common learning methods that are currently in use or

being researched. At the same time, it uses some case

model articles for certain introductions. This way,

both researchers or those who are eager to acquire

knowledge can obtain the basic concepts of these

learning methods and their applications in artificial

intelligence, and also meet the readers' need for

understanding related knowledge.

2 OVERVIEW OF MAINSTREAM

TECHNOLOGIES

2.1 Self-Supervised Learning

Self-supervised learning (SSL) is a method that

enables models to learn rich and generalizable

representations from unlabeled data, achieving this

goal by solving predefined tasks. Unlike traditional

supervised learning, which requires a large amount of

labeled data for training or a significant amount of

imitation of human behavior to achieve the goal, self-

supervised learning utilizes the inherent structure of

raw data to generate pseudo-labels, eliminating the

need for manual labeling. Essentially, self-supervised

learning is a learning method without supervision.

The following is the core goal for the self-supervised

learning (Blaire et al, 2025).

First, about the representation learning, the goal

of SSL is to acquire reliable, transferable feature

representations that apply to various applications and

datasets. Second, about the effectiveness of data, by

using large volumes of unlabeled data, SSL increases

the scalability of learning by reducing reliance on

manually labeled datasets. Third, about the

generalization, domain- invariant and task-agnostic

properties can be captured via self-supervised

representations, which allows for efficient

generalization to other domains. Fourth, about the

lower annotation cost, SSL techniques significantly

reduce the expense and labor associated with data

annotation because they do not call for explicit labels

In self-supervised learning, the model is trained

by automatically generating supervisory signals from

unlabeled data, thereby avoiding the cost of manual

annotation. Since the core of this article is to compare

the advantages and disadvantages of computer vision

under different algorithms, we will not elaborate on

each paradigm in detail. Instead, we will only briefly

introduce the core idea of each one. Next, the paper

will extract representative paradigms from them to

illustrate the characteristics of self-supervised

learning. The following are five commonly used self-

supervised learning paradigms and their typical

methods.

2.1.1 Contrastive Learning

The core idea is to learn representations by bringing

close similar samples (positive sample pairs) and

push apart dissimilar samples (negative sample pairs).

Otherwise, there are three key methods. SimCLR

means to enerate positive sample pairs through data

augmentation, with negative samples coming from

other samples in the same batch, and uses the

InfoNCE loss. MoCo (Momentum Contrast) means to

maintain a dynamically updated queue as the negative

sample library, and enhances consistency through the

momentum encoder. CLIP (Cross-modal Contrastive

Learning) means to align the embeddings of images

and texts, used for multimodal tasks. Furthermore, its

advantage lies in its strong adaptability to data

augmentation, making it suitable for fields such as

vision and speech. Its disadvantages are that the

machine depends on a large number of negative

samples, with high computational cost (Hu et al,

2024).

2.1.2 Generative Model

The core idea is to learn representations by

reconstructing input data or predicting missing parts.

The typical methods are Masked Autoencoder (MAE),

BERT (in the NLP field) and VAE/GAN. Masked

Autoencoder (MAE) means to randomly mask the

input (such as image blocks or text words), and train

the model to reconstruct the missing parts (widely

used in Vision Transformer). BERT (in the NLP field)

means to predict the masked words through Masked

Language Modeling (MLM). VAE/GAN means that

the generative model can also be regarded as a form

of self-supervised learning. Moreover, its advantages

is that the process do not need negative samples and

is suitable for high-dimensional data such as text and

images.

An disadvantages are that the reconstruction task

may be too simple and unable to learn high-level

semantics.

2.1.3 Based on Prediction Tasks

The core idea is to design auxiliary tasks (pretext

tasks), and utilize the inherent structure of the data as

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

382

the supervisory signal. Typically, the common tasks

are rotation prediction, puzzle reconstruction and

temporal sequence prediction. What’s more, Being

simple and intuitive is its advantages. And

disadvantages are that task design requires domain

knowledge and the features learned may not be

universally applicable.

2.1.4 Based on Clustering

Its core idea is that generate pseudo labels through

online clustering (such as K-means) and iteratively

optimize the representation. There are two typical

methods. First one is DeepCluster, which means to

alternate between clustering and classification tasks.

Another one is SwAV. In other words,, Online

Clustering combined with Contrastive Learning,

avoiding Explicit Negative Samples. Additionally,

about advantages, it reduces reliance on negative

samples. However, about disadvantages, the

clustering process may be unstable.

2.1.5 Based on Distillation

Its core idea is that different views of the same model

(such as different augmented samples) supervise each

other. Also, it has two typical methods. First, DINO

means the student network learns through the output

of the teacher network. The teacher network is the

exponential moving average (EMA) of the student

network. Second, BYOL means it eliminate negative

samples and rely solely on positive samples and

momentum encoders. The advantages are that it do

not need negative samples and is suitable for training.

The disadvantage is that it depends on design such as

momentum encoder.

2.2 Supervised Learning

Supervised learning is a machine learning paradigm

that trains models using labeled data. During the

process, it learns a mapping function (model) from

known inputs (such as images) and correspond

outputs (such as class labels, bounding boxes), so that

the model can make accurate predictions for new

input data. Then, by defining loss functions (such as

cross-entropy loss, mean squared error), the

difference between the model's predictions and the

true labels is quantified, and optimization methods

such as gradient descent are used to minimize the loss,

gradually adjusting the model parameters. The

ultimate goal is to ensure that the model performs

well on unseen test data, avoiding overfitting

(achieved through techniques such as regularization

and data augmentation). There are two typical

methods (Nkemdilim et al, 2024).

2.2.1 Convoluntional Neural Network

(CNN)

Its representative models are ResNet, EfficientNet,

MobileNet, etc. In fact, it efficiently captures the

hierarchical feature of the image through local

receptive fields, weight sharing, and pooling

operations.

2.2.2 Visual Transformer (ViT)

Its representative models are Vision Transformer,

Swin Transformer. In fact, the image is divided into

sequences, and the global dependency relationships

are modeled through the Self-Attention mechanism,

breaking through the locality limitation of CNN. To

be short, it performs exceptionally well on large

datasets, but requires a higher volume of data.

3 LITERATURE ANALYSIS

3.1 Research for Self-supervised

Learning

This paper uses the literature “Masked Autoencoders

Are Scalable Vision Learners” to show the

characteristic of self-supervised learning. The MAE

architecture was used in the experiment. During the

pre-training stage, a large number of image segments

(for example, 75%) were randomly masked out. The

visible few segments were encoded using the encoder.

After the encoder, mask tokens were introduced, and

then a small decoder processed the complete encoded

segments and mask tokens, which reconstructed the

original image in pixel form. After the pre-training

was completed, the decoder was discarded, and the

encoder was applied to the undisturbed images (the

complete segment set) for the recognition task. This

article highlights the advantages of self-supervised

learning by measuring and analyzing the performance

of the MAE model. The analysis is conducted from

three aspects: datasets, evaluation metrics, and

experimental results (He et al, 2021).

3.1.1 Dataset

The core experiment of this paper is based on the

ImageNet-1K dataset (1.28 million images) for pre-

training and evaluation. During the pre-training stage,

MAE adopted an innovative data usage approach:

The Current Research Status of Computer Vision

383

Input data, using original ImageNet-1K images;

Preprocessing method, randomly masking 75% of the

image blocks (patches); training mechanism, only

using 25% of the visible image blocks for encoding.

3.1.2 Evaluation Indicators

Table 1 shows that MAE demonstrates outstanding

performance across multiple dimensions.

Table 1: Performance comparison of the MAE frame framework across four key evaluation dimensions. The results are

reported in terms of training time (the shorter, the better), reconstruction accuracy (the higher, the better), Top-1 accuracy

(the higher, the better), and model scalability (the higher, the better).

Evaluation dimension Specific Indicators Performance

Pre-training efficiency Training time(ViT-L) 31hours

Image reconstruction quality Pixel-level Reconstruction Accuracy significantly better than the baseline

Transfer learning performance ImageNet Top-1 Accuracy 85.9%(Vit-L)

Model scalability ViT-Huge Accuracy 87.8%

3.1.3 Analysis of Experimental Results

About the pre-training effect, the high mask rate (75%)

strategy has been proven to be the optimal choice, as

it not only ensures the challenge of the learning task

but also maintains the training efficiency. The

asymmetric encoder-decoder design reduces the

computational load by 3.3 times. After 1600 epochs

of training, the performance of the model continued

to improve without reaching saturation.

About transfer learning performance, it achieved

an accuracy rate of 85.9% on the ImageNet-1K

classification task (ViT-L). In the COCO object

detection task, the APbox metric achieved 53.3. The

ADE20K semantic segmentation task achieved a

mIoU of 53.6.

About the model scalability, the method can

stably support the training of ultra-large models such

as ViT-Huge. It achieved an 87.8% accuracy rate on

ImageNet on ViT-Huge (448px). The computing

resource requirements increase linearly with the

model size, demonstrating excellent scalability.

About robustness performance, the error rate on

the damaged-image test set of ImageNet-C is only

33.8%. It achieved an accuracy rate of 76.7% in the

ImageNet-A adversarial sample test.

3.2 Research for Supervised Learning

This paper make an analysis of “Multiscale Residual

Learning of Graph Convolutional Sequence Chunks

for Human Motion Prediction”. The datasets used in

this paper are CMU Mocap and Human3.6M, both of

which are standard 3D human motion capture datasets,

containing various action categories (such as walking,

running, jumping, etc.).

Supervised learning (SL) methods (such as

Seq2Seq, TrajDep, MSR-GCN) rely on large-scale

labeled data and are trained by minimizing the error

between the predicted value and the actual value

(such as MSE).

3.2.1 Dataset

The paper uses two strictly annotated static motion

capture datasets, which meet the requirements of

supervised learning for "input-output" paired data.

First is CMU Mocap. Its data format is that the

original data contains the 3D coordinates of 38 joints,

and the paper selects 24 key joints. Also, each

sequence is divided into input-output pairs through a

sliding window (with a length of 3 seconds and a step

size of 10 frames). About the supervisory information,

the input is joint trajectories (represented by position

or axis-angle) for 1 second of history (e.g. 50 frames).

The output is the actual motion sequence for the next

second, strictly aligned with the input. About the

action diversity, it includes 8 types of actions (such as

basketball, running, jumping), to verify the model's

generalization ability for different movement patterns.

Second is human 3.6M, whose data format is 32

joints → The paper selects 21 of them and represents

the posture using the angle-axis method (D = 3).

About the supervisory information, it uses fixed

division, S5 and S11 are used as the test/verification

set, while the remaining 5 subjects are used for

training. Then, input 1 second, predict the next 1

second (total of 15 types of actions, such as walking,

eating) (Zand, Etemad & Greenspan, 2023).

3.2.2 Evaluation Indicators

This paper uses MPJPE (Mean Per Joint Position

Error) to measure the Euclidean distance (in mm)

between the predicted joint positions and the actual

values.

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

384

In the Evaluation Scenario, it uses short-term

prediction and long-term prediction. Short-term

prediction uses Errors of 80ms, 160ms, 320ms, and

400ms to test the local dynamic modeling capability.

Long-term prediction uses an error of 1000ms (1

second) is allowed to verify the model's ability to

capture the movement trend.

About advantages of Supervised Learning, it can

direct optimization objective, which means MPJPE

clearly quantifies the deviation between the predicted

value and the actual value, and backpropagation can

precisely adjust the model parameters. Also, it do not

need to design complex rewards.

3.2.3 Experimental Result Analysis

First is comparative experiments. With the Baselines

comparison, it compared with other supervised

learning methods (such as Seq2Seq, MSR-GCN,

SGSN), all are based on the same dataset and MPJPE

metric. The key results are that in the 80ms prediction

error for the "running" action of CMU Moca,

ResChunk 12.35mm is less than MSR-GCN

12.84mm. In the 1000ms error for the "jump" action,

ResChunk 121.38mm is still less than MSR-GCN

124.79mm. About the average error of 1000ms of the

Human 3.6, ResChunk 81.95mm is close to the

optimal value (SGSN 81.13mm).

In the 1000ms long-term prediction, the error of

ResChunk was significantly lower than that of

traditional methods (such as the "jumping" action in

the CMU dataset-- 121.38mm vs. 162.84mm of

Seq2Seq). This indicates that supervised learning, by

directly optimizing MPJPE, can effectively model the

spatio-temporal continuity of motion sequences and

avoid the common pattern collapse in reinforcement

learning (where predictions degenerate into average

postures).

4 CHALLENGES AND

PROSPECTS

4.1 Challenges

About annotation dependency and cost, the

performance of SL is highly dependent on large-scale

annotated data (such as ImageNet, Human3.6M). The

annotation cost is high and it is difficult to scale it to

niche fields (such as healthcare, industrial inspection).

Besides, the SL model is prone to overfitting the

labeled data distribution, and has poor robustness

against adversarial samples (IN-A), occlusion or

distribution shift (ImageNet-C) (with a 7.5% lower

MAE error rate). ,What’s more, SL needs to process

the entire input (such as 100% image blocks of ViT-

L), while MAE reduces 3.3× FLOPs through the

masking mechanism (25% visible blocks),

highlighting the computational efficiency

disadvantage of SL.

The pixel reconstruction objective of MAE may

not directly relate to high-level semantics (such as

fine-grained classification), and more complex pre-

training tasks (such as combining contrastive learning)

need to be designed. Although MAE can be extended

to ViT-H, it requires hyperparameter tuning (such as

mask rate and loss weight), and the training stability

still relies on engineering techniques (such as gradient

clipping). The masking strategies of MAE for

sequential/3D data such as videos and point clouds

are not yet mature, while SL can be directly adapted

through labeling (e.g., 3D human pose estimation).

4.2 Future Outlook

About SSL + SL collaborative training, by combining

the general representation ability of MAE with the

semantic alignment advantage of SL (such as MAE

pre-training + SL fine-tuning), higher accuracy can be

achieved in the scenario with limited samples. About

dynamic masking strategy, MAE can explore

adaptive masking rates (such as adjusting according

to the complexity of the image), further enhancing the

efficiency of pre-training. About robustness

enhancement, by conducting adversarial training or

using diffusion models, the robustness of MAE

against noise and occlusion is improved, narrowing

the gap in generalization from human vision. About

cross-modal General Model, it extends MAE to

videos (spatio-temporal masking) and text-image

pairs (multimodal reconstruction), constructing a

unified self-supervised pre-training framework.

About low-cost labeling alternative, it can utilize

MAE to generate pseudo-labels to assist in SL

training (such as semi-supervised learning), reducing

reliance on manual labeling. It can establish a cross-

modal evaluation set covering distribution shifts and

adversarial attacks (such as ImageNet-C for SSL +

SL).

5 CONCLUSIONS

Supervised learning and self-supervised learning are

two core paradigms of machine learning. Supervised

learning relies on labeled data to directly optimize the

task objective, suitable for high-precision scenarios

but with high data costs; self-supervised learning uses

The Current Research Status of Computer Vision

385

designed auxiliary tasks to pre-train with unlabeled

data, reducing the reliance on labeling but requiring

fine-tuning. The former performs well in

deterministic tasks, while the latter is good at learning

general features. The current trend is to combine the

advantages of both, first extracting general features

through self-supervised pre-training, and then fine-

tuning specific tasks with supervised learning to

achieve efficient and high-performance AI systems.

This hybrid paradigm is demonstrating strong

potential in fields such as healthcare and autonomous

driving.

Future research can explore the integration of SSL

and SL (such as self-supervised pre-training +

supervised fine-tuning) to balance efficiency and

accuracy; at the same time, improving the training

stability of RL (such as combining imitation learning)

may enable it to play a greater role in long-term

sequence tasks. Moreover, dynamic masking

strategies, cross-modal pre-training, and other

directions are expected to further enhance the

generalization ability of SSL. This study provides a

theoretical basis for the selection of different learning

paradigms, especially in scenarios with limited data

or high robustness requirements, where SSL

demonstrates stronger application potential.

REFERENCES

Anjiali, V., & Bappaditva, J. (2025). Aim of Computer

Vision in Image Processing. International Journal of

Scientific Research in Science and Technology, 12(1),

114–116.

Blaire, L., Oakleigh, H., Brycen, C., Kaitlynn, L., & Nadine,

J. (2025). Understanding Self-Supervised Learning and

Its Future Directions. hal-04964098.

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R.

(2021). Masked Autoencoders Are Scalable Vision

Learners (Version 3). arXiv. https://doi.org

/10.48550/ARXIV.2111.06377

Hu, H., Wang, X., Zhang, Y., Chen, Q., & Guan, Q. (2024).

A comprehensive survey on contrastive learning.

Neurocomputing, 610, 128645.

Kong, X. R. (2019). Overview of Machine Learning.

electronic production, (24), 82-84+38.

Kotappa, Y. G., Krushika, M., Ravichandra, M. et al.

(2022). A Review Paper on Computer Vision and

Image Processing. International Journal of Advanced

Research in Science, Communication and Technology,

68–72.

Morandín-Ahuerma, F. (2022). What is Artificial

Intelligence? International Journal of Research

Publication and Reviews, 03(12), 1947–1951.

Nkemdilim, M. N., Uzoamaka, P. R., Daniel, U., & Chidi,

M. K. (2024). An Overview of Supervised Machine

Learning Paradigms and their Classifiers. International

Journal of Advanced Engineering, Management and

Science, 10(3), 24–32.

Wiley, V., & Lucas, T. (2018). Computer Vision and Image

Processing: A Paper Review. International Journal of

Artificial Intelligence Research, 2(1), 22.

Zand, M., Etemad, A., & Greenspan, M. (2023). Multiscale

Residual Learning of Graph Convolutional Sequence

Chunks for Human Motion Prediction (Version 1).

arXiv. https://doi.org/10.48550/ARXIV.2308.16801

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

386