In fact, the paper found that currently there are
very few articles that provide a preliminary
algorithmic summary for the field of computer vision.
Therefore, this article chooses to give a relatively
basic introduction and comparison of these two
common learning methods that are currently in use or
being researched. At the same time, it uses some case
model articles for certain introductions. This way,
both researchers or those who are eager to acquire
knowledge can obtain the basic concepts of these
learning methods and their applications in artificial
intelligence, and also meet the readers' need for
understanding related knowledge.
2 OVERVIEW OF MAINSTREAM
TECHNOLOGIES
2.1 Self-Supervised Learning
Self-supervised learning (SSL) is a method that
enables models to learn rich and generalizable
representations from unlabeled data, achieving this
goal by solving predefined tasks. Unlike traditional
supervised learning, which requires a large amount of
labeled data for training or a significant amount of
imitation of human behavior to achieve the goal, self-
supervised learning utilizes the inherent structure of
raw data to generate pseudo-labels, eliminating the
need for manual labeling. Essentially, self-supervised
learning is a learning method without supervision.
The following is the core goal for the self-supervised
learning (Blaire et al, 2025).
First, about the representation learning, the goal
of SSL is to acquire reliable, transferable feature
representations that apply to various applications and
datasets. Second, about the effectiveness of data, by
using large volumes of unlabeled data, SSL increases
the scalability of learning by reducing reliance on
manually labeled datasets. Third, about the
generalization, domain- invariant and task-agnostic
properties can be captured via self-supervised
representations, which allows for efficient
generalization to other domains. Fourth, about the
lower annotation cost, SSL techniques significantly
reduce the expense and labor associated with data
annotation because they do not call for explicit labels
In self-supervised learning, the model is trained
by automatically generating supervisory signals from
unlabeled data, thereby avoiding the cost of manual
annotation. Since the core of this article is to compare
the advantages and disadvantages of computer vision
under different algorithms, we will not elaborate on
each paradigm in detail. Instead, we will only briefly
introduce the core idea of each one. Next, the paper
will extract representative paradigms from them to
illustrate the characteristics of self-supervised
learning. The following are five commonly used self-
supervised learning paradigms and their typical
methods.
2.1.1 Contrastive Learning
The core idea is to learn representations by bringing
close similar samples (positive sample pairs) and
push apart dissimilar samples (negative sample pairs).
Otherwise, there are three key methods. SimCLR
means to enerate positive sample pairs through data
augmentation, with negative samples coming from
other samples in the same batch, and uses the
InfoNCE loss. MoCo (Momentum Contrast) means to
maintain a dynamically updated queue as the negative
sample library, and enhances consistency through the
momentum encoder. CLIP (Cross-modal Contrastive
Learning) means to align the embeddings of images
and texts, used for multimodal tasks. Furthermore, its
advantage lies in its strong adaptability to data
augmentation, making it suitable for fields such as
vision and speech. Its disadvantages are that the
machine depends on a large number of negative
samples, with high computational cost (Hu et al,
2024).
2.1.2 Generative Model
The core idea is to learn representations by
reconstructing input data or predicting missing parts.
The typical methods are Masked Autoencoder (MAE),
BERT (in the NLP field) and VAE/GAN. Masked
Autoencoder (MAE) means to randomly mask the
input (such as image blocks or text words), and train
the model to reconstruct the missing parts (widely
used in Vision Transformer). BERT (in the NLP field)
means to predict the masked words through Masked
Language Modeling (MLM). VAE/GAN means that
the generative model can also be regarded as a form
of self-supervised learning. Moreover, its advantages
is that the process do not need negative samples and
is suitable for high-dimensional data such as text and
images.
An disadvantages are that the reconstruction task
may be too simple and unable to learn high-level
semantics.
2.1.3 Based on Prediction Tasks
The core idea is to design auxiliary tasks (pretext
tasks), and utilize the inherent structure of the data as