Large Scale Graph Construction and Label Propagation

Z. Ibrahim

, A. Bosaghzadeh

3 a

and F. Dornaika

1,2,∗ b

University of the Basque Country UPV/EHU, San Sebastian, Spain

IKERBASQUE, Basque Foundation for Science, Bilbao, Spain

Shahid Rajaee Teacher Training University, Tehran, Iran

ﬁ

Keywords:

Scalable Graph Construction, Semi-Supervised Learning, Topology Imbalance, Large Scale Databases,

Reduced Flexible Manifold Embedding.

Abstract:

Despite the advances in semi-supervised learning methods, these algorithms face three limitations. The ﬁrst

is the assumption of pre-constructed graphs and the second is their inability to process large databases. The

third limitation is that these methods ignore the topological imbalance of the data in a graph. In this paper,

we address these limitations and propose a new approach called Weighted Simultaneous Graph Construction

and Reduced Flexible Manifold Embedding (W-SGRFME). To overcome the ﬁrst limitation, we construct the

afﬁnity graph using an automatic algorithm within the learning process. The second limitation concerns the

ability of the model to handle a large number of unlabeled samples. To this end, the anchors are included in the

algorithm as data representatives, and an inductive algorithm is used to estimate the labeling of a large number

of unseen samples. To address the topological imbalance of the data samples, we introduced the Renode

method to assign weights to the labeled samples. We evaluate the effectiveness of the proposed method through

experimental results on two large datasets commonly used in semi-supervised learning: Covtype and MNIST.

The results demonstrate the superiority of the W-SGRFME method over two recently proposed models and

emphasize its effectiveness in dealing with large datasets.

1 INTRODUCTION

In recent years, various machine learning systems

have integrated supervised learning methods, demon-

strating impressive outcomes across diverse tasks and

domains. However, the reliance of these methods on

substantial amounts of labeled data introduces signiﬁ-

cant human involvement in the modeling process and

potentially high costs for data annotation. Address-

ing these challenges, graph-based semi-supervised

learning (GSSL) serves as a theoretical framework,

capitalizing on insights derived from unlabeled data.

This approach involves a dataset and a graph illus-

trating connections between labeled and unlabeled

elements. GSSL operates under two key assump-

tions: the clustering assumption, which pertains to the

data’s nature, and the manifold assumption, which re-

lates to its spatial distribution (Belkin et al., 2006).

The majority of graph-based semi-supervised learn-

ing (GSSL) approaches rely on a pre-existing graph,

https://orcid.org/0000-0002-0372-6144

https://orcid.org/0000-0001-6581-9680

∗

Corresponding author

treating graph construction and label propagation as

distinct tasks (Qiu et al., 2019; Sindhwani and Niyogi,

2005; Song et al., 2022). For instance, in the work

of (Bosaghzadeh et al., 2013), an adaptive KNN al-

gorithm is employed to establish the graph, while

in (Wang et al., 2010), the afﬁnity matrix is formed

based on a data representation algorithm. However, in

more recent methodologies, these two tasks are inte-

grated to simultaneously create the graph and predict

labels (Tu et al., 2022; Wang et al., 2022; Wu et al.,

2019).

An additional signiﬁcant concern pertains to the

issue of imbalanced data. While previous studies

have primarily addressed imbalances arising from un-

evenly distributed labeled examples across classes

(set imbalance) (Chen et al., 2021), we posit that

graph data introduce a unique form of imbalance due

to the asymmetric topological characteristics of la-

beled nodes. Speciﬁcally, labeled nodes differ in their

structural roles within the graph (topology imbalance)

(Chen et al., 2021). This phenomenon has been ex-

plored within the realm of data analysis, particularly

in the ﬁeld of topological data analysis (TDA). The

702

Ibrahim, Z., Bosaghzadeh, A. and Dornaika, F.

Large Scale Graph Construction and Label Propagation.

DOI: 10.5220/0012430100003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

702-709

ISBN: 978-989-758-679-8; ISSN: 2184-4321

mapper algorithm stands out as a prominent approach

in this domain, and various algorithms, such as the

fuzzy mapper algorithm (Bui et al., 2020) and Shape

Fuzzy C-Means (SFCM) (Bui et al., 2021), have been

proposed to address this aspect of topology imbal-

ance.

In tasks involving label propagation, labeled sam-

ples situated near the decision boundaries between

different classes are more prone to generating con-

ﬂicts in information. Conversely, labeled samples po-

sitioned farther away from these boundaries do not

encounter such conﬂict issues (Chen et al., 2021;

Chen et al., 2019).

A signiﬁcant limitation of Graph-Based Semi-

Supervised Learning (GSSL) is the scalability issue

(Collobert et al., 2006; Zhu and Lafferty, 2005). De-

spite notable advancements in semi-supervised meth-

ods, particularly for smaller datasets, many of these

approaches struggle to scale effectively when con-

fronted with large, unlabeled datasets common in

practical applications (Sindhwani et al., 2005; Wang

et al., 2019). The challenges in scalability primar-

ily manifest in the graph generation and label evalua-

tion phases of graph-based SSL solutions (Long et al.,

2019; Qiu et al., 2019; Song et al., 2022)

Another challenge with Semi-Supervised Learn-

ing (SSL) methods lies in predicting the labels of test

samples. Transductive methods require the repetition

of the whole procedure, including graph construction

and label estimation, to predict labels for unseen test

samples. Conversely, inductive approaches deﬁne a

projection that maps test samples from the feature

space to the label space, enabling reliable label es-

timation for test samples (Qiu et al., 2019; Sindhwani

et al., 2005).

This article introduces the W-SGRFME model, an

inductive semi-supervised framework that addresses

challenges associated with large datasets through the

incorporation of anchor points. Additionally, it tack-

les topological imbalance in the data by assigning

weights to labeled nodes. The model can simulta-

neously predict the projection matrix, anchor afﬁnity

matrix, and labels for unlabeled data. Furthermore, it

offers a method for estimating the labels of test sam-

ples through a linear transformation. The contribu-

tions of this work include:

• Expanding the idea of graph topology imbalance

to large data sets.

• Incorporating weights of labeled samples into the

uniﬁed scalable semi-supervised model.

• Showing the effectiveness of the proposed method

through experimental results on two large datasets

in the context of semi-supervised learning.

The subsequent sections of the paper are struc-

tured as follows: Section 2 provides an overview of

Graph-Based Semi-Supervised Learning (GSSL) ap-

proaches. Section 3 explains some fundamental con-

cepts along with the proposed algorithm. The exper-

imental results of the method are outlined in Section

4, and the paper concludes with Section 5.

2 RELATED WORK

The use of prefabricated graphs poses a signiﬁcant

challenge in GSSL algorithms, as highlighted by (Cui

et al., 2018; Dornaika et al., 2021; Hamilton et al.,

2017; Nie et al., 2010). Prefabricated graphs, espe-

cially for large datasets, can be impractical, contain-

ing inappropriate connections. This issue becomes

pronounced with very large datasets due to compu-

tational impracticality and memory space concerns,

given the quadratic scaling of the graph matrix with

the number of nodes.

Recent research emphasizes the interconnected

nature of graph building and learning tasks, advocat-

ing for their simultaneous consideration (Kang et al.,

2021; Nie et al., 2017; Yuan et al., 2021).

Weighting labeled samples has shown promise

in improving classifcation accuracy (Aromal et al.,

2021; Chen et al., 2021), suggesting that assign-

ing lower weights to samples near class boundaries

is beneﬁcial. Combining the node effect shift phe-

nomenon with label propagation, (Chen et al., 2021)

presents a uniﬁed approach to analyze quantitative

and topological imbalance problems. The ReNode

method (Chen et al., 2021) ﬂexibly reweights the ef-

fects of labeled nodes based on their positions relative

to class boundaries, providing a model-neutral solu-

tion.

This paper presents an algorithm that combines la-

bel transfer and graph generation into a uniﬁed op-

eration. The incorporation of labels throughout the

process of graph generation contributes to a more

comprehensive evaluation of data diversity. The pro-

posed approach is inductive, capable of handling large

amounts of previously unseen data, and scalable,

managing extensive training databases using anchors.

Furthermore, assigning different weights to labeled

nodes using the ReNode algorithm enhances the ro-

bustness of the proposed model.

3 PROPOSED METHOD

In any semi-supervised classiﬁcation problem, we

have N data samples in a data matrix X = {X

} =

Large Scale Graph Construction and Label Propagation

703

,...,x

l+1

...,x

l+u

} ∈ R

d×N

, where l, u, N = l +

u, and d are the numbers of labeled and unlabeled

and total number of training samples, and the di-

mensionality of each sample, respectively. For each

labeled sample x

, there exist a label vector as y

where y

i j

= 1 if sample x

corresponds to the j

class and 0 otherwise. Consequently, we have a la-

bel matrix for the training data as Y = [Y

] =

;....; y

l+1

;...; y

l+u

] ∈ R

N×C

, where C is the

number of classes. Moreover, there is the soft-label

matrix F = [f

;....; f

] ∈ R

N×C

, where F

i j

shows

the probability that the sample x

is a member of the

class.

Also, we have a similarity graph as W ∈ R

N×N

where W

i j

shows the similarity between x

and x

Furthermore, the Laplacian matrix is deﬁned as L =

D − W, D being the degree matrix of the graph.

Moreover, we have the matrix of anchors as Z =

,..., z

] ∈ R

d×m

where m is the number of an-

chors and m << N. The afﬁnity matrix B ∈ R

N×m

represents the similarity between the anchors and the

training samples. It is worth noting that, using the la-

bel of anchors (i.e., A ∈ R

m×C

), one can predict the

label of the training data using Eq. (1).

F = B A. (1)

3.1 Brief Description of SGRFME

The reduced FME (r-FME) (Qiu et al., 2019) tech-

nique was proposed to solve the limitations of the for-

mer FME algorithm which was the problem of han-

dling large databases. r-FME algorithm solves this

problem by using anchors that can serve as represen-

tative samples for a collection of nodes (Qiu et al.,

2019).

However, r-FME method treated the graph con-

struction and label propagation as two separate tasks.

Hence, in (Ibrahim et al., 2023), we proposed the Si-

multaneous graph construction and Reduced RFME

method that jointly estimates the r-FME unknowns

and the anchor-to-anchor graph similarity matrix. In

other words, the SGRFME method simultaneously es-

timates the anchor-to-anchor graph matrix and the r-

FME model variables (i.e., Soft label matrix, projec-

tion matrix, and bias vector). Hence, the anchor-to-

anchor graph is not ﬁxed a priori as in r-FME (Qiu

et al., 2019). Moreover, it uses both the feature of

anchors and the online predicted labels of unlabeled

samples.

One of the main issues in the SGRFME method

is that it treats the whole samples equally and does

not consider the topology importance of the nodes.

The main obstacle in solving the topology imbalance

problem is how to evaluate the relative topological po-

sition of the labeled node to its class.

3.2 Node Weighting in Large Scale

Databases

Renode algorithm (Chen et al., 2021) was proposed

to use the node topology information and calculate

the weights of the available labeled samples. The cal-

culated weights are called Topology Relative Loca-

tion measure (Totoro). This algorithm provides high

weights of any labeled sample using the available la-

bels and the graph topology seen by that labeled sam-

ple. These weights are used in the loss function of

some deep semi-supervised classiﬁers.

However, the problem is that the Renode algo-

rithm requires an N × N afﬁnity graph which is not

feasible to be constructed for large scale databases

due to memory limitations. Hence, our solution is

to adapt it for large scale databases using anchor

nodes. Instead of using the whole training database

to calculate the weights of the labeled nodes, we use

the anchors as data representatives for unlabeled data

(which builds a large portion of training data).

First, we select m anchors from the unlabeled data

(i.e., Z ∈ R

d×m

). We put these anchors which are

unlabeled data representatives along labeled data and

construct a new data matrix as X

∈ R

d×(l+m)

such

that X

= [X

, Z]. Then, we build the afﬁnity ma-

trix O ∈ R(l + m) ×(l + m) which shows the similar-

ity between the l + m samples, where l + m << N.

It is worth mentioning that the O afﬁnity matrix can

be efﬁciently computed using any graph construction

method. In this paper, we use the well-known KNN

method with K =10 to ﬁnd the neighbors of a node,

and the similarity between the nodes is calculated us-

ing the Gaussian function.

We then feed this afﬁnity matrix in the Renode

algorithm and calculate the weights for the labeled

samples (i.e., w

,k = 1, ..., l). The calculated weights

show the topological location and importance of the

labeled samples.

3.3 Proposed Weighted SGRFME

As we explained before, the SGRFME algorithm

(Ibrahim et al., 2023) has two drawbacks: First, it

considers the anchors equally and second, it does not

weight the labeled samples.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

704

The objective function of the SGRFME is

min

A,Q, b,S

Tr (A

L A) + λTr (ZL Z

)+ (2)

Tr ((BA −Y)

U(BA − Y)) +

||S||

µ(||Q||

+ γ ||Z

Q + 1b

− A||

)

where S is the anchor-to-anchor matrix of the

graph and L is the Laplacian matrix of this graph. The

ﬁrst term is the smoothness of the anchors’ labels, the

second term measures the smoothness of the anchors’

features, the third term is the error of the weighted la-

beling estimate over the labeled samples, the fourth

term is a ℓ

regularization of the graph S, and the

ﬁfth term regularizes the projection matrix and the es-

timated ﬁtting error of the anchors over the projection

matrix (regression error). The λ, ρ, µ, and γ are bal-

anced parameters.

Our solution to insert the calculated weights (i.e.,

) into the SGRFME objective function is to use the

third term in Eq.4 which is the labeling error calcu-

lated over the labeled samples and propose a weighted

label ﬁtting term.

In other words, we extend our previous work

(Ibrahim et al., 2023) and present a weighted simul-

taneous graph construction and a reduced ﬂexible

branching model that can adapt appropriate weights

to nodes and also manage large datasets using an-

chor points. This method dynamically calculates the

weights of labeled nodes using the ReNode algorithm

and then uses those weights to improve the model.

Without loss of generality, we assume that the ﬁrst

l rows of the B matrix contain the labeled samples.

Hence, the ﬁrst l elements in the U matrix contain a

ﬁxed value and the rest are zero.

We deﬁne a new diagonal matrix V given by

Eq. (3), which indicates the importance of the labeled

samples.

V =







.. .

.. . 0







(3)

Thus, in any minimization problem that aims to

recover the unknowns of the model, the important or

relevant nodes that have large weights will receive

more importance compared to the labeled nodes with

low weights. Hence, our proposed objective function

will become as:

min

A,Q, b,S

Tr (A

LA) + λTr (ZL Z

) + (4)

Tr ((BA −Y)

V(BA − Y)) +

||S||

µ(||Q||

+ γ ||Z

Q + 1b

− A||

)

where V is the diagonal matrix of the labeled sample

weights (Eq. (3)) and the unknown variable are those

of SGRFME.

In the next section, we explain the solution of the

proposed objective function.

3.4 Optimization

The proposed method has a ﬁrst step of initialization

followed by the solution to obtain the unknowns.

In the initialization step, we determine the anchors

and calculate the weight of labeled samples. To de-

termine the anchors, we use the well-known Kmeans

clustering method and determine m centroids and set

them as anchors (i.e. Z = [z

,..., z

]). These

anchors allows us to estimate the weight of labeled

samples and moreover, we use them in the semi-

superivsed model depicted in the objective function

To calculate the weights of labeled nodes, we fol-

low the procedure explained in Section 3.2. We build

a new data matrix as X

= [X

,Z] by linking the la-

beled samples and the anchors. Then, we construct

the afﬁnity matrix for the X

matrix using the well-

known KNN method. The we use the algorithm pre-

sented in (Wang et al., 2016) to construct the B ma-

trix and calculate the initial anchor-to-anchor graph

by setting S = B

Next step is how to solve the optimization function

introduced in Eq.4 to calculate the unknowns (i.e., S,

A,Q, b). Since the proposed objective function does

not have a closed form solution, we adopt an iterative

algorithm to solve it. In other words, we ﬁx some

variables and solve for other variables.

Fix S and estimate A, Q, b

By ﬁxing the S matrix, the objective function is

reduced to

min

A,Q, b

Tr (A

L A) + Tr (BA −Y)

V (BA − Y)+ (5)

µ(||Q||

+ γ ||Z

Q + 1b

− F||

)

that is similar to the objective function of r-FME

method (Qiu et al., 2019), hence, the solution is simi-

lar to the solution of r-FME variables as

A = [L + B

VB + µH

− µH

(ZH

+ γI)

−1

]

−1

(6)

Q = (ZH

+ γI)

−1

A (7)

Large Scale Graph Construction and Label Propagation

705

b =

1 − Q

Z1) (8)

Fix A,Q, b and estimate S

Next step we ﬁx all variables and solve the objec-

tive function to estimate the S. Doing so, the objective

function can be reduced to

min

Tr (A

LA) + λTr (ZL Z

) +

||S||

(9)

We have the following in the area of spectral anal-

ysis:

Tr (A

LA) =

∑

i=1

∑

j=1

||A

i,.

− A

j,.

i j

(10)

where A

i,.

is the i

row of the matrix A. Thus, by ex-

panding the two trace terms, the minimization prob-

lem of Eq. (9) can be written as

min

∑

i=1

∑

j=1



||A

i,.

− A

j,.

+ λ||Z

.,i

− Z

., j



i j

(11)

∑

i=1

||s

≡ min

∑

i=1

∑

j=1

i j

∑

i=1

||s

where g

i j

= (||A

i,.

−A

j,.

+λ||Z

.,i

−Z

., j

), and Z

.,i

is the ith column of the matrix Z.

Eq. (11) may be subdivided into m sub-problems

(i = 1, ..., m), each of which can be used to estimate a

row of the similarity matrix, s

. So, we have:

min

∑

j=1

i j

+ ρ||s

, i = 1, .. .m (12)

The solution of Eq. (12) was introduced in (Nie

et al., 2016; Nie et al., 2017) using a closed form so-

lution by imposing three constraints on the m prob-

lems. First, the solution is non-negative (i.e., s > 0).

Second, their sum is 1 (

∑

i=1

i j

= 1). Third, the op-

timal solution s

has exactly K nonzero values, where

K ≤ 10.

Till here, we solved the algorithm for one itera-

tion. We repeat these two steps to calculate A,Q, b

and S until the difference between two S matrices in

subsequent iterations is less than a threshold or we

reach 10 iterations.

After convergence, the problem is how to estimate

the label of unlabeled samples in the training set and

the test samples. For the unlabeled samples one can

use Eq. (1) to determine the labels and for the test

samples the labels can be estimate using

f = Q

test

+ b (13)

For the estimated soft label vector f

the class la-

bel is obtained by

c = arg max

;k = 1, . . .,C (14)

4 EXPERIMENTAL RESULTS

In this section, we evaluate the performance of the

proposed method compared to recently proposed

methods. For this purpose, two algorithms namely

r-fme (Qiu et al., 2019) and SGRFME (Ibrahim et al.,

2023) is adopted.

For databases, we select two large-scale databases

namely Covtype and MNIST databases.

MNIST. This database has 60,000 images, and we

randomly select 1000 samples from each class for

training and the rest for testing. We fed the images

into the ResNet-50 (He et al., 2015) network and ex-

tracted the information in the Average Pooling layer

as the image descriptor that forms a 2048-dimensional

vector.

Covtype. This database

contains the forest cover

type for 30 x 30 meter cells obtained from US For-

est Service data. It contains 581,012 instances and 54

attributes. We arbitrarily selected 80% of the data for

training and the remaining 20% for testing.

To reduce the dimensionality of the data, we ap-

plied PCA and kept the top 50 dimensions.

For both databases, o samples of each class in the

training set are selected as labeled and the rest as un-

labeled. Having C classes, we can conclude to have

l = o × C. To reduce the dependency of results on

a speciﬁc set of labeled data, we created 20 random

combinations of labeled and unlabeled sets and re-

ported the average of results.

We use Matlab version R2018a and a PC with a

i9-7960@2.80 GHz CPU and 128 GB RAM.

4.1 Parameter Evaluation

The proposed method has several parameters i.e., µ,

γ, λ, ρ, α, and w

max

. The values adopted for these pa-

rameters can has a high impact on the accuracy of the

proposed method, hence their values should be clev-

erly selected.

To evaluate the effect these parameters, we take

the ﬁve labeled samples from the MNIST database

and vary the parameters and report the accuracy.

In the ﬁrst experiment, we evaluate the effect of

the parameter α. Fig. 1 shows the variation of accu-

racy respect to the variation of the parameter α. As

we observe, the accuracy for the test samples is rela-

tively ﬁx and does not vary when α varies, however,

the unlabeled data has a higher variance. Also, the

accuracy has its best values when α is 0.7.

In the second experiment, the effect of w

max

evaluated. Fig. 2 shows the effect of varying this pa-

rameter on the accuracy. We can see that the accuracy

http://archive.ics.uci.edu/ml/datasets/Covertype

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

706

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Accuracy

Unlabel

Test

Figure 1: Performance of the proposed method versus α pa-

rameter on the MNIST database with ﬁve labeled samples.

50 100 150 200 250 300 350

max

Accuracy

Unlabel

Test

Figure 2: Performance of the proposed method versus w

max

parameter on the MNIST database with ﬁve labeled sam-

ples.

has its highest value when w

max

is 300. Moreover, we

observe that the accuracy does not follow the same

trend specially when w

max

is lower than 250.

Thirdly, we evaluate the effect of varying number

of the anchors. In Fig. 3, we have plotted the accuracy

when the number of anchors varies in the range [10

20 50 100 200 500 1000 2000]. As we can see the ac-

curacy increases as the number of anchors increases,

however, we observe that the accuracy suddenly drops

as we use more than 1000 anchors.

Number Of Anchors

Accuracy

Unlabel

Test

Figure 3: Performance of the proposed method versus the

number of anchors on the MNIST database with ﬁve labeled

samples.

4.2 Comparison with Other Methods

For comparison, we used two recently proposed meth-

ods, r-FME(Qiu et al., 2019) and SGRFME(Ibrahim

et al., 2023). Since we are in semi-supervised context,

in the training data we have labeled and unlabaled

samples. Hence, for the number of labeled samples

per class, in the Covtype database, we set l to 30, 50,

and 70 and for MNIST we set l to 5 and 20.

In these experiments, we ﬁxed w

min

to one, α to

0.7, and w

max

to 300. For the diagonal U

matrix, we

set the ﬁrst l elements to the weights obtained by To-

toro values (i.e., w

) and the rest to zero. For µ, γ,

λ, and ρ, we select one split of labeled and unlabeled

data and scan the parameters to ﬁnd the best combi-

nations. Then we ﬁx the obtained parameters for the

rest of the experiments.

Table 1 shows the average and standard devia-

tion for 20 random combinations of labeled and unla-

beled samples on Covtype an MNIST databases. We

use bold font for the highest accuracy. As we ob-

serve, the average accuracy of the proposed method is

higher compared to other competing methods includ-

ing SGRFME, which shows the effect of the adaptive

weighting of SGRFME algorithm. Also, the standard

deviation of the proposed method is lower, showing a

more stable accuracy compared to other methods. We

observe this behavior for both databases which shows

that there is no bias toward a speciﬁc database for the

outperformance of the proposed method. Also, we

can see that even though the training database is the

same, when we increase the number of labeled sam-

ples, we have an increase in the accuracy for all meth-

ods. However, the proposed method keeps its outper-

formance in different numbers of labeled samples.

4.3 Confusion Matrix

We calculate the confusion matrix of the proposed

method. We select one split of MNIST database with

5 labeled samples and estimate the labels of the un-

labeled set and test set. We then plot the confusion

and report the precision, recall, and F1 score for each

class in Table 2. Moreover, we report the macro preci-

sion, macro recall, and macro F1 score. Macrometrics

(Precision, Recall, and F1 score) are calculated by av-

eraging the metrics across all classes.

5 CONCLUSION

In this article, we tackle the problem of topology im-

balance in graphs. We adopted the Renode technique

to assign weights to the labeled samples. To do so,

Large Scale Graph Construction and Label Propagation

707

Table 1: Average accuracy with standard deviation for the proposed method and two recent methods (r-FME and SGRFME)

obtained on 20 random combinations of labeled and unlabeled samples.

Covtype

Method 30 labeled samples 50 labeled samples 70 labeled samples

Unlabeled Test Unlabeled Test Unlabeled Test

r-FME (Qiu et al., 2019)

47.70 ± 3.20 45.88 ± 3.87 49.54 ± 1.78 50.01 ± 3.14 51.89 ± 2.08 53.36 ± 2.74

(10

,10

) (10

,10

−3

) (10

,10

) (10

,10

−3

) (10

,10

−3

) (10

,10

−3

)

SGRFME (Ibrahim et al., 2023)

51.00 ± 2.02 49.62 ± 2.39 52.39 ± 1.82 52.23 ± 1.97 54.62 ± 0.95 54.11 ± 1.24

(10

,10

−12

(10

,10

−12

(10

,10

(10

,10

(10

,10

(10

,10

−12

) 10

−12

,10

) 10

,10

) 10

,10

) 10

,10

) 10

,10

)

W-SGRFME

52.70 ± 2.58 52.91 ± 2.26 53.38 ± 0.84 53.41 ± 1.74 55.16 ± 0.76 54.77 ± 0.66

(10

,10

−12

(10

,10

−12

(10

,10

(10

,10

(10

,10

(10

,10

−12

,10

) 10

−12

,10

) 10

,10

) 10

,10

) 10

,10

) 10

,10

)

MNIST

Method 5 labeled samples 20 labeled samples

Unlabeled Test Unlabeled Test

r-FME (Qiu et al., 2019)

64.47 ± 2.24 57.82 ± 4.31 70.39 ± 1.05 67.61 ± 3.07

(10

,10

−12

) (10

,10

) (10

,10

−12

) (10

,10

)

SGRFME (Ibrahim et al., 2023)

65.29 ± 1.82 58.27 ± 4.42 71.22 ± 0.83 68.08 ± 2.95

(10

,10

(10

−24

,10

(10

,10

(10

−24

,10

−24

,10

) 10

−12

,10

) 10

−24

,10

) 10

−12

,10

)

W-SGRFME

66.09 ± 1.42 59.17 ± 4.12 71.72 ± 1.01 69.17 ± 2.89

(10

,10

(10

−24

,10

(10

,10

(10

−24

,10

−24

,10

) 10

−12

,10

) 10

−24

,10

) 10

−12

,10

)

Table 2: Confusion Matrix, Precision, Recall, and F1 of the proposed method on the MNIST database (5 labeled samples).

Unlabeled samples

Predicted Precision Recall F1

Class 1 (Actual) 846 0 6 4 0 1 33 2 14 59 0.85 0.87 0.86

Class 2 (Actual) 0 1050 0 0 2 0 3 28 0 2 0.93 0.96 0.95

Class 3 (Actual) 9 1 547 95 15 146 40 22 74 120 0.55 0.51 0.53

Class 4 (Actual) 3 0 48 307 1 170 3 2 7 12 0.30 0.55 0.38

Class 5 (Actual) 7 43 92 9 900 31 28 306 14 29 0.92 0.61 0.73

Class 6 (Actual) 2 0 71 552 4 498 6 3 8 27 0.55 0.42 0.48

Class 7 (Actual) 84 0 101 18 2 34 779 9 71 114 0.79 0.64 0.70

Class 8 (Actual) 3 29 28 0 25 5 13 661 2 11 0.63 0.85 0.72

Class 9 (Actual) 10 0 24 18 0 3 7 0 645 50 0.66 0.85 0.74

Class 10 (Actual) 23 1 76 19 25 16 74 11 140 568 0.57 0.59 0.58

Macro 0.69 0.68 0.67

Test samples

Predicted Precision Recall F1

Class 1 (Actual) 4679 1 60 21 14 40 805 71 18 581 0.94 0.74 0.83

Class 2 (Actual) 7 4981 37 0 41 19 198 884 33 40 0.88 0.79 0.84

Class 3 (Actual) 2 4 1184 37 35 63 56 54 21 195 0.23 0.71 0.35

Class 4 (Actual) 59 1 1168 4070 56 2827 178 75 80 752 0.79 0.43 0.56

Class 5 (Actual) 37 620 683 49 4213 87 175 1072 81 264 0.86 0.57 0.69

Class 6 (Actual) 15 0 499 696 7 1092 296 10 24 89 0.24 0.40 0.30

Class 7 (Actual) 33 0 67 21 0 39 1920 8 4 31 0.38 0.90 0.54

Class 8 (Actual) 16 11 409 27 454 126 82 3035 10 123 0.58 0.70 0.63

Class 9 (Actual) 87 0 818 148 36 76 1154 10 4598 1484 0.94 0.54 0.69

Class 10 (Actual) 1 0 40 40 12 148 68 2 7 1398 0.28 0.81 0.41

Macro 0.61 0.66 0.58

we used anchors as data representatives and modiﬁed

the Renode method to extend the idea of topological

imbalance for large scale databases. Then we adopted

the label estimation error term to insert these weights

into our objective function. Our experimental results

on two large databases show that the proposed method

has a higher average accuracy and lower standard de-

viation compared to two recently proposed methods

namely r-FME and SGRFME. On the other hand, due

to the iterative nature of the proposed method, it has

a higher computational complexity than its competi-

tive methods. Moreover, the relatively large number

of parameters to be tuned is another weakness of the

proposed method. Hence, as a future work, we focus

on the automatic tuning of these parameters. Also, re-

duction of the running time is another track to follow.

REFERENCES

Aromal, A., M. Rasool, A., Dubey, A., and Roy, B. N.

(2021). Optimized weighted samples based semi-

supervised learning. In 2021 Second International

Conference on Electronics and Sustainable Commu-

nication Systems (ICESC), pages 1311–1318.

Belkin, M., Niyogi, P., and Sindhwani, V. (2006). Manifold

regularization: A geometric framework for learning

from labeled and unlabeled examples. The Journal of

Machine Learning Research, 7:2399–2434.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

708

Bosaghzadeh, A., Moujahid, A., and Dornaika, F. (2013).

Parameterless local discriminant embedding. Neural

Processing Letters, 38.

Bui, Q.-T., Vo, B., Do, H.-A. N., Hung, N. Q. V., and Snasel,

V. (2020). F-mapper: A fuzzy mapper clustering al-

gorithm. Knowledge-Based Systems, 189:105107.

Bui, Q.-T., Vo, B., Snasel, V., Pedrycz, W., Hong, T.-P.,

Nguyen, N.-T., and Chen, M.-Y. (2021). Sfcm: A

fuzzy clustering algorithm of extracting the shape in-

formation of data. IEEE Transactions on Fuzzy Sys-

tems, 29(1):75–89.

Chen, D., Lin, Y., Zhao, G., Ren, X., Li, P., Zhou, J.,

and Sun, X. (2021). Topology-imbalance learning

for semi-supervised node classiﬁcation. Advances in

Neural Information Processing Systems, 34:29885–

29897.

Chen, X., Yu, G.-X., Tan, Q., and Wang, J. (2019).

Weighted samples based semi-supervised classiﬁca-

tion. Applied Soft Computing, 79:46–58.

Collobert, R., Sinz, F., Weston, J., and Bottou, L. (2006).

Large scale transductive svms. Journal of Machine

Learning Research, 7:1687–1712.

Cui, B., Xie, X., Hao, S., Cui, J., and Lu, Y. (2018).

Semi-supervised classiﬁcation of hyperspectral im-

ages based on extended label propagation and rolling

guidance ﬁltering. Remote Sensing, 10(4).

Dornaika, F., Baradaaji, A., and El Traboulsi, Y. (2021).

Semi-supervised classiﬁcation via simultaneous label

and discriminant embedding estimation. Information

Sciences, 546:146–165.

Hamilton, W. L., Ying, R., and Leskovec, J. (2017). Induc-

tive representation learning on large graphs. In Pro-

ceedings of the 31st International Conference on Neu-

ral Information Processing Systems, NIPS’17, page

1025–1035, Red Hook, NY, USA. Curran Associates

Inc.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep

residual learning for image recognition. CoRR,

abs/1512.03385.

Ibrahim, Z., Bosaghzadeh, A., and Dornaika, F. (2023).

Joint graph and reduced ﬂexible manifold embedding

for scalable semi-supervised learning. Artiﬁcial Intel-

ligence Review, 56:9471–9495.

Kang, Z., Peng, C., Cheng, Q., Liu, X., Peng, X., Xu, Z.,

and Tian, L. (2021). Structured graph learning for

clustering and semi-supervised classiﬁcation. Pattern

Recognition, 110:107627.

Long, Y., Li, Y., Wei, S., Zhang, Q., and Yang, C. (2019).

Large-scale semi-supervised training in deep learn-

ing acoustic model for asr. IEEE Access, 7:133615–

133627.

Nie, F., Cai, G., and Li, X. (2017). Multi-view cluster-

ing and semi-supervised classiﬁcation with adaptive

neighbours. In Thirty-First AAAI Conference on Arti-

ﬁcial Intelligence.

Nie, F., Wang, X., Jordan, M. I., and Huang, H. (2016). The

constrained laplacian rank algorithm for graph-based

clustering. In AAAI Conference on Artiﬁcial Intelli-

gence.

Nie, F., Xu, D., Tsang, I. W., and Zhang, C. (2010). Flex-

ible manifold embedding: A framework for semi-

supervised and unsupervised dimension reduction.

IEEE Transactions on Image Processing, 19(7):1921–

1932.

Qiu, S., Nie, F., Xu, X., Qing, C., and Xu, D. (2019). Ac-

celerating ﬂexible manifold embedding for scalable

semi-supervised learning. IEEE Transactions on Cir-

cuits and Systems for Video Technology, 29(9):2786–

2795.

Sindhwani, V. and Niyogi, P. (2005). Linear manifold regu-

larization for large scale semi-supervised learning. In

Proc. of the 22nd ICML Workshop on Learning with

Partially Classiﬁed Training Data.

Sindhwani, V., Niyogi, P., Belkin, M., and Keerthi, S.

(2005). Linear manifold regularization for large scale

semi-supervised learning. Proc. of the 22nd ICML

Workshop on Learning with Partially Classiﬁed Train-

ing Data.

Song, Z., Yang, X., Xu, Z., and King, I. (2022). Graph-

based semi-supervised learning: A comprehensive re-

view. IEEE Transactions on Neural Networks and

Learning Systems, pages 1–21.

Tu, E., Wang, Z., Yang, J., and Kasabov, N. (2022).

Deep semi-supervised learning via dynamic anchor

graph embedding in latent space. Neural Networks,

146:350–360.

Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong,

Y. (2010). Locality-constrained linear coding for im-

age classiﬁcation. In IEEE Conference on Computer

Vision and Pattern Recognition.

Wang, M., Fu, W., Hao, S., Tao, D., and Wu, X. (2016).

Scalable semi-supervised learning by efﬁcient anchor

graph regularization. IEEE Transactions on Knowl-

edge and Data Engineering, 28(7):1864–1877.

Wang, Z., Wang, L., Chan, R. H., and Zeng, T. (2019).

Large-scale semi-supervised learning via graph struc-

ture learning over high-dense points.

Wang, Z., Zhang, L., Wang, R., Nie, F., and Li, X. (2022).

Semi-supervised learning via bipartite graph construc-

tion with adaptive neighbors. IEEE Transactions on

Knowledge and Data Engineering, pages 1–1.

Wu, X., Zhao, L., and Akoglu, L. (2019). A quest for struc-

ture: Jointly learning the graph structure and semi-

supervised classiﬁcation.

Yuan, Y., Li, X., Wang, Q., and Nie, F. (2021). A semi-

supervised learning algorithm via adaptive laplacian

graph. Neurocomputing, 426:162–173.

Zhu, X. and Lafferty, J. (2005). Harmonic mixtures: com-

bining mixture models and graph-based methods for

inductive and scalable semi-supervised learning. In

Machine Learning, Proceedings of the Twenty-Second

International Conference (ICML 2005), Bonn, Ger-

many, August 7-11, 2005.

Large Scale Graph Construction and Label Propagation

709