characterize the collective innovative features of the
entire cluster across multiple dimensions.
Considering that nodes may differ in their
structural positions and representativeness within a
cluster, this study introduces an embedding-distance-
based weighting mechanism to linearly aggregate the
individual indicators, thereby constructing a unified
cluster-level indicator system .
To quantify the representativeness of each node
within its cluster, we compute the Euclidean distance
between the node’s embedding—generated by the
graph neural network—and the centroid of its
corresponding cluster in the embedding space. A
smaller distance indicates that the node is closer to the
semantic center of the cluster, implying higher
representativeness and centrality within the
knowledge community.
3 EXPERIMENTAL
FRAMEWORK
To comprehensively evaluate the effectiveness of the
proposed integrated graph construction mechanism
and the graph neural network-based approach for
identifying innovation frontiers, we conducted
systematic experimental studies on the constructed
academic paper–technical patent integrated citation
network.
3.1 Dataset Overview
The dataset for this study comprises 91,360
academic papers and 92,337 technical patents,
constituting a heterogeneous citation network with
183,697 nodes. The paper data is sourced from the
Web of Science Core Collection, while the patent
data is drawn from the USPTO and EPO databases.
The dataset spans the years 2010 to 2023 and
primarily covers various disciplines, including
biomedicine.During the data preprocessing phase,
we extracted the titles, abstracts, publication years,
and citation relationships of academic papers, as
well as the titles, abstracts, application years, and
citation information of patents. For isolated nodes in
the network, we employed an indirect citation
inference mechanism that combines semantic and
citation relationships to construct new connections
for these nodes, effectively enhancing network
connectivity. The training, validation, and test sets
were randomly split in a ratio of 7:1:2.
Based on the GraphSAGE framework, a three-
layer graph neural network model is constructed. The
embedding dimension is set to 512 dimensions to
fully capture the semantic and structural information
within the network, while the hidden layer size is set
to 128 dimensions to balance the model's expressive
power and computational efficiency. The network is
configured with three layers, effectively aggregating
information from three-hop neighbors. The number of
sampled neighbors per layer is 25, 20, and 15 nodes,
respectively. In terms of activation functions, the
hidden layer employs the ReLU function, while the
output layer utilizes the Sigmoid function. To
enhance the model's generalization capability,
Dropout regularization is added after each layer, with
a dropout rate set to 0.3.
The model training adopts a supervised learning
paradigm, modeling link prediction as a binary
classification task. The optimizer uses AdamW to
provide better convergence stability, with a learning
rate set at 1×10
-4
to ensure stable training for large-
scale networks. The loss function employs Mean
Squared Error Loss (MSELoss) to provide a smooth
gradient signal. The training is set to 1000 epochs,
and an early stopping strategy (patience=50) is
adopted to prevent overfitting. The batch size is set to
2048 node pairs, and weight decay is set at 1×10
-5
for
L2 regularization. The learning rate schedule employs
the ReduceLROnPlateau strategy, reducing the
learning rate to 80% of its original value when the
validation set AUC fails to improve for 10
consecutive epochs. The negative sampling strategy
samples one negative edge for every positive edge,
maintaining node type matching to avoid sampling
bias.
3.2 Comparative Analysis of Link
Prediction Performance
Initially, we compared the performance of the
proposed model with various graph models in the task
of link prediction. The experiments selected classic
graph neural network models and attention models for
comparison, including Graph Convolutional
Networks (GCN), GraphSAGE, Graph Attention
Networks (GAT), Graph Embedding Network
(GEN), graph network models based on Transformer,
and Graph Neural Network for Tag Ranking
(GraphTR), totaling six models. Evaluation metrics
such as AUC, F1, Precision, Recall, and Accuracy
were employed to assess the accuracy of link
prediction, as presented in Table 1 below.