The vertical red lines assign the boundary
between clusters. As can be seen, the partitions are
about the same in all cases, which can indicate the
presence of two subpopulations in the edge
community.
5 SUMMARY AND
CONCLUSIONS
This research introduces a new methodology for
detecting suspicious citations in scientific literature
using the GraphSAGE algorithm and enhanced
citation graph embeddings. The method has shown
effectiveness in uncovering citation anomalies
through extensive testing. However, challenges arise
in handling interdisciplinary research and "sleeping
beauties"—articles initially overlooked but later
recognized due to delayed breakthroughs—making it
difficult for the model to differentiate genuine citation
dynamics from anomalies.
Approximately 80% of citation edges studied in
the study are identified as vulnerable to distortion,
revealing their lack of robustness within the citation
graph. These edges are flagged as potentially
manipulated, highlighting the fragile nature of
citation datasets and the significant impact that
individual edges can have on network stability and
reliability. Despite structural differences between
datasets, shared characteristics are identified,
suggesting universal tendencies within citation
systems. The Cora dataset displayed a homogeneous
structure with a higher proportion of suspicious
citations, while an analysis of the larger and more
heterogeneous PubMed dataset reveals two distinct
citation groups: one associated with suspicious edges
and another with more stable, well-reconstructed
citations.
All datasets considered exhibit a stable core of
reliable connections, reflecting the gradual
accumulation of trustworthy citations over time.
Nonetheless, even in datasets regularly updated with
new publications, a substantial number of edges are
found to be unstable or irrelevant, suggesting that
citation datasets inherently include connections
disposed to manipulation or unreliability.
Reconstruction score distributions demonstrated a
positively skewed, unimodal pattern, where most
citations clustered around lower scores, with a right-
skewed tail influenced by higher scores. This
distribution implies that a significant portion of
citations may lack reliability, raising concerns about
potential manipulation.
To validate the proposed approach, an experiment
is conducted with artificially augmented citation
graphs obtained by adding random noise expressed in
random edges. The results validate the model's
effectiveness in detecting such anomalies, further
reinforcing its value as a reliable tool for identifying
citation manipulation. The proposed method provides
a framework for dynamically monitoring research
trends and integrating new articles into citation
graphs, leveraging a stable core of knowledge to
evaluate individual links. Exploring positions within
the recovery histogram offers insights into citation
reliability and susceptibility to manipulation.
This research proposes new avenues for
understanding citation dynamics, emphasizing the
role of stable reconstructed edge clusters in
maintaining citation network integrity. It also
highlights universal patterns within citation systems,
offering valuable insights for developing robust tools
for citation analysis and anomaly detection.
REFERENCES
Avros, R., Haim, M. B., Madar, A., Ravve, E., &
Volkovich, Z. (2024). Spotting suspicious academic
citations using self-learning graph transformers.
Mathematics, 12(6), 814.
https://doi.org/10.3390/math12060814.
Avros, R., Keshet, S., Kitai, D. T., Vexler, E., & Volkovich,
Z. (2023). Detecting pseudo-manipulated citations in
scientific literature through perturbations of the citation
graph. Mathematics, 11(18), 3820.
https://doi.org/10.3390/math11123820.
Avros, R., Keshet, S., Kitai, D. T., Vexler, E., & Volkovich,
Z. (2023). Detecting manipulated citations through
disturbed node2vec embedding. In Proceedings of the
25th International Symposium on Symbolic and
Numeric Algorithms for Scientific Computing
(SYNASC), Nancy, France, 2023 (pp. 274–278). IEEE.
https://doi.org/10.1109/SYNASC61333.2023.00047
Falagas, M. E., & Alexiou, V. G. (2008). The Top-Ten in
Journal Impact Factor Manipulation. Archives of
Immunology and Therapy Experimental (Warsz),
56(4), 223–226. https://doi.org/10.1007/s00005-008-
0024-5.
Fong, E. A., & Wilhite, A. W. (2017). Authorship and
citation manipulation in academic research. PLOS
ONE, 12(12), e0187394.
https://doi.org/10.1371/journal.pone.0187394.
Grover, A., & Leskovec, J. (2016). Node2vec: Scalable
feature learning for networks. In Proceedings of the
22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD '16),
San Francisco, CA, USA, 13–17 August 2016 (pp. 855–
864). ACM. https://doi.org/10.1145/2939672.2939754.
Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive
representation learning on large graphs. Advances in
Neural Information Processing Systems, 30, 1024–
1034.