Which Is More Helpful in Finding Scientiﬁc Papers to Be Top-cited in the

Future: Content or Citations? Case Analysis in the Field of Solar Cells

2009

Masanao Ochi

1 a

, Masanori Shiro

, Jun’ichiro Mori

and Ichiro Sakata

Department of Technology Management for Innovation, Graduate School of Engineering, The University of Tokyo,

Hongo 7-3-1, Bunkyo, Tokyo, Japan

HIRI, National Institute of Advanced Industrial Science and Technology, Umezono 1-1-1, Tsukuba, Ibaraki, Japan

Keywords:

Citation Analysis, Scientiﬁc Impact, Graph Neural Network, BERT.

Abstract:

With the increasing digital publication of scientiﬁc literature and the fragmentation of research, it is becoming

more and more difﬁcult to ﬁnd promising papers. Of course, we can examine the contents of a large number of

papers, but it is easier to look at the references cited. Therefore, we want to know whether a paper is promising

or not based only on its content and citation information. This paper proposes a method of extracting and

clustering the content and citations of papers as distributed representations and comparing them using the

same criteria. This method clariﬁes whether the future promising papers will be biased toward content or

citations. We evaluated the proposed method by comparing the distribution of the papers that would become

the top-cited papers three years later among the papers published in 2009. As a result, we found that the

citation information is 39.9% easier to identify the papers that will be the top-cited papers in the future than

the content information. This analysis will provide a basis for developing more general models for early

prediction of the impact of various scientiﬁc researches and trends in science and technology.

1 INTRODUCTION

In order to identify research worthy of investment, it

is essential to identify promising research at an early

stage. In addition, with the increase in the digital pub-

lication of scientiﬁc literature and the increasing frag-

mentation of research, there is a need to automatically

develop techniques to predict future research trends.

Previous research on predicting the impact of scien-

tiﬁc research has been conducted using specially de-

signed features for each indicator. On the other hand,

recent advances in deep learning technology have fa-

cilitated integrating different individual models and

constructing more general-purpose models. However,

the possibility of using deep learning techniques to

predict the impact indicators of scientiﬁc research has

not been sufﬁciently explored. In this paper, we ex-

tracted the number of citations after publication, one

of the typical impact indicators of scientiﬁc research,

and the corresponding information in the academic

literature as a distributed representation. We analyzed

the possibility of identifying papers with high impact.

https://orcid.org/0000-0002-6661-6735

The analysis results show that the linguistic informa-

tion of academic literature and the distributed rep-

resentation using network information are different.

The results of this paper may provide a fundamental

analysis for the development of a more general model

for early prediction of the impact of various scientiﬁc

researches and the prediction of trends in science and

technology.

2 RELATED WORKS

Research on the impact of science and technology has

focused on developing indicators and their future pro-

jections. The development of indices mainly aims at

quantifying the inﬂuence of an individual subject. For

example, the number of citations for papers, the h-

index(Hirsch, 2005) for authors, the Journal Impact

Factor (JIF) (Garﬁeld and Sher, 1963) for journals,

and the Nature Index (NI) for research institutions are

typical examples. Of course, various other indices

have been developed, but most of them focus on pa-

pers and authors. On the other hand, some studies

360

Ochi, M., Shiro, M., Mori, J. and Sakata, I.

Which Is More Helpful in Finding Scientiﬁc Papers to Be Top-cited in the Future: Content or Citations? Case Analysis in the Field of Solar Cells 2009.

DOI: 10.5220/0010689100003058

In Proceedings of the 17th International Conference on Web Information Systems and Technologies (WEBIST 2021), pages 360-364

ISBN: 978-989-758-536-4; ISSN: 2184-3252

Table 1: Rank of the number of citations of the papers in the dataset (published in 2009) until 2012.

Ranking

@2012

Authors Title Journal

Citation

@2012

1 Park, S. H., et al.

Bulk heterojunction solar cells

with internal quantum efﬁciency approaching 100%.

Nature Photonics 1,126

2 Chen, H. Y., et al.

Polymer solar cells

with enhanced open-circuit voltage and efﬁciency.

Nature Photonics 930

3 Dennler, G., et al.

Polymer-fullerene bulk-heterojunction solar cells.

Advanced materials 747

4 Krebs, F. C., et al.

Fabrication and processing of polymer solar cells:

A review of printing and coating techniques.

Solar energy materials

and solar cells

495

5 Gr

atzel, M., et al.

Recent advances in sensitized mesoscopic solar cells.

Accounts of

chemical research

465

have reported predicting these indices. Some studies

predict the h-index of future researchers(Ayaz et al.,

2018; Mir

o et al., 2017; Schreiber, 2013; Acuna et al.,

2012), studies that predict the number of citations af-

ter publication(Bai et al., 2019; Sasaki et al., 2016;

Stegehuis et al., 2015; Cao et al., 2016). Among

these, the difference is that Stegehuis et al.. and Cao

et al.. consider the number of citations one to three

years after publication and predict the number of cita-

tions in the reasonably distant future. In comparison,

Sasaki et al.. predict the number of citations three

years later without using citations after publication.

Recently, the application of deep learning tech-

niques to academic literature data has been promoted.

The SPECTER model(Cohan et al., 2020), trained on

the SciDocs dataset, is a representative example of ap-

plying text data in academic literature. However, the

SPECTER model uses the citation information of the

articles, and it does not simply obtain the distributed

representation of each article based on linguistic in-

formation alone. In this study, we used the learned

Sentence-BERT model(Reimers and Gurevych, 2019)

trained by the SNLI corpus(Bowman et al., 2015) as

a method to obtain the distributed representation for

each article.

On the other hand, there is an attempt to cap-

ture the citation information of academic literature

data as one huge graph and use it for task evalua-

tion such as link prediction. The SEAL model(Zhang

and Chen, 2018) is the top-ranked model on #ogbl-

citation2, for the citation prediction task in the aca-

demic literature dataset of the Open Graph Bench-

mark (OGB)(Weihua Hu, 2020), one of the bench-

mark datasets for graph data, as of February 2021

The SEAL model learns by sampling a pair of nodes

in a graph and using a subgraph containing the two

nodes to predict a link between the sampled nodes.

The SEAL model does not use the entire graph as

input but rather a large number of small subgraphs,

OGB:Leaderboards for Link Property Prediction:

https ogb

stanford

edudocsleader linkprop#ogbl-citation2

which has the advantage of being relatively easy to

apply to parallelization and large graphs.

3 METHODOLOGY

The purpose of this paper is to analyze the possibil-

ity of identifying papers with high impact by extract-

ing the number of citations after publication, which

is one of the representative impact indicators of sci-

entiﬁc research, and the corresponding information

on academic literature as a distributed representation.

In order to analyze the possibility of identifying pa-

pers with high impact, we use two methods to obtain

the distributed representation for each paper: one is

for linguistic information (title and abstract), and the

other is for citation information. We compare the dis-

tribution of the papers with the highest citations after

three years of the publication on the obtained variance

representation. The likelihood of identifying such pa-

pers is high if the papers with the highest citations are

skewed within a particular region and low otherwise.

This paper compares the likelihood of identifying the

papers with the highest citations by the method using

linguistic information and the method using citation

information for a relatively small dataset.

The method of comparison is as follows. Ob-

tain the distributed representation of each article by

two methods: one is the embedding method for lin-

guistic information, and the other is the embedding

method for citation information. After obtaining these

two distributed representations, we apply a clustering

method under the same number clusters k. Further-

more, we calculate the entropy of the entire dataset

with the percentage of papers in the same cluster that

will be the most cited papers in n years after publica-

tion. The following formula calculates the entropy.

H(P

) = −

∑

c∈C

(c)lnP

However, the symbols in the equation are as follows:

Which Is More Helpful in Finding Scientiﬁc Papers to Be Top-cited in the Future: Content or Citations? Case Analysis in the Field of Solar

Cells 2009

361

N(c): Number of papers belonging to the cluster c

(c): Number of papers in the cluster c that are

among the top cited papers in the cluster

(c)

N(c)

: Percentage of papers with the highest

citations in the cluster c

The lower value of entropy, the more likely the pa-

pers with the highest number of citations concentrate

in a particular cluster.

4 EXPERIMENT

In this section, we describe the experiment. First,

we describe the scientiﬁc and technical literature data

used in the experiment. Next, we explain the param-

eters and conditions we set for the extraction of the

variance representation. Here includes how we visu-

alized the data in two dimensions.

4.1 Scientiﬁc and Technical Literature

Dataset

We received the data from Elsevier, one of the inter-

national publishers of many journals. They ran the

query “(TITLE-ABS-KEY(nano AND carbon) OR

TITLE-ABS-KEY(gan) OR TITLE-ABS-KEY(solar

AND cell) OR TITLE-ABS-KEY(complex AND net-

works)) AND PUBYEAR AFT 2006” on Scopus and

obtained the results of the data retrieval.

In this paper, we focus on the 57,935 papers pub-

lished between 2006 and 2009 that have abstract in-

formation, and the top-cited papers are the 66 papers

published in 2009 that have been cited more than 100

times by 2012(n = 3). We show some of the top-cited

papers in Table 1.

For the method based on linguistic information,

we combine the title and abstract of each paper as in-

put. For the method based on citation information, we

create an undirected graph using the citation informa-

tion of the period, where the nodes are the papers and

the edges are the citation relations. This graph has

921,454 nodes and 1,348,424 edges.

4.2 Conditions for Distributed

Representation Extraction

For the method using linguistic information, we

use the Sentence-BERT(Reimers and Gurevych,

2019) trained model, “nli-bert-large”. We use the

SEAL(Zhang and Chen, 2018) for the method using

citation information and use the created network as in-

put. We set the parameter h = 1 to represent the sam-

pling range of nodes to create the subgraph. However,

10% of the edges are used as test data to evaluate the

accuracy of the trained model. The distributed repre-

sentation acquisition by SEAL learns the presence or

absence of an edge between two sampled nodes. For

this purpose, we obtain the distributed representation

of the target node from the output layer of the MLP

layer. We then average with the variance representa-

tion of the target node and the neighbouring nodes.

We apply the K-means method for clustering the

extracted distributed representations, and we set the

number of clusters to k = 20. For visualization, we

use the UMAP method(McInnes et al., 2018) to re-

duce the dimensionality to two dimensions.

5 RESULTS

In this section, we explain the results of our experi-

ments. In the experiment, we use a pre-trained model

for embedding linguistic information, while we need

to train the model for embedding citation information

using a dataset. For this reason, we explain the train-

ing results of the SEAL model that we selected as the

method using citation information. After we conﬁrm

that both models have been sufﬁciently trained, we ﬁ-

nally show the comparison results of the distributions

of the top-cited papers.

5.1 Training Results of SEAL Model

We show the Precision-Recall curves of the link pre-

diction results for the test data in Figure 1, and we

show the Precision and Recall at the threshold where

the F-value is the maximum in Table 2. We observe

that the Precision-Recall curve has a stable shape and

that the model is not sensitive to the output thresh-

old. In addition, the F-value is 0.835 at the threshold

= 0.960 when the F-value is maximum, indicating

that the learned model has high accuracy on the test

data.

Table 2: Accuracy of Link Prediction.

Precision Recall F-value(Max) P

0.916 0.768 0.835 0.960

Table 3: Distribution results of the top-cited papers by En-

tropy.

Model Entropy

Sentence-Bert 2.900

SEAL 1.742

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

362

Figure 1: The Result of Precision-Recall curve for link pre-

diction.

5.2 Results of the Distribution of the

Top-cited Papers

We show the visualization results of the extracted dis-

tributed representations and the distributions of the

top-cited papers by the UMAP method in Figure 2.

We show the entropies of the distributions of the

top-cited papers in the clusters in Table 3. In the

visualization result shown in Figure 2, the colour-

coding indicates the result of clustering. The red plots

sparsely shown with the titles of the papers are the

top-cited papers. Comparing the visualization results

of Sentence-BERT and SEAL, we can observe that

the top-cited papers are more concentrated in SEAL.

Table 3 shows that the entropy of the top-cited pa-

pers is 2.900 for the Sentence-BERT model, while it

is 1.742 for SEAL. In other words, the SEAL model is

more biased than the Sentence-BERT model by more

than 1.1 points in terms of the number of papers with

the highest citations.

6 DISCUSSION

In this section, we discuss the results presented in the

5 section.

First, the SEAL model shows an MRR (Mean

Reciprocal Rank) of 0.8767 for the OGB (#ogbl-

citation2) leaderboard

. The result indicates that the

target node is the 1.1th candidate on average. Al-

though the learning results of the link prediction are

https://ogb.stanford.edu/docs/leader linkprop/#ogbl-

citation2

not as accurate as this, the learning results are compa-

rable to those of the network in this experiment with a

smaller size than the #ogbl-citation2 network, which

indicates that the learning result is sufﬁcient.

Next, the bias of the papers with the highest ci-

tations is more skewed in SEAL than in Sentence-

BERT, indicating that the papers are concentrated in

speciﬁc clusters. This result indicates that the citation

relationship is more likely to concentrate the papers

whose citations are more likely to increase than the

content of the title or abstract. The effect of citations

on SEAL learning is limited since the present analysis

only covers the papers published in 2009 and marks

the top-cited papers after three years.

7 CONCLUSION

In this paper, we conducted an identiﬁability anal-

ysis using distributed representation extracted from

academic literature information for predicting the im-

pact of scientiﬁc research. Speciﬁcally, we used the

trained Sentence-BERT model, a method for obtain-

ing distributed representation for linguistic informa-

tion, and the SEAL model, which is a method for

obtaining distributed representation for citation infor-

mation. We apply these models to identify the top-

cited papers three years after publication using only

linguistic information and citation information at the

time of publication. We evaluate the results by apply-

ing the entropy index.

The results show that the SEAL model is more

likely than the Sentence-BERT model to bias the top-

cited papers to a speciﬁc cluster by about 1.1 points.

This result indicates that the citation information is

more likely to identify the top-cited papers three years

after publication than the linguistic information.

On the other hand, there are some limitations to

our results. The trained Sentence-BERT model used

in this study does not use the academic literature data

as training data. It may show different results if the

model is trained only on academic literature corpus.

In addition, the analysis is a case in a technological

ﬁeld related to solar cells. In addition, we have used

the technological ﬁeld related to solar cells as a case

study for 2009. In this study, we analyzed the data

as of 2009, using the technology ﬁeld related to solar

cells as a case study. This result may be because solar

cells are a highly specialized ﬁeld or a phenomenon

speciﬁc to a particular year. In the future, we could

obtain different results if the analysis is carried out

for different periods in different ﬁelds, especially in

ﬁelds that develop in an interdisciplinary manner.

We will need to discuss further the possibility of

Which Is More Helpful in Finding Scientiﬁc Papers to Be Top-cited in the Future: Content or Citations? Case Analysis in the Field of Solar

Cells 2009

363

Sentence-BERT

SEAL

Figure 2: Visualization results of the acquired distributed representation. Color coding is the result of the K-means method.

identifying studies that will be heavily cited in the fu-

ture by analyzing more models and examples.

ACKNOWLEDGEMENT

This article is based on results obtained from a

project, JPNP20006, commissioned by the New En-

ergy and Industrial Technology Development Organi-

zation (NEDO).

REFERENCES

Acuna, D. E., Allesina, S., and Kording, K. P. (2012). Pre-

dicting scientiﬁc success. Nature, 489(7415):201–

202.

Ayaz, S., Masood, N., and Islam, M. A. (2018). Predict-

ing scientiﬁc impact based on h-index. Scientomet-

rics, 114(3):993–1010.

Bai, X., Zhang, F., and Lee, I. (2019). Predicting the ci-

tations of scholarly paper. Journal of Informetrics,

13(1):407 – 418.

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D.

(2015). A large annotated corpus for learning natural

language inference. In Proceedings of the 2015 Con-

ference on Empirical Methods in Natural Language

Processing (EMNLP). Association for Computational

Linguistics.

Cao, X., Chen, Y., and Liu, K. R. (2016). A data analytic

approach to quantifying scientiﬁc impact. Journal of

Informetrics, 10(2):471 – 484.

Cohan, A., Feldman, S., Beltagy, I., Downey, D., and Weld,

D. S. (2020). Specter: Document-level representa-

tion learning using citation-informed transformers. In

ACL.

Garﬁeld, E. and Sher, I. H. (1963). New factors in the evalu-

ation of scientiﬁc literature through citation indexing.

American Documentation, 14(3):195–201.

Hirsch, J. E. (2005). An index to quantify an individual’s

scientiﬁc research output. Proceedings of the National

Academy of Sciences, 102(46):16569–16572.

McInnes, L., Healy, J., Saul, N., and Grossberger, L. (2018).

Umap: Uniform manifold approximation and projec-

tion. The Journal of Open Source Software, 3(29):861.

Mir

O., Burbano, P., Graham, C. A., Cone, D. C.,

Ducharme, J., Brown, A. F. T., and Mart

ın-S

anchez,

F. J. (2017). Analysis of h-index and other bibliomet-

ric markers of productivity and repercussion of a se-

lected sample of worldwide emergency medicine re-

searchers. Emergency Medicine Journal, 34(3):175–

181.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. In

Proceedings of the 2019 Conference on Empirical

Methods in Natural Language Processing. Associa-

tion for Computational Linguistics.

Sasaki, H., Hara, T., and Sakata, I. (2016). Identifying

emerging research related to solar cells ﬁeld using a

machine learning approach. Journal of Sustainable

Development of Energy, Water and Environment Sys-

tems, 4:418–429.

Schreiber, M. (2013). How relevant is the predictive power

of the h-index? a case study of the time-dependent

hirsch index. Journal of Informetrics, 7(2):325 – 329.

Stegehuis, C., Litvak, N., and Waltman, L. (2015). Pre-

dicting the long-term citation impact of recent publi-

cations. Journal of Informetrics, 9.

Weihua Hu, Matthias Fey, M. Z. Y. D. H. R. B. L. M. C. J. L.

(2020). Open graph benchmark: Datasets for machine

learning on graphs. arXiv preprint arXiv:2005.00687.

Zhang, M. and Chen, Y. (2018). Link prediction based on

graph neural networks. In Advances in Neural Infor-

mation Processing Systems, pages 5165–5175.

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

364