Innovative Sentence Classiﬁcation in Scientiﬁc Literature: A Two-Phase

Approach with Time Mixing Attention and Mixture of Experts

Meng Wang

1 a

, Mengting Zhang

1,2 b

, Hanyu Li

1,2 c

Jing Xie

1 d

, Zhixiong Zhang

1,2 e

, Yang Li

1,2 f

and Gaihong Yu

1 g

National Science Library, Chinese Academy of Sciences, Beijing 100190, China

School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China

Keywords:

Innovative Sentence Identiﬁcation, Multi-Class Text Classiﬁcation, Time Mixing Attention, Mixture of

Experts, Generative Semantic Data Augmentation.

Abstract:

Accurately classifying innovative sentences in scientiﬁc literature is essential for understanding research con-

tributions. This paper proposes a two-phase classiﬁcation framework that integrates a Time Mixing Attention

(TMA) mechanism and a Mixture of Experts (MoE) system to enhance multi-class innovation classiﬁcation. In

the ﬁrst phase, TMA improves long-range dependency modeling through temporal shift padding and sequence

slice reorganization. The second phase employs an MoE-based approach to classify theoretical, methodologi-

cal, and applied innovations. To mitigate class imbalance, a generative semantic data augmentation method is

introduced, improving model performance across different innovation categories. Experimental results demon-

strate that the proposed two-phase SciBERT+TMA model achieves the highest performance, with a macro-

averaged F1-score of 90.8%, including 95.1% for theoretical innovation, 90.8% for methodological innova-

tion, and 86.6% for applied innovation. Compared to the one-phase SciBERT+TMA model, the two-phase

approach signiﬁcantly improves precision and recall, highlighting the beneﬁts of progressive classiﬁcation

reﬁnement. In contrast, the best-performing LLM baseline, Ministral-8B-Instruct, achieves a macro-averaged

F1-score of 85.2%, demonstrating the limitations of prompt-based inference in structured classiﬁcation tasks.

The results underscore the advantage of a domain-adapted approach in capturing ﬁne-grained distinctions in

innovation classiﬁcation. The proposed framework provides a scalable solution for multi-class sentence classi-

ﬁcation and can be extended to broader academic classiﬁcation tasks. Model weights and details are available

at https://huggingface.co/wmsr22/Research Value Generation/tree/main.

1 INTRODUCTION

Text classiﬁcation, a fundamental natural language

processing (NLP) task, involves categorizing textual

units—ranging from documents to sentences—into

predeﬁned categories based on their meaning or func-

tion. (You et al., 2019). This task plays a critical

role in the processing of scientiﬁc literature, as it fa-

cilitates the capture and analysis of complex seman-

tic structures, the understanding of intricate linguistic

patterns, and the extraction of key information (Wang

https://orcid.org/0009-0009-7780-6516

https://orcid.org/0000-0002-9941-7548

https://orcid.org/0000-0003-1426-3242

https://orcid.org/0000-0001-6698-1786

https://orcid.org/0000-0003-1596-7487

https://orcid.org/0009-0008-8451-1452

https://orcid.org/0000-0003-1301-2871

et al., 2025). Consequently, optimizing and improv-

ing text classiﬁcation models, particularly in the con-

text of scientiﬁc literature, has become one of the cen-

tral research topics in the ﬁeld of NLP (Shang et al.,

2024).

The self-attention mechanism, as a core compo-

nent of state-of-the-art neural architectures in text

classiﬁcation, allows the model to learn the internal

structure of input data and focus on the most rele-

vant parts during information processing. This capa-

bility has proven to be highly effective in a variety of

text classiﬁcation tasks (Guo et al., 2020). However,

with the rapid growth of scientiﬁc literature and the

acceleration of interdisciplinary research, the variety

of sentence types within the literature has increased,

often exhibiting signiﬁcant class imbalance. On one

hand, certain sentence types are relatively rare and

dispersed throughout the text, which places higher de-

mands on the model’s ability to capture long-distance

382

Wang, M., Zhang, M., Li, H., Xie, J., Zhang, Z., Li, Y., Yu and G.

Innovative Sentence Classiﬁcation in Scientiﬁc Literature: A Two-Phase Approach with Time Mixing Attention and Mixture of Experts.

DOI: 10.5220/0013513200003967

In Proceedings of the 14th International Conference on Data Science, Technology and Applications (DATA 2025), pages 382-389

ISBN: 978-989-758-758-0; ISSN: 2184-285X

dependencies. On the other hand, even within the

same broad category, there is considerable variation

in the number of sentences across different subtypes,

further exacerbating the class imbalance and reduc-

ing the accuracy of multi-class classiﬁcation models

(Oida-Onesa and Ballera, 2024). Therefore, integrat-

ing temporal information into the self-attention mech-

anism and optimizing the overall classiﬁcation archi-

tecture are critical for improving performance in these

highly imbalanced and complex tasks.

To address these issues, this paper proposes a

two-phase classiﬁcation framework based on an im-

proved self-attention mechanism that incorporates

time-mixing information to enhance the model’s ca-

pability in handling long-range dependencies. Ad-

ditionally, by designing a hierarchical classiﬁcation

architecture and incorporating a generative semantic-

based data augmentation method, we aim to improve

the accuracy and generalization ability of the multi-

classiﬁcation model.

The main contributions of this paper are summa-

rized as follows:

• To improve the accuracy of multi-classiﬁcation

models, a two-phase classiﬁcation architecture is

designed, which integrates the Time Mixing At-

tention (TMA) mechanism and a mixture of ex-

perts (MoE) system. This architecture enhances

the model’s ability to recognize different types of

sentences through hierarchical feature extraction

and classiﬁcation decisions.

• To capture long-range dependencies and con-

textual semantic information, the self-attention

mechanism is introduced into the model used in

the ﬁrst phase, with dynamic temporal encoding

of context, thereby improving the model’s ability

to capture sequential dynamic behaviors.

• To address the issue of data imbalance, a gener-

ative semantic-based data augmentation method

is proposed. This method expands the training

samples of rare categories by generating data that

maintain semantic consistency, thereby improving

the model’s classiﬁcation performance across all

categories.

• To validate the effectiveness of the proposed

method, comparative experiments are conducted

on innovative sentences in scientiﬁc literature, in-

cluding both pre-trained language models (PLMs)

and large language models (LLMs). The exper-

imental results demonstrate signiﬁcant improve-

ments on macro-average metrics.

2 RELATED WORK

In this paper, we conduct a literature review on multi-

class text classiﬁcation, the self-attention mechanism,

and data augmentation approaches, as these areas pro-

vide the essential theoretical and technical founda-

tion for the development of efﬁcient text classiﬁcation

models and the mitigation of data imbalance chal-

lenges.

2.1 Multi-Class Text Classiﬁcation

Multi-class text classiﬁcation is the task of categoriz-

ing a given text into one of several predeﬁned cat-

egories, with each text belonging to only one class

from a set of possible labels (Wang et al., 2024).

Given the importance of this task, numerous ap-

proaches have been proposed to improve classiﬁ-

cation accuracy and efﬁciency. Jain et al. (Jain

et al., 2024) developed a hierarchical text classiﬁca-

tion framework that encodes dynamic text represen-

tations using language models and introduces a hori-

zontal guidance loss function to capture relationships

between text and label semantics, thus adapting lan-

guage models to domain knowledge. Afzal et al.

(Afzal et al., 2024) proposed a Transformer-based ac-

tive learning approach for multi-class text annotation

and classiﬁcation, which utilizes deep learning tech-

niques to enhance annotation efﬁciency and improve

classiﬁcation performance, particularly for unstruc-

tured medical data. Similarly, Le et al. (Le et al.,

2024) introduced CoLAL, an active learning algo-

rithm that combines noisy labels and predictions from

the primary model to select diverse and representative

samples, signiﬁcantly enhancing performance in text-

based active learning.

2.2 Self-Attention Mechanism

To address long-range dependency issues, researchers

have primarily focused on enhancing the self-

attention mechanism, particularly by improving its

ability to capture relationships across positions in a

sequence. Vaswani et al. (Vaswani, 2017) intro-

duced positional embeddings and integrated the at-

tention mechanism into the Transformer architecture,

thus obviating the need for recurrence and convolu-

tion, while effectively capturing long-range depen-

dencies and contextual relationships. Yu et al. (Yu

et al., 2024) proposed a dual attention module de-

signed to capture dependencies between different se-

quences and variations within individual sequences,

using a learnable decomposition strategy for multi-

variate time series prediction, which improved the

Innovative Sentence Classiﬁcation in Scientiﬁc Literature: A Two-Phase Approach with Time Mixing Attention and Mixture of Experts

383

capture of dynamic trend information for more accu-

rate time series forecasting.

Data augmentation is a method to generate addi-

tional data by manipulating original samples to in-

crease diversity and quantity while preserving their

core characteristics (Bayer et al., 2022). In the con-

text of text, data augmentation involves creating new

training samples through semantic-preserving trans-

formations, rule-based replacements, or generative

models, thereby expanding the scale and diversity of

datasets to improve model generalization and robust-

ness.

3 OVERVIEW

This section introduces the task of identifying and

classifying innovative sentences in scientiﬁc litera-

ture. The primary objective of this task is to cat-

egorize sentences into two main groups: innovative

and non-innovative. Additionally, for sentences iden-

tiﬁed as innovative, the task further classiﬁes them

into three distinct categories: theoretical innovation,

methodological innovation, and applied innovation.

The task is structured into two layers, each employ-

ing specialized techniques to address different aspects

of the classiﬁcation problem, ensuring a more precise

and robust categorization process.

3.1 Task Objectives

The task involves two main objectives: the identiﬁ-

cation of innovative sentences and their subsequent

classiﬁcation into speciﬁc categories.

3.1.1 Innovative Sentence Identiﬁcation

The ﬁrst objective is to identify innovative sentences.

This step focuses on detecting sentences that intro-

duce new ideas, theories, or advancements, which are

classiﬁed as innovative sentences. In contrast, non-

innovative sentences either restate established knowl-

edge or provide general descriptions without offering

novel insights or perspectives.

3.1.2 Innovation Classiﬁcation

Once innovative sentences are identiﬁed, the next step

is to classify them into one of three types of innova-

tion:

• Theoretical Innovation: Sentences that propose

new theories or frameworks, offering fresh per-

spectives on existing problems and advancing the-

oretical understanding.

• Methodological Innovation: Sentences that in-

troduce novel research methods, techniques, or

tools, enhancing the way research is conducted

through improved experimental designs or data

analysis approaches.

• Applied Innovation: Sentences that demonstrate

the practical application of existing theories or

methodologies in new contexts, addressing real-

world problems or societal needs.

3.2 Approach Design

3.2.1 First Phase: Innovative Sentence

Identiﬁcation

The ﬁrst phase is dedicated to identifying innovative

sentences within a given text. To achieve this, we be-

gin by utilizing a pre-trained model to encode each to-

ken in the sentence into high-dimensional embedding

vectors. These embeddings capture rich semantic in-

formation and contextual dependencies for each token

within the sentence. Following the encoding step, the

Time Mixing Attention (TMA) mechanism, as de-

tailed in Section 4.1, is applied to model inter-token

dependencies and capture the contextual relationships

between the token representations. Subsequently, the

resulting feature representations are passed through a

fully connected layer, which maps them into a two-

dimensional space. This projection enables the ﬁnal

classiﬁcation of sentences into two categories: inno-

vative and non-innovative, based on the aggregated

representation of the sentence.

3.2.2 Second Phase: Innovation Classiﬁcation

The second phase involves classifying the identiﬁed

innovative sentences into one of three categories: the-

oretical, methodological, or applied innovation. This

classiﬁcation process is facilitated through the inte-

gration of Mixture of Experts (MoE), as outlined

in Section 4.2, and Generative Semantic Data Aug-

mentation, discussed in Section 4.3. The MoE sys-

tem enables the dynamic selection of specialized ex-

perts based on the content of each sentence, ensuring

precise categorization. Simultaneously, the genera-

tive augmentation techniques leverage large language

models to address issues of class imbalance and data

scarcity by enriching the training dataset with diverse

sentence variations, thereby increasing the model’s

ability to handle a wide range of sentence structures

and enhancing its overall robustness.

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

384

4 METHODOLOGY

4.1 Time Mixing Attention Mechanism

We propose the Time Mixing Attention (TMA) mech-

anism, which integrates time-shift padding, embed-

ding dimension slicing and recombination, and self-

attention to effectively capture both local subspace

features and global temporal dependencies.

4.1.1 Time-Shift Padding

To provide contextual support for boundary positions

in time series data, TMA employs time-shift padding

as a foundational step. In time series sequences, the

initial and ﬁnal positions often lack sufﬁcient contex-

tual information due to the absence of preceding or

succeeding elements. Time-shift padding addresses

this by symmetrically appending zero vectors to both

ends of the embedding sequence, providing additional

contextual “buffers”.

Let the input sequence be denoted as S =

,...,s

}, and its corresponding embedding

matrix, generated by a pre-trained language model,

as X = [x

,...,x

] , where X ∈ R

T ×d

, and x

represents the d-dimensional embedding vector of the

i-th element of S. Zero-padding along the temporal

dimension produces the extended embedding matrix

′

= {x

,...,x

T +1

}, where x

and x

T +1

are zero vectors acting as padding for the start and

end positions respectively, as illustrated in Figure 1.

This operation ensures sufﬁcient contextual support

for boundary positions while maintaining the integrity

of the original sequence data by using zero-padding,

which introduces no additional semantic or structural

information and preserves the neutrality of the origi-

nal embeddings.

Figure 1: Time-shift Padding.

4.1.2 Embedding Dimension Slicing and

Recombination

After applying time-shift padding, the extended em-

bedding matrix X

′

∈ R

(T +2)×d

, where each timestep

∈ R

(with i ∈ {1, 2, ..., T + 1}) is a d-dimensional

embedding vector, undergoes embedding dimension

slicing and recombination to capture local subspace

information. Each embedding vector x

is partitioned

into three equal subspaces along its embedding di-

mension (d is typically 768), as follows:

• The starting part x

i,start

= x

[: d/3], corresponding

to the ﬁrst third of the embedding vector;

• The middle part x

i,middle

= x

[d/3 : 2d/3], corre-

sponding to the middle third of the embedding

vector;

• The ending part x

i,end

= x

[2d/3 :], corresponding

to the last third of the embedding vector.

These slices are then recombined in a shifted manner

to form the ﬁnal vector. Speciﬁcally, the slices from

consecutive timesteps are shifted as shown in Figure

2, and the formula is as follows:

′

= Concat(x

i−1,start

i,middle

i+1,end

) (1)

where i ∈ {1, 2, ..., T }. This operation ensures that

each timestep’s embedding vector x

′

incorporates in-

formation from neighboring time spans. The starting

part x

i−1,start

comes from the previous timestep, the

middle part x

i,middle

comes from the current timestep,

and the ending part x

i+1,end

comes from the subse-

quent timestep.

Figure 2: Embedding Dimension Slicing and Recombina-

tion.

By shifting the slices in this manner, the recom-

bination process captures the temporal dependencies

between consecutive timesteps, allowing the model

to integrate information from neighboring time inter-

vals. The result is a new embedding matrix X

′′

∈

T ×d

, where each timestep’s embedding vector x

′

now reﬂects a broader context by incorporating lo-

cal and adjacent temporal information. This approach

helps the model better capture both short- and long-

range dependencies within the time series data.

4.1.3 Self-Attention Mechanism

Building upon the embedding matrix obtained from

the previous step, the self-attention mechanism is em-

ployed to model the dynamic relationships between

timesteps. First, the embedding matrix X

′′

is pro-

jected into three matrices: Query (Q), Key (K), and

Value (V ), as follows:

Q = X

′′

, K = X

′′

, V = X

′′

(2)

Innovative Sentence Classiﬁcation in Scientiﬁc Literature: A Two-Phase Approach with Time Mixing Attention and Mixture of Experts

385

where W

∈ R

d×d

are learned projection

matrices. Next, the attention weights µ

i, j

between

timesteps are computed using the vectors derived

from the matrices Q and K. Speciﬁcally, the attention

score between timesteps i and j is given by:

i, j

√

(3)

where q

and k

are the i-th and j-th rows of the matri-

ces Q and K, respectively. The attention weight µ

i, j

then computed using the softmax function as follows:

i, j

exp(ω

i, j

)

∑

j=1

exp(ω

i, j

)

(4)

These attention weights µ

i, j

reﬂect the degree of rele-

vance between timestep x

′

and timestep x

′

, with larger

values indicating stronger temporal dependencies.

Finally, the weighted sum of the Value vectors is

computed to generate the new representation, as fol-

lows:

∑

j=1

i, j

(5)

where v

is the j-th row of the Value matrix V , and

represents the ﬁnal feature vector for timestep i.

This vector integrates information from all timesteps,

weighted by their respective attention scores µ

i, j

. This

mechanism allows the model to dynamically focus on

relevant parts of the sequence, effectively capturing

both short- and long-range temporal dependencies.

4.2 Mixture of Experts

In this paper, we construct independent expert models

for each type of innovative sentence: theoretical inno-

vation, methodological innovation, and applied inno-

vation. These expert models are the same ones used in

the innovative sentence identiﬁcation task, which in-

corporates the Time Mixing Attention (TMA) mech-

anism. To ensure that each expert model effectively

focuses on its designated innovation dimension, we

create three distinct annotated datasets, each corre-

sponding to one of the innovation types (theoretical,

methodological, or applied innovation). Speciﬁcally,

the annotated datasets are denoted as (x

( j)

)

i=1

with j ∈ {1, 2, 3}, where x

( j)

represents the input

texts, y

( j)

represents the corresponding true labels,

and N

is the total number of samples for the j-th in-

novation type.

The training objective for each expert model is

to minimize the binary cross-entropy loss (BCE) be-

tween the model’s predicted output and the true label.

The objective function for the j-th expert model is de-

ﬁned as:

∑

i=1

BCE( f

( j)

),y

( j)

) (6)

where f

( j)

) denotes the output of the j-th expert

model for the input x

( j)

After training the expert models, we perform in-

ference by selecting the most appropriate expert for

each input sentence. During inference, a hard routing

mechanism is employed, which is based on the prob-

ability distribution computed as follows:

= σ( f

(x)) (7)

where σ(·) denotes the sigmoid function, and f

(x) is

the output of the j-th expert model for the input x. The

probability p

represents the likelihood that the j-th

expert is the most suitable for making the prediction

for the given input. The expert model with the highest

probability is selected for ﬁnal inference, as indicated

by:

max

= argmax

(8)

This hard routing mechanism ensures that only the

most appropriate expert contributes to the ﬁnal pre-

diction, thereby reducing computational complexity

and improving the model’s efﬁciency. The ﬁnal pre-

diction y

ﬁnal

is made using the selected expert model:

ﬁnal

= f

max

(x) (9)

4.3 Generative Semantic Data

Augmentation

In the ﬁne-grained Innovation Classiﬁcation task,

class imbalance and data scarcity pose signiﬁcant

challenges across the three categories of innovative

sentences. To mitigate these issues, this study pro-

poses a generative semantic data augmentation ap-

proach leveraging large language models, which aims

to enhance the diversity and quality of training data,

thereby improving the model’s classiﬁcation perfor-

mance.

In the data augmentation process, we begin by

combining labeled innovative sentences with speciﬁc

prompts to create the instructions required for gen-

erating sentences using the large language model.

These instructions consist of three main components:

task description, seed sentences, and generation re-

quirements. The prompt we constructed is as follows:

Generate semantically related sentences for each

seed sentence, with varied expressions, to enrich the

dataset: Seed 1: [We designed a novel experimental

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

386

Table 1: Experimental Results for Innovative Sentence Identiﬁcation.

Model Acc% P% R% F1%

BERT 83.2 82.0 90.8 86.1

RoBERTa 83.9 84.9 87.6 86.2

SciBERT 84.3 83.4 90.8 86.9

BERT + TMA 83.3 85.5 85.5 85.5

RoBERTa + TMA 83.8 83.0 90.4 86.6

SciBERT + TMA 85.3 86.1 88.8 87.4

approach combining ﬂow cytometry and mass spec-

trometry to analyze cellular responses under differ-

ent conditions.] Seed 2: [The developed system has

immediate applications in remote patient monitoring,

enabling real-time health status tracking in rural ar-

eas.]. Requirements: 1) Semantically related to seeds,

reﬂecting methodological innovation and application

innovation. 2) Follow application scenario descrip-

tion style and norms. 3) Vary in expression, similar

in length and complexity. 4) Demonstrate practical

implementation and impact. 5) Generate sentences in

English, enclosed in square brackets [].

5 EXPERIMENTS

The experiments consist of two parts: the ﬁrst part fo-

cuses on identifying innovative sentences, where the

model is trained to recognize and extract innovation-

related sentences from scientiﬁc literature. The sec-

ond part addresses innovation classiﬁcation, where

the identiﬁed sentences are categorized into three dis-

tinct types: theoretical, methodological, and applied

innovations.

5.1 Experiment for Innovative Sentence

Identiﬁcation

The objective of this experiment was to validate the

effectiveness of the Time Mixing Attention (TMA)

mechanism in identifying innovative sentences from

scientiﬁc literature. To construct the dataset, sen-

tences were extracted from relevant sections, resulting

in a total of 23,912 annotated sentences, with 10,036

labeled as non-innovative sentences and 13,876 as in-

novative sentences. The dataset was split into train-

ing, validation, and test sets in an 8:1:1 ratio while

ensuring balanced distributions across subsets. The

experiment utilized multiple models, including BERT

(Devlin, 2018), RoBERTa (Liu, 2019), and SciBERT

(Beltagy et al., 2019), both with and without the in-

tegration of TMA, to assess its impact on classify-

ing innovation and non-innovative sentences based

on their semantic content. The training conﬁguration

employed a learning rate of 1 ×10

−5

, a batch size of

5, and a single training epoch. Performance evalu-

ation was conducted using macro-average precision

(P), recall (R), F1-score (F1) and accuracy (Acc) to

comprehensively assess the effectiveness of the mod-

els and the impact of TMA.

As shown in Table 1, the integration of the TMA

mechanism led to signiﬁcant improvements in model

performance. SciBERT with TMA achieved the best

results, with a test accuracy of 85.3% and an F1-score

of 87.4%, outperforming all other models. In addi-

tion, All models incorporating TMA achieved an av-

erage increase of 0.1% in F1-score compared to their

counterparts without the mechanism.

5.2 Experiment for Innovation

Classiﬁcation

The goal of this experiment was to evaluate the effec-

tiveness of our proposed two-phase innovation classi-

ﬁcation approach in distinguishing between different

types of innovative sentences in scientiﬁc literature.

The dataset was initially constructed through man-

ual annotation, yielding a total of 13,876 labeled in-

novative sentences, categorized into three innovation

types: 11,353 theoretical innovative sentences, 1,171

methodological innovative sentences, and 1,352 ap-

plied innovative sentences. Given the signiﬁcant class

imbalance, where methodological and applied inno-

vative sentences were substantially fewer in number,

we applied our proposed LLM-based generative se-

mantic data augmentation to expand these categories.

Speciﬁcally, we utilized Claude-3.5-Sonnet to gener-

ate additional innovative sentences while preserving

linguistic diversity and semantic consistency. For the-

oretical innovative sentences, we selected 6,000 high-

quality instances from the manually labeled data. For

methodological innovative sentences, we expanded

the dataset to 5,500 sentences, combining the orig-

inal annotations with generated samples. Similarly,

the applied innovation category was augmented to

5,000 sentences using a mix of original and generated

Innovative Sentence Classiﬁcation in Scientiﬁc Literature: A Two-Phase Approach with Time Mixing Attention and Mixture of Experts

387

Table 2: Experimental Results for Two-phase Innovation Classiﬁcation.

Model

Theoretical Methodological Applied Macro avg

Acc% P% R% F1% Acc% P% R% F1% Acc% P% R% F1% Acc% P% R% F1%

Ministral-8B-Instruct 86.7 86.3 86.9 86.6 85.2 84.8 85.5 85.1 84.1 83.8 84.3 84.0 85.3 85.0 85.6 85.2

LLAMA 3.1 8B-instruct 84.3 83.8 83.1 83.4 85.2 85.9 84.7 85.3 83.6 82.9 84.1 83.5 84.4 84.2 84.0 84.1

Phi 4-14B 85.8 86.2 85.5 85.8 84.5 84.2 84.7 84.4 83.2 83.5 83.0 83.2 84.5 84.6 84.4 84.5

Qwen2.5 -14B 85.6 84.9 85.3 85.1 84.2 83.8 84.5 84.1 83.1 82.8 83.3 83.0 84.3 83.8 84.4 84.1

SciBERT+TMA (one-phase) 91.2 96.1 85.2 90.3 91.2 90.4 90.3 90.4 91.2 83.7 74.6 78.9 91.2 90.1 83.4 86.5

SciBERT+TMA (two-phase) 95.2 96.7 93.6 95.1 91.5 98.3 84.3 90.8 86.3 85.0 88.3 86.6 91.0 93.3 88.7 90.8

data. To balance the dataset and reduce model bias, an

equal number of non-innovative sentences was incor-

porated for each category, maintaining a 1:1 positive-

to-negative sample ratio. The dataset was then strat-

iﬁed into training, validation, and test sets using an

8:1:1 split, ensuring a consistent distribution of posi-

tive and negative samples across all subsets.

The experiment evaluated multiple mainstream

LLMs, including Ministral-8B-Instruct (Para-

manayakam et al., 2024), LLAMA 3.1 8B-Instruct

(Dubey et al., 2024), Phi 4-14B (Abdin et al.,

2024), and Qwen2.5-14B (Yang et al., 2024), us-

ing a prompt-based classiﬁcation approach, where

sentences were categorized based on the following

prompt:

Your task is to classify each sentence in a given

text into one of four categories:

1.Theoretical Sentences: Sentences that discuss

theoretical frameworks, conceptual deﬁnitions, theo-

retical models, or hypotheses

2.Methodological Sentences: Sentences that de-

scribe research methods, data collection processes,

experimental designs, or analytical techniques

3.Applied Sentences: Sentences that discuss prac-

tical applications, implications, solutions, or recom-

mendations

4.Other Sentences: Sentences that don’t ﬁt into the

above categories

The following is an example:

Input: “Recent advances in cognitive psychology

have suggested that working memory capacity is not

ﬁxed but can be enhanced through training. To test

this hypothesis, we recruited 150 undergraduate stu-

dents and randomly assigned them to experimental

and control groups. The experimental group under-

went an 8-week computerized working memory train-

ing program, while the control group played casual

computer games. Results showed that students who

completed the training demonstrated signiﬁcant im-

provements in both working memory tasks and aca-

demic performance, suggesting that such interven-

tions could be valuable for educational programs.”

Output: { “sentences”: [ { “text”: ”Recent ad-

vances in cognitive psychology have suggested that

working memory capacity is not ﬁxed but can be

enhanced through training.”, “classiﬁcation”: “the-

oretical” }, { “text”: “To test this hypothesis,

we recruited 150 undergraduate students and ran-

domly assigned them to experimental and control

groups.”, “classiﬁcation”: “methodological” }, {

“text”: “The experimental group underwent an 8-

week computerized working memory training pro-

gram, while the control group played casual computer

games.”, “classiﬁcation”: “methodological” }, {

“text”: “Results showed that students who completed

the training demonstrated signiﬁcant improvements

in both working memory tasks and academic per-

formance, suggesting that such interventions could

be valuable for educational programs.”, “classiﬁca-

tion”: “applied” } ]

Now, please extract the information from the fol-

lowing text:

Input:

In addition, we conducted experiments using

SciBERT+TMA, the best-performing model from our

ﬁrst phase of innovative sentence identiﬁcation. The

classiﬁcation was performed under two settings: (1)

a one-phase classiﬁcation approach, where sentences

were directly categorized into non-innovation, the-

oretical innovation, methodological innovation, and

applied innovation; and (2) our proposed two-phase

model, which ﬁrst identiﬁed innovative sentences and

then further classiﬁed them into subcategories using

our MoE-based approach.

As shown in Table 2, the proposed two-phase

SciBERT+TMA model achieved the highest perfor-

mance, with a macro-averaged F1-score of 90.8%.

Speciﬁcally, it attained F1-scores of 95.1% for theo-

retical innovation, 90.8% for methodological innova-

tion, and 86.6% for applied innovation. Compared to

the one-phase SciBERT+TMA model, the two-phase

approach demonstrated notable improvements, partic-

ularly in precision and recall, highlighting the beneﬁts

of progressive classiﬁcation reﬁnement.

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

388

6 CONCLUSIONS

This paper presents a two-phase classiﬁcation frame-

work for identifying innovative sentences in scien-

tiﬁc literature, integrating a Time Mixing Attention

(TMA) mechanism and a Mixture of Experts (MoE)

model. The ﬁrst phase enhances long-range depen-

dency modeling using TMA, while the second phase

employs MoE to classify sentences into theoretical,

methodological, and applied innovation categories.

Additionally, a generative semantic data augmenta-

tion method is introduced to address class imbalance

and improve model performance.

Experimental results demonstrate that the pro-

posed two-phase SciBERT+TMA model achieves su-

perior performance, with a macro-averaged F1-score

of 90.8%, outperforming the one-phase approach and

all LLM baselines. Speciﬁcally, the F1-scores for

theoretical, methodological, and applied innovation

categories reach 95.1%, 90.8%, and 86.6%, respec-

tively, highlighting the effectiveness of progressive

classiﬁcation reﬁnement. Compared to direct clas-

siﬁcation, the MoE-based approach signiﬁcantly im-

proves precision and recall. Among LLMs, Ministral-

8B achieves the best performance in prompt-based

classiﬁcation, with a macro-averaged F1-score of

85.2%, reinforcing the advantages of a domain-

adapted framework over general-purpose LLM infer-

ence.

ACKNOWLEDGEMENTS

This study was funded by National Social Science

Foundation of China (Grant No. 21&ZD329).

REFERENCES

Abdin, M., Aneja, J., Awadalla, H., and et al. (2024).

Phi-3 technical report: A highly capable language

model locally on your phone. arXiv preprint,

arXiv:2404.14219.

Afzal, M., Hussain, J., Abbas, A., and et al. (2024).

Transformer-based active learning for multi-class

text annotation and classiﬁcation. Digital Health,

10:20552076241287357.

Bayer, M., Kaufhold, M. A., and Reuter, C. (2022). A sur-

vey on data augmentation for text classiﬁcation. ACM

Computing Surveys, 55(7):1–39.

Beltagy, I., Lo, K., and Cohan, A. (2019). Scibert: A

pretrained language model for scientiﬁc text. arXiv

preprint arXiv:1903.10676.

Devlin, J. (2018). Bert: Pre-training of deep bidirec-

tional transformers for language understanding. arXiv

preprint arXiv:1810.04805.

Dubey, A., Jauhri, A., Pandey, A., and et al. (2024).

The llama 3 herd of models. arXiv preprint,

arXiv:2407.21783.

Guo, Q., Qiu, X., Liu, P., and et al. (2020). Multi-scale

self-attention for text classiﬁcation. In Proceedings

of the AAAI Conference on Artiﬁcial Intelligence, vol-

ume 34, pages 7847–7854.

Jain, V., Rungta, M., Zhuang, Y., and et al. (2024). Higen:

Hierarchy-aware sequence generation for hierarchical

text classiﬁcation. arXiv preprint arXiv:2402.01696.

Le, L., Zhao, G., Zhang, X., and et al. (2024). Colal: Co-

learning active learning for text classiﬁcation. In Pro-

ceedings of the AAAI Conference on Artiﬁcial Intelli-

gence, volume 38, pages 13337–13345.

Liu, Y. (2019). Roberta: A robustly optimized bert pretrain-

ing approach. arXiv preprint arXiv:1907.11692, 364.

Oida-Onesa, R. and Ballera, M. A. (2024). Fine tuning lan-

guage models: A tale of two low-resource languages.

Data Intelligence, 6(4):946–967.

Paramanayakam, V., Karatzas, A., and Anagnostopoulos, I,

e. a. (2024). Less is more: Optimizing function call-

ing for llm execution on edge devices. arXiv preprint,

arXiv:2411.15399.

Shang, S., Jiang, R., Shibasaki, R., and Yan, R. (2024).

Foundation models for information retrieval and

knowledge processing. Data Intelligence, 6(4):891–

892.

Vaswani, A. (2017). Attention is all you need. In Advances

in Neural Information Processing Systems.

Wang, M., Kim, J., and Yan, Y. (2025). Syntactic-aware text

classiﬁcation method embedding the weight vectors of

feature words. IEEE Access, 13:37572–37590.

Wang, M., Zhang, Z., Li, H., and Zhang, G. (2024). An im-

proved meta-knowledge prompt engineering approach

for generating research questions in scientiﬁc litera-

ture. In Proceedings of the 16th International Joint

Conference on Knowledge Discovery, Knowledge En-

gineering and Knowledge Management (KDIR), vol-

ume 1, pages 457–464.

Yang, A., Yang, B., Zhang, B., and et al. (2024). Qwen2.5

technical report. arXiv preprint, arXiv:2412.15115.

You, R., Zhang, Z., Dai, S., and et al. (2019). Haxmlnet:

Hierarchical attention network for extreme multi-label

text classiﬁcation. arXiv preprint arXiv:1904.12578.

Yu, G., Zou, J., Hu, X., and et al. (2024). Revital-

izing multivariate time series forecasting: Learn-

able decomposition with inter-series dependencies

and intra-series variations modeling. arXiv preprint

arXiv:2402.12694.

Innovative Sentence Classiﬁcation in Scientiﬁc Literature: A Two-Phase Approach with Time Mixing Attention and Mixture of Experts

389