Harnessing Mixture of Experts for Enhanced Abstractive Text
Summarization: A Leap Towards Scalable and Efficient NLP Models
Pramod Patil and Akanksha Songire
Department of Computer Engineering, Dr. D. Y. Patil Institute of Technology, Pimpri, Pune, India
Keywords: Abstractive Text Summarization, Mixture of Experts (MoE), Large Language Model, Transformer
Architecture, Generative AI, Natural Language Processing (NLP), Deep Learning, Recurrent Neural Network
(RNN), Feed Forward Neural Networks (FFNNs).
Abstract: The exploding area of Abstractive Text Summarization (ATS) in Natural Language Processing (NLP) marks
a shift from traditional extractive methods, providing more coherent and human-like summaries by creating
unique phrases and sentences. While there have been many advancements in ATS in recent years, they
encompass unique challenges and opportunities in NLP. The present models encounter concerns like content
preservation, factual inconsistency, semantic understanding, etc. This paper outlines an implementation of
ATS that adopts the Mixture of Experts (MoE) Model, it improves the efficiency of complex tasks by using
multiple small models and activating only necessary ones while processing data. This method enhances the
content's quality and the produced output's relevancy. The experiments show that implementing the MoE
approach within the framework of ATS improves the content’s accuracy and expands the horizons for
developing more effective and efficient NLP.
1 INTRODUCTION
We are surrounded by a large amount of information
today, knowledge flows from articles, news, social
media posts, blogs, and scientific papers. It is a huge
amount of information to understand and develop a
decision out of it, we need to have insights or process
it. However, no human can digest such huge data
where Text Summarization becomes the priority.
Summarization is mainly done in two ways:
Extractive and Abstractive Summarization. In
Extractive Summaries as the name suggests it extracts
the most relevant and important sentences from the
data and generates summaries. In contrast, abstract,
summaries are created by rephrasing and rewriting
the content from the original text, like the creation of
a condensed new version in a fresh way. Sometimes
it is desirable to implement a way of presenting facts
that differ from those present in the source. For
example, the embedding encapsulates what the author
intended, more than just quoting and embedding
specific parts from the text. With so many other
problems in natural language processing (NLP),
people have begun to tackle the exciting task of
generating an abstract text! Over the years, various
approaches have evolved to understand the problem
of summarizing text.
Rule-based approaches being the oldest NLP
methods, involve applying a set of heuristic rules and
constructed features to extract information or capture
structures. There are progress statistical approaches
where algorithms like Term Frequency - Inverse
Term Frequency and latent Semantic Analysis were
used for summarization. Even graph-based methods
advanced, where sentences were treated as nodes and
were connected based on similarity, sentences with
higher scores were selected. In recent years the
advancements in Machine Learning Approaches
increased to overcome issues like capturing long-term
dependencies and global content from supervised,
unsupervised models to the latest sequence-to-
sequence models. With the introduction of sequence-
to-sequence models in (Sequence to Sequence
Learning with Neural Networks - Sequence to
Sequence models by Google), the encoder-decoder
model where variable length input is passed to
encoder block which turns it into fixed-length tokens
and those are sent to the decoder. The power of
Recurrent Neural Network (RNN) is leveraged,
which made summarized tasks more sophisticated
with the masking models to produce smoother and
420
Patil, P. and Songire, A.
Harnessing Mixture of Experts for Enhanced Abstractive Text Summarization: A Leap Towards Scalable and Efficient NLP Models.
DOI: 10.5220/0013620600004664
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 3, pages 420-426
ISBN: 978-989-758-763-4
Proceedings Copyright © 2025 by SCITEPRESS – Science and Technology Publications, Lda.
better summaries. The limitations of the seq-to-seq
model like processing one word by word and fading
gradient were overcome in transformers with an
attention mechanism that processes complete
sentences in parallel and has an additional attention
layer to capture dependencies.
A new Approach unexplored in this domain is
using a Mixture of Experts (MoEs), the term was first
introduced in 1991(Adaptive Mixture of Local
Experts) to have a supervised technique for systems
having multiple networks each handling different
input space. Between 2010 and 2015 different areas
of research contributed to this field, commonly MoEs
were thought of as complete systems having Expert
layers and routers lately they coined MoEs as
components of deep networks making them larger
and more efficient. Secondly, the investigation of
conditional computation where the dynamic
activation and deactivation of parts of the network
were managed. MoE, a type of dynamic neural
network architecture, incorporates a set of ‘experts’
or sub-models that perform a specific task depending
on the input data requirements. The technique
involves a gating mechanism that allows the model to
allocate processing power to certain tasks relieving
other tasks by directing relevant input parts to the
most appropriate experts.
2 RELATED WORKS
A complete study is conducted to learn about existing
abstract text summarization systems, identify
research gaps, and determine the need for developing
an effective and efficient ATS model with high
accuracy.
The paper (Zixiang Chen, Yihe Deng, 2022)
particularly details the MoE framework, the sparsely
connected model that has achieved success and
expanded a new variation in neural networks. We
certainly delve into this paper to explain why the
mixture model does not collapse into one prominent
model, and how the MoE layer improves the
performance of learning in neural nets. The main
conclusion of our empirical results is that the
effectiveness of MoE depends mainly on the structure
of the underlying problem and the nonlinear nature of
the expert. Two scenarios are compared in this work
(1) a single expert (i.e. base model) versus a mixture
of experts for particular tasks. The authors concluded
after conducting tests on toy datasets that the single
expert model reached its highest precision at 87.5%,
in comparison with which the Mixture of expert
models outperformed it and showed increased
efficiency. The work also found that the router can
learn centric features and divide complex tasks into
sub-tasks which can be solved easily by the experts.
Lately, in the paper (Weilin Cai, Juyong Jiang,
2024), there is a detailed survey on a range of
advancements and architecture of Mixture of Experts
(MoE) models from 2018 to 2024. The two types of
experts namely Sparse MoE and Dense MoE are
elucidated, and the working and formulation of the
gating mechanism in these are demonstrated.
Working of routers, the distribution of input towards
various experts available, and training of routers to
perform the division and allocate the sub-tasks to
experts is discussed with various methods like
auxiliary losses and load balancing, etc. This survey
closes the gap and has been a vital tool for inspecting
the complexities of MoE by the researchers. After a
brief review of the structure of the MoE layer, the
presentation of a new MoE taxonomy is done. The
pre-trained models and several variants of core design
available to date of research and comparison of those
are reviewed, both by algorithmic and systemic
elements.
In traditional transformers, FFNNs are used as an
internal layer to capture intrinsic patterns of the data,
it expands twice the input tokens and then converses
to the number of the same tokens again. In the paper
(Xu Owen He, 2024) they have tried to overcome the
disadvantages of initial architecture that grow linearly
with the increase in the width of the hidden layers.
The method proposed in the research is called PEER
or Parameter Efficient Expert Retrieval, a technique
that can be retrieved from large pools. The
architecture decouples model size from computing
cost by using a sparse experts architecture to
effectively exploit more than a million experts.
Regarding the performance-compute trade-off,
experiments in language modeling tasks suggest that
PEER layers are better than these coarse-grained
MoEs and dense feedforward layers.
The paper (Gospel Ozioma Nnadi, Favio Bertini,
2024) serves as a base for the work done on
abstractive text summarization and the advancements
done recently, specifically using neural networks.
The work is divided into 5 sections where the authors
have discussed the seq-to-seq models, mechanisms,
training techniques, and how to optimize the existing
models. Detailed description of the encoder-decoder
models along with the datasets commonly used for
summarization tasks and evaluation metrics are
explained. It helps in understanding the artificial
neural nets and recurrent neural nets-based models.
Mechanisms like attention, copying, distraction, and
coverage are used in architecture for summaries
Harnessing Mixture of Experts for Enhanced Abstractive Text Summarization: A Leap Towards Scalable and Efficient NLP Models
421
generation using neural nets. Survey (Hassan Shakil,
Ahmad Farooq, Jugal Kalita, 2024) details the state-
of-the-art architecture and the advancements from
traditionally used architectures to the recent forms of
transform models. The evaluation and methods used
in recent architecture for summarization tasks are
discussed in detail. The future improvements possible
and the path for researchers to delve deep into
abstractive methods for summary creation are
discussed in detail.
The paper (Mike Lewis, Yinhan Liu, 2019)
introduces BART, the denoising autoencoder for
pretraining sequence-to-sequence models. For
BART, to learn the model and recover the original
text, it is first pre-trained on noisy text using a noise
function. This generally follows the conventional
Transformer-based design in generalizing BERT
(bidirectional encoder) and GPT (left-to-right
decoder). On a range of tasks, the model produces
state-of-the-art results with gains of up to 3.5
ROUGE, including summarization. It is effective at
tasks relevant to text creation, translation, and
comprehension. Implementing architecture can be
difficult and involves significantly more computer
knowledge.
Implemented for direct copying words from
source text and the generation of new words in a
single pass, this is a hybrid pointer-generator
network, which utilizes both the abstractive and the
extractive methods. Another aspect of utilizing this
technique is that it lets the model rely on a coverage
approach to not copy information in circles. When
used with the CNN/Daily Mail summarizing task, at
least two ROUGE points outperformed the cut at the
edge of any earlier attempt. Improves summary by
abstractive extracting information together with
purely extracting. Quoting the source text reinstates
the original facts to be repeated word for word. Over-
repetition in the summaries is avoided due to the
coverage technique used. The above model raises two
problems:
a. The architecture might be relatively easy to
optimize and implement.
b. Huge processing power to train.
Based on the BART objective, in the paper
(Yinhan Liu, Jiatao Gu, 2020), mBART is proposed
as a whitening autoencoder linguistic sequence-to-
sequence that pre-training on huge monolingual data
in multiple languages. The technique of the mBART
forms one of the first preliminary training strategies
of a whole sequence-to-sequence model that denoises
whole texts in multiple languages. It can be directly
fine-tuned for both machine translation tasks
supervised at the sentence and document levels as
well as unsupervised and shows significant
performance gains across most translation tasks.
3 PROBLEM STATEMENT
With the ever-increasing information around the
world, the need for summarized content has become
a necessity. We need to be time-efficient and precise
in the content that we need versus the data that we
consume. MoE framework is analyzed for
Abstractive Text Summarization to overcome the
word-to-word processing and long-distance memory
issues of the transformer’s architecture, the widely
used technique for summarization. The model targets
quality and coordination of content production by
dynamically directing information input to the
network experts, thereby solving the problem of
respecting semantic integrity and intelligence in the
management. The proposed approach helps to
enhance the efficiency and scalability of summarized
content with fewer computational requirements.
4 OBJECTIVES
The main aim of the current research is stated below:
1 Build Understanding of Abstractive Text
Summary.
2 Study the Mixed Expert (MoE) Approaches.
3 Design Strategies towards text summarization
using Awakening of Experts.
4 To assess scalability and efficiency.
5 To outline the areas for enhancement in future
studies in the respective sector.
5 METHODOLOGY
5.1 Normal LLM
A Normal LLM forwards all input and parameters
(weights and biases) to the chosen base model, such
as T5, BART, and the transformers, which are
generally FFNNs. They expand internally with 2x the
number of tokens. For example, if 512 tokens are
passed as input, the FFNN's internal layer expands to
1024 to find the intricate relationships between the
tokens and then converses to 512. The underlying
model is fine-tuned or pre-trained to fit the particular
tasks.
INCOFT 2025 - International Conference on Futuristic Technology
422
5.2 Mixture of Experts
One such type of neural network architecture
designed to increase the effectiveness and efficacy of
machine learning activities is called the Mixture of
Experts or MoE. This is achieved by breaking down
the problem into smaller jobs that experts and highly
specialized sub-models then manage. Each one is
trained to be an expert in a particular area of the
overall task. It improves the overall performance and
efficiency of the model by dynamically choosing the
most relevant experts for each input through a gating
mechanism.
Figure 1: Mixture of Expert
5.2.1 Dense Mix of Experts
Dense Mixture of Experts uses all the experts to
process all the inputs, since all the parameters
(weights and biases) are passed to FFNNs, they use
the gating mechanisms for distribution purposes. It
means that every expert thinks over the whole input
before their output is combined with that of another
Figure 2: Dense Mix of Experts
expert. This ensures all specialists contribute to the
final result produced, although its high computational
cost can be a bit unfriendly.
5.2.2 Sparse Mix of Experts
The Sparse MoE framework uses the concept of
conditional computing, unlike dense models which
use all parameters for all inputs, the sparse models
activate only some parts of the parameters. This
approach of Sparse MoEs helps to scale the size of the
model allowing to integrate of thousands of experts.
Figure 3: Sparse Mix of Experts
6 PROPOSED MODEL
The mixture of Experts uses different sub-models or
experts to improve the quality of LLMs. MoE is
defined by two main components namely Experts and
Router or gating mechanism. In LLMs there are
FFNNs used in each layer, MoE uses a set of experts
in each layer which are FFNNs themselves. The
router decides which token should be sent to which
expert. Experts learn more fine-grained information
rather than the whole domain. Since most LLMs have
many decoders, the input will go through multiple
experts before the actual text generation (each
decoder has one expert). The router is trained on,
which expert to choose based on a specific input. The
output is the probability which helps them to select
the best experts. The output of the selected expert is
multiplied by the probability of the gate, the expert
along with the router makes up the MoE. The gating
mechanism is the most important part which decides
during inference as well as training phases. The basic
form is input multiplied by router weight(W):
H(x) = x * W (1)
Harnessing Mixture of Experts for Enhanced Abstractive Text Summarization: A Leap Towards Scalable and Efficient NLP Models
423
This sets the weights of all but the top 2 to -
infinity.
While getting the softmax on the weights -infinity
results in probability 0.
6.1 Architecture Diagram
1 Input Layer: Transforms the input text into
embeddings to enable further processing by the
model.
2 PEGASUS Encoder: The PEGASUS encoder is
essentially a multi-layered version of the Transformer
that sequentially processes the embedded text. This is
where the contextual meaning of the input text is
realized.
3 Gating Mechanism: The gating mechanism
makes the selection of the most relevant set of experts
from a pool of specialized experts on the encoded
input.
4 Experts: Each expert analyzes the input they
have been provided with, focusing on aspects of the
summarizing task a little different from
others/contextual understanding, coherence, and
factuality.
5 Aggregation of Outputs: For an overall coherent
representation of the summary text to be achieved, the
active experts' outputs are aggregated.
Figure 4: Proposed Model
6 Decoder (PEGASUS): For the final abstractive
summary to be acquired, the aggregated
representation is fed into the PEGASUS decoder.
7 Output Layer: The output of the decoder is the
final condensed text.
6.2 PEGASUS Base Model
1 Summarization-Specific Design: PEGASUS is
specifically designed for summarization tasks. It
intrinsically lends itself to the task much better than
general-purpose language models, as it uses a
particular type of pretraining target designed
specifically to generate summaries.
2 Gap Sentence Generation (GSG) Pre-Training
Method: PEGASUS employs a novel pre-training
method called Gap-Sentence Generation, masking
out entire phrases and training the model to generate
the masked content. This further improves model
understanding and summarization abilities as it more
closely resembles the task of summarization.
3 State-of-the-Art Performance - Benchmark
Results: PEGASUS outperforms other models at each
step with significant improvements over a range of
summarization benchmarks, including CNN/Daily
Mail and XSum.
4 Efficient Use of Training Data - Data
Efficiency: It is more efficient than the other models
with less amount of data due to the GSG pre-training
objective which allows it to make the most out of its
training data.
5 Domain Flexibility - Versatility: PEGASUS
proved that it can be used over a variety of domains.
That makes it eligible for the summarization of varied
content, from research publications to news articles.
All of this is very crucial to the research application.
6 Compatibility with MoE - Improvement with
MoE: By integrating the Mixtures of Experts,
PEGASUS's effectiveness in understanding and
generating summaries can further be optimized. The
MoE's specialization and efficiency improvements
will allow PEGASUS to handle more extensive and
complex summarization jobs efficiently.
7 Transformer Backbone - Firm Base: PEGASUS,
itself founded on the Transformer architecture,
heavily relies on the scalability and robust attention
mechanisms natively intrinsic within Transformers.
Both of those aspects are critical to quality
summaries.
MoE techniques significantly enhance the
performance and efficiency of the abstractive text
summarization process. Following are the benefits of
applying various abstractive text summarization
experts. In the MoE model, specialists possess
INCOFT 2025 - International Conference on Futuristic Technology
424
specific knowledge of many features of
summarization such as factual accuracy, sentence
structure, and contextual meaning. A Gate
dynamically decides which experts should be called
to activate just the experts that are needed at that
particular time for the particular input at hand. The
model can save on computation costs with good
performance by making use of very few experts.
Professional experts working collectively will deliver
better accuracy and coherence with the summary
produced.
7 RESULTS AND DISCUSSION
Figure 5 and Figure 6 display the performance
metrics where the Abstract Summarization is
deployed using Normal LLM versus the
summarization performed by integrating the base
model with the Mix of Experts. A mix of Experts
outperforms the tasks in fewer computation costs
since there are sub-experts that are selected based on
the specialist knowledge they possess. Integrating
Mix of Expert in PEGASUS which is the best base
model for text summarization tasks also outperforms
the work done by normal PEGASUS.
Figure 5: Normal LLM versus MOE
Figure 6: PEGASUS versus PEGASUS with MOE
8 CONCLUSION
The MoE approach, which combines the strength of
special-purpose models of experts, appears to be a
promising way to extend abstractive text
summarization. Summaries generated by the MoE
models become not only more accurate and coherent
but also more contextually relevant because they
consider outputs from multiple experts. This
approach can actually help to avoid several more
serious drawbacks of the traditional techniques of
summarization, such as exposure bias and difficulties
arising from large search spaces. But then, the MoE
model, as it is called, comes along with its own set of
challenges, such as high complexity and increased
requirements on resources and also the danger of
overfitting. Nevertheless, the Mixture of Experts
technique is an invaluable tool in the research and
practice of natural language processing, since it
brings specific advantages in performance and
flexibility in natural language processing. The article
concludes by highlighting MoE's dual role in
abstractive text summarization; while offering
multiple performance and adaptability benefits, it
once again illustrates real-world issues that need to be
addressed quite effectively. This paper's core insight
is its fair play in the strengths and weaknesses of
MoE, thus imparting an all-round understanding of
the practicality of this model in actual practice.
REFERENCES
Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and
Yuanzhi Li,2022, Published in ArXiv Machine
Learning.
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun
Kim, Jiayi Huang, 2024, The publishing journal ArXiv
Vol. 14.
Xu Owen He, 2024, Mixture of Million Experts.
Gospel Ozioma Nnadi, Favio Bertini, 2024, Issued in
ArXiv Artificial Intelligence.
Hassan Shakil, Ahmad Farooq, Jugal Kalita, 2024,
Produced in Computation and Language ArXiv.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, Luke Zettlemoyer,2019, Printed for
Transformer's BART model.
Abigail See, Peter J. Liu, Christopher D. Manning, 2017,
Brought up on Pointer-Generator Networks
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Edunov, Marjan Ghazvininejad, Mike Lewis, Luke
Zettlemoyer, 2020, Revised on mBART model.
Ilya Sutskever, Oriol Vinyals, Quoc V. Le, 2014, Published
on seq-to-seq neural nets.
Harnessing Mixture of Experts for Enhanced Abstractive Text Summarization: A Leap Towards Scalable and Efficient NLP Models
425
David Eigen, Marc' Aurelio Ranzato, Ilya Sutskever, 2014,
Communicated on Deep Mixture of Experts ArXiv.
Kumar, S., Solanki,2023, Issued in Springer.
Shubham Dhapola, Siddhant Goel, Daksh Rawat, Satvik
Vats, Vikrant Sharma,2024, Publishing company is
IEEE 3rd World Conference on Applied Intelligence
and Computing (AIC).
INCOFT 2025 - International Conference on Futuristic Technology
426