better summaries. The limitations of the seq-to-seq
model like processing one word by word and fading
gradient were overcome in transformers with an
attention mechanism that processes complete
sentences in parallel and has an additional attention
layer to capture dependencies.
A new Approach unexplored in this domain is
using a Mixture of Experts (MoEs), the term was first
introduced in 1991(Adaptive Mixture of Local
Experts) to have a supervised technique for systems
having multiple networks each handling different
input space. Between 2010 and 2015 different areas
of research contributed to this field, commonly MoEs
were thought of as complete systems having Expert
layers and routers lately they coined MoEs as
components of deep networks making them larger
and more efficient. Secondly, the investigation of
conditional computation where the dynamic
activation and deactivation of parts of the network
were managed. MoE, a type of dynamic neural
network architecture, incorporates a set of ‘experts’
or sub-models that perform a specific task depending
on the input data requirements. The technique
involves a gating mechanism that allows the model to
allocate processing power to certain tasks relieving
other tasks by directing relevant input parts to the
most appropriate experts.
2 RELATED WORKS
A complete study is conducted to learn about existing
abstract text summarization systems, identify
research gaps, and determine the need for developing
an effective and efficient ATS model with high
accuracy.
The paper (Zixiang Chen, Yihe Deng, 2022)
particularly details the MoE framework, the sparsely
connected model that has achieved success and
expanded a new variation in neural networks. We
certainly delve into this paper to explain why the
mixture model does not collapse into one prominent
model, and how the MoE layer improves the
performance of learning in neural nets. The main
conclusion of our empirical results is that the
effectiveness of MoE depends mainly on the structure
of the underlying problem and the nonlinear nature of
the expert. Two scenarios are compared in this work
(1) a single expert (i.e. base model) versus a mixture
of experts for particular tasks. The authors concluded
after conducting tests on toy datasets that the single
expert model reached its highest precision at 87.5%,
in comparison with which the Mixture of expert
models outperformed it and showed increased
efficiency. The work also found that the router can
learn centric features and divide complex tasks into
sub-tasks which can be solved easily by the experts.
Lately, in the paper (Weilin Cai, Juyong Jiang,
2024), there is a detailed survey on a range of
advancements and architecture of Mixture of Experts
(MoE) models from 2018 to 2024. The two types of
experts namely Sparse MoE and Dense MoE are
elucidated, and the working and formulation of the
gating mechanism in these are demonstrated.
Working of routers, the distribution of input towards
various experts available, and training of routers to
perform the division and allocate the sub-tasks to
experts is discussed with various methods like
auxiliary losses and load balancing, etc. This survey
closes the gap and has been a vital tool for inspecting
the complexities of MoE by the researchers. After a
brief review of the structure of the MoE layer, the
presentation of a new MoE taxonomy is done. The
pre-trained models and several variants of core design
available to date of research and comparison of those
are reviewed, both by algorithmic and systemic
elements.
In traditional transformers, FFNNs are used as an
internal layer to capture intrinsic patterns of the data,
it expands twice the input tokens and then converses
to the number of the same tokens again. In the paper
(Xu Owen He, 2024) they have tried to overcome the
disadvantages of initial architecture that grow linearly
with the increase in the width of the hidden layers.
The method proposed in the research is called PEER
or Parameter Efficient Expert Retrieval, a technique
that can be retrieved from large pools. The
architecture decouples model size from computing
cost by using a sparse experts architecture to
effectively exploit more than a million experts.
Regarding the performance-compute trade-off,
experiments in language modeling tasks suggest that
PEER layers are better than these coarse-grained
MoEs and dense feedforward layers.
The paper (Gospel Ozioma Nnadi, Favio Bertini,
2024) serves as a base for the work done on
abstractive text summarization and the advancements
done recently, specifically using neural networks.
The work is divided into 5 sections where the authors
have discussed the seq-to-seq models, mechanisms,
training techniques, and how to optimize the existing
models. Detailed description of the encoder-decoder
models along with the datasets commonly used for
summarization tasks and evaluation metrics are
explained. It helps in understanding the artificial
neural nets and recurrent neural nets-based models.
Mechanisms like attention, copying, distraction, and
coverage are used in architecture for summaries