
configuration. This paper addresses these challenges
by fine-tuning a recent large language model (LLM),
T5-large, specifically for DBA-related SQL queries.
As there is no such dataset available publicly for han-
dling all types of DBA queries, we created a cus-
tomized ”DBASQL” (Database Administrator Struc-
tured Query Language) dataset containing natural lan-
guage queries commonly used by database adminis-
trators. This dataset includes natural language ques-
tions such as “Modify datatype of custid column from
integer to number”, “Allow insert operation on user
table to the manager”, and “Create table employee
having id, name, age”. These questions are paired
with corresponding SQL commands, forming a robust
dataset for training and fine-tuning advanced LLMs to
achieve accurate and reliable translation. This dataset
can be used in combination with other existing het-
erogeneous datasets, like single-domain and cross-
domain to cover the range of SQL operations through
natural language questions. The proposed DBASQL
dataset is publicly available at https://www.kaggle.
com/datasets/pradnyasawant/dbasql. Also, the pro-
posed model can be used to handle complex tasks like
referencing schemas and updating database content
and table schema, and managing user permissions on
database objects.
Through experiments, we show that our model
performs with high accuracy, highlighting the poten-
tial of T5 Large in automating database administra-
tion tasks. By minimizing the need for deep SQL
knowledge, this research helps create a smarter Natu-
ral language Interface to Database (NLIDB) for man-
aging DBA operations as well.
2 RELATED WORK
Converting natural language (NL) queries into SQL
has been studied for a long time. Early methods relied
on rule-based systems, but these were limited because
they were rigid and couldn’t handle a wide variety of
queries or complex database structures cite Kumar14.
Neural network-based models brought big improve-
ments to converting natural language into SQL, espe-
cially with sequence-to-sequence (Seq2Seq) models.
For example, Seq2SQL used reinforcement learning
to create SQL commands, solving issues with query
structure and accuracy (V. Zhong and Socher, 2017).
Later models like SyntaxSQLNet (T. Yu, 2018b), F-
SemtoSql (Q. Li and Zhong, 2020) and TypeSQL
(T. Yu and Radev, 2018) became even more accu-
rate by adding rules and using information about data
types. They also introduced attention mechanisms,
which helped the models focus on the important parts
of the input, making it easier to handle more compli-
cated queries.
Along with NLIDB, NL2VIS(Natural Language to
Visualizations) systems like NL4DV(A. Narechania
and Stasko, 2021), Advisor(C. Liu and Yuan, 2021),
ncNet(Y. Luo and Qin, 2022)are becoming popular as
non-technical users can generate business insights us-
ing charts, graphs, etc. from the underlying database.
There are different benchmarks available for generat-
ing visualizations through natural language questions
(K. Z. Hu and et al., 2019)(Y. Luo and Qin, 2021).
The Transformer architecture completely changed
NLP-to-SQL tasks, with models such as BERT(J. De-
vlin and Toutanova, 2018), RoBERTa(K. Ahkouk
and Ennaji, 2021), XLNet(Q. Li and Zhong, 2020),
T5(Y. Li and Zhang, 2023), and Codex(Trummer,
2022) providing better context understanding, which
improved performance in generating SQL queries.
Fine-tuning models like T5 on SQL datasets made
NL-to-SQL translation more reliable and adaptable,
while OpenAI’s Codex model showed great abil-
ity in generating SQL commands across many dif-
ferent types of queries. (T. Yu, 2018a) intro-
duced a large, cross-domain dataset with complex
multi-table SQL queries, which became an impor-
tant benchmark for evaluating recent large language
models(LLMs)(M. A. Khan and Azam, 2024)(N. T.
K. Le and Teshebaev, 2023)(C. Raffel and Liu, 2020).
This dataset has driven the development of advanced
models that can be generalized effectively. The
Spider dataset is a popular benchmark in NLP-to-
SQL research. It contains natural language ques-
tions linked to complex SQL queries for many dif-
ferent database schemas. Spider is great for testing
how well models can work with new database struc-
tures and handle multi-table joins and nested queries.
However, it mainly covers DML queries and does not
include database administration-related DDL, DML,
and DCL queries. WikiSQL is another well-known
dataset that is simpler to use. It focuses on single-
table queries created from Wikipedia tables. How-
ever, since it only includes SELECT queries and
doesn’t cover the DDL and DCL queries (T. Yu,
2018a).
CoSQL (T. Yu and Su, 2019a)and SParC(T. Yu
and Su, 2019b) are based on the Spider dataset,
but add conversational and multi-step queries, where
users improve their queries gradually. However, like
Spider, they mostly focus on DML queries and do not
cover DDL or DCL queries in detail. While current
NLP-to-SQL datasets provide a strong foundation for
developing models that handle DML queries, there is
a clear gap in datasets representing DBA-related DDL
and DCL queries. Addressing this gap is important
NL-Based Database Administration for Handling Heterogeneous Datasets Using Finetuned LLM
57