NL-Based Database Administration for Handling Heterogeneous

Datasets Using Finetuned LLM

Pradnya Sawant

and Kavita Sonawane

St. Francis Institute of Technology, Mumbai, India

ﬁ

Keywords:

Database Administrator Queries, Data Deﬁnition Language, Data Manipulation Language, Data Control

Language, Large Language Model, NLIDB.

Abstract:

Translating natural language (NL) questions into structured query language (SQL) queries is becoming in-

creasingly important for making databases easier to use and manage. Different large language models (LLMs)

have been used for this translation in recent years. These models are mostly trained and evaluated on datasets

covering a few types of data manipulation language(DML) queries like projection, selection, aggregate func-

tions, joins, etc. However, these datasets failed to contain queries required for Database Administrator(DBA)

operations such as creating and modifying database schema, managing user permissions, etc. This paper

presents an approach to help database administrators (DBAs) and end users interact with databases more in-

tuitively by generating SQL queries from natural language inputs. As no such dataset is publicly available,

we have created a specialized dataset called DBASQL, which includes common DBA operations addressing

data deﬁnition language(DDL), data manipulation language(DML), and data control language(DCL) related

natural language questions like creating tables, views, or indexes; inserting values; updating data types or

values; renaming tables or columns; granting or revoking user permissions, paired with their corresponding

SQL queries. For experimentation, we have ﬁnetuned Text-to-Text Transfer Transformer (T5) Large on our

customized DBASQL dataset, aiming to improve the accuracy of these translations. Our evaluation shows

that this approach effectively translates NL to SQL that addresses DBA operations, making it easier to han-

dle DDL, DML, and DCL database operations without requiring extensive SQL knowledge. This research

highlights the potential of NLP models to improve the efﬁciency of natural language to SQL translation by

enabling smarter database interfaces for DBA as well. Also, the proposed DBASQL dataset can be integrated

with any heterogeneous datasets, such as single-domain and cross-domain, for the translation of natural lan-

guage to SQL queries. Hence, covering the border range of SQL queries that can be used by both end users

and database administrators.

1 INTRODUCTION

With data growing quickly, managing databases ef-

ﬁciently is more important than ever. However, ac-

cessing database information often requires knowing

query languages like SQL, which can be challenging

for non-technical users.

Natural Language Processing (NLP) provides a

solution by enabling non-technical users to ask ques-

tions in natural language rather than knowing com-

plex SQL commands. NLP models convert these

natural language questions into appropriate SQL

queries, making it easier for both Database Admin-

istrators(DBA) and non-technical users to work with

https://orcid.org/0000-0003-3982-4077

https://orcid.org/0000-0003-0865-6760

databases.

Current NL to SQL systems have shown success

in general-purpose applications, but adapting these

models for DBA-speciﬁc operations presents unique

challenges. The systems already available have im-

proved a lot in converting natural language into SQL,

but most datasets mainly focus on a few Data Manip-

ulation Language (DML) queries. On the other hand,

queries for Data Deﬁnition Language (DDL), such

as CREATE and ALTER, Data Manipulation Lan-

guage(DML) like CREATE, INSERT, UPDATE, and

Data Control Language (DCL), like GRANT and RE-

VOKE, are not well-represented. DBA-speciﬁc nat-

ural language questions involve complex queries to

modify schema structures, update data types and val-

ues, grant and revoke permissions, etc., that require

precise SQL commands tailored to each database’s

Sawant, P. and Sonawane, K.

NL-Based Database Administration for Handling Heterogeneous Datasets Using Finetuned LLM.

DOI: 10.5220/0013608700004664

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 3, pages 56-64

ISBN: 978-989-758-763-4

conﬁguration. This paper addresses these challenges

by ﬁne-tuning a recent large language model (LLM),

T5-large, speciﬁcally for DBA-related SQL queries.

As there is no such dataset available publicly for han-

dling all types of DBA queries, we created a cus-

tomized ”DBASQL” (Database Administrator Struc-

tured Query Language) dataset containing natural lan-

guage queries commonly used by database adminis-

trators. This dataset includes natural language ques-

tions such as “Modify datatype of custid column from

integer to number”, “Allow insert operation on user

table to the manager”, and “Create table employee

having id, name, age”. These questions are paired

with corresponding SQL commands, forming a robust

dataset for training and ﬁne-tuning advanced LLMs to

achieve accurate and reliable translation. This dataset

can be used in combination with other existing het-

erogeneous datasets, like single-domain and cross-

domain to cover the range of SQL operations through

natural language questions. The proposed DBASQL

dataset is publicly available at https://www.kaggle.

com/datasets/pradnyasawant/dbasql. Also, the pro-

posed model can be used to handle complex tasks like

referencing schemas and updating database content

and table schema, and managing user permissions on

database objects.

Through experiments, we show that our model

performs with high accuracy, highlighting the poten-

tial of T5 Large in automating database administra-

tion tasks. By minimizing the need for deep SQL

knowledge, this research helps create a smarter Natu-

ral language Interface to Database (NLIDB) for man-

aging DBA operations as well.

2 RELATED WORK

Converting natural language (NL) queries into SQL

has been studied for a long time. Early methods relied

on rule-based systems, but these were limited because

they were rigid and couldn’t handle a wide variety of

queries or complex database structures cite Kumar14.

Neural network-based models brought big improve-

ments to converting natural language into SQL, espe-

cially with sequence-to-sequence (Seq2Seq) models.

For example, Seq2SQL used reinforcement learning

to create SQL commands, solving issues with query

structure and accuracy (V. Zhong and Socher, 2017).

Later models like SyntaxSQLNet (T. Yu, 2018b), F-

SemtoSql (Q. Li and Zhong, 2020) and TypeSQL

(T. Yu and Radev, 2018) became even more accu-

rate by adding rules and using information about data

types. They also introduced attention mechanisms,

which helped the models focus on the important parts

of the input, making it easier to handle more compli-

cated queries.

Along with NLIDB, NL2VIS(Natural Language to

Visualizations) systems like NL4DV(A. Narechania

and Stasko, 2021), Advisor(C. Liu and Yuan, 2021),

ncNet(Y. Luo and Qin, 2022)are becoming popular as

non-technical users can generate business insights us-

ing charts, graphs, etc. from the underlying database.

There are different benchmarks available for generat-

ing visualizations through natural language questions

(K. Z. Hu and et al., 2019)(Y. Luo and Qin, 2021).

The Transformer architecture completely changed

NLP-to-SQL tasks, with models such as BERT(J. De-

vlin and Toutanova, 2018), RoBERTa(K. Ahkouk

and Ennaji, 2021), XLNet(Q. Li and Zhong, 2020),

T5(Y. Li and Zhang, 2023), and Codex(Trummer,

2022) providing better context understanding, which

improved performance in generating SQL queries.

Fine-tuning models like T5 on SQL datasets made

NL-to-SQL translation more reliable and adaptable,

while OpenAI’s Codex model showed great abil-

ity in generating SQL commands across many dif-

ferent types of queries. (T. Yu, 2018a) intro-

duced a large, cross-domain dataset with complex

multi-table SQL queries, which became an impor-

tant benchmark for evaluating recent large language

models(LLMs)(M. A. Khan and Azam, 2024)(N. T.

K. Le and Teshebaev, 2023)(C. Raffel and Liu, 2020).

This dataset has driven the development of advanced

models that can be generalized effectively. The

Spider dataset is a popular benchmark in NLP-to-

SQL research. It contains natural language ques-

tions linked to complex SQL queries for many dif-

ferent database schemas. Spider is great for testing

how well models can work with new database struc-

tures and handle multi-table joins and nested queries.

However, it mainly covers DML queries and does not

include database administration-related DDL, DML,

and DCL queries. WikiSQL is another well-known

dataset that is simpler to use. It focuses on single-

table queries created from Wikipedia tables. How-

ever, since it only includes SELECT queries and

doesn’t cover the DDL and DCL queries (T. Yu,

2018a).

CoSQL (T. Yu and Su, 2019a)and SParC(T. Yu

and Su, 2019b) are based on the Spider dataset,

but add conversational and multi-step queries, where

users improve their queries gradually. However, like

Spider, they mostly focus on DML queries and do not

cover DDL or DCL queries in detail. While current

NLP-to-SQL datasets provide a strong foundation for

developing models that handle DML queries, there is

a clear gap in datasets representing DBA-related DDL

and DCL queries. Addressing this gap is important

NL-Based Database Administration for Handling Heterogeneous Datasets Using Finetuned LLM

for developing a robust NL-to-SQL system that can

handle all database management queries.

Even with these advancements, applying mod-

els to database administration (DBA) operations is

an area of growing research. DBA operations, like

managing permissions, manipulating structures of

schema, changing database contents, and making

complex schema references, which general-purpose

models struggle to handle. Models like RAT-

SQL (B. Wang and Richardson, 2020) and SmBoP

(Z. Zhao and Liang, 2021) focus on schema encoding

and semantic parsing of a few DML queries, but they

do not address queries related to DBA operations.

Hence there is a need to ﬁne-tune LLM for DBA op-

erations, which help better manage these administra-

tion operations, creating a more user-friendly inter-

face for DBAs with limited SQL knowledge. This pa-

per builds on these advancements by ﬁne-tuning the

T5 large model and developing a specialized dataset

for schema and database content manipulation, han-

dling user permissions, etc. Our work demonstrates

the model’s capability in generating accurate SQL

queries for DBA needs, broadening NLP-to-SQL ap-

plications, and addressing gaps in handling special-

ized DBA operations, ultimately supporting the de-

velopment of smarter, more accessible NLIDB.

3 PROPOSED ARCHITECTURE

Figure 1 shows the proposed architecture for NL-

based database administration. In this, a natural lan-

guage question addressing DBA queries is ﬁrst pre-

processed using tokenization, padding, etc. Then the

NLP module will be responsible for converting this

preprocessed text into the appropriate SQL query.

The highlighted part in the ﬁgure shows that we

have ﬁne-tuned the T5 large model for NL processing

speciﬁcally for DBA-related queries. This generated

SQL query is executed against the relational database

through which the output can be produced. For Fine-

Tuning the T5 large model, we have not considered

the table schema as input which can reduce the com-

plexity of the model. The ﬁnetuning of the T5 model

is explained in Section 3.1.

3.1 Finetuning of T5 Large

The model is initialized with pre-trained weights from

T5-large, and the input is formatted to include the nat-

ural language questions. The output is the target SQL

query. Both input and output are tokenized to ﬁt the

model vocabulary, with careful attention to sequence

truncation and padding. The loss is calculated on

Figure 1: Proposed Architecture for NL-based Database

Administration.

the generated SQL sequences compared to the ground

truth. By optimizing the model using techniques like

learning rate scheduling and validation on held-out

examples, T5-large learns to generalize its transla-

tion capabilities, even for complex SQL queries. For

training the model we have used our own customized

”DBASQL” dataset, which is explained in detail in

Section 3.2.

3.2 Proposed Dataset: DBASQL

Dataset

There is no publicly available dataset speciﬁcally de-

signed for NL to SQL translation that comprehen-

sively handles Data Deﬁnition. Language(DDL),

Data Manipulation Language(DML), and Data Con-

trol Language(DCL) queries for database adminis-

tration (DBA) requirements. Most existing datasets,

such as Spider, focus on SQL query generation for

various databases but tend to emphasize SELECT

statements (DML) and not the full range of DBA-

related tasks, such as creating, altering, or manag-

ing database structures (DDL). Developing a dataset

tailored to DBA requirements would require incorpo-

rating queries that handle schema modiﬁcations, and

other administrative operations. Hence, we have cre-

ated a ”DBASQL” dataset having around 2500 pairs

of natural language and SQL queries covering DDL,

DML, and DCL queries for DBA requirements.

DBASQL Dataset contains natural language ques-

tions and corresponding SQL queries addressing

DBA tasks like schema creation and modiﬁcations,

updation of table contents, managing user permis-

sions, etc. as listed in Table 1. These queries are

divided into three different categories: DBA-related

INCOFT 2025 - International Conference on Futuristic Technology

DDL, DML, and DCL queries. This dataset also in-

cludes natural language questions that do not explic-

itly mention SQL clause names, making it easier for

users to understand and interact with. By avoiding di-

rect references to SQL clauses, the DBA experience

will be improved. As a result, this dataset helps cre-

ate a more natural and effective interaction between

the DBA and the database system.

The count of queries is sufﬁcient to train any LLM

model which is given in Table 2, covering all DDL,

DML, and DCL queries.

The screenshot of the DBASQL dataset is shown

in Figure 2.

Figure 2: DBASQL Dataset.

4 IMPLEMENTATION DETAILS

AND EXPERIMENTAL SETUP

The following is the experimental setup and the

model parameters used for ﬁne-tuning T5 large.

Customized the DBASQL dataset for DBA-

related queries and DataLoader classes tailored for

the NLP to SQL task, which includes speciﬁc padding

and tokenization requirements that can impact model

performance. Training includes inference functions

to infer query types from SQL queries, which are spe-

ciﬁc to the NL-to-SQL task. Finetuning of the T5

large model uses the following architecture parame-

ters.

• Learning Rate: 1e-4 (0.0001),

• Number of Epochs: 20,

• Optimizer: AdamW optimizer,

• Scheduler: Learning rate scheduler with linear

warm-up and linear decay,

• Batch Size: 4,

• Loss Function: The Cross-Entropy loss function

is computed based on the output of the T5 model

during training,

• Early Stopping: Patience of 5 epochs is used for

early stopping. If validation loss does not improve

for 5 consecutive epochs, the training stops.

The performance of ﬁnetuned and modiﬁed T5 is

measured based on the following two parameters.

Exact Match Accuracy (EMA): It compares the

expected query with the predicted SQL query to

check whether they match each other. It is concerned

with the syntactical correctness of the generated

SQL. The EMA is calculated using Equation 1 as

follows:

ACCema = Nema/n (1)

Logical Accuracy(LA) : It checks whether the gener-

ated SQL retrieves the correct data semantically, even

if the SQL structure differs. The LA is calculated

using Equation 2 as follows:

ACCla = Nla/n (2)

Where n is the number of examples.

Nema- Number of predicted queries that are syntac-

tically similar to the expected SQL query.

Nla- Number of predicted queries that are logically

similar to the expected SQL query.

5 RESULTS AND DISCUSSION

We have proposed the novel DBASQL dataset for

Database administration queries with the intention

of handling the full breadth of SQL queries effec-

tively. Also, experimenting with the proposed T5

model on a variety of NL questions will justify the

strength of the proposed model. The training and val-

idation loss graph for the ﬁne-tuned T5 model is pre-

sented in Figure 3. The comparison of exact match

accuracy(EMA) and logical accuracy(LA) for DDL,

DML, and DCL queries is shown in ﬁgures Figure 4,

Figure 5, and Figure 6, respectively. The proposed

model is observed to be better at predicting DBA

queries even without using the schema of the under-

lying tables.

The logical accuracy for DDL queries like CRE-

ATE TABLE and RENAME TABLE is more than the

exact match accuracy as the primary keys are assigned

automatically by the model as shown in Figure 4.

Also, the renaming of the tables is done dynamically

by the model.

NL-Based Database Administration for Handling Heterogeneous Datasets Using Finetuned LLM

Table 1: Description of Queries (DBA Related).

Category Query Type Description

DDL

CREATE Used to create database objects like tables, indexes, and views.

ALTER

Modiﬁes the structure of an existing database object. Common opera-

tions include:

ADD: Adds a new column or primary key to an existing table.

DROP COLUMN: Removes an existing column from a table.

RENAME COLUMN: Renames a column in a table.

RENAME TABLE: Renames a table.

MODIFY COLUMN: Changes the data type or size of an existing column.

DROP Deletes an existing database object such as a table or an index.

TRUNCATE Removes all rows from a table but retains its structure.

DML

SELECT Retrieves data from one or more tables.

INSERT Adds new rows of data to a table.

UPDATE Modiﬁes existing data in a table.

DELETE FROM Removes rows from a table.

DCL

GRANT Allowing users or roles to perform speciﬁc actions like SELECT, IN-

SERT, DELETE, and UPDATE on database objects.

REVOKE Restricting users from accessing or modifying database objects after

permissions are revoked.

Table 2: Count of Queries(DBA Related).

DDL Query Type Count DML Query Type Count DCL Query Type Count

CREATE 250 INSERT 176 GRANT 94

ALTER MODIFY 63 UPDATE 164 REVOKE 80

ALTER RENAME TABLE 80 DELETE FROM 150

ALTER RENAME COLUMN 86

ALTER MODIFY 85

ALTER DROP COLUMN 50

ALTER ADD 105

DROP 47

TRUNCATE 60

DESCRIBE 30

OTHER 25

Figure 3: Loss Graph for Finetuned T5 Large.

The LA of DML queries like INSERT is more as

alteration of the datatype and size are automatically

done by the model as presented in Figure 5.

As shown in Figure 6, the EMA and LA are both

the same for all natural language questions address-

ing DCL queries like granting and revoking user per-

missions like select, update, and delete on any table.

Hence, the proposed model is observed to be bet-

ter at predicting DBA queries even without using the

schema of underlying tables.

The sample results addressing DDL, DML, and DCL

queries are presented in Table 3, Table 4, and Table

5, respectively. For the majority of test data, the pre-

dicted query is the same as the expected query.

Also, for some test data, the model predicts the

SQL clause without explicit mention in natural lan-

guage questions, as shown in bold. As presented in

Table 3, the proposed model creates the table with-

out explicit mentions about the datatype and size of

the table ﬁelds in the input natural language question.

This makes the proposed model more robust. Also,

for questions about creating the view, the model as-

signs the view names at run time.

Table 4 presents sample results for DML-related

INCOFT 2025 - International Conference on Futuristic Technology

Table 3: Sample Test Results on Finetuned T5 Large for DDL Queries.

Input Text Expected Output Finetuned T5 Predicted

Output

Remark

Create a table named

’Emp’ with columns:

emp id (INT), name

(VARCHAR(100)), email

(VARCHAR(255)), dept id

(INT), salary (DECI-

MAL(12, 2)).

CREATE TABLE Emp

(emp id INT, name VAR-

CHAR(100), email VAR-

CHAR(255), dept id INT,

salary DECIMAL(12, 2));

CREATE TABLE Emp

(emp id INT, name VAR-

CHAR(100), email VAR-

CHAR(255), dept id INT,

salary DECIMAL(12, 2));

Predicted is same as

expected output.

Make table named

’Staffnew’ with columns:

staff id, name, email.

CREATE TABLE Staffnew

(staff id INT, name VAR-

CHAR(100), email VAR-

CHAR(100));

CREATE TABLE

Staffnew (staff id INT

PRIMARY KEY, name

VARCHAR(255), email

VARCHAR(255));

Model is generat-

ing SQL queries

without their ex-

plicit mentions in

NL question.

Add an index on the

course id column in the

Course table.

CREATE INDEX

idx course id ON

Course(course id);

CREATE INDEX

idx course id ON

Course(course id);

Predicted is same as

expected output.

Formulate a view that

includes all professor’s

emails.

CREATE VIEW Profes-

sorEmailList AS SELECT

email FROM Professors;

CREATE VIEW Profes-

sorEmailListView AS SE-

LECT professor id, email

FROM Professors;

The view names

are assigned dy-

namically.

Rename the column ’title’

to ’title paper’ in the table

’ResearchPapers’

ALTER TABLE Re-

searchPapers RENAME

COLUMN title TO ti-

tle paper;

ALTER TABLE Re-

searchPapers RENAME

COLUMN title TO ti-

tle paper;

Predicted is same as

expected output.

Change the name of the

table ’ResearchPapers’ to

’ResearchPapers1’

ALTER TABLE Research-

Papers RENAME TO Re-

searchPapers1;

ALTER TABLE Research-

Papers RENAME TO Re-

searchPapers1;

Model is generat-

ing SQL queries

without their ex-

plicit mentions in

NL question.

Modify the datatype of

column ’enrollmentdate’ to

DATE in the table ’Enroll-

ments’

ALTER TABLE Enroll-

ments MODIFY COLUMN

enrollmentdate DATE;

ALTER TABLE Enroll-

ments MODIFY COLUMN

enrollmentdate DATE;

Predicted is same as

expected output.

Delete the column ’email’

from the table ’Students’.

ALTER TABLE Students

DROP COLUMN email;

ALTER TABLE Students

DROP COLUMN email;

Model is generat-

ing SQL queries

without their ex-

plicit mentions in

NL question.

Add a new column ’gradu-

ation year’ to the Students

table

ALTER TABLE Students

ADD graduation year INT;

ALTER TABLE Students

ADD graduation year INT;

Predicted is same as

expected output.

Remove the Grades table,

Drop the Grades table

DROP TABLE IF EXISTS

Grades;

DROP TABLE IF EXISTS

Grades;

Model is generat-

ing SQL queries

without their ex-

plicit mentions in

NL question

Delete all contents of the

Students table

TRUNCATE TABLE Stu-

dents;

TRUNCATE TABLE Stu-

dents;

Model is generat-

ing SQL queries

without their ex-

plicit mentions in

NL question

Add a primary key con-

straint to the ’student id’

column in the ’Students’ ta-

ble.

ALTER TABLE Students

ADD PRIMARY KEY

(student id);

ALTER TABLE Students

ADD PRIMARY KEY

(student id);

Predicted is same as

expected output.

Describe the Grades table DESCRIBE Grades; DESCRIBE Grades; Predicted is the

same as expected

output.

NL-Based Database Administration for Handling Heterogeneous Datasets Using Finetuned LLM

CREATE

ALTER MODIFY

ALTER RENAME TABLE

ALTER RENAME COLUMN

ALTER MODIFY

ALTER DROP COLUMN

ALTER ADD

DROP

TRUNCATE

DESCRIBE

100

DDL Operations

EMA (%)

100

LA (%)

Figure 4: Comparison of EMA and LA for DDL queries.

INSERT

UPDATE

DELETE

100

DML Operations.

EMA (%)

100

LA (%)

Figure 5: Comparison of EMA and LA for DML queries.

natural language questions. Here, the model predicts

the correct SQL clauses without explicitly mentioning

the SQL clause in the input natural language question,

e.g. for inserting the values in the database table with-

out an explicit mention of the ”INSERT” word in a

natural language question, the model is still predict-

ing the clause precisely.

Sample results of DCL queries are presented in

Table 5. As presented in Table 5 the model is granting

and revoking user permissions accurately on database

tables.

Hence, the proposed model was found to achieve

an EMA 94.74% and LA 97.2% with early stopping

epoch number 14 using the proposed ﬁne-tuned T5

large model without using the table schema as input.

6 CONCLUSION AND FUTURE

SCOPE

This research work aimed to translate DBA-related

natural language questions into SQL queries. We have

proposed and validated ﬁne-tuned T5 on a diversiﬁed

customized DBASQL dataset, and we could achieve

the exact match accuracy 94.74% and logical accu-

racy 97.2% without using the table schema as input.

The proposed model can be effectively used in vary-

ing natural language users as well as DBA needs.

The same is justiﬁed and validated by the contribu-

tion in the form of a new proposed DBASQL dataset

covering all varieties of natural language questions

without explicitly mentioning the SQL clause. This

dataset can be easily combined with the available

heterogeneous datasets to cover the full breadth of

SQL operations effectively for performing natural

language to SQL translations.

In the future, we can continue to improve the efﬁ-

ciency of LLMs to handle more complex, ambiguous,

and multi-turn queries.

GRANT

REVOKE

100

DCL Operations

EMA (%)

100

LA (%)

Figure 6: Comparison of EMA and LA for DCL queries.

INCOFT 2025 - International Conference on Futuristic Technology

Table 4: Sample Test Results on Finetuned T5 Large for DML Queries.

Input Text Expected Output Finetuned T5 Predicted

Output

Remark

Delete a course having id

DELETE FROM Courses

WHERE course id = 21;

DELETE FROM Courses

WHERE course id = 21;

Predicted is same as

expected output.

Change the title of a re-

search paper to ’Advance-

ments for Robotics’, having

paper id 30.

UPDATE ResearchPapers

SET title = ’Advancements

for Robotics’ WHERE

paper id = 30;

UPDATE ResearchPapers

SET title = ’Advancements

for Robotics’ WHERE pa-

per id = 30;

Model is generat-

ing SQL queries

without their ex-

plicit mentions in

NL question.

Insert a student with ID

11, name ’John’, email

’john@example.com’, de-

partment id 1, advisor id

101, gpa 3.5.

INSERT INTO Students

(student id, name, email,

department id, advisor id,

gpa) VALUES (11, ’John’,

’john@example.com’, 1,

101, 3.5);

INSERT INTO Students

(student id, name, email,

department id, advisor id,

gpa) VALUES (11, ’John’,

’john@example.com’, 1,

101, 3.5);

Predicted is the

same as the ex-

pected output.

Add a student with ID 3,

name ’Ronan’, email ’ro-

nan@example.com’, advi-

sor id 102, gpa 9.5, depart-

ment id 4.

”INSERT INTO Students

(student id, name, email,

department id, advisor id,

gpa) VALUES (3, ’Ronan’,

’ronan@example.com’, 4,

102, 9.5);”

INSERT INTO Stu-

dents (student id, name,

email, advisor id, gpa,

department id) VAL-

UES (3, ’Ronan’, ’ro-

nan@example.com’, 102,

9.5, 4);

Model is generat-

ing SQL queries

without their ex-

plicit mentions in

NL question.

Table 5: Sample Test Results on Finetuned T5 Large for DCL Queries.

Input Text Expected Output Finetuned T5 Predicted

Output

Remark

Enable ALL permissions

on the ’inventory’ table for

role ’stock manager’.

GRANT ALL PRIVI-

LEGES ON inventory TO

stock manager;

GRANT ALL PRIVI-

LEGES ON inventory TO

stock manager;

Predicted is the

same as the ex-

pected output

without explicit

mentions in the NL

question.

Allow user ’vidya’ to SE-

LECT data from the ’logs’

table.

GRANT SELECT ON logs

TO vidya;

GRANT SELECT ON logs

TO vidya;

Predicted is same as

expected output.

Take away UPDATE privi-

leges on the ’inventory’ ta-

ble from role ’manager’.

REVOKE UPDATE ON in-

ventory FROM manager;

REVOKE UPDATE ON

inventory FROM manager;

Predicted is the

same as the ex-

pected output

without explicit

mentions in the NL

question.

Remove all access on the

’departments’ table for the

’admin’ role.

REVOKE ALL PRIVI-

LEGES ON departments

FROM admin;

REVOKE ALL PRIVI-

LEGES ON departments

FROM admin;

Predicted is the

same as the ex-

pected output

without explicit

mentions in the NL

question.

NL-Based Database Administration for Handling Heterogeneous Datasets Using Finetuned LLM

REFERENCES

A. Narechania, A. S. and Stasko, J. (2021). Nl4dv: A toolkit

for generating analytic speciﬁcations for data visual-

ization from natural language queries. IEEE Transac-

tions on Visualization and Computer Graphics.

B. Wang, R. Shin, X. L. Y. P. and Richardson, M. (2020).

Rat-sql: Relation-aware schema encoding and link-

ing for text-to-sql parsers. In Proceedings of the 58th

Annual Meeting of the Association for Computational

Linguistics (ACL), pages 7567–7578, Online.

C. Liu, Y. Han, R. J. and Yuan, X. (April 2021). Advisor:

Automatic visualization answer for natural-language

question on tabular data. In 14th IEEE Paciﬁc Visu-

alization Symposium, PaciﬁcVis, page 11–20, Tianjin,

China.

C. Raffel, N. Shazeer, A. R. K. L. S. N. M. M. Y. Z. W. L.

and Liu, P. J. (2020). Exploring the limits of transfer

learning with a uniﬁed text-to-text transformer. Jour-

nal of Machine Learning Research.

J. Devlin, M.-W. Chang, K. L. and Toutanova, K. (2018).

Bert: Pre-training of deep bidirectional transformers

for language understanding. arXiv:1810.04805.

K. Ahkouk, M. M. and Ennaji, M. (2021). Data agnos-

tic roberta-based natural language to sql query gener-

ation. In IEEE 6th International Conference for Con-

vergence in Technology (I2CT).

K. Z. Hu, S. N. S. G. and et al. (2019). Viznet: Towards

a large-scale visualization learning and benchmarking

repository. In Proceedings of CHI Conference on Hu-

man Factors in Computing Systems.

M. A. Khan, M. S. H. Mukta, K. F. N. M. F. S. S. M. M. J.

M. J. A. M. E. A. and Azam, S. (2024). A review on

large language models: Architectures, applications,

taxonomies, open issues and challenges. IEEE Access.

N. T. K. Le, N. Hadiprodjo, H. E.-A. A. K. and Teshe-

baev, A. (2023). Recent large language models in nlp.

In 22nd International Symposium on Communications

and Information Technologies (ISCIT), Sydney, Aus-

tralia.

Q. Li, L. Li, Q. L. and Zhong, J. (2020). A comprehensive

exploration on spider with fuzzy decision text-to-sql

model. IEEE Transactions on Industrial Informatics,

16(4):April.

T. Yu, R. Zhang, K. Y. D. S. W. I. Z. and Su, M. Y.

(2019a). Cosql: A conversational text-to-sql challenge

towards cross-domain natural language interfaces to

databases. In Proceedings of the 2019 Conference on

Empirical Methods in Natural Language Processing

and the 9th International Joint Conference on Nat-

ural Language Processing (EMNLP-IJCNLP), pages

1962–1979, Hong Kong, China.

T. Yu, R. Zhang, K. Y. M. Y. D. W. Z. L. J. M. I. L. Q. Y. S.

R. Z. Z. D. R. (2018a). Spider: A large-scale human-

labeled dataset for complex and cross-domain seman-

tic parsing and text-to-sql task. In Proc. Conf. Empiri-

cal Methods Natural Lang. Process., page 3911–3921.

T. Yu, M. Yasunaga, K. Y. R. Z. D. W. Z. L. D. R. (2018b).

Syntaxsqlnet: Syntax tree networks for complex and

cross-domain text-to-sql task. In Proc. Conf. Empiri-

cal Methods Natural Lang. Proces., page 1653–1663.

T. Yu, Z. Lin, L. T. D. S. W. I. Z. and Su, M. Y. (2019b).

Sparc: Cross-domain semantic parsing in context. In

Proceedings of the 57th Annual Meeting of the As-

sociation for Computational Linguistics (ACL), pages

4511–4523, Florence, Italy.

T. Yu, Z. Li, Z. Z. R. Z. and Radev, D. (2018). Type-

sql: Knowledge-based type-aware neural text-to-sql

generation. In 16th Annual Conference of the North

American Chapter of the Association for Computa-

tional Linguistics, New Orleans.

Trummer, I. (2022). Codexdb: Generating code for pro-

cessing sql queries using gpt-3 codex. In Proceedings

of the International Conference on Artiﬁcial Intelli-

gence.

V. Zhong, C. X. and Socher, R. (2017). Seq2sql: Gener-

ating structured queries from natural language using

reinforcement learning. arXiv:1709.00103.

Y. Li, Z. Su, Y. L. H. Z. S. W. W. W. and Zhang, Y. (2023).

T5-sr: A uniﬁed seq-to-seq decoding strategy for se-

mantic parsing. In IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP).

Y. Luo, N. Tang, G. L. C. C. W. L. and Qin, X. (2021). Syn-

thesizing natural language to visualization (nl2vis)

benchmarks from nl2sql benchmark. In Interna-

tional Conference on Management of Data, page June

20–25. ACM.

Y. Luo, N. Tang, G. L. J. T. C. C. and Qin, X. (2022). Natu-

ral language to visualization by neural machine trans-

lation. IEEE Transactions on Visualization and Com-

puter Graphics, 28(1):January.

Z. Zhao, Y. Yang, X. H. and Liang, J. (2021). Smbop:

Semi-autoregressive bottom-up semantic parsing. In

Proceedings of the 59th Annual Meeting of the As-

sociation for Computational Linguistics (ACL), pages

5903–5914, Online.

INCOFT 2025 - International Conference on Futuristic Technology