Evaluating ChatGPT’s Ability to Detect Naming Bugs in Java Methods

Kinari Nishiura

1 a

, Atsuya Matsutomo

and Akito Monden

3 b

Faculty of Information and Human Sciences, Kyoto Institute of Technology, Kyoto, Japan

School of Engineering, Okayama University, Okayama, Japan

Graduate School of Environmental, Life, Natural Science and Technology, Okayama University, Okayama, Japan

Keywords:

Java, Naming Bug, ChatGPT, Large Language Models, Machine Learning.

Abstract:

In Java programming, large-scale and complex functions are realized by combining multiple methods. When

method names do not match their functionality, readability decreases, making maintenance challenging. Al-

though several machine learning models have been proposed to detect such naming bugs, they require exten-

sive training data, limiting user accessibility. Recently, large language models (LLMs) like ChatGPT have

gained popularity and show potential for code comprehension tasks. This study evaluates the performance of

ChatGPT in detecting naming bugs using the same datasets as in previous machine learning studies. We evalu-

ated detection accuracy through traditional methods, various prompt adjustments, and more direct approaches.

The results indicate that, while ChatGPT does not surpass traditional models, it can match their accuracy with

appropriately structured prompts, requiring no additional training.

1 INTRODUCTION

In software development, large-scale and complex

functions are typically realized by combining multi-

ple processing units. In Java programming specif-

ically, methods and classes represent the smallest

units that implement speciﬁc functionalities. De-

velopers have the ﬂexibility to name methods and

classes freely; however, the names of methods are

particularly expected to intuitively reﬂect their behav-

ior(McConnell, 2004). While method names do not

inﬂuence function execution, they are crucial for code

readability and serve as valuable cues for understand-

ing the code, especially when reviewed by other de-

velopers(Boswell and Foucher, 2011; Martin, 2008).

When method names fail to align with their behav-

ior, reducing readability, this issue is referred to as

a ”naming bug”(Høst and Østvold, 2009), and it re-

quires correction.

Several methods for detecting naming bugs have

been proposed. Liu et al.(Liu et al., 2019) intro-

duced a technique utilizing Doc2Vec, Word2Vec, and

CNN models. Minehisa et al. proposed methods us-

ing Doc2Vec and Sent2Vec(Minehisa et al., 2021), as

well as approaches based on machine learning models

https://orcid.org/0009-0007-7168-1500

https://orcid.org/0000-0003-4295-207X

leveraging Transformer architectures(Minehisa et al.,

2022). However, these approaches require collecting

appropriate training data and training models, which

can pose challenges for easy implementation by users.

Recently, the proliferation of AI services powered

by pre-trained large language models (LLMs), such as

ChatGPT

, has attracted signiﬁcant attention. These

services enable users to interact with AI trained on

a vast corpus of web-based knowledge through in-

puts known as prompts. The notable strengths of pre-

trained models include their advanced comprehension

and adaptability across various use cases. Leverag-

ing these capabilities, such models can also inter-

pret source code written in programming languages,

suggesting the potential to detect naming bugs with

prompts speciﬁcally designed for code analysis.

In this study, we evaluate ChatGPT’s performance

in detecting naming bugs using the same dataset

employed in previous research focused on machine

learning-based detection(Liu et al., 2019; Minehisa

et al., 2021; Minehisa et al., 2022). Our experiments

target method-level data extracted from 430 open-

source software projects, utilizing GPT-3.5 Turbo and

GPT-4o mini via a web API. We evaluate not only

with the detection criteria established in prior studies

but also by modifying the input information and em-

https://openai.com/chatgpt

128

Nishiura, K., Matsutomo, A. and Monden, A.

Evaluating ChatGPT’s Ability to Detect Naming Bugs in Java Methods.

DOI: 10.5220/0013354400003928

In Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2025), pages 128-136

ISBN: 978-989-758-742-9; ISSN: 2184-4895

ploying more direct detection strategies. The results

indicate that, while ChatGPT does not signiﬁcantly

outperform existing machine learning models, it can

achieve comparable detection performance depend-

ing on how the input information is structured. This

means that by using ChatGPT, each user can build a

machine learning model and rationally save the time

and effort required to collect training data and per-

form training in the task of naming bug detection.

The structure of the paper is as follows. Section

2 reviews the related work referenced in this study.

Section 3 details the experimental setup. Section 4

shows the experimental results. Section 5 conducts

additional experiments based on the results of the ex-

periment and deepen the discussion. Section 6 dis-

cusses threats to validity. Finally, Section 7 concludes

the paper.

2 RELATED WORK

Several methods utilizing machine learning models

have been proposed for detecting naming bugs. Liu

et al.(Liu et al., 2019) introduced an approach that

combines natural language processing and machine

learning techniques to automatically evaluate method

names using Doc2Vec, Word2Vec, and CNN. Mine-

hisa et al.(Minehisa et al., 2021) reported that re-

placing the combination of Word2Vec and CNN with

the more lightweight Sent2Vec model could achieve

nearly equivalent performance. Additionally, Mine-

hisa et al.(Minehisa et al., 2022) proposed a method

where a Transformer-based estimation model predicts

method names based on the method’s content and ver-

iﬁes whether the predicted names match the actual

ones.

In these methods, the detection of naming bugs

is not performed as a direct binary classiﬁcation task

to determine whether a method name is appropri-

ate. Instead, the models infer a method name from

the method’s Abstract Syntax Tree (AST) and assess

whether the inferred name’s preﬁx matches the ac-

tual method name’s preﬁx. For example, if a method

named getReadTimeout has its preﬁx inferred as get

by the model based on the given AST, the method

name getReadTimeout is considered free of naming

bugs. Conversely, if the model infers a different pre-

ﬁx (e.g., set or write), getReadTimeout would be

identiﬁed as having a naming bug.

Our experiment partially follows this method of

judgement.

3 EXPERIMENT DESIGN

3.1 Overview

Through evaluation experiments using a dataset con-

taining actual naming bugs, we compared and as-

sessed the accuracy of the proposed method, tradi-

tional methods for naming bug detection, and detec-

tion using ChatGPT. The comparative methods in-

clude three prior approaches: the method by Liu

et al. using Doc2Vec, Word2Vec, and CNN (Liu

et al., 2019); the approach by Minehisa et al.

using Sent2Vec (Minehisa et al., 2021); and the

Transformer-based method by Minehisa et al. (Mine-

hisa et al., 2022). For naming bug detection using

ChatGPT, in addition to employing the same detec-

tion approach as the traditional methods, a total of six

different detection patterns were tested.

This experiment utilized the same dataset and

evaluation metrics as in previous research (Minehisa

et al., 2022). Consequently, the detection accuracies

of the traditional methods were taken directly from

the reported results of previous research (Minehisa

et al., 2022). Details on the frameworks and param-

eters of these machine learning models can be found

in the original papers. The dataset consisted of Java

method data collected from 430 open-source software

projects. Detailed descriptions of the detection pat-

terns using ChatGPT and the dataset are provided in

Sections 3.2 and 3.3, respectively.

Two versions of ChatGPT models were used:

GPT-3.5 Turbo and GPT-4o mini. GPT-3.5 Turbo

is accessible to general users for free via ChatGPT,

while GPT-4o mini, although more advanced than

GPT-3.5, offers higher speed and lower costs, making

it suitable for this experiment. The experiments were

conducted in October 2024, speciﬁcally referencing

gpt-3.5-turbo-0125 and gpt-4o-mini-2024-07-18. The

automation of ChatGPT interactions was achieved us-

ing a Python library through the Web API, with the

library openai 1.52.1 used for implementation.

3.2 Use of ChatGPT

In this experiment, we evaluate not only the case

where the pre-trained model from previous studies

is simply replaced with ChatGPT, but also scenarios

where the input information to the model is modiﬁed

or where the evaluation method itself is changed. The

six patterns for using ChatGPT to detect naming bugs

in this experiment are outlined below, along with ab-

breviations for each pattern used in the experiments.

ex1A: The method content is represented by an

AST-analyzed token sequence, and the original

Evaluating ChatGPT’s Ability to Detect Naming Bugs in Java Methods

129

Prompt

 

You are a genius mastering the Java language. Please output only a

one-line answer to the question below with no introduction, no expla-

nation, no embellishments, only code.

I will show you the ﬂattened token sequence of a Java method’s AST

with node information, and the method name with the ﬁrst word hid-

den.

Please propose a perfect method name that anyone can understand,

keeping in mind ”what kind of processing is performed” and ”what

kind of value is returned. Please output only your proposed method

name. No reason is required.

[Method name]

***ReadTimeout

[AST]

ReturnStatement return ThisExpression this VariableName inte-

gerVar

 

Figure 1: Example of prompt for ex1A.

method name’s preﬁx is concealed. ChatGPT

is tasked with suggesting the most appropriate

method name, and a naming bug is detected if the

suggested name’s preﬁx does not match the origi-

nal. This method follows the exact evaluation ap-

proach used in previous studies.

ex1J: In addition to hiding the original method

name’s preﬁx, the method content is provided as

Java source code instead of an AST-analyzed to-

ken sequence. The method name in the source

code is concealed.

ex2A: Only the AST-analyzed token sequence of

the method content is provided, and ChatGPT is

tasked with suggesting the appropriate method

name, without providing the original method

name.

ex2J: Only the Java source code of the method con-

tent is given, and ChatGPT is tasked with suggest-

ing the appropriate method name. The method

name in the source code is concealed.

ex3A: Instead of suggesting a method name, both the

original method name and the AST-analyzed to-

ken sequence of the method content are provided,

and ChatGPT is asked to directly assess whether

the method name is appropriate.

ex3J: Both the original method name and the Java

source code of the method content are provided,

and ChatGPT is asked to directly assess whether

the method name is appropriate.

In this study, the instructions given to ChatGPT

are referred to as ”prompts.” An example of the

prompt used in ex1A is shown in Figure 1. This

prompt consists of three paragraphs followed by a

Table 1: Variations in the second paragraph of the prompt

used in each experiment.

ex sentences

ex1A I will show you the ﬂattened token se-

quence of a Java method’s AST with node

information, and the method name with

the ﬁrst word hidden.

ex1J I will show you a Java method source code

without method name, and the method

name with the ﬁrst word hidden.

ex2A I will show you only I will show you

the ﬂattened token sequence of a Java

method’s AST with node information.

ex2J I will show you a Java method source

code. The method name is hidden.

ex3A I will show you the method name, and

the ﬂattened token sequence of a Java

method’s AST with node information.

ex3J I will show you the method name, and the

Java method source code without method

name.

Table 2: Variations in the third paragraph of the prompt used

in each experiment.

ex sentences

ex1, ex2 Please propose a perfect method name

that anyone can understand, keeping in

mind ”what kind of processing is per-

formed” and ”what kind of value is re-

turned. Please output only your pro-

posed method name. No reason is re-

quired.

ex3 Please determine if the method name

is clear enough for everyone to un-

derstand, keeping in mind ”what kind

of processing is performed” and ”what

kind of value is returned.” Please out-

put only ”Yes” or ”No”. No reason is

required.

method name and its corresponding Abstract Syn-

tax Tree (AST). The ﬁrst paragraph speciﬁes the ex-

pected behavior from ChatGPT. It has been reported

that providing clear and speciﬁc roles and instructions

in prompts generally leads to better results (Bsharat

et al., 2024), which we followed in this study. The

second paragraph describes the method information

given to ChatGPT, while the third paragraph requests

the output from ChatGPT.

The second and third paragraphs of the prompt are

modiﬁed for each experiment, from ex1A to ex3J, as

needed. Variations of the second paragraph are sum-

marized in Table 1, while the variations of the third

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

130

paragraph are shown in Table 2. Additionally, the

method information provided after the sentences must

be appropriately adjusted for each experiment.

3.3 Datasets

To evaluate the detection accuracy of the proposed

method, we used the same dataset

as in Liu et al.’s

prior research (Liu et al., 2019). This dataset was also

used in the studies by Minehisa et al. (Minehisa et al.,

2021)(Minehisa et al., 2022).

The dataset comprises a total of 2,119,218 meth-

ods extracted using a proprietary algorithm from

430 open-source Java projects. In previous studies,

2,116,413 of these methods were used as training

data, while 2,805 were designated as test data. For

our experiment, since we leveraged ChatGPT, which

has already been pre-trained on extensive web-based

data, we only used the test data for evaluation. Out of

this test data, only 1,268 methods could be matched

correctly with their method names and corresponding

Java source code from the published dataset. Among

these, 635 were labeled as containing naming bugs,

and 633 were labeled as bug-free.

The method for constructing the dataset as de-

scribed in Liu et al.’s study(Liu et al., 2019) is out-

lined below. First, all methods from the 430 open-

source projects were collected, excluding: (1) Main

methods, (2) Constructors, (3) Empty methods (ab-

stract methods or methods with no operations), (4)

Methods with names containing ”example,” ”sam-

ple,” or ”template,” (5) Methods whose names do

not include alphabetic characters (e.g., methods using

only underscores).

Methods with more than 95 tokens after AST

transformation were excluded due to model limita-

tions, resulting in the ﬁnal dataset of 2,119,218 meth-

ods. Subsequently, GitHub commit histories of the

projects were analyzed to identify methods meeting

the following criteria as candidates for naming bugs:

Criterion 1:

The commit involved a change where only the

method name was modiﬁed, without altering the

method’s body.

Criterion 2:

The reason for the name change was not to correct

a typographical error.

Criterion 3:

The method’s content and name remained un-

changed after the renaming commit until the time

of data collection.

https://github.com/TruX-DTF/debug-method-

name/tree/master/Data/TestData

Methods meeting Criterion 1 were expected to

represent cases where a suboptimal name was cor-

rected to an appropriate one. Criterion 2 aimed to

remove noise, while Criterion 3 ensured the consis-

tency of the data. This process yielded 2,805 meth-

ods as test data. Commit history analysis provided

the names before and after the renaming, and visual

inspection by the authors of the original study (Liu

et al., 2019) conﬁrmed whether the initial names con-

stituted naming bugs. As a result, 1,403 methods were

labeled as containing naming bugs, while 1,402 were

labeled as bug-free.

Next, we describe the matching process between

method names and Java source code. While the

published dataset maintained sequential order for the

original and revised method names, tokenized ASTs,

and labels, it did not do so for the Java source

code. Given that the source code contained the up-

dated method names, we performed a straightforward

matching procedure, restricting the linkage to meth-

ods with unique names within the dataset. This ap-

proach resulted in 1,268 matched test data points,

which were used in our experiment.

3.4 Evaluation Metrics

Following prior research, the evaluation metrics for

naming bug detection include accuracy, precision, re-

call, and F1-score. The formulas for these metrics are

shown in Table 3. The confusion matrix used for de-

riving these metrics is presented in Table 4.

The criteria for determining naming bugs vary de-

pending on the speciﬁc approach used with ChatGPT.

When employing the method similar to previous stud-

ies, where an appropriate method name is predicted

and evaluated based on preﬁx matching, the compari-

son baseline depends on the labels in the test data. For

test data labeled as containing naming bugs, the pre-

dicted method name is compared to the pre-correction

name. For data labeled as bug-free, the comparison is

made with the post-correction name. This approach

aligns with the methodology of existing research for

consistency in evaluation. In the method where Chat-

GPT directly judges the presence of naming bugs, the

results provided by ChatGPT are used without further

transformation.

Given that ChatGPT’s responses can exhibit vari-

ability, its heuristic nature must be considered. To

mitigate this variability, naming bug detection for

each method was conducted independently three

times, with the majority outcome being taken as the

ﬁnal result.

Evaluating ChatGPT’s Ability to Detect Naming Bugs in Java Methods

131

Table 3: Evaluation metrics formula.

Accuracy =

T P + T N

T P + FP + T N + FN

(1)

Precision =

T P

T P + FP

(2)

Recall =

T P

T P + FN

(3)

F1 =

2 × recall × precision

recall + precision

(4)

Table 4: Confusion matrix.

Judged as

naming bug

Judged as

not naming bug

Naming bug

True Positive

False Negative

Not naming bug

False Positive

True Negative

4 EXPERIMENT RESULTS

The results of the experiment are shown in Table 5,

with the best results in each metric highlighted in

bold. The highest scores were observed as follows:

accuracy and precision with GPT-4o mini in ex2J, re-

call with GPT-3.5 Turbo in ex1A, and F1-score with

GPT-3.5 Turbo in ex2A.

Firstly, there was minimal difference in F1-scores

between GPT-3.5 Turbo and GPT-4o mini for most

of the ChatGPT usage methods. The exception is

when determining naming bugs directly from given

method name and AST token sequences, where GPT-

3.5 Turbo showed a signiﬁcantly lower performance.

This was primarily due to GPT-3.5 Turbo in this case

often predicting almost all cases as bug-free, resulting

in very low recall. In contrast, GPT-4o mini exhibited

consistent performance whether given AST tokens or

Java source code, suggesting greater robustness.

Next, the method of directly asking ChatGPT to

detect naming bugs (ex3) yielded a lower F1-score

compared to the method of generating a suggested

method name and comparing its preﬁx to the actual

name. This indicates that using ChatGPT for naming

bug detection is more effective when following the

methodology of previous research, where an appro-

priate method name is proposed based on the method

content and compared for preﬁx differences.

When comparing the information provided with

ChatGPT, there was little difference between using

AST token sequences and Java source code (except

for ex3). Given the ease of providing raw Java source

code, this approach is more practical. Additionally,

there was no signiﬁcant difference between provid-

ing the actual method name with the hidden preﬁx

(ex1) and not providing any method name information

(ex2). This result implies that given partial method

names do not introduce noise, but also hints related to

method names are not particularly necessary.

In summary, using ChatGPT to directly judge

naming bugs is less effective. Using GPT-3.5 Turbo

for ex1A (providing AST token sequences and a hid-

den preﬁx) slightly lowered the F1-score compared to

other cases, but overall, most approaches yielded sim-

ilar performances.

Next, we will compare the accuracy of the pro-

posed model with that of previous studies. The

highest accuracy reported was by Minehisa et al.’s

Transformer-based approach among previous studies,

with an F1-score of 0.679(Minehisa et al., 2022). The

direct judgment method (ex3) performed worse than

this, while the preﬁx comparison methods (ex1 and

ex2) occasionally surpassed this score but did not

show signiﬁcant overall differences. ChatGPT out-

performed non-Transformer prior methods, indicating

that while it may not exceed the most advanced mod-

els without ﬁne-tuning, it can achieve comparable re-

sults with straightforward data input.

This suggests that using ChatGPT for naming bug

detection offers substantial convenience, as it does

not require specialized training, skillsets, or extensive

data collection by each user. Ordinary programmers

can utilize ChatGPT for effective naming bug detec-

tion without deep expertise.

Our experiment showed that for each method,

three independent trials were conducted, with the ma-

jority result used as the ﬁnal outcome. An analy-

sis of the consistency across trials, shown in Table

6, revealed that GPT-4o mini exhibited higher con-

sistency compared to GPT-3.5 Turbo. Consistency

was lowest for ex3, while ex1 and ex2 with GPT-

4o mini had approximately 90% consistency. This

indicates that while ChatGPT responses have inher-

ent randomness, GPT-4o mini used with preﬁx com-

parison shows strong result consistency, reducing the

need for multiple trials.

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

132

Table 5: Experiment results.

Accuracy Precision Recall F1-score

Doc2Vec, Word2Vec, and CNN(Liu et al., 2019) 0.482 0.489 0.775 0.599

Sent2Vec(Minehisa et al., 2021) 0.531 0.519 0.843 0.643

Transformer(Minehisa et al., 2022) 0.582 0.551 0.884 0.679

ex1A 3.5 Turbo 0.506 0.502 0.967 0.661

4o mini 0.584 0.553 0.863 0.674

ex1J 3.5 Turbo 0.599 0.563 0.878 0.686

4o mini 0.605 0.574 0.809 0.671

ex2A 3.5 Turbo 0.573 0.541 0.940 0.687

4o mini 0.603 0.567 0.869 0.686

ex2J 3.5 Turbo 0.618 0.582 0.826 0.683

4o mini 0.619 0.584 0.821 0.683

ex3A 3.5 Turbo 0.498 0.492 0.152 0.232

4o mini 0.509 0.507 0.594 0.547

ex3J 3.5 Turbo 0.501 0.500 0.423 0.459

4o mini 0.512 0.509 0.646 0.569

Table 6: The percentage of times ChatGPT gave the same

answer all three times.

GPT 3.5 Turbo GPT 4o mini

ex1A 90.9% 92.4%

ex1J 88.2% 94.0%

ex2A 89.4% 89.7%

ex2J 93.3% 95.2%

ex3A 64.6% 80.4%

ex3J 63.6% 85.3%

5 ADDITIONAL EXPERIMENTS

AND DISCUSSIONS

This section explores additional analyses and experi-

ments conducted following the previous experimental

results.

5.1 Detection Accuracy Evaluation

Limited to Methods with Verb

Preﬁxes

A manual inspection of the method names suggested

by ChatGPT revealed that nearly all of them had pre-

ﬁxes that were verbs. This aligns with the common

best practice that method names should start with

verbs. However, upon reviewing the original method

names in the dataset, there was a noticeable number of

cases where the preﬁxes were not verbs. This discrep-

ancy in preﬁx type between the suggested and original

method names could potentially lower the detection

Table 7: Naming bug detection accuracy by ChatGPT 4o

mini for methods with verb-based preﬁxes.

Accuracy Precision Recall F1-score

ex1A 0.544 0.460 0.726 0.563

ex1J 0.586 0.491 0.634 0.553

ex2A 0.574 0.483 0.743 0.585

ex2J 0.603 0.508 0.653 0.571

ex3A 0.451 0.364 0.479 0.414

ex3J 0.436 0.364 0.528 0.431

accuracy for ex1 and ex2. Therefore, we evaluated

the accuracy speciﬁcally for methods whose original

preﬁxes were verbs.

To identify the part of speech, we used the natural

language processing library spaCy

. We ﬁltered out

methods whose preﬁxes were not verbs (”VERB”) or

auxiliary verbs (”AUX”). This left us with 748 meth-

ods, of which 303 were labeled as containing naming

bugs and 445 were labeled as bug-free.

The naming bug detection accuracy only for these

methods, using GPT-4o mini, is shown in Table 7.

Compared to Table 5, contrary to our expectations,

indicating that restricting the evaluation to methods

with verb preﬁxes led to a decline in detection per-

formance. The main reason appears to be that, for

methods with verb preﬁxes, ChatGPT’s predictions

were split more evenly between positive (naming bug

present) and negative (no naming bug) outcomes. In

contrast, when evaluating other methods, most were

predicted as positive, leading to very high recall and

consequently higher F1-scores. This analysis sug-

gests that even for ChatGPT, detecting naming bugs

https://spacy.io

Evaluating ChatGPT’s Ability to Detect Naming Bugs in Java Methods

133

Table 8: Naming bug detection accuracy using semantic

similarity of preﬁxes by ChatGPT 4o mini.

Accuracy Precision Recall F1-score

ex1A 0.606 0.580 0.760 0.658

ex1J 0.602 0.588 0.673 0.628

ex2A 0.615 0.588 0.765 0.665

ex2J 0.614 0.598 0.695 0.643

in method names with verb preﬁxes is more challeng-

ing.

5.2 Evaluation of Detection Accuracy

Based on Semantic Similarity of

Preﬁxes

In the experiments described in Section 3, ex1 and ex2

used exact matches of method name preﬁxes to deter-

mine naming bugs. However, due to the existence of

synonyms in English, preﬁxes that do not match ex-

actly might still be semantically similar, potentially

leading to false positives in naming bug detection.

Examples of such synonymous preﬁxes include get

and obtain, start and begin, or run and execute.

To address this, we evaluated detection accuracy

considering semantic similarity in preﬁxes. While

several natural language processing techniques can

be employed for synonym detection, we chose a

straightforward approach using ChatGPT for this

task. Speciﬁcally, we used the GPT 4o mini model

and performed a single judgment for each compari-

son.

The results of using semantic synonym detection

for ex1 and ex2 (where ChatGPT 4o mini was used to

propose method names) are shown in Table 8. Com-

pared to Table 5, precision slightly increased, but re-

call and F1-score decreased slightly, resulting in a de-

tection performance signiﬁcantly lower than that of

Minehisa et al.’s method.

This suggests that considering semantic similar-

ity when detecting naming bugs using preﬁx matching

does not improve accuracy. One possible explanation

is that ChatGPT’s synonym detection may not be en-

tirely reliable, which calls for more robust synonym

detection techniques in future research.

5.3 Evaluation of Detection Accuracy

with Stricter Criteria for ChatGPT

In the experiments of Section 3, ex3 involved asking

ChatGPT whether a method name was ”clear enough

for everyone to understand,” and naming bugs were

identiﬁed based on negative responses. While this

Table 9: Naming bug detection accuracy of ChatGPT 4o

mini with modiﬁed prompt criteria.

Accuracy Precision Recall F1-score

ex3A 0.539 0.647 0.171 0.270

ex3J 0.735 0.874 0.548 0.674

criterion was somewhat suitable, it included a subjec-

tive element that could make the guidance ambigu-

ous, particularly when assessing edge cases or ”mid-

dle layer.”

To address this, we reﬁned the prompt to enforce

stricter criteria for determining naming bugs, focus-

ing on detecting method names that were clearly in-

appropriate. For this purpose, the third sentence of

the prompts for ex3A and ex3J was modiﬁed as fol-

lows: “Please judge whether the method name given

is clearly inappropriate. Please output only ‘Yes’ or

‘No’ in one line. No reason is required.” This was

conducted using the GPT 4o mini model with a single

assessment per method.

The results are shown in Table 9. The data re-

veals a signiﬁcant decrease in accuracy when AST to-

ken sequences were provided, whereas the accuracy

improved when Java source code was used as input.

Compared to the method by Minehisa et al., the F1-

score became comparable, and accuracy and precision

were considerably higher.

This indicates that, although prior to the prompt

modiﬁcation, ex3 showed signiﬁcantly lower accu-

racy than the existing method, adjusting the prompt

to detect only clearly inappropriate names achieved

comparable accuracy to established approaches. This

suggests that users can rely on ChatGPT for naming

bug detection without performing indirect and labor-

intensive analyses like proposing method names and

comparing preﬁxes, enhancing convenience.

One possible reason for the improvement in ac-

curacy is that ChatGPT’s judgement for the ‘middle

layer’, which is not completely appropriate but also

not completely inappropriate, did not match the la-

belling of the authors of the previous study. They

judged that many of the ‘middle layer’ were not nam-

ing bugs, and the improved prompts also provide the

same judgement criteria.

However, recall was still much lower than that of

existing methods, implying that if comprehensive de-

tection is a priority, existing approaches remain more

suitable. The reason for the drastic drop in accuracy

when AST token sequences were used remains un-

clear and may warrant further investigation.

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

134

6 THREATS TO VALIDITY

Potential threats to the validity of this study include

the following:

First, we compared our results with the detection

accuracy reported in previous work(Minehisa et al.,

2022) without conducting replication experiments of

prior methods due to time constraints. While the pre-

vious research evaluated a total of 2,805 test data

points, our study was limited to 1,268 test cases that

could be aligned with corresponding source code.

This discrepancy in the evaluation dataset could po-

tentially affect the detection accuracy. Future work

should include replicating the prior methods and eval-

uating them using the exact same dataset to ensure a

fair comparison.

Second, the test data used in this study were la-

beled by the authors of prior research(Liu et al., 2019)

to indicate the presence or absence of naming bugs.

The accuracy of these labels cannot be guaranteed.

Additionally, the criteria for determining naming bugs

that we provided to ChatGPT may differ from those

used in the original labeling. Re-labeling the data

ourselves and aligning the labeling criteria with those

used for ChatGPT would enable a more equitable

comparison.

Lastly, due to time and resource constraints, only

GPT 3.5 Turbo and GPT 4o mini were used as Chat-

GPT models. Higher-performing models, such as

more advanced versions of GPT-4, may yield im-

proved results. Exploring other generative AI services

could also be beneﬁcial. However, given the simplic-

ity of the tasks assigned in this study, it is possible

that the use of more advanced models would not sig-

niﬁcantly affect the accuracy.

7 CONCLUSION

In this study, we used ChatGPT to detect naming

bugs and compared the detection accuracy with that

of conventional methods. In the experiment, the same

evaluation data as in previous research was used, and

in addition to the same detection method as in the

conventional method, we also evaluated cases where

the input was changed and cases where direct binary

classiﬁcation was performed. In addition, both the

3.5 Turbo and the 4o mini ChatGPT models were

used. As a result, it was found that ChatGPT does

not greatly exceed the detection performance of exist-

ing machine learning models, but that it has detection

performance equivalent to existing methods without

special training, depending on how the information is

given. There was no difference in the ChatGPT mod-

els used, and there was also no signiﬁcant difference

between the cases where the token sequences were

given to ChatGPT after AST analysis, as in the pre-

vious study, and the cases where the Java source code

was given as is. When ChatGPT was given method

names and processing content to directly detect nam-

ing bugs, it was found to perform at the same level as

previous research when given Java source code and

prompted to detect bugs using stricter criteria.

As a result, using ChatGPT to detect naming bugs

is effective in increasing developer convenience, and

it is possible that naming bugs can be pointed out and

ﬁxed more easily.

Future issues include more detailed evaluation of

existing methods and expanding the naming bug data

used for evaluation. More speciﬁcally, comparisons

with recent related research(Wang et al., 2024) that

was not included in this experiment, and comparisons

with other methods that focus on the generation of

method names, are important issues for the future. It

is also possible to consider evaluating other LLMs or

expanding the evaluation to other programming lan-

guages besides Java.

ACKNOWLEDGEMENTS

This work was supported in part by JSPS KAKENHI

Grant number JP23K16863 and JP20H05706.

REFERENCES

Boswell, D. and Foucher, T. (2011). The Art of Readable

Code: Simple and Practical Techniques for Writing

Better Code. Oreilly & Associates.

Bsharat, S. M., Myrzakhan, A., and Shen, Z. (2024). Prin-

cipled instructions are all you need for questioning

llama-1/2, gpt-3.5/4.

Høst, E. W. and Østvold, B. M. (2009). Debugging method

names. In Proceedings of the 23rd European Con-

ference on ECOOP 2009 — Object-Oriented Pro-

gramming, Genoa, page 294–317, Berlin, Heidelberg.

Springer-Verlag.

Liu, K., Kim, D., Bissyand

e, T., Taeyoung, K., Kim, K.,

Koyuncu, A., Kim, S., and Le Traon, Y. (2019).

Learning to spot and refactor inconsistent method

names. In in Proc. 41st Int’l Conf. Softw. Eng., pages

1–12.

Martin, R. C. (2008). Clean Code: A Handbook of Agile

Software Craftsmanship. Prentice Hall.

McConnell, S. (2004). Code Complete, Second Edition. Mi-

crosoft Press, USA.

Minehisa, T., Aman, H., and Kawahara, M. (2022). Naming

bug detection using transformer-based method name

Evaluating ChatGPT’s Ability to Detect Naming Bugs in Java Methods

135

suggestion. Conputer Software (in Japanese), pages

17–23.

Minehisa, T., Aman, H., Yokogawa, T., and Kawahara,

M. (2021). A comparative study of vectorization

approaches for detecting inconsistent method names.

Computer and Information Science 2021—Summer,

985:125–144.

Wang, T., Zhang, Y., Jiang, L., Tang, Y., Li, G., and Liu, H.

(2024). Deep learning based identiﬁcation of incon-

sistent method names: How far are we? Empirical

Software Engineering, 30.

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

136