A Data-driven Methodology towards Interpreting Readability

against Software Properties

Thomas Karanikiotis, Michail D. Papamichail, Ioannis Gonidelis, Dimitra Karatza

and Andreas L. Symeonidis

Electrical and Computer Engineering Dept., Aristotle University of Thessaloniki,

Intelligent Systems & Software Engineering Labgroup, Information Processing Laboratory, Thessaloniki, Greece

Keywords:

Developer-perceived Readability, Readability Interpretation, Size-based Clustering, Support Vector

Regression.

Abstract:

In the context of collaborative, agile software development, where effective and efﬁcient software maintenance

is of utmost importance, the need to produce readable source code is evident. Towards this direction, several

approaches aspire to assess the extent to which a software component is readable. Most of them rely on experts

who are responsible for determining the ground truth and/or set custom evaluation criteria, leading to results

that are context-dependent and subjective. In this work, we employ a large set of static analysis metrics along

with various coding violations towards interpreting readability as perceived by developers. In an effort to

provide a fully automated and extendible methodology, we refrain from using experts; rather we harness data

residing in online code hosting facilities towards constructing a dataset that includes more than one million

methods that cover diverse development scenarios. After performing clustering based on source code size, we

employ Support Vector Regression in order to interpret the extent to which a software component is readable

on three axes: complexity, coupling, and documentation. Preliminary evaluation on several axes indicates that

our approach effectively interprets readability as perceived by developers against the aforementioned three

primary source code properties.

1 INTRODUCTION

The term readability can be described as “the ease of

a reader to understand a written text”. In the case of

typical text this deﬁnition is straight-forward; how-

ever from a software engineering point of view and

in speciﬁc when we refer to source code, readability

is a complex concept linked to several factors beyond

the understanding of the speciﬁcs of each program-

ming language. These factors are the comprehension

of the purpose, the control ﬂow, and the functionality

that the source code serves, aggregated at the level of

code block, method, class, component and/or system.

The vital importance of readability as a software

quality attribute is more than evident given the fact

that it is closely related to maintainability, which is

one of the most important quality characteristics ac-

cording to ISO/IEC 25010:2011 (ISO, 2020). In this

context, where software maintenance involes ﬁxing

bugs as well as evolving the source code so as to

cover future requirements (both functional and non-

functional), several studies suggest that reading code

is one of the most time and effort-consuming tasks

while maintaining software (Rugaber, 2000; Ray-

mond, 1991). On top of the above, according to

Knight and Myers, checking for readability issues

has a positive impact in several quality attributes

such as portability, maintainability, and reusability

and should thus constitute a special part of the soft-

ware inspection procedure (Knight and Myers, 1993).

And, given the continuously increasing turn towards

the reuse-oriented software development paradigm,

the need to produce readable software increases.

Several research efforts are directed towards as-

sessing the extent to which software components are

readable (Buse and Weimer, 2010; Posnett et al.,

2011; Dorn, 2012; Scalabrino et al., 2016). The ma-

jority of the proposed approaches employ static anal-

ysis metrics, such as the widely used Halstead met-

rics (Halstead, 1977), in an effort to build readability

evaluation models. These approaches are in essence

effective, however they exhibit certain inherent weak-

nesses. At ﬁrst, readability evaluation in the major-

ity of the proposed methodologies depends on quality

Karanikiotis, T., Papamichail, M., Gonidelis, I., Karatza, D. and Symeonidis, A.

A Data-driven Methodology towards Interpreting Readability against Software Properties.

DOI: 10.5220/0009891000610072

In Proceedings of the 15th International Conference on Software Technologies (ICSOFT 2020), pages 61-72

ISBN: 978-989-758-443-5

experts who are responsible for deﬁning the readabil-

ity degree of each software component under evalu-

ation and/or determining the appropriate thresholds

of metrics that result in higher readability. In addi-

tion, given that expert-aided evaluation is a proce-

dure that requires a signiﬁcant amount of time and hu-

man resources, the size of the used datasets is small

and thus covers only a few use cases. As a result,

the proposed approaches provide a somewhat subjec-

tive evaluation and are restricted to certain develop-

ment scenarios. Finally, providing a single readability

score without actionable recommendations regarding

the certain axes that need improvement makes it difﬁ-

cult for developers to perform targeted audits towards

readability improvements.

In this work, we aspire to overcome the aforemen-

tioned limitations. We employ data residing in online

code hosting facilities (i.e. GitHub) in order to build a

fully-automated and interpretable readability evalua-

tion methodology that expresses the extent to which a

software component is readable as perceived by de-

velopers. Upon performing static analysis in more

that 1 million methods of the most popular and reused

GitHub Java projects, we deﬁne a readability score

at the method level based on the compliance of the

source code with the widely accepted code writing

practices as reﬂected in the number of identiﬁed vi-

olations. In order to cover various assessment scenar-

ios, we employ clustering for segmenting our dataset

into coherent groups that share similar (within clus-

ter) characteristics. Subsequently, for each cluster,

we employ Support Vector Regression and construct

three models that enable a comprehensive and inter-

pretable evaluation of the readability degree on three

axes, each corresponding to a primary source code

property; complexity, coupling, and documentation.

The rest of this paper is organized as follows.

Section II provides background information on static

analysis metrics and reviews current approaches on

readability estimation, while Section III describes our

benchmark dataset and designs a scoring mechanism

for the readability degree of source code components.

The developed models are discussed in Section IV,

while Section V evaluates the efﬁciency of our read-

ability interpretation methodology against different

axes. Finally, Section VI concludes this paper and

provides insight for further research.

2 RELATED WORK

The constantly increasing demand for producing bet-

ter software products that can live up to the expec-

tations of end-users, while at the same time reduc-

ing time-to-market and staying on budget, has pro-

moted the assessment of software quality aspects to

a key enabler for success. Quality aspects related to

maintainability have attracted a strong research fo-

cus, given the importance of software health and evo-

lution, and readability is one of the software char-

acteristics strongly related to maintainability. Sev-

eral quality metrics have been proposed for various

purposes, such as the recommendation of appropriate

code refactorings (Mkaouer et al., 2015) or the detec-

tion of code smells (Moha et al., 2010). However, the

metrics that are widely used in the above approaches

are often not able to quantify the quality improve-

ments as perceived by the community of developers

(Pantiuchina et al., 2018). Therefore, there is a need

for models that could identify the speciﬁc code qual-

ity metrics that can quantify and measure the qual-

ity of the source code, as well as its improvements in

practice, from the perception of the developers.

One of the ﬁrst approaches towards evaluating

readability was made by Buse et al. (Buse and

Weimer, 2010) who built a descriptive model to clas-

sify a given code as “more readable” or “less read-

able”. The metrics that were used in the model are

mostly related to the structure, the documentation and

the logical complexity of the code and they were in-

tuitively selected by the authors. The authors re-

cruited 120 human annotators, in order to create the

ground truth, based on which their approach was eval-

uated, classifying correctly almost 80% of the anno-

tated samples. A research for the predictive power of

the features was also conducted and concluded that

95% of the total variability can be explained only by

the ﬁrst 8 principal components of the features used.

A lot of subsequent approaches built on top of the

work of Buse et al. . Posnett et al. (Posnett et al.,

2011) extended the above approach by arguing that

the code size should explicitly be included in a model

that quantiﬁes readability, in order to distinguish the

size dependency from the rest of the features. The

authors proved that the majority of the metrics used

by Buse were not independent from code size, while

code size itself cannot fully explain readability. The

metrics analysis also proved that the Halstead’s V

(Halstead, 1977) contains considerable explanatory

power and, when combined with size metrics, can

easily outperform the model proposed by Buse. The

model proposed by Posnett et al. formed the basis

for a study (Mannan et al., 2018) that quantiﬁes the

readability score of open source projects, as well as

its evolution over the project lifetime. The study con-

cluded that projects tend to achieve high readability

scores, while they maintain these high scores over

time, despite the many changes that are made.

ICSOFT 2020 - 15th International Conference on Software Technologies

Contrary to Posnett, Dorn (Dorn, 2012) took into

account mainly structural and visual perception fea-

tures, quantifying the changes of code indentation,

line length and comments length with the use of

discrete Fourier transform (DFT) (Bergland, 1969).

Dorn evaluated his model based on a human study

of 5,000 participants, arguing that it correlates with

the human judgements 2.3 times better than any other

previous approach.

Scalabrino et al. (Scalabrino et al., 2016), in an

attempt to improve the previous readability models,

stated that textual features should also be taken into

account. The authors proposed some features that

could extract useful information from the source code

lexicon, such as the number of terms that are simul-

taneously used both in comments and in identiﬁers,

the number of full-word identiﬁers, the hyponyms

(e.g. terms with speciﬁc meaning) and the readabil-

ity of the comments (natural language text readabil-

ity). The proposed features were evaluated upon an

empirical study, indicating that the model based both

on structural and on textual features outperforms the

previously proposed approaches and that a consider-

able amount of code snippets could be correctly clas-

siﬁed only by using the textual features. Scalabrino

et al. (Scalabrino et al., 2018), extended their previ-

ous work by adding new textual features and conduct-

ing a large empirical study. The results indicated that

the new model achieved slightly higher accuracy, out-

performing the current state-of-the-art models. More-

over, a readability model that achieves higher accu-

racy is proved to be more correlated to FindBugs

warnings.

On the other hand, Choi et al. (Choi et al., 2020)

attempted to build a model based only on structural

features. The evaluation upon human-annotated data

proved that it could make the right prediction on more

than the 70% of their data. The development of a tool,

named Instant R. Gauge, which calculates code read-

ability on the ﬂy and helps the developer to make the

appropriate improvements, is also part of the authors’

contribution.

Fakhoury et al. (Fakhoury et al., 2019) conducted

an interesting study, in an attempt to examine the per-

formance of the approaches being proposed in the

bibliography in code improvements made by readabil-

ity commits. The results were quite interesting; The

readability models proposed are not able to capture

readability improvements, while additional metrics,

such as the number of incoming invocations, seem to

change signiﬁcantly in readability commits.

In this work, we propose a generic methodology

that evaluates software readability at a method level.

Unlike previous approaches, trying to confront the

challenges that originate from the observations made

by Fakhoury, as well as the limitations of the afore-

mentioned readability evaluation methodologies, we

employ information residing in online code hosting

facilities. Upon formulating the ground truth using

a systematic methodology based on the compliance

of the source code with widely accepted code writing

practices (as reﬂected in the number of coding viola-

tions), we refrain from the limitations imposed by the

use of experts and design a fully automated evalua-

tion methodology. In the context of this methodology

and in an effort to provide interpretable results and

thus actionable recommendations, we employ Sup-

port Vector Regression and analyze the readability de-

gree of a given method on three different axes, each

corresponding to a primary code property.

3 TOWARDS DEFINING

READABILITY

3.1 Benchmark Dataset

In an effort to deﬁne readability as perceived by de-

velopers, our primary design choice is harnessing the

deluge of the available data residing in online code

hosting facilities so as to formulate a ground truth that

expresses the extent to which a software component is

readable. In speciﬁc, our dataset contains more than

1 million methods included in the most popular (as

reﬂected in the number of GitHub stars) and reused

(as reﬂected in the number of GitHub forks) GitHub

Java projects. We performed static analysis at method

level in order to compute two kinds of information:

a) the computation of a large set of static analysis

metrics that quantify four major source code proper-

ties: complexity, coupling, documentation, and size,

and b) while the second refers to the identiﬁcation of

various coding violations regarding widely accepted

code writing practices. Given their scope and impact,

these violations are categorized into eight categories

(Best Practices, Documentation, Design, Code Style,

Error Prone, Performance, Multithreading, and Secu-

rity) and three levels of severity (Minor, Major, and

Critical). Upon selecting only the violations that are

related to readability, we eliminate the ones of cate-

gories Performance, Multithreading, and Security.

Certain statistics regarding the benchmark dataset

are given in Table 1, while Table 2 presents the cal-

culated static analysis metrics along with their asso-

ciated property. The static analysis metrics were cal-

culated using SourceMeter (sourcemeter, 2020) tool,

while the identiﬁcation of coding violations was per-

A Data-driven Methodology towards Interpreting Readability against Software Properties

Table 2: Overview of the Computed Static Analysis Metrics.

Property Metric Name Metric Description

Complexity

NL Nesting Level

WMC Weighted Methods per Class

HDIF Halstead Difﬁculty

HEFF Halstead Effort

HNDB Halstead Number of Delivered Bugs

HPL Halstead Program Length

HPV Program Vocabulary

HTRP Time Required to Program

HVOL Volume

McCC McCabe’s Cyclomatic Complexity

MI Maintainability Index

Coupling

NII Number of Incoming Invocations

NOI Number of Outgoing Invocations

Documentation

CD Comment Density

CLOC Comment Lines of Code

DLOC Documentation Lines of Code

TCD Total Comment Density

TCLOC Total Comment Lines of Code

Size

LOC Lines of Code

LLOC Logical Lines of Code

NOS Number of Statements

Table 1: Dataset Statistics.

Metric Value

Number of GitHub projects 308

Number of Methods 1,004,589

Number of Metrics 21

Number of Code Properties 4

Number of Violations 193

Number of Violations Categories 5

Lines of Code Analyzed 9,003,547

formed using PMD tool (PMD, 2020).

3.2 Clustering based on Size

Given that our analysis is performed at the method

level and involves more than 1 million methods that

exhibit high diversity both in terms of size and scope,

our ﬁrst step involves applying clustering techniques

so as to split our dataset in a set of cohesive clusters

that share similar characteristics. This design choice

originates from the fact that in practice, methods of

different size usually serve different functionalities or

follow different architectures. For instance, meth-

ods with a small number of lines of code (< 5) are

mainly used as setters/getters or speciﬁc utilities (read

data from ﬁles, middleware functions etc.), while

larger ones mainly provide more advanced functional-

ities. From a static analysis metrics perspective, they

should thus be handled accordingly.

Figure 1: Overview of the Quality Estimation Methodology.

Figure 1 presents the histogram (logarithmic

scale) of the lines of code regarding the analyzed

method; it is obvious that the dataset covers a wide

range of development scenarios.

Upon examining the data and in an effort to elim-

inate any introduced bias from the high frequency of

setters/getters and methods that provide no function-

ality (empty methods), our ﬁrst step involves remov-

ing the methods that have less than 3 lines of code

combined with minimal complexity as reﬂected in the

ICSOFT 2020 - 15th International Conference on Software Technologies

values of McCabe Cyclomatic Complexity (<= 1).

These methods correspond to 27.12% of the dataset

(272,511 methods).

Our next step involves applying clustering using

k-Means algorithm. During this process and in order

to identify the optimal number of clusters, we calcu-

lated the cohesion as expressed by the within sum of

squares regarding different clusterings. Figure 2 illus-

trates the calculated cohesion for the cases where the

number of clusters varies from 2 to 8. Given the pro-

vided results, we selected ﬁve as the optimal number

of clusters.

Figure 2: Overview of Cohesion for Different Clusterings.

Table 3: Overview of the Formulated Clusters.

Cluster

Number LOC Mean

of Methods Range Silhouette

#1 499,858 (68.29%) [1, 10] 0.76

#2 166,496 (22.74%) [11, 24] 0.52

#3 51,964 (7.09%) [25, 51] 0.51

#4 11,925 (1.62%) [52, 112] 0.51

#5 1,718 (0.23%) > 112 0.69

The formulated clusters are presented in Table 3.

For assessing the results of the clustering procedure,

we used mean silhouette value (Rousseeuw, 1987)

which combines the criteria of both cohesion and sep-

aration and is given by the following equations:

s(i) =

b(i) − a(i)

max{a(i), b(i)}

(1)

a(i) =

| − 1

∑

j∈C

,i6= j

d(i, j) (2)

b(i) = min

∑

j∈C

,k6=i

d(i, j) (3)

In the above equations a(i) refers to the mean eu-

clidean distance between i and all other data points

in the same cluster, where d(i, j) is the euclidean dis-

tance between data points i and j in the cluster C

. On

the other hand, b(i) represents the smallest mean eu-

clidean distance of i to all points in any other cluster,

of which i is not a member. As shown in Table 3, the

mean silhouette value regarding the ﬁve formulated

clusters ranges from 0.51 to 0.76, while the value for

the whole clustering is 0.7.

Finally, in an effort to refrain from having clusters

that exhibit high similarities in terms of the behaviour

of the static analysis metrics and thus facilitate the

modelling procedure, we merge clusters #2 and #3

into one cluster that represents the cluster of medium

size methods and clusters #4 and #5 into one that rep-

resents the cluster of large size methods. These two

clusters along with cluster #1 that represents small

size methods are going to be used during modelling.

3.3 Deﬁning Ground Truth

After having constructed our ﬁnal clusters, each cor-

responding to a different size category, the next

step involves the formulation of the readability score

which will be used as the information basis for build-

ing our readability evaluation models. To that end,

we use the number of identiﬁed violations along with

their impact as reﬂected in their severity degree (Mi-

nor, Major, and Critical), according with the follow-

ing equations:

ViolPerLoc(i) =

Identi f iedViolations(i)

LLOC(i)

(4)

Identi f iedViolations(i) = w1 ∗ N

Minor

+ w2 ∗ N

Ma jor

+ w3 ∗ N

Critical

(5)

In the above equations, ViolPerLoc(i) refers to

the number of identiﬁed violations per Lines of Code

regarding the i − th method included in the dataset,

while LLOC(i) refers to the number of logical lines

of code. As shown in Equation 5 and given the fact

that each violation has different signiﬁcance and thus

impact on the readability degree, the number of iden-

tiﬁed violations is computed using a different weight

based on the severity. The weight regarding the Minor

violations is 1 (w1), while the weights for the Major

and Critical violations are 2 (w2) and 4 (w3), respec-

tively. Once having calculated the ViolPerLoc metric

for methods included in the three formulated clusters,

we normalize its values in the range [0, 1] and the ﬁnal

readability score is given by the following equation:

Score

(i) = 1 − Normed{ViolPer Loc(i)} (6)

Figure 4 depicts the boxplots of the readability

scores for the three formulated clusters where it is ob-

vious that in all clusters, the majority of the scores is

A Data-driven Methodology towards Interpreting Readability against Software Properties

GitHub

Top Java

Repositories

Static Analysis

Metrics

Code

Violations

Complexity

Documentation

Coupling

Readabilty

Evaluation

Small

Medium

Large

Figure 3: Overview of Readability Evaluation System.

distributed among a large interval and thus covers a

wide range of evaluation scenarios. Given the box-

plots, the cluster of “small methods” appears to have

the highest range, which makes no surprise given that

it contains almost 70% of the dataset and thus con-

tains methods that exhibit signiﬁcant differences in

terms of adopting certain coding practices. Finally,

it is worth noting that the “large methods” cluster ap-

pears to have the highest mean readability score. Al-

though this may be surprising, it is logical from a soft-

ware engineering point of view given that our dataset

originates from the “best” GitHub Java projects as re-

ﬂected in their adoption by the community of devel-

opers. These projects have hundreds of contributors

and thus need to comply with certain code writing

practices in order to ensure efﬁcient collaboration, es-

pecially in the more complex parts of the source code.

Figure 4: Distribution of the Readability Scores.

4 SYSTEM DESIGN

In this section we design our readability evaluation

system (shown in Figure 3) based on the values of a

large set of static analysis metrics that quantify three

major source code properties; complexity, coupling,

and documentation.

4.1 Data Preprocessing

The preprocessing stage is used to examine the set

of available metrics, detecting the overlays between

them, in order to reduce the dimensions of the dataset

and form the ﬁnal set of metrics that will be used in

our model. Speciﬁcally, we compute the pairwise cor-

relations among all metrics to eliminate metrics that

appear to be interdependent. Figure 5 illustrates the

heatmap with the results of the correlation analysis.

From the heatmap that represents the correla-

tion analysis, we can easily notice the high correla-

tions between metrics that belong to the same cate-

gory (e.g. Complexity, Coupling and Documentation),

while metrics between different categories appear to

have lower correlations. Thus, our decision of evalu-

ating the readability degree on three independent axes

is fully justiﬁed. The results seem quite reasonable

from a software quality perspective. For instance, a

method with high Halstead Effort (HEFF) has a high

probability to also exhibit high Halstead Time Re-

quired to Program (HTRP) (with a correlation value

of 1), while there is no clue about the Number of

Incoming Invocations (NII) or the Number of Out-

going Invocations (NOI)(with a correlation value of

0.00027 and 0.16 respectively).

Table 4: The ﬁnal metrics used in our model.

Property Metrics

Complexity NL, HDIF, HPV, McCC, MI

Coupling NII, NOI

Documentation CD, CLOC, DLOC

The correlation analysis showed that a lot of met-

rics coming from the same category are highly corre-

lated. For each metric category, upon examining the

highly correlated metrics and keeping one metric for

each of these groups, the ﬁnal dataset consists of the

metrics depicted in Table 4.

ICSOFT 2020 - 15th International Conference on Software Technologies

Figure 5: Heatmap representation of correlation analysis.

4.2 Model Construction

As already mentioned, we calculate one readability

score per each metric category, i.e. the readability

score concerning the Complexity metrics, the Cou-

pling metrics and the Documentation metrics, evalu-

ating the readability of each method from the percep-

tion of each axis separately. These three values are

then aggregated to form the ﬁnal readability score of

the source code.

For the evaluation of the readability score of one

method upon each metrics category, a well-known re-

gression model was used, the Support Vector Regres-

sion (SVR) model (Drucker et al., 1997). In our ap-

proach, nine independent SVR models were built, re-

garding the three size clusters and the three metrics

categories in each cluster. The readability score as

formed in the previous section was used as the target

score of the three SVR models (e.g. the Complexity

SVR, the Coupling SVR and the Documentation SVR).

The various parameters of each model is depicted in

table 5, where g stands for gamma parameter, tol for

tolerance for stopping criterion and C for the regular-

ization parameter.

For the training process of each model, we follow

a 80/20 training-testing split, while we validate each

model by using 10-fold cross-validation. The train-

ing and testing errors for every model after the cross-

validation are depicted in table 6.

The output of each of the three models repre-

sents the readability score of the method regarding the

Complexity, the Coupling and the Documentation re-

spectively. The ﬁnal readability score of the method is

Table 5: The parameters of the regression models.

Cluster Category g tol C

Small

Complexity 0.001 0.001 256

Coupling 0.001 0.0001 256

Documentation 0.001 0.01 256

Medium

Complexity 0.01 0.1 256

Coupling 0.01 0.01 256

Documentation 0.001 0.01 64

Large

Complexity 0.001 0.01 32

Coupling 0.2 0.1 64

Documentation 0.15 0.001 256

simply calculated by a weighted average of the three

scores, based on the number of metrics, from which

each metric category is made up of. As already men-

tioned in the preprocessing stage, 5 metrics are in-

cluded in Complexity and 2 metrics are included in

Coupling, while Documentation is consisted of 3 met-

rics. Thus, the ﬁnal aggregation function is depicted

in the following equation:

RS = 0.5 · S

cmplx

+ 0.2 · S

cpl

+ 0.3 · S

doc

(7)

where RS is the ﬁnal readability score of the method,

cmplx

is the readability score regarding Complexity,

cpl

the readability score regarding Coupling and S

doc

the readability score regarding the Documentation.

After the construction of the complete model, we

calculate the errors of the training and testing set re-

spectively, in order to evaluate its performance. Fig-

ure 6 illustrates the training and testing histograms for

each cluster. The models seem to be trained effec-

tively, as the training and testing errors are low and

A Data-driven Methodology towards Interpreting Readability against Software Properties

Small Size Cluster

Medium Size Cluster

Large Size Cluster

Figure 6: Error histograms of all cluster sizes.

Table 6: The cross-validation errors of the regression models.

Cluster Category

Training Testing

MAE MSE MAE MSE

Small

Complexity 18.56% 6.60% 18.5% 6.53%

Coupling 24.07% 8.27% 24.14% 8.30%

Documentation 23.65% 8.46% 23.52% 8.36%

Medium

Complexity 19.51% 5.52% 21.84% 7.11%

Coupling 22.83% 7.25% 23.02% 7.38%

Documentation 22.77% 7.09% 22.99% 7.19%

Large

Complexity 18.48% 6.53% 21.02% 8.68%

Coupling 21.24% 9.29% 22.54% 10.17%

Documentation 19.08% 7.12% 20.34% 8.94%

lie mostly around 0. At the same time, the distribu-

tions of the two errors are quite similar and the differ-

ences are minimal, indicating that the models avoided

overﬁtting.

5 EVALUATION

In this section we evaluate our constructed method-

ology for estimating software readability in a set of

diverse axes. At ﬁrst, in an effort to evaluate the ef-

fectiveness and efﬁciency of our system, we apply our

methodology on a set of diverse projects that exhibit

different characteristics. As for the second axis and

towards assessing whether the calculated readability

scores are reasonable from a quality perspective, we

perform manual inspection on the values of the static

analysis metrics regrading methods that received both

low and high readability scores. Finally, in an attempt

to evaluate the effectiveness of our approach in prac-

tice, we harness the readability evaluation results in

order to improve the readability degree of a certain

Java method.

5.1 Readability Estimation Evaluation

In the ﬁrst step towards assessing the validity of our

system, we evaluate its efﬁciency based on the read-

ability scores computed for ﬁve randomly selected

repositories (more than 20K methods and 350K Lines

of Code) that exhibit signiﬁcant differences in terms

of size (number of methods and total lines of code)

and scope. Table 7 presents certain statistics regard-

ing the size and the readability evaluation for the ﬁve

examined repositories. In speciﬁc, the table contains

the number of methods as well as the lines of code

along with the mean values regarding the actual and

the predicted overall readability scores and the indi-

vidual score that targets each one of the evaluated

source code properties. In addition, in an effort to

further examine the readability interpretation results

against the evaluated source code properties, Figure 7

illustrates the percentage of the methods that received

low, medium and high readability scores. Low score

refers to values below 0.33 (or 33%), medium refers

to values in the interval (0.33, 0.66], while high refers

to scores above 0.7 (or 70%).

According to the provided results, it is obvious

ICSOFT 2020 - 15th International Conference on Software Technologies

Figure 7: Percentage of scores per category.

Table 7: The readability score interpretation on evaluation repositories.

Repository

Number of Total Actual Predicted Readability Interpretation Scores

methods Lines of Code Score Score Complexity Coupling Documentation

#1 20,803 339,597 66.52% 56.81% 55.75% 57.83% 56.83%

#2 2,004 27,418 63.73% 53.51% 50.75% 55.45% 55.85%

#3 916 7,741 60.41% 53.88% 50.91% 54.87% 55.92%

#4 1,411 13,416 56.26% 52.35% 46.84% 54.31% 54.95%

#5 127 1,596 62.20% 56.43% 58.28% 56.04% 53.06%

that the overall predicted readability score, which oc-

curs as an aggregation of the respective scores for the

three source code properties, aligns with the one com-

puted using the number of identiﬁed violations. The

largest projects exhibit the highest differences (almost

10%), which is expected given that projects having

thousands of methods often include outliers from a

static analysis metrics point of view and thus may ex-

hibit higher errors. Even in such cases, our method-

ology appears to be efﬁcient. In addition, given the

mean values of the readability score for the examined

properties, the results denote that in all cases the mean

value lies in the interval [40%, 60%] and thus one may

conclude that our models do not exhibit bias towards

making predictions aroung a certain value. This is

also reﬂected in the distribution of the scores as illus-

trated in Figure 7.

Upon further examining the calculated readability

scores in terms of decomposing the ﬁnal score into

the three different axes under evaluation and in an ef-

fort to assess whether the calculated scores are logical

from a quality perspective, we examined the variance

of the scores for each respective property. The re-

sults showed that the scores regarding documentation

exhibit the lowest variance, while the ones regarding

complexity appear to have the highest variance. This

makes no surprise given that the way of documenting

source code in a certain project depends on the design

choices made by the main contributors that drive the

development process and thus refers to the project as a

whole. As a result, the within-project variance of the

documentation scores are expected to be low. This is

reﬂected in the percentage of methods receiving low,

medium, and high values regarding the ﬁve examined

projects. On the other hand, complexity and coupling

are properties that fully depend on the provided func-

tionality and thus methods with different scope and

target may exhibit high differences. This is also re-

ﬂected in the percentage of methods receiving differ-

ent readability evaluation, where in the cases of cou-

pling and complexity this percentage is almost evenly

distributed in all ﬁve projects. At this point it is worth

noting that in the case of coupling, projects #1 and #4

appear to have a large number of methods that receive

a high score. This originates from the fact that these

projects contain several totally decoupled methods as

A Data-driven Methodology towards Interpreting Readability against Software Properties

Table 8: Overview of the Static Analysis Metrics per Property for Methods with different Quality Scores.

Metrics Small Size Cluster Medium Size Cluster Large Size Cluster

High Low High Low High Low

Category Name Score Score Score Score Score Score

(77.2%) (28.18%) (85.2%) (15.43%) (76.1%) (24.83%)

NL 0 1 1 9 4 13

HDIF 16.25 16 61.92 42.54 121.47 114.83

Complexity HPV 20 28 59 95 126 227

McCC 1 2 4 14 12 64

MI 105.3 122.9 88.02 72.37 60.12 20.56

Coupling

NII 0 0 1 2 0 1

NOI 0 6 1 20 23 30

CD 0.23 0.00 0.34 0.08 0.27 0.0

Documentation CLOC 3 0 17 0 23 0

DLOC 3 0 23 5 15 0

reﬂected in the values of incoming and outgoing in-

vocations.

5.2 Example Readability Estimation

In order to further assess the effectiveness of our mod-

els and evaluate it from a software quality perspective,

we examined the methods that received high or low

readability score for each size cluster, along with the

values of the related static analysis metrics that led to

the predicted score. Table 8 presents these values re-

garding six different methods (two for each cluster)

that received low and high readability score, respec-

tively.

As for the methods of low size, it is obvious that

the method that received low readability score ap-

pears to have no documentation as reﬂected in the

zero value of the Comments Density (CD) metric. On

top of that and given the number of outgoing invoca-

tions (6), it appears to be highly coupled as it calls

six other methods during its execution. As a result,

the low readability score is logical from a quality per-

spective. The same applies for the method which re-

ceived high score given that it appears to exhibit no

coupling and has an average documentation level. It

is worth noting that both classes exhibit high scores

in terms of complexity.

As for the methods of medium size, it is obvious

that the method that received low readability score ap-

pears to be more complex and coupled than the one

that received high score as reﬂected in the values of

McCabe Cyclomatic Complexity (McCC) and Nest-

ing Level (NL), as well as in the number of incoming

and outgoing invocations. In addition, the class which

received a low score exhibits signiﬁcantly higher vol-

ume as reﬂected in the value of Halstead Program Vol-

ume (HPV), which is calculated from the number of

distinct and total operations and operands. The same

conclusions are drawn, while inspecting the computed

values of the static analysis metrics of the methods in-

cluded in the large size cluster. In these methods, it is

worth noting that as size increases, the impact of com-

plexity into the readability degree becomes even more

evident. This is reﬂected in the high difference in the

values of Maintainability Index (MI) between the two

methods of the large size cluster. Given all the above,

the readability evaluation in all six cases appears to

be logical and can be explained by the values of the

static analysis metrics.

5.3 Application of Readability

Enhancement in Practice

Further assessing the effectiveness of our readabil-

ity evaluation system in terms of providing actionable

recommendations that can be used in practice during

development, we resort to the exploration of a certain

use-case where we harness the results of our system

towards improving the readability degree of a certain

method.

Figure 8 presents the initial source code of the

method under evaluation. This method is responsi-

ble for updating a certain database along with the

backup database and works in two different modes.

The ﬁrst mode refers to the case when the variable

ForceUpdate is true and involves updating the main

database along with the backup database, while the

second refers to the case when the variable ForceUp-

date is false and involves only updating cache. At

this point, it is worth noting that no update operation

should be performed in the main database in cases

when isUpdateReady is false or synchronization is not

complete (isSynchCompleted is false). Upon evalu-

ating the respective method using our trained mod-

ICSOFT 2020 - 15th International Conference on Software Technologies

private static void updateDb(boolean isForceUpdate) {

if (isUpdateReady) {

if (isForceUpdate) {

if (isSynchCompleted) {

updateDbMain(true);

updateBackupDb(true);

} else {

updateDbMain(false);

updateBackupDb(true);

}

} else {

updateCache(!isCacheEnabled);

}

Figure 8: Initial version of method.

els the overall readability score is 0.428 (or 42.8%),

while the scores for the three properties were as fol-

lows: 0.637 (or 63.7%) for the Complexity, 0.472 (or

47.2%) for the Coupling, and 0.052 (or 5.2%) for the

Documentation. Given these results, it is obvious that

our method lacks proper documentation, while at the

same time we can see that there is a relatively large

nesting level as reﬂected in the NL value which is 3.

/∗∗

∗ Update mainDB, backupDB, and cache

∗/

private static void updateDb(boolean isForceUpdate) {

// Do nothing in case the update is not ready

if (!isUpdateReady){

return;

}

// Update cache in case of non forced update

if (!isForceUpdate) {

updateCache(!isCacheEnabled);

return;

}

// General Update Pipeline (Backup and Main DB)

updateBackupDb(true);

updateDbMain(isSynchCompleted ? true : false);

}

Figure 9: Final version of method.

We try to optimize our method in two directions.

At ﬁrst, we add detailed documentation explaining the

different control ﬂow paths in order to improve the

comprehensibility of the code. Our second audit tar-

gets reducing complexity by refactoring the naviga-

tion to the different available control ﬂow paths and

thus improve clarity. Figure 9 presents the optimized

version of the source code, which originates from

the aforementioned audits. Upon evaluating the opti-

mized version, the overall readability score is 0.80 (or

80%), while the scores regarding the three properties

were as follows: 0.833 (or 83.3%) for the Complex-

ity, 0.472 (or 47.2%) for the Coupling, and 0.963 (or

96.3%) for the Documentation. As given by the com-

parison of the two code fragments, which are func-

tionally equal, the performed audits had a signiﬁcant

impact on the readability degree, which is reﬂected in

the scores. Finally, given that the two code fragments

have the same number of incoming and outgoing in-

vocations, the score regarding the coupling property

remains the same. Finally, there is still room for im-

provement by splitting the method into multiple meth-

ods each being responsible for a certain task. In that

way, we can also improve coupling.

6 THREATS TO VALIDITY

Our approach towards readability evaluation interpre-

tation seems to achieve high internal validity, as it has

already been proved from the evaluation. The limi-

tations and threats to the external validity of our ap-

proach span along the following axes: a) limitations

imposed by the deﬁnition of our ground truth, and b)

the selection of our benchmark dataset.

Our design choice to quantify readability based on

the compliance of the source code with widely ac-

cepted coding practices as reﬂected in the number of

identiﬁed violations originates from the fact that the

primary target of coding violations is to set up a com-

mon ground between the development community in

terms of following certain code writing guidelines.

Apart from preventing the occurrence of various types

of errors (already known and documented), this com-

mon ground is crucial for improving the understand-

ability of the source code and thus inﬂuences read-

ability. Furthermore, given that we interpret readabil-

ity as perceived by developers, our benchmark dataset

is built upon harnessing crowdsourcing information

regarding the popularity and the degree of reuse for

a large number of GitHub Java projects. This in-

formation reﬂects the high adoption of the selected

projects among the community of developers and thus

was considered appropriate towards formulating our

benchmark dataset. Of course, our methodology can

be applied as-is using a different benchmark dataset

that covers the individual needs of speciﬁc evaluation

scenarios.

A Data-driven Methodology towards Interpreting Readability against Software Properties

7 CONCLUSIONS AND FUTURE

WORK

In this work, we proposed an automated and inter-

pretable readability evaluation methodology, which is

based on a large set of static analysis metrics and cod-

ing violations. The evaluation of our approach in a set

of diverse axes indicates that our system can be effec-

tive for evaluating readability on three axes, each cor-

responding to a primary source code property. Upon

providing results that lead to actionable recommenda-

tions regarding the audits that can enhance the read-

ability degree of the project under evaluation, our sys-

tem can be a valuable tool for developers.

Future work relies on several directions. At

ﬁrst, we can expand our dataset by adding additional

projects with different characteristics and thus im-

prove the ability of our models to generalize. Finally,

the design of our target variable can be further investi-

gated for the incorporation of additional metrics other

than violations.

ACKNOWLEDGEMENTS

This research has been co-ﬁnanced by the European

Regional Development Fund of the European Union

and Greek national funds through the Operational

Program Competitiveness, Entrepreneurship and In-

novation, under the call RESEARCH - CREATE - IN-

NOVATE (project code: T1EDK-04045).

REFERENCES

Bergland, G. D. (1969). A guided tour of the fast fourier

transform. IEEE Spectrum, 6(7):41–52.

Buse, R. and Weimer, W. (2010). Learning a metric for code

readability. IEEE Transactions on Software Engineer-

ing, 36:546–558.

Choi, S., Kim, S., Kim, J., and Park, S. (2020). Met-

ric and tool support for instant feedback of source

code readability. Tehnicki vjesnik - Technical Gazette,

27(1):221228.

Dorn, J. (2012). A general software readability model.

Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. J.,

and Vapnik, V. (1997). Support vector regression ma-

chines. In Mozer, M. C., Jordan, M. I., and Petsche,

T., editors, Advances in Neural Information Process-

ing Systems 9, pages 155–161. MIT Press.

Fakhoury, S., Roy, D., Hassan, S. A., and Arnaoudova,

V. (2019). Improving source code readability: The-

ory and practice. In Proceedings of the 27th Interna-

tional Conference on Program Comprehension, ICPC

19, page 212. IEEE Press.

Halstead, M. H. (1977). Elements of software science.

ISO (2020). ISO/IEC 25010. https://iso25000.com/index.

php/en/iso-25000-standards/iso-25010. Accessed:

2020-03-20.

Knight, J. C. and Myers, E. A. (1993). An improved in-

spection technique. Communications of the ACM,

36(11):50–61.

Mannan, U. A., Ahmed, I., and Sarma, A. (2018). Towards

understanding code readability and its impact on de-

sign quality. pages 18–21.

Mkaouer, M. W., Kessentini, M., Bechikh, S., Cinnide,

M., and Deb, K. (2015). On the use of many quality

attributes for software refactoring: a many-objective

search-based software engineering approach. Empiri-

cal Software Engineering.

Moha, N., Gueheneuc, Y., Duchien, L., and Le Meur, A.

(2010). Decor: A method for the speciﬁcation and de-

tection of code and design smells. IEEE Transactions

on Software Engineering, 36(1):20–36.

Pantiuchina, J., Lanza, M., and Bavota, G. (2018). Im-

proving code: The (mis) perception of quality metrics.

pages 80–91.

PMD (2020). PMD static analysis tool. https://pmd.github.

io/. [Online; accessed March 2020].

Posnett, D., Hindle, A., and Devanbu, P. (2011). A simpler

model of software readability. In Proceedings of the

8th Working Conference on Mining Software Repos-

itories, MSR 11, page 7382, New York, NY, USA.

Association for Computing Machinery.

Raymond, D. R. (1991). Reading source code. In Proceed-

ings of the 1991 Conference on Centre for Advanced

Studies on Collaborative Research (CASCON), pages

3–16.

Rousseeuw, P. (1987). Rousseeuw, p.j.: Silhouettes:

A graphical aid to the interpretation and validation

of cluster analysis. comput. appl. math. 20, 53-65.

Journal of Computational and Applied Mathematics,

20:53–65.

Rugaber, S. (2000). The use of domain knowledge in pro-

gram understanding. Annals of Software Engineering,

9:143–192.

Scalabrino, S., Linares-Vsquez, M., Oliveto, R., and Poshy-

vanyk, D. (2018). A comprehensive model for code

readability. Journal of Software: Evolution and Pro-

cess, 30(6):e1958. e1958 smr.1958.

Scalabrino, S., Linares-Vsquez, M., Poshyvanyk, D., and

Oliveto, R. (2016). Improving code readability mod-

els with textual features. In 2016 IEEE 24th In-

ternational Conference on Program Comprehension

(ICPC), pages 1–10.

sourcemeter (2020). SourceMeter static analysis tool. https:

//www.sourcemeter.com/. [Online; accessed March

2020].

ICSOFT 2020 - 15th International Conference on Software Technologies