A Mechanism for Automatically Extracting Reusable and Maintainable

Code Idioms from Software Repositories

Argyrios Papoudakis

, Thomas Karanikiotis

and Andreas L. Symeonidis

Dept. of Electrical and Computer Eng., Aristotle University of Thessaloniki, Thessaloniki, Greece

Keywords:

Code Idioms, Syntactic Fragment, Software Reusability, Software Maintainability, Software Engineering.

Abstract:

The importance of correct, qualitative and evolvable code is non-negotiable, when considering the maintain-

ability potential of software. At the same time, the deluge of software residing in code hosting platforms has

led to a new component-based software development paradigm, where reuse of suitable software components/

snippets is important for software projects to be implemented as fast as possible. However, ensuring accept-

able quality that will guarantee basic maintainability is also required. A condition for acceptable software

reusability and maintainability is the use of idiomatic code, based on syntactic fragments that recur frequently

across software projects and are characterized by high quality. In this work, we present a mechanism that

employs the top repositories from GitHub in order to automatically identify reusable and maintainable code

idioms. By extracting the Abstract Syntax Tree representation of each project we group code snippets that ap-

pear to have similar structural and semantic information. Preliminary evaluation of our methodology indicates

that our approach can identify commonly used, reusable and maintainable code idioms that can be effectively

given as actionable recommendations to the developers.

1 INTRODUCTION

Delivering software fast and with good quality has al-

ready been a quest for software engineering practi-

tioners. This quest has become more evident in our

era of rapid software prototyping and the establish-

ment of the agile, component-based software engi-

neering paradigm. The narrow time-to-market sched-

ules and the constantly changing features undermine

the quality of developed software and threaten the

maintainability of software projects. Abran et al.

(Abran and Nguyenkim, 1993) argue that the software

maintaining process consumes the largest percentage

of software costs. A poor software maintainability

”score” usually leads to increased bug ﬁxing times

and inability to properly plan features development,

this way derailing the software project time frame,

both in time as well as in budget context.

In order to speed up software development, re-

duce programming effort and minimize the risks of

defective software, developers constantly try to reuse

existing software components/libraries/code snippets,

https://orcid.org/0000-0001-7371-2277

https://orcid.org/0000-0001-6117-8222

https://orcid.org/0000-0003-0235-6046

beneﬁting from software that has already been imple-

mented and released and resides in some code hosting

facility. However, the quality of the components to be

reused is not granted and suboptimal searches may

lead to code snippets of poor quality that are tough to

integrate and could possibly introduce bugs. So, there

is a high need for software components that the devel-

opers can easily reuse and integrate into their software

and that can guarantee acceptable quality.

This requirement can be satisﬁed by using id-

iomatic code or code idioms. Code idioms are small

syntactic fragments that recur frequently across var-

ious software projects, with simple tasks to execute.

According to Allamanis et al. (Allamanis and Sut-

ton, 2014), code idioms appear to be signiﬁcantly

different from ”previous notions of textual patterns

in software”, as they involve syntactic constructs

(decision-making statements, loops and exception-

handling blocks). Idioms, which can improve the

overall quality of the software they are used into, are

characterized by high reusability and can make the

code signiﬁcantly easier to maintain (Hnatkowska and

Jaszczak, 2014). Popular Integrated Development En-

vironments (IDEs) already support idioms for various

programming languages. However, there are only few

approaches that aspire to mine code idioms automat-

Papoudakis, A., Karanikiotis, T. and Symeonidis, A.

A Mechanism for Automatically Extracting Reusable and Maintainable Code Idioms from Software Repositories.

DOI: 10.5220/0011279300003266

In Proceedings of the 17th International Conference on Software Technologies (ICSOFT 2022), pages 79-90

ISBN: 978-989-758-588-3; ISSN: 2184-2833

ically (Allamanis and Sutton, 2014). The few exist-

ing ones are mostly based on statistical methods and

are not able to capture idioms based also on their in-

creased usage among top-level and popular reposito-

ries.

In this work, we present a mechanism that can

harness data from the top-level, most popular open-

source projects, residing in the code hosting platform

GitHub, and identify reusable and maintainable code

idioms, in a completely unsupervised manner. This

mechanism extracts the most-used code blocks found

across different projects, compares both their struc-

tural and their semantic information and groups code

snippets with similar architecture and functionality.

After automatically processing the generated clusters

and discarding the ones that do not meet speciﬁc re-

quirements, our system produces a set of code id-

ioms that have been used considerably across top-

level projects. To sum up, the main contributions of

our approach lie in the following axes:

• We present an approach for automatically identi-

fying and extracting code idioms used in various

projects.

• We build our methodology on projects of high

maintainability and reusability, which is shown in

the projects’ stars and forks.

• The extraction of code idioms is separated and ac-

complished differently for the various code blocks

found in programming languages, such as if and

for, which provide different functionalities.

• We provide an abstract form of the ﬁnal idioms

extracted by our system, which a developer can

easily integrate into his/her project and modify ac-

cordingly.

• Having evaluated our approach and examined the

extracted idioms, we identify that these idioms

are used in various important programming tasks,

such as null checking, looping through the ele-

ments of an array and reading the content of an

object or ﬁle. Such tasks considerably affect soft-

ware maintainability.

The rest of this paper is organized as follows. Sec-

tion 2 reviews the current approaches in the literature

that aspire to extract small fragments of code that re-

cur frequently across projects and are characterized

by high quality. Our methodology, along with the

dataset we have created, the processing steps and the

models we have employed are depicted in Section 3,

while in Section 4 we evaluate the code idioms that

were extracted by our approach. In Section 5, we

analyze potential threats to the internal and external

validity of our approach and, ﬁnally in Section 6 we

provide insight for further research and conclude the

paper.

2 RELATED WORK

The importance of software maintainability and

reusability has gained increasing interest during the

recent years, where the time schedules for projects

are very tight, features are being constantly added or

modiﬁed and the maintenance costs are rising. To

address these issues, the component-based software

paradigm, which is easy to debug, maintain and reuse,

has gained ground. This paradigm relies on soft-

ware components and small fragments of code, that

execute speciﬁc and well-deﬁned programming tasks

and can be easily understood by the developers. Id-

iomatic code or code idioms are such small syntactic

blocks that recur frequently across different projects

and execute simple tasks. Idioms can signiﬁcantly im-

prove the software maintainability of a project, while

they provide small, compact and highly reusable code

blocks, which the developers can effortlessly compre-

hend and (re)use.

The process of extracting useful information from

semi-structured data is a well-known task, in which

the developers attempt to detect frequently encoun-

tered components. Even though mining frequent pat-

terns is an important task, it is not directly correlated

to the idioms mining challenge. A problem that is

highly connected to idioms mining is the identiﬁca-

tion of code clones. Code clones aim to detect sim-

ilar blocks of code across different projects. Zhang

and Wang (Zhang and Wang, 2021) proposed a tool,

named CCEyes, which is based on semantic vector

representations of big repositories and can identify

similar code fragments. Upon evaluating their ap-

proach, the authors argue that CCEyes outperforms

state-of-the-art approaches in code cloning. Similarly,

Ji et al. (Ji et al., 2021) are mainly focused on the hier-

archical structure of the Abstract Syntax Tree, where

they apply an attention mechanism, in order to exam-

ine the importance of different tree nodes. The results

of the experiments conducted by the authors show that

their proposed methodology achieves superior results

compared to the baseline methods.

Another area related to idioms mining is the API

mining problem. In this task, the main goal is the ex-

traction of sequences or graphs of API method calls.

While there have been a number of approaches that

aspire to mine API usage patterns, Wang et al. (Wang

et al., 2013) argued that they lack appropriate metrics

to evaluate the quality of their results. Thus, they pro-

posed two metrics, called succinctness and coverage

ICSOFT 2022 - 17th International Conference on Software Technologies

in order to measure the quality of the mined usage

patterns of API methods from the view of the devel-

opers. At the same time, they proposed UP-Miner,

a system that mines API usage patterns from source

code based on the similarity of the sequence and a

clustering algorithm. The experiments conducted on

the effectiveness of UP-Miner indicate that the UP-

Miner outperforms the existing approaches. More-

over, Fowkes and Sutton (Fowkes and Sutton, 2016)

aspire to address the limitation of the large parameter

tuning that is required in the most of the API min-

ing algorithms, by proposing the Probabilistic API

Miner - PAM, which is a near parameter-free prob-

abilistic algorithm for mining API call patterns. PAM

outperformed previously proposed systems, while the

authors argue also that the most of the existing API

calls are not well documented, which makes the API

usage patterns mining a demanding task.

While all these approaches are closely related to

idioms mining, signiﬁcant variations exist. Code

clones aspire to detect similar code that executes the

same functionality, but not identical one, which is pre-

requisite for the idioms mining. Similarly, API min-

ing is interested only in sequences or graphs mining

and not on the code itself. Additionally, while code

search approaches cover a much wider area, they are

usually not focused on code quality and maintainabil-

ity.

When it comes to idiom mining, research ap-

proaches try to model the problem from a statistical

point of view. Allamanis and Sutton (Allamanis and

Sutton, 2014) proposed HAGGIS, a system for mining

code idioms, which is mostly based on nonparametric

probabilistic tree substitution grammars. Upon apply-

ing HAGGIS to some of the most popular repositories,

the authors claim that the extracted idioms are being

used in a wide range of projects. However, the most

of the idioms that HAGGIS was able to identify per-

form only easy tasks, such as object creation and ex-

ception handling. In their next approach, Allamanis

et al. (Allamanis et al., 2018) focused only on loop

idioms. They were based again on probabilistic tree

substitution grammars, which they augmented with

important semantic, in order to identify loop blocks

that recur frequently and meet the basic requirements

of code idioms. The identiﬁed loop idioms seem to

appear frequently, while they are able to identify op-

portunities for new code features. However, a study

conducted in 2019 by Tanaka et al. (Tanaka et al.,

2019) indicated that a lot of the mined code idioms

are not being used frequently, while some types of id-

ioms, such as the Stream idioms that operate on the

elements of a collection, are only used at times. At

the same time, the lack of complete idioms or ex-

amples leads to bad idioms usage. Finally, Sivara-

man et al. (Sivaraman et al., 2021) proposed Jezero,

a system that uses nonparametric Bayesian methods,

in order to extract canonicalized dataﬂow trees. Even

though Jezero performs better than other baseline ap-

proaches, it appears to signiﬁcantly depend on the ex-

tent and the nature of the code it is applied to.

In this work, we aspire to tackle the main draw-

backs that are introduced in the approaches found in

the literature. We propose a mechanism to extract id-

ioms from the most popular and reused projects from

GitHub, characterized by high maintainability and

reusability (Papamichail et al., 2016a), split the soft-

ware into small meaningful code snippets and group

commonly used code blocks across different projects

that have similar structural and semantic information.

These groups are then ﬁltered to discard code snip-

pets that do not meet basic requirements of the id-

iomatic code and the ﬁnal set of code idioms is gener-

ated. We argue that code idioms can effectively help

the developers during both the development phase,

where components that accomplish simple tasks are

selected to be reused, and the maintenance phase. At

the same time, we argue that the extraction of the code

idioms from repositories that have a big number of

stars (users that can comprehend the main content and

functionality of the project) and forks (users that use

the whole project or project components), along with

their long lifespan, is a good indicator of idiomatic

code.

3 METHODOLOGY

In this section we present the architecture of our code

idioms mining system (shown in Figure 1).

3.1 Dataset Construction

The ﬁrst step towards creating our system is to gener-

ate a dataset that comprises of code snippets derived

from the most popular software projects. The pro-

gramming language selected for creating our corpus

is Java, as it is a well-structured language and is con-

sidered ideal for mining code patterns.

In order to build our dataset, we use the 1,500

most popular Java repositories from GitHub

, based

on their reputation with respect to their number of

stars and forks and their long lifespan, as it originates

from the date of their ﬁrst release. As our main target

is the identiﬁcation of reusable and maintainable code

idioms, the initial dataset used in our system needs to

https://github.com/

A Mechanism for Automatically Extracting Reusable and Maintainable Code Idioms from Software Repositories

switch blocks

if blocks

for blocks

while blocks

try blocks

do blocks

Dataset

Complexity

Clustering

Snippet

Filtering

Similarity

Scheme

Snippets

Filtering

Clustering

PQ - GRAMS

Cluster Selection

Generalized Form

Preprocessing

Figure 1: Overview of the Code Idioms Mining System.

include idiomatic source code that is popular among

the developers. This can be ensured by the reputation

metrics; forks measure how many times a software

repository has been cloned and, thus, a large number

of forks indicates great reusability, while the number

of stars reﬂects the attractiveness of the project and

hence a qualitative and maintainable software. Each

project should also satisfy speciﬁc criteria posed by

the idioms mining problem and the need for highly

maintainable and reusable code. Except from the high

number of stars and forks, each project needs to ex-

hibit a large number of different contributors as well

as steady and short release cycles (i.e. new features

constantly added). These requirements aspire to en-

sure maintainability and reusability, as they are re-

ﬂected by the project’s persistence in time and its high

acceptance from a lot of developers.

From the 1,500 software repositories, we ran-

domly select 1,000 repositories to be used in the ini-

tial stage of our approach, where the code idioms are

identiﬁed, and employ the remaining 500 repositories

to evaluate the extracted idioms. The source code

of these projects needs to be converted into a form

suitable for our models. For this purpose, we make

use of the Abstract Syntax Tree (AST) representation,

which maintains both the structural and the seman-

tic information of the source code. Speciﬁcally, the

execution order of the code statements is encoded in

the structure of the AST, while the semantic infor-

mation is stored in the leaves of the tree, where the

names of variables, methods and objects are found.

The transformation of the source code ﬁles to ASTs

is performed with the use of the ASTExtractor

tool.

As the code idioms are small syntactic fragments

with a single semantic role and execute speciﬁc and

well-deﬁned programming tasks, we can easily as-

sume that the most important code idioms concern a

small block of code (e.g. an if block or a for block)

and not a sequential snippet of code. Therefore, we

mainly focus on fragments of code that belong to

https://github.com/thdiaman/ASTExtractor

the Control Flow Statements (CFSs), which break up

the sequential execution of the code with looping,

decision-making or exception-handling statements.

The main CFSs used in nearly any programming lan-

guage are the If, For, Try, While, Do and Switch

blocks. After traversing the ASTs of the source code

ﬁles, we extract all the fragments of code that belong

to the CFSs categories. Each CFS category is han-

dled independently, as the snippets of each category

have totally different semantic content and there is no

similarity between them, that could lead to the identi-

ﬁcation of a code idiom. Table 1 depicts the number

of snippets that each category of the CFS contains,

where the Enhanced For statements refer to the it-

eration on the elements of an array or collection. It

should be noted that the number of the If Statements

was too large, so we narrowed it down using only 300

repositories and split them into 7 different ﬁles. The

data have been publicly available and can be accessed

at Zenodo

. The data can be accessed by any pro-

gramming language, as it completely follows the csv

standards.

Table 1: Number of Snippets per Category.

Type of CFS

# Snippets

Blocks

If 4,988,452

For 411,770

Enhanced For 495,256

Try 716,776

While 174,644

Switch 106,904

Do 10,739

https://doi.org/10.5281/zenodo.5391965

ICSOFT 2022 - 17th International Conference on Software Technologies

3.2 Preprocessing

Prior to calculating the similarity matrix between

code snippets, we have to adjust the initially created

dataset to the requirements of the mining idioms prob-

lem. Taking into account the large number of repos-

itories we employed, it is crucial that we reduce the

computational cost, while, at the same time, discard

data that may lead to deﬁcient results. Therefore, we

apply some preprocessing steps, which make the im-

plementation of the main clustering procedure feasi-

ble.

As it has been already mentioned, the code id-

ioms are small syntactic fragments with speciﬁc and

well-deﬁned programming tasks to execute. Thus, a

larger snippet of code would not be part of a code

idiom, as it contradicts with the main principles of

code idioms. In order to exclude larger blocks of

code from our dataset, we discard all the snippets that

are greater than seven logical lines of code, i.e. ex-

ecutable lines of code, as idioms need to be signif-

icantly small and execute simple tasks and they usu-

ally do not exceed the above threshold

. It should be

mentioned, however, that this is just a tunable param-

eter in our methodology and can easily be changed for

further experimentation. Table 2 depicts the number

of snippets each category contains, after removing the

large snippets of code.

Table 2: Number of Snippets per Category after Filtering.

Type of CFS

# Snippets

Blocks

If 1,049,729

For 194,299

Enhanced For 252,587

Try 246,632

While 55,438

Switch 10,482

Do 3,482

From the Table 2 we can notice that the ﬁrst four

categories (i.e. If, For, Enhanced For and Try state-

ments) include a large amount of code snippets, which

greatly impedes the effective clustering analysis of

the next steps, especially during the calculation of the

similarity matrices. In order to cope with this limi-

tation, we split the respective categories into smaller

parts, applying an initial clustering. The main goal

of this initial clustering is the ability to perform only

targeted comparisons when applying the main cluster-

https://programming-idioms.org/

https://www.nayuki.io/page/good-java-idioms

ing, avoiding the comparisons between code blocks

that are considerably different.

The initial clustering applied on this step of our

modelling procedure is based on the code complexity

of the snippets. Snippets with a quite large difference

in their complexity probably perform different tasks,

while, even if they execute the same programming

task, they are based on different approaches. Thus,

we apply a clustering algorithm making use of two

complexity metrics; the McCabe’s Cyclomatic Com-

plexity (McCabe, 1976), which counts the indepen-

dent execution paths of the source code, as well as the

total number of variables, methods and objects that

the code contains. These two metrics are used as fea-

tures for the clustering algorithm, in order to create

groups that contain snippets with the same character-

istics.

Applying this preprocessing step in the sets of If,

For, Enhanced For and Try statements, we create sev-

eral subsets that contain a signiﬁcantly smaller num-

ber of snippets. The main point of this modelling

step is that each of our sets contains only fragments

of code that appear to have similar code complexity

and hence they could lead to the identiﬁcation of a

code idiom. Table 3 depicts a summary of the results

of the initial clustering, including the number of dif-

ferent groups that are created and the average number

of snippets that are included in each group.

Table 3: Initial Clustering Results.

Type of CFS

# Groups

Average #

Blocks Snippets

If 21 49,987

For 8 24,287

Enhanced For 10 25,258

Try 10 24,663

3.3 Similarity Scheme

Before applying the main clustering algorithm, we

need to calculate a distance matrix for each one of the

subsets of our corpus, which reﬂects the similarity be-

tween two snippets of code. For the comparison of the

snippets, we use their AST representations and calcu-

late the Tree Edit Distance (TED). TED is deﬁned as

the minimum cost sequence of edit operations (node

insertion, node deletion and label change) that can

transform one tree into another (Tai, 1979). Over the

years, various methods have been introduced to cal-

culate the TED, in order to improve the runtime algo-

rithm’s complexity (Zhang and Shasha, 1989; Klein,

1998). It is remarkable, though, that the complexity in

A Mechanism for Automatically Extracting Reusable and Maintainable Code Idioms from Software Repositories

all these implementations is greater or equal to O(n

In order to avoid using such a computationally expen-

sive method, we approximate the TED by using the

pq-Grams algorithm (Augsten et al., 2005).

Towards using the pq-Grams algorithm, we ﬁrst

need to transform each AST into an ordered label tree

T, using the type of AST statements as labels and con-

necting the nodes so that the tree is traversed in a pre-

order manner. Then a pq-Extended-Tree T

is con-

structed by adding null nodes on the tree T. Precisely,

p − 1 ancestors are added to the root of the tree, q − 1

children are added before the ﬁrst and after the last

child of each non-leaf node and q children are inserted

to each leaf of T. In our methodology, we deﬁne p = 2

and q = 3, as these values have been used signiﬁcantly

in tree matching approaches with great results. Figure

2 illustrates the extended tree transformation of a sim-

ple labelled tree, where p = 2 and q = 3.

(a)

* *

* * *

* *

* * *

* *

(b)

Figure 2: (a) Example tree T . (b) T

2,3

Extended-Tree.

For each Extended-Tree, we need to calculate a

list of all the pq-Gram patterns it contains. A pq-Gram

pattern is deﬁned as a subtree of the extended tree T

that consists of an anchor node with p − 1 ancestors

and q children. These lists are called Proﬁles P(T )

and are used to calculate the distance between trees.

For instance, the proﬁle of the tree in Figure 2 is the

list [∗a∗∗b, ∗a∗bc, ∗abc∗, ∗ac ∗∗, ab∗∗∗, ac ∗∗d, ac∗

d∗, acd ∗ ∗, cd ∗ ∗∗]. Overall, the pq-Gram distance

between two trees T

and T

is deﬁned as follows:

distance(T

, T

) = 1 − 2 ∗

p,q

) ∩ P

p,q

) ∪ P

p,q

(1)

As it is depicted in Equation 1, the distance be-

tween two trees depends on the number of mutual

pq-Grams patterns contained in both their proﬁles di-

vided by the union of the two lists, which results

in a value between 0 and 0.5 (when the union con-

tains double the instances of the intersection) (Aug-

sten et al., 2008). It is obvious that the similarity be-

tween two trees can be easily calculated using the for-

mula 1 − distance(T

, T

). The ﬁnal similarity value

ranges from 0 to 1.

3.4 Snippets Filtering

The pq-Gram algorithm discussed in the previous sec-

tion outputs the distance matrices calculated between

code snippets in the same subsets presented above.

We, then, proceed to the next step of our modelling

procedure, where we remove duplicate instances from

the data. The term duplicate instances refers to two

different cases:

• snippets with zero pq-Gram distance that belong

to the same code repository, as it is quite common

for a fragment of code to be used numerous times

in the same repository.

• snippets with zero pq-Gram distance that derived

from different code repositories but, however,

originate from the same package or library. It

is a usual case, when developers need to make a

change to an existing library or package and, thus,

they include it into their project.

It should be noted that this processing step, which

removes the duplicate instances from the data, is im-

portant in order to ensure that the resulting clusters

will contain patterns used by many different software

projects and, therefore, many different developers.

3.5 Clustering

In order to group code snippets, we perform Agglom-

erative Hierarchical Clustering, which is a bottom-up

approach that initially considers each snippet as a sep-

arate cluster and iteratively merges the groups with

the lesser distance. As a method of measuring the

distance between clusters we use the average linkage,

which is deﬁned as the average distance between all

the snippets, while the optimal number of clusters is

identiﬁed using the average silhouette score. Silhou-

ette coefﬁcient ranges from -1 to 1, indicating whether

the point is assigned to the correct cluster (positive

value) or not (negative value). Equation 4 shows the

calculation form of this metric using the mean intra-

cluster distance (variable a) and the min nearest clus-

ter distance (variable b) as shown in Equations 2 and

3 respectively.

ICSOFT 2022 - 17th International Conference on Software Technologies

a(i) =

|−1

∑

j∈C

j̸=i

d(i, j) (2)

b(i) = min

k̸=i

∑

j∈C

d(i, j)) (3)

s(i) =

b(i) − a(i)

max(a(i), b(i))

(4)

Figure 3 illustrates a histogram of the cohesion

achieved on a set of different clusters for the If state-

ments. Table 4 depicts the average number of snip-

pets per cluster, the average cohesion of the clusters

and the average number of repositories found within

the cluster.

Clusters

Cohesion

0.70

0.75 0.80 0.85 0.90

Figure 3: A histogram of the cohesion achieved in a set of

different clusters.

Table 4: Clustering Results.

Type of CFS Cluster

Cohesion # Repos

Blocks Size

If 622.3 0.81 75.8

For 202.1 0.82 69

Enhanced For 527.9 0.78 79.3

Try 299 0.86 102.5

While 195.6 0.83 100.8

Switch 93 0.78 49

Do 26 0.87 5

3.6 Cluster Selection & Generalized

Form

After performing the clustering analysis, the ﬁnal step

of our approach examines the generated clusters and

selects only those that meet the requirements of the id-

ioms mining problem. The extracted idioms and clus-

ters need to satisfy two speciﬁc criteria. First of all,

the idioms have to be widely used by a lot of differ-

ent developers and, thus, it is important to examine

the number of different code repositories that contain

snippets within a cluster. Additionally, the clusters

derived from a clustering algorithm have to be cohe-

sive, so that the code snippets they carry are quite sim-

ilar both structurally and semantically.

Using these remarks, we deﬁne the three parame-

ters that designate an optimal cluster:

• The number of code snippets it contains

• The number of different repositories that contain

at least one snippet in the cluster

• The cluster cohesion, which is calculated as the

average similarity of the snippets from the cen-

troid. The centroid of each cluster is deﬁned as

the snippet with the lowest average distance from

all the other snippets of the group. Equations 5

and 6 depict these calculations:

m = min

(

|C|−1

∑

j∈C

j̸=i

d(i, j)) (5)

cohesion = 1 −

|C|−1

∑

x∈C

d(x, m) (6)

Table 5 depicts the minimum thresholds applied to

each category described above. The values of these

parameters were carefully selected, taking into ac-

count the number of the available snippets and the

clustering results. Discarding the generated clusters

that do not meet these thresholds, a total of 101 opti-

mal clusters have been produced. From each of these

optimal clusters, we extract the centroid of the cluster

as the representative one, i.e. the code idiom of the

respective cluster.

Table 5: Threshold Parameters per Category.

Type of CFS

# Snippets # Repos Cohesion

Blocks

If 100 50 0.7

For 100 40 0.7

Enhanced For 100 40 0.7

Try 100 30 0.7

While 50 30 0.7

Switch 80 20 0.7

Do 25 5 0.7

The last step in our approach is to transform the

generated idioms into a generalized form, that can be

easily recognized and used by any developer. The

main task of this step is the identiﬁcation of variables,

A Mechanism for Automatically Extracting Reusable and Maintainable Code Idioms from Software Repositories

functions and objects that are commonly used within

the idiom, which remain as is. On the other hand, all

the other names, that change according to the domain

the idiom is used on, are replaced with an abstract

naming convention. For this purpose, we compute the

frequency of each token of the centroid in the snip-

pets of the corresponding cluster. All the tokens of

the centroid, with a frequency lower than a speciﬁc

threshold, which in our case is set to 0.5 (i.e. less than

half of the snippets in the cluster contain the examined

token), are replaced by an abstract token. Figure 4 de-

picts an example idiom extracted by our approach, as

well as the generalized form it is transformed to.

try {

writer . close ();

}

catch (IOException e) {

throw new RuntimeException(e);

}

(a)

try {

$( object ). close ();

}

catch (IOException e) {

throw new $(method)(e);

}

(b)

Figure 4: (a) An example idiom extracted by our system.

(b) Generalized form of the pattern adding metavariables.

Figure 5 depicts some examples of the top idioms

that were extracted using our methodology, accord-

ing to the number of different repositories they are

found in. These abstract code snippets meet the ba-

sic requirements set for the code idioms, as they are

small syntactic fragments that perform well-deﬁned

programming tasks and that are used widely by a

number of different developers. Table 6 depicts the

number of idioms identiﬁed in each category of the

CFS.

Table 6: Number of idioms extracted per category.

Type of CFS

# Idioms

Blocks

If 44

For 20

Enhanced For 15

Try 15

While 5

Switch 1

Do 1

4 EVALUATION

In this section we evaluate our methodology for ex-

tracting code idioms from the most popular GitHub

repositories. The evaluation is performed in three di-

verse axes. At ﬁrst, we examine the extracted idioms

based on a set of repositories used for testing and as-

sess their association with top repositories. Addition-

ally, we evaluate our approach and compare our re-

sults with the respective idioms identiﬁed by Allama-

nis and Sutton (Allamanis and Sutton, 2014), using

the PROJECTS dataset as baseline. Finally, towards

the evaluation of the effectiveness of our approach in

practice, we investigate the applicability of our ex-

tracted idioms, by employing the most popular ques-

tions and answers from StackOverﬂow.

4.1 Evaluation of Extracted Idioms

In the ﬁrst step towards assessing the effectiveness

of our system and the applicability of the extracted

code idioms in the most popular and most (re)used

repositories, we inspect the appearance of our idioms

into these projects. Speciﬁcally, we examine the code

blocks of the generalized code idioms against code

snippets coming from a set of testing repositories, in

order to identify blocks of code in these projects that

are identical to the extracted idioms.

From the initial dataset coming from 1,500 of the

most popular repositories, we had already left out a

set of 500 repositories in the ﬁrst place. We use the

projects of these repositories as our main testing set,

in which we examine whether they make use of our

extracted idioms or not. Table 7 depicts some statis-

tics on these repositories.

Table 7: Testing Repositories Statistics.

Metric Value

Mean Stars 1,385.8

Mean Forks 499.2

Mean Commits 3,344.7

Mean Watches 620.7

The comparison of the code idioms identiﬁed

by our approach, with code blocks from the test-

ing repositories is performed in three steps. At ﬁrst,

code coming from the repositories is transformed into

ASTs, which are then traversed, in order to obtain

code blocks that belong to the CFSs. Next, the two

ASTs, the one of the code idiom and the one of the

tested code fragment, are edited and the leaves, which

contain the names of the variables, methods and ob-

jects, are removed, so that we can compare only the

ICSOFT 2022 - 17th International Conference on Software Technologies

try {

Thread. sleep ($( simpleVariable1 ));

}

catch ( InterruptedException e) {

}

(a)

if ( value == null ) {

$(method1)();

}

else {

$(method2)(( String )value );

}

(b)

for ( int i=0; i < $(object1 ). length ; i++) {

$( object1 )[ i]=$(method1)(i );

}

(c)

for (Thread thread : threads ) {

thread . $(method1)();

}

(d)

while (( line =$(object1 ). readLine ()) != null){

$( object2 ). $(method1)(line );

}

(e)

while ($( object1 ). hasNext()) {

$( object2 ). add($( object1 ). next ());

}

(f)

for (Map.Entry<String,String> entry : $( object1 ). entrySet ()){

$( object2 ). $(method1)(entry .getKey (), entry . getValue ());

}

(g)

Figure 5: Examples of idioms extracted using our approach (a) Suspends a thread for a period of time. (b) Checks if a value is

equal to null and calls the appropriate method. (c) Loops through an array and assigns its values using a method. (d) Iterates

over a sequence of threads calling a method (e) Reads the content of an object line by line. (f) Adds the values of iterator to

an object. (g) Iterates through the key-value pairs of a java map and calls the appropriate method of another java map object.

syntactical similarity of the two snippets. The pq-

Grams algorithm is then used to calculate the distance

between the two trees and, in case the resulting dis-

tance is zero, the snippets are syntactically similar

and the next step can be executed. In the ﬁnal step

of the comparison, we examine the original variables

included in the two code snippets and, speciﬁcally, we

check if all the variables, excluding the ones that were

replaced by meta-variables in the generalized form, of

the tested snippet are also contained in the generalized

idiom. In this case, the two code snippets are both

structurally and semantically similar and the idiom is

considered to be used by the project.

Using the aforementioned comparison approach

and the set of the 500 testing repositories, it emerged

that only 4 of our extracted idioms are not used at all

by any repository of the testing set. In fact, each code

idiom can be found at least once in 66 different repos-

itories on average. Figure 6 illustrates the number of

different code repositories where the extracted idioms

are used at least once.

Additionally, in an attempt to evaluate the extent

to which the extracted idioms can contribute to the

software development procedure, we calculated the

number of idioms identiﬁed by our methodology that

are employed by the repositories of the testing set.

From the results, it appears that each testing reposi-

tory uses about 10 code idioms from our set at least

once, while only 47 projects do not use any of our ex-

Percentage of Idioms

0.0%

0 50 100 150 200 250 300

2.5%

5.0%

7.5%

10.0%

12.5%

15.0%

17.5%

Number of Repositories

Figure 6: Number of different repositories in which our ex-

tracted idioms can be found in.

tracted idioms at all, which can be justiﬁed by the fact

that the selected projects span along a wide variety of

domains and functionalities. Figure 7 illustrates the

number of idioms extracted by our methodology that

are used at least once in the testing repositories.

4.2 Evaluation on the PROJECTS

Dataset

For the evaluation of Haggis, their code idioms min-

ing system, Allamanis and Sutton (Allamanis and

Sutton, 2014) created two evaluation datasets, which

contain some of the top open-source Java reposito-

A Mechanism for Automatically Extracting Reusable and Maintainable Code Idioms from Software Repositories

Percentage of Repositories

0.0%

0 10 20 30 40 50 60 70

2.5%

5.0%

7.5%

10.0%

12.5%

15.0%

17.5%

20.0%

Number of Idioms

Figure 7: Number of our extracted idioms used by the

repositories of the testing set.

ries from GitHub. The LIBRARY dataset is a se-

lection of Java classes that make use of 15 popu-

lar Java libraries (just import the libraries without

re-implementing them). The LIBRARY dataset was

used to mine code idioms that occur across differ-

ent libraries and is the main set of code blocks that

was used to extract idioms. On the other hand, the

PROJECTS dataset is a selection of the top 13 Java

GitHub projects (by the time the authors accessed the

code hosting platform), based on their z-score of forks

and watchers. These projects were used in order to

evaluate the mined idioms from the LIBRARY dataset.

Table 8 depicts these repositories, along with the sha

of the last commit that was used by the authors.

Table 8: The repositories of the PROJECTS dataset.

Name Forks Stars Commit SHA

arduino 2,633 1,533 2757691

atmosphere 1,606 370 a0262bf

bigbluebutton 1,018 1,761 e3b6172

elasticsearch 5,972 1,534 ad547eb

grails-core 936 492 15f9114

hadoop

756 742 f68ca74

hibernate 870 643 d28447e

libgdx 2,903 2,342 0c6a387

netty 2,639 1,090 3f53ba2

storm 1,534 7,928 cdb116e

vert.x 2,739 527 9f79416

voldemort 347 1,230 9ea2e95

wildﬂy 1,060 1,040 043d7d5

In an attempt to compare the usefulness and the

applicability of the code idioms extracted by our ap-

proach against the ones identiﬁed in (Allamanis and

Sutton, 2014), we evaluated our idioms also on the

PROJECTS dataset. After cloning the project reposi-

tories and checking out in the same commit ids used

in (Allamanis and Sutton, 2014), we performed the

comparison analysis that was described and employed

also in the previous subsection. Each code block from

these projects was compared to our extracted idioms

and the evaluation metric was calculated. The eval-

uation metric used in that case is Precision, deﬁned

as the percentage of the identiﬁed idioms that were

found at least once in the testing repositories.

Table 9 depicts the precision of our extracted id-

ioms against the precision achieved in (Allamanis and

Sutton, 2014). Our mined idioms achieve a precision

value of 58%, compared to the 50% precision value of

the Haggis’ idioms. Each code idiom has been found

in 3.5 repositories on average.

Table 9: Evaluation results on PROJECTS dataset.

Approach Precision

Haggis 50%

Ours 58%

4.3 Applicability Evaluation in Practice

Finally, in order to further assess the applicability of

our methodology in practice in providing actual and

useful recommendations during the development pro-

cess, we made use of the StackOverﬂow questions

dataset (Baltes et al., 2018). It is well-known that

questions and answers in StackOverﬂow usually con-

tain short code snippets that need to cover the re-

quested functionality, so they are traditionally con-

cise, cover only the essential code statements and

represent the best programming practices. Thus,

StackOverﬂow is considered to incorporate highly id-

iomatic code snippets.

From the StackOverﬂow questions dataset, we

created two different sets of data for our evalua-

tion. We extracted the answers referred to Java-tagged

questions that contain at least one block of code and

generated one set of code snippets using all code

blocks found in these answers. A second set of code

blocks was also build, that includes only the snippets

that belong to the answers marked as correct.

We make use of the comparison method men-

tioned in subsection 4.1, in order to examine whether

our code idioms can be found in code snippets that

usually constitute idiomatic code. Finally, we use two

different metrics; we calculate the precision of the ex-

tracted idioms, i.e. the percentage of the code idioms

that are found in the StackOverﬂow snippets, and the

The branch of the hadoop project that contained the

selected commit id has been dropped by the developers and,

thus, we did not use this project in our evaluation.

ICSOFT 2022 - 17th International Conference on Software Technologies

average number of times a code idiom is being used

in these snippets. Table 10 depicts these results. The

precision metric in the whole StackOverﬂow dataset

is 66% and in the accepted answers 62%, while our

idioms were found in 243 different posts and 30 cor-

rect answers on average. The reduced metrics on the

accepted answers dataset were expected, as this set

contains signiﬁcantly fewer code snippets. It should

be noted that Allamanis and Sutton achieved a pre-

cision of 67% on the StackOverﬂow answers dataset,

which is almost similar to ours.

Table 10: Evaluation results on StackOverﬂow dataset.

Dataset Precision

Average #

Occurrences

Answers 66% 243

Accepted Answers 62% 30

5 THREATS TO VALIDITY

In this work, we presented our approach towards iden-

tifying and extracting maintainable and reusable code

idioms from the most popular code hosting platforms.

Based on the evaluation performed, the only limita-

tion that applies to the internal validity of our system

is the selection of the repositories based on their num-

ber of stars and forks, as a reﬂection of their reusabil-

ity. At the same time, our design choice of select-

ing the most popular repositories from GitHub in or-

der to mine and extract our code idioms can possi-

bly be a threat to the external validity of our system.

However, the selection of the top repositories is not

random, as these projects have proven their quality

in practice (including maintainability and reusability)

and their popularity proves that they have the accep-

tance of a huge team of developers. Moreover, many

researchers have argued that projects that exhibit large

numbers of stars and forks, also exhibit high qual-

ity (Dimaridou et al., 2018; Dimaridou. et al., 2017;

Papamichail et al., 2016b), involve frequent mainte-

nance releases (Borges et al., 2016) and have sufﬁ-

cient documentation (Aggarwal et al., 2014; Weber

and Luo, 2014).

As far as the external validity of our approach is

concerned, one should take into account also the fol-

lowing limitation. We selected to look for code id-

ioms only in CFS and not on each possible code snip-

pet. Even though this fact can be considered as a limi-

tation of our approach, in fact it is proven in our eval-

uation section that code idioms extracted from CFS

are really useful.

6 CONCLUSIONS AND FUTURE

WORK

In this work, we proposed a mechanism that can au-

tomatically examine the most popular projects and

repositories and identify code idioms that are used

within those projects. These idioms are character-

ized by high reusability and can be easily used by

developers, in order to accelerate the software devel-

opment procedure, while improving the quality of the

project and ensuring an acceptable level of maintain-

ability. As argued by Allamanis et al. (Allamanis

et al., 2018), code idioms can help ”developers clearly

communicate the intention of their code” and, thus,

produce more readable and reusable software. Addi-

tionally, they can help developers avoid frequent er-

rors or, even, provide useful suggestions in the most

common ”how to” questions asked by developers and

”bootstrap” the quality of work of junior developers

by adopting the ”best practices”.

The evaluation of our approach in three diverse

axes indicates that the code idioms generated by our

system can be found in a large number of the top

repositories, execute useful and commonly asked pro-

gramming tasks, seem natural to developers and can

be actionable recommendations to the developers.

Moreover, our code idioms can be found in a large de-

gree in question-and-answers systems, such as Stack-

Overﬂow, which is the main point where developers

usually look for idiomatic code. Additionally, the

extraction of our code idioms from the most popu-

lar repositories in GitHub ensures that these code id-

ioms are characterized by high reusability, while, at

the same time, can help the developers achieve an ac-

ceptable level of maintainability.

Future work lies in various axes. First of all, our

methodology could be further expanded by including

snippets of code that do no belong to CFS, in order to

identify also general code idioms. Additionally, in the

last step of our methodology, which produces the gen-

eralized form of the code idioms, the initialization of

variables, methods and objects could also be included

into the idiom. Moreover, we would suggest the use

of various methods that can speed up the process of

identifying code idioms, such as the use of a GPU for

the distance calculations, the investigation of faster

comparison methods and the additional preprocessing

methods that could signiﬁcantly reduce the size of the

distance matrices. Finally, the usefulness of the gen-

erated idioms could be further evaluated on the basis

of a developer survey.

A Mechanism for Automatically Extracting Reusable and Maintainable Code Idioms from Software Repositories

REFERENCES

Abran, A. and Nguyenkim, H. (1993). Measurement of the

maintenance process from a demand-based perspec-

tive. J. Softw. Maintenance Res. Pract., 5:63–90.

Aggarwal, K., Hindle, A., and Stroulia, E. (2014). Co-

evolution of project documentation and popularity

within github. In Proceedings of the 11th Working

Conference on Mining Software Repositories, MSR

2014, page 360–363, New York, NY, USA. Associ-

ation for Computing Machinery.

Allamanis, M., Barr, E. T., Bird, C., Devanbu, P., Marron,

M., and Sutton, C. (2018). Mining semantic loop id-

ioms. IEEE Transactions on Software Engineering,

44(7):651–668.

Allamanis, M. and Sutton, C. (2014). Mining idioms from

source code. CoRR, abs/1404.0417.

Augsten, N., B

ohlen, M., and Gamper, J. (2008). The

¡i¿pq¡/i¿-gram distance between ordered labeled trees.

ACM Trans. Database Syst., 35(1).

Augsten, N., B

ohlen, M. H., and Gamper, J. (2005). Ap-

proximate matching of hierarchical data using pq-

grams. In VLDB.

Baltes, S., Dumani, L., Treude, C., and Diehl, S. (2018).

Sotorrent: reconstructing and analyzing the evolution

of stack overﬂow posts. In Zaidman, A., Kamei, Y.,

and Hill, E., editors, Proceedings of the 15th Interna-

tional Conference on Mining Software Repositories,

MSR 2018, Gothenburg, Sweden, May 28-29, 2018,

pages 319–330. ACM.

Borges, H., Hora, A., and Valente, M. T. (2016). Un-

derstanding the factors that impact the popularity of

github repositories. In 2016 IEEE International Con-

ference on Software Maintenance and Evolution (IC-

SME), pages 334–344.

Dimaridou., V., Kyprianidis., A., Papamichail., M., Dia-

mantopoulos., T., and Symeonidis., A. (2017). To-

wards modeling the user-perceived quality of source

code using static analysis metrics. In Proceedings

of the 12th International Conference on Software

Technologies - ICSOFT,, pages 73–84. INSTICC,

SciTePress.

Dimaridou, V., Kyprianidis, A.-C., Papamichail, M., Dia-

mantopoulos, T., and Symeonidis, A. (2018). Assess-

ing the user-perceived quality of source code compo-

nents using static analysis metrics. In Cabello, E.,

Cardoso, J., Maciaszek, L. A., and van Sinderen, M.,

editors, Software Technologies, pages 3–27, Cham.

Springer International Publishing.

Fowkes, J. and Sutton, C. (2016). Parameter-free proba-

bilistic api mining across github. In Proceedings of

the 2016 24th ACM SIGSOFT International Sympo-

sium on Foundations of Software Engineering, FSE

2016, page 254–265, New York, NY, USA. Associa-

tion for Computing Machinery.

Hnatkowska, B. and Jaszczak, A. (2014). Impact of se-

lected java idioms on source code maintainability –

empirical study. In Zamojski, W., Mazurkiewicz, J.,

Sugier, J., Walkowiak, T., and Kacprzyk, J., editors,

Proceedings of the Ninth International Conference

on Dependability and Complex Systems DepCoS-

RELCOMEX. June 30 – July 4, 2014, Brun

ow, Poland,

pages 243–254, Cham. Springer International Pub-

lishing.

Ji, X., Liu, L., and Zhu, J. (2021). Code clone detection with

hierarchical attentive graph embedding. International

Journal of Software Engineering and Knowledge En-

gineering, 31(6):837–861. cited By 0.

Klein, P. (1998). Computing the edit-distance between un-

rooted ordered trees. In ESA.

McCabe, T. (1976). A complexity measure. IEEE Transac-

tions on Software Engineering, SE-2:308–320.

Papamichail, M., Diamantopoulos, T., and Symeonidis, A.

(2016a). User-perceived source code quality estima-

tion based on static analysis metrics. pages 100–107.

Papamichail, M., Diamantopoulos, T., and Symeonidis, A.

(2016b). User-perceived source code quality estima-

tion based on static analysis metrics. In 2016 IEEE

International Conference on Software Quality, Relia-

bility and Security (QRS), pages 100–107.

Sivaraman, A., Abreu, R., Scott, A., Akomolede, T., and

Chandra, S. (2021). Mining idioms in the wild. CoRR,

abs/2107.06402.

Tai, K. (1979). The tree-to-tree correction problem. J. ACM,

26:422–433.

Tanaka, H., Matsumoto, S., and Kusumoto, S. (2019). A

study on the current status of functional idioms in

java. IEICE Transactions on Information and Systems,

E102.D:2414–2422.

Wang, J., Dang, Y., Zhang, H., Chen, K., Xie, T., and

Zhang, D. (2013). Mining succinct and high-coverage

api usage patterns from source code. In 2013 10th

Working Conference on Mining Software Repositories

(MSR), pages 319–328.

Weber, S. and Luo, J. (2014). What makes an open source

code popular on git hub? In 2014 IEEE Interna-

tional Conference on Data Mining Workshop, pages

851–855.

Zhang, K. and Shasha, D. (1989). Simple fast algorithms for

the editing distance between trees and related prob-

lems. SIAM J. Comput., 18:1245–1262.

Zhang, Y. and Wang, T. (2021). Cceyes: An effective tool

for code clone detection on large-scale open source

repositories. pages 61–70. cited By 0.

ICSOFT 2022 - 17th International Conference on Software Technologies