A Hybrid Approach to MVC Architectural Layers Analysis

Dragos¸ Dobrean

and Laura Dios¸an

Computer Science Department, Babes Bolyai University, Cluj Napoca, Romania

Keywords:

Mobile Applications Software Architecture Analyser, Automatic Analysis of Software Architectures,

Structural and Lexical Information, Software Clustering, Hybrid Approach.

Abstract:

Mobile applications have become one of the most important means of interacting with businesses, getting

information, or accessing entertainment and news for the vast majority of the people, especially for the young

generations. How those applications are being built, heavily inﬂuences their lifecycle, costs, and product

roadmap, that is why software architecture plays a very important role as it affects the maintainability and

extensibility of those products.

We are presenting a novel automatic approach for detecting MVC architectural layers from mobile codebases

that combines an unsupervised Machine Learning algorithm and a classic static analysis. Our proposal does

not require any prior training stage or datasets since it does not rely on apriori annotated codebases. As another

key of novelty, it uses the information obtained from the mobile SDKs for enhancing the detection process.

The validation of our proposal is done in eight different sized codebases that operate in various domains and

come from either open-source projects as well as closed-source ones. The performance of the detection quality

is measured by the accuracy of the system, as we compared to a manually constructed ground truth, achieving

an average accuracy of 85% on all the analysed codebases.

Our proposal provides a viable hybrid approach for detecting architectural layers from mobile codebases

achieving good results by providing the accurate detection of the layers using a deterministic step and great

ﬂexibility for being used on other architectural patterns via the non-deterministic step. Furthermore, we con-

sider our approach as being valuable to students or beginners because it could provide insightful information

on how the code should be structured and help them to respect architectural guidelines in real-world projects.

1 INTRODUCTION AND

CONTEXT

Mobile phones have become one of the most personal

devices, they are fully packed with applications that

help people manage their ﬁnances, health, exercises,

and social life. Those applications need to be ﬂexi-

ble to change, easily maintainable and they need to

run on a large variety of devices and OSs. With this

study, we are furthering our work (Dobrean, 2019)

towards creating an autonomous system for improv-

ing the architectural health of those projects since a

large number of the mobile codebases do not have a

well deﬁned architectural pattern in place or even if

they do, it is not consistently implemented all over

the codebase (DeLong, 2017).

Furthermore, identifying and examining the im-

plemented software architecture of mobile applica-

https://orcid.org/0000-0001-7521-7552

https://orcid.org/0000-0002-6339-1622

tions codebases could help the inexperienced devel-

opers or the students to better grasp or perceive the

architecture importance, an important premise in soft-

ware engineering education. These beginners could

beneﬁt from using an automatic detection system and

to advance in the difﬁcult task of learning architecting

(Galster and Angelov, 2016), (Li, 2020).

Such a system would also provide valuable in-

sights to teachers and mentors regarding the mistake

their students make by examining the history of the

analysis performed by the system on the student’s

codebase.

In (Dobrean and Dios¸an, 2019a) we have pre-

sented a deterministic approach for automatically de-

tecting architectural layers. We have furthered our re-

search in (Dobrean and Dios¸an, 2020) where we have

approached the same problem from a different per-

spective, we have used Machine Learning algorithms

for detecting those layers by considering the detec-

tion as a clustering problem (each layer representing

a cluster).

Dobrean, D. and Dio¸san, L.

A Hybrid Approach to MVC Architectural Layers Analysis.

DOI: 10.5220/0010326700360046

In Proceedings of the 16th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2021), pages 36-46

ISBN: 978-989-758-508-1

Our newest proposal is a hybrid one, where we use

a combination of deterministic and non-deterministic

(Machine Learning) methods for solving the same

problem. By using this new approach, we are paving

the way for more specialized architecture detection

(such as MVVM, MVP, VIPER, etc.), we call this ap-

proach HyDe (Hybrid Detection).

Analyzing mobile codebases using a purely deter-

ministic approach can not yield great results on a large

variety of codebase, the main reason for this being

the fact that projects have their speciﬁc particularities,

such as coding standards, naming conventions, code-

base split (how external libraries and tools are being

integrated), data ﬂows, etc.. When using a Machine

Learning approach, some of the elements might be

wrongly categorized due to the features of the compo-

nents. Besides, there is not a lot of information on the

particularities of the layers that need to be detected.

Both developed detection systems have strong

points as well: the deterministic one is able to accu-

rately detect codebase elements and place them in the

right architectural layers, while the non-deterministic

one allows more ﬂexibility for the analysis part and

it can recognize clusters in more types of codebases.

With these thoughts in mind, we joined both ideas into

a hybrid one (HyDe) and we have used it to separate

the architectural layers from a codebase using a com-

bined approach. In this study, we are searching for

a viable way of combining the deterministic methods

(such as SDK inheritance) with the non-deterministic

ones (such as clustering) into a hybrid approach for

processing mobile codebases.

Research Challenges. The main challenges of these

research topics are:

• architecture detection using a combined approach

(deterministic + non-deterministic);

• a clustering process that works while combined

with the information obtained from the determin-

istic process;

• automatically inferring the architectural layers of

an MVC codebase that uses an SDK for building

the user interfaces — as those SDKs usually con-

tain more types of items than web SDKs.

Contributions. The main contributions of this study

are listed below:

• a new workﬂow for automatic detection of archi-

tectural layers using a hybrid system;

• a novel approach at combining a deterministic

method that uses SDK and lexical information for

detecting architectural layers in a codebase with

a non-deterministic one that uses a wide range

of features extracted from the codebase (such as

the number of common methods and properties

and components name similarities) for solving the

same problem;

• an innovative way of leveraging the informa-

tion from the deterministic step to aid the non-

deterministic one to increase the accuracy.

The paper is structed as follows. The next section in-

troduces the formalisation of the scientiﬁc problem

and presents details about other related work. Section

3 introduces our approach HyDe. The numerical ex-

periments and analysis of the method are presented in

sections 4 and 5. The end of the paper is composed

by Section 6 which presents the threats to validity, fol-

lowed by conclusions and some directions for further

work in Section 7.

2 SCIENTIFIC PROBLEM

A mobile codebase is made of components (classes,

structures, protocols, and extensions); we represent

a codebase in a formal matter by using the fol-

lowing notation A = {a

, a

, . . . , a

} where a

, i ∈

{1, 2, . . . , n} denotes a component.

The purpose of this study is to ﬁnd a way for split-

ting the codebase into architectural layers. An archi-

tectural layer is represented by a partition P of the

components P = {P

, P

, . . . , P

} in the codebase that

satisﬁes the following conditions:

• 1 ≤ m ≤ n;

• P

is a non-empty subset of A, ∀ j ∈ {1, 2, . . . , m};

• A = ∪

j=1

, P

∩ P

0, ∀ j

, j

∈ {1, 2, . . . , m}

with j

6= j

In this formulation, a P

( j ∈ {1, 2, . . . , m}) subset of

A represents an architectural layer.

Since we are using a hybrid approach, the detected

architectural layers (partitions) P

, P

, ... P

are de-

termined by applying a set of deterministic rules and

a clusterization process on some of the output parti-

tions.

In order to apply a clustering algorithm, each com-

ponent a

(i ∈ {1, 2, . . . , n}) has to be represented by

one or more features or characteristics. We suppose

to have k features and we denote them by F(a

) =

, F

, . . . , F

The purpose of the hybrid approach is to output

partitions of components that match as best as possi-

ble the ground truth (the partitions provided by human

experts).

The Model View Controller is one of the most

widespread presentational architectural patterns. It

is used extensively in all sorts of client applications,

A Hybrid Approach to MVC Architectural Layers Analysis

web, desktop and mobile. It provides a simple sepa-

ration of concerns between the components of a code-

base in 3 layers:

• Model: responsible for business logic.

• View: responsible for the user’s input and the out-

put of the application.

• Controller: keeps the state of the application, acts

like a mediator between the Model and the View

layer.

There are many ﬂavours of MVC in which the data-

ﬂow is different, but they all share the same model

of separation. MVC is also the precursor of more

specialised presentational architectural patterns such

as Model View View Model (MVVM) or Model

View Presenter (MVP) (Dobrean and Dios¸an, 2019b),

(Daoudi et al., 2019).

In this study, the focus is on analyzing MVC ar-

chitectures, therefore, the number of partitions m =

3. For a better understanding of the concepts, we

are going to substitute the notation of partitions P =

, P

}, with P = {M, V, C}, as this notation bet-

ter reﬂects the partition and the output layers we are

interested in ﬁnding, M representing the Model layer,

V representing the View layer, and C being the Con-

troller layer. For the rest of the paper, this notation

will be used to refer to the partitions for architectural

layers.

Other studies have focused on architecture recov-

ery by clusterization. For instance, Mancoridi (Man-

coridis et al., 1999), Mitchell (Mitchell and Man-

coridis, 2008) or Lutellier (Lutellier et al., 2015) from

a structural point of view, Anquetil (Anquetil and

Lethbridge, 1999) or Corazza (Corazza et al., 2016)

from a lexical point of view and Garcia (Garcia et al.,

2011) or Rathee (Rathee and Chhabra, 2017) from a

structural and lexical point of view. However, none

of those approaches speciﬁcally focused on mobile

codebase or codebases that use SDKs for building

their UI interfaces.

HyDe combines our previous work, regarding

splitting a mobile codebase into architectural layers,

a deterministic approach where information from the

mobile SDKs is being used for detecting the archi-

tectural layers for a certain component using a set

of rules (Dobrean and Dios¸an, 2019a) with a non-

deterministic one for which we previously paved the

way with a study where we have applied Machine

Learning techniques for solving the same problem

(Dobrean and Dios¸an, 2020).

3 HyDe: A TWO-STAGE

AUTOMATIC DETECTION OF

ARCHITECTURAL LAYERS IN

MOBILE CODEBASES

Mobile applications are clients, usually, they are built

as monoliths as they are self-contained.

In this study, we are interested in automatically

categorizing the component of a codebase into ar-

chitectural layers, based on the software architecture

used. We are focusing on MVC, as is one of the most

widely used architectural patterns, and is the precur-

sor of other, more specialized architectures. For the

inferring information from the codebases, we are do-

ing a static analysis of its components.

By component, in this work we understand, a

class, a struct, or a protocol, as building blocks of the

codebase. We are interested in all the public and pri-

vate properties, methods, and inherited types (if any)

of those components. Our interest does not revolve

around the body of the functions; we are only in-

terested in their signature (naming and parameters),

nor are we interested in the dependencies between the

components or how they interact.

When developing these kinds of applications,

SDKs are used for displaying information on the

screen, interacting with the user, and using the hard-

ware of those devices (camera, sensors, etc.). The

codebase of those products is split into architectural

layers, and some of those are closely related to the

SDK deﬁned elements (the View and Controller layer

from MVC).

Our proposal is categorized by two important fea-

tures, unsupervised — which means there is no prior

knowledge needed before analyzing a codebase —

and autonomous — no developer involvement needed

after the process is started. Furthermore, HyDe in-

volves a two-step process: ﬁrstly, we apply the de-

terministic approach on the entire codebase, and sec-

ondly, on some of the components we run a clustering

algorithm to enhance the categorization of the com-

ponents in architectural layers.

3.1 Pre-processing

To obtain information regarding the components of a

codebase, a precursor step needs to be made, the pre-

processing, where we extract the information required

as input by the proposed approach. This is done by

statically analyzing all the source ﬁles of the project;

we are only interested in code source ﬁles, so the re-

sources are ignored (a comprehensively presentation

of the pre-processing particularities can be found in

ENASE 2021 - 16th International Conference on Evaluation of Novel Approaches to Software Engineering

(Dobrean and Dios¸an, 2019a)).

As an outcome of this step, we have the set of

components A = {a

, a

, . . . , a

} where every a

, i ∈

{1, 2, . . . , n} is characterised by:

• name

• type (class, struct, protocol)

• path

• inherited types

• all the private, static, and non-static methods and

properties

3.2 Deterministic Step

In this part of the study, HyDe determines in which

layer should a component reside by looking at the re-

lationship between the components and the SDK. The

mechanism behind this deterministic step is actually

of MaCS (Dobrean, 2019): constructing the topolog-

ical structure of the codebase, analysing the relation-

ships among components and assigning them to the

architectural layers.

3.2.1 Extraction

After the codebase has been preprocessed, MaCS,

ﬁrstly, creates the topological structure of the code-

base (a directed graph) in which every node corre-

sponds to a codebase component. The links between

the nodes of the graph represent dependencies be-

tween the components, for instance, if component A

has a property of type B, then we are going to have

a dependency, a link from the component A to the

component B. Unlike a typical graph, the topological

structure created could have multiple links between

(eg. if component A has multiple properties or meth-

ods that have as parameters components of type B).

3.2.2 Categorization

In the categorization part, the codebase is split into ar-

chitectural layers: every component is analyzed and

placed into one of the 3 layers (Model, View, Con-

troller) – since we are only focusing on MVC.

We have applied the CoordController approach

(Dobrean, 2019), in which we also detect the Coordi-

nating Controller layer, because it paves the way for

analyzing more complex architectures and codebases.

The Coordinating Controller layer is responsible for

keeping the state of the application and deciding what

is the ﬂow of the application, it is a controller of the

state of the application. Those type of elements have

no direct correspondent in the development SDKs,

however, those should also reside in the Controller

layer of an MVC architecture (Dobrean, 2019).

The architectural layers are constructed in a deter-

ministic fashion by using the following rules:

• All Controller layer items should inherit from

Controller classes deﬁned in the used SDK;

• All View layer items should inherit from a UI

classes deﬁned in the used SDK;

• All the remaining items are treated as Model layer

items.

In addition to these base rules, we have also applied

the one meant to detect the Coordinating Controller

elements:

• All the components which have properties or

methods that manipulate Controller components

(including other Coordinating Controller compo-

nents, also) are marked as Coordinating Con-

trollers.

All the coordinating controller components are also

placed into the Controller layer.

The output of MaCS represents a partition P =

{M, V, C}, where the M, V , and C subsets are

constructed by applying the above rules over A =

, a

, . . . , a

To be able to formalise these decision rules, we

involve the concept of predicate as following: a predi-

cate of type X over two components as pred

(el

, el

)

is True if we can apply the action X over el

and we

obtain el

. The proposed checker system deﬁnes the

following predicates:

• pred

instanceO f

(component

, Type) = True when

component

is a variable of type Type;

• pred

inheritance

(component

, component

) = True

when component

(directly) inherits component

;

• pred

using

(component

, component

)

• pred

using

(component

, listComponent

)

• pred

inheritance

(component

, listComponent

) =

True

With these deﬁnitions in place we can represent the

layers using the above stated rules as sets:

simple

={a

|pred

, SDK’s Controller) = True,

where X ∈ {instanceO f , inheritance},

∈ A}

(1)

Coordinator s = {a

∈ A, ∃v ∈ a

. properties and

c ∈ C such as pred

instanceO f

(v, c) = True or

∃m ∈ a

.methods and c ∈ C such as

pred

using

(m, c) = True}

(2)

A Hybrid Approach to MVC Architectural Layers Analysis

C = C

simple

∪Coordinators

(3)

V ={a

|pred

, SDK’s View) = True,

where X ∈ {instanceO f , inheritance}, a

∈ A}

(4)

M = A \ (V ∪C) (5)

After the categorization process is ﬁnished, the sys-

tem yielded the ﬁrst assignment of the components

into architectural layers. By using MaCS approach,

the View layer is determined with high accuracy, the

Controller layer has a good accuracy as well, as the el-

ements which inherit from an SDK deﬁned Controller

are being detected.

The major downside of a pure deterministic detec-

tion is the fact that the heuristics used for determin-

ing the Model and Coordinating Controller compo-

nents do not always yield great results; it can happen

that certain elements from either the Controller or the

Model to be wrongly categorized.

3.3 Non-deterministic Step

Trying to improve the detection accuracy, HyDe in-

volves a second stage when a clustering algorithm is

run onto the output layers of the ﬁrst step in order to

better categorize the elements of the codebase.

We have already conducted some experiments on

a clustering approach only (from scratch) for solving

the same problem (splitting a codebase into architec-

tural layers) in (Dobrean and Dios¸an, 2020). While

the clustering algorithm is the same, the process used

in this study is rather different as we are focusing on

only a part of the codebase for applying the cluster-

ing algorithm and one of the most important parts in a

clustering process, the feature detection is completely

new and different than our previous approach CARL

(Dobrean and Dios¸an, 2020).

Clustering is one of the most commonly used Ma-

chine Learning techniques for ﬁnding similarities in

collections and splitting the items into groups that

have similar features.To better split the codebase into

architectural layers, a clustering process is applied on

the layers in which the accuracy obtained in the ﬁrst

step was not satisfactory – the Model and the Con-

troller layer.

What we try to achieve with this step of the pro-

cess is ﬁltering better the elements from the Model

and Controller layer, to lower the percentage of

wrongly categorized components. We know that Con-

troller elements and Model elements have certain par-

ticularities that were not caught using the previously

described rules. However, a clustering algorithm can

ﬁnd particularities in data that are less obvious.

HyDe is applying the clustering process only to

the Model and Controller layer (M and C partitions)

obtained from the deterministic approach, the View

(V ) being ignored. The View layer is ignored as the

detection precision for this layer is 100% (Dobrean

and Dios¸an, 2019a), see Table 2. Therefore the set of

analysed elements is represented by

E = M ∪C = {a

, a

, . . . , a

where a

is a component from set A (see problem def-

inition) and denotes a component that was identiﬁed

as either as a Model or Controller element by the De-

terministic step, and k represents the number of com-

ponents in E.

3.3.1 Feature Extraction

Feature extraction from raw input data is an essential

stage before applying the clustering algorithm. Based

on the chosen feature set, a clustering algorithm can

yield good or bad results, even if it is applied to the

same data collection.

From the pre-processing stage, we have informa-

tion regarding every component from the codebase.

In addition, due to the deterministic process, we also

have information regarding the initial categorization

of the components in architectural layers.

We have applied HyDe to one of the mid-sized

codebases (E-Commerce) for which we had the

ground-truth constructed and we have analyzed mul-

tiple features of the codebase such as:

• Levenshtein distance between the components

name

• Whether a component inherits from a Model com-

ponent (from the output of the deterministic step)

• Whether a component is using a Controller or

Model component (it has properties or methods

that use Controller or Model items from the deter-

ministic step output)

• Whether a component is using a Controller com-

ponent (it has properties or methods that use Con-

troller items from the deterministic step output)

• Whether a component is a class or another type of

programming language structure (such as structs,

protocols, extensions)

• Whether a component is included in the Con-

troller components (from the output of the deter-

ministic step)

• Number of common methods between the ana-

lyzed components

• Number of common properties between the ana-

lyzed components

ENASE 2021 - 16th International Conference on Evaluation of Novel Approaches to Software Engineering

We have applied a trial and error approach to the set of

the features above mentioned to ﬁnd out which con-

ﬁguration yields the best results on the benchmark ap-

plication. After different tries, we have concluded that

the best results on the benchmark application come

from the following set of features.

Two features take into account the already ob-

tained assignments in the ﬁrst step of our proposed

system:

• Whether a component is included in the set of

Controller components (from the output of the de-

terministic step). We associate a score θ if the

component is included in the Controller layer and

0 otherwise. The score value has no inﬂuence over

the detection process because, by normalisation, a

Boolean feature is created.

) =

(

θ, if a

∈ C

0, otherwise

(6)

• Whether a component is using a Controller com-

ponent (it has properties or methods that used

Controller items from the deterministic step out-

put). We associate a score of 1 if the component

uses another component from the Controller layer

and 0 otherwise.

) =

(

1, if pred

using

, C) = True

0, otherwise

(7)

In addition to those two rules, for every pair of com-

ponents from the analyzed set (E) we compute the fol-

lowing values for the feature set of a component:

• a feature that emphasises the similarity of two

components’ names

) = [NameDist(a

, a

NameDist(a

, a

. . . , NameDist(a

, a

)]

(8)

where NameDist(a

, a

) is the Levenshtein dis-

tance (Levenshtein, 1966) between the names of

component a

and component a

• a feature based on how many properties two com-

ponents have in common

) = [C ommProp(a

, a

CommProp(a

, a

), . . . , CommProp(a

, a

)]

(9)

where CommProp(a

, a

) is the number of com-

mon properties (name ant type) of component a

and component a

• a feature based on how many methods two com-

ponents have in common

) = [C ommMeth(a

, a

CommMeth(a

, a

), . . . , CommMeth(a

, a

)]

(10)

where CommProp(a

, a

) is the number of com-

mon methods (signature, name, and parameters)

of component a

and component a

The output of this feature extraction phase is repre-

sented by the matrix that encodes all ﬁve feature sets,

M = F

∪ F

. In fact, a (3 × k + 2) × k

matrix, that contains the feature set for every compo-

nent in the codebase is constructed.

3.3.2 Clusterization

For the clusterization algorithm, we have found out

that the Agglomerative Clustering (Murtagh, 1983)

works best for this type of problem – a hierarchical,

bottom-up approach. Since we are analyzing only the

Model and the Controller layer, the number of clusters

used for this step is set to two. One of the most im-

portant features of a clustering algorithm is the mea-

surement of similarities between the clusters, our ap-

proach uses the Euclidian distance (Huang, 2008).

The ﬁve computed features, described in the previous

section, are normalized and then they are fed to the

clustering algorithm.

The last step of the process is assigning respon-

sibilities to the output clusters. HyDe does that by

looking at the output clusters and deciding which one

is the Controller cluster based on the number of com-

ponents that inherit from an SDK deﬁned Controller

component.

With this ﬁnal step, HyDe has categorized the

codebase in the ﬁnal architectural layers. The View

layer is identical to the one obtained from the deter-

ministic step of the proposal, the Controller one is de-

termined in the last phase after the clustering has been

applied and the group which contained the most ele-

ments that inherited from a Controller type was identi-

ﬁed, while the Model is represented by the other clus-

ter from the clusterization process output. A sketch of

our system is depicted on Figure 1.

In case of more specialised architectures, where

there are multiple other architectural layers, some

heuristics should be used for assigning architectural

layers to the output clusters. Those heuristics are

closely related to the particularities of the analysed

software architecture as well as the purpose of its

composing architectural layers.

A Hybrid Approach to MVC Architectural Layers Analysis

Figure 1: Overview of the HyDe workﬂow.

4 PERFORMANCE EVALUATION

To validate the performance of our approach, we are

using the following metrics for analyzing the result of

the HyDe against a ground truth obtained by manu-

ally inspecting the codebase by two senior developers

(with a development experience of over ﬁve years):

• accuracy: Acc =

AllLayers

DetectedCorrectly

allComponents

• precision for the layer X: P

DetectedCorrectly

TotalDetected

• recall for the layer X: R

DetectedCorrectly

GroundTruth

where:

• N

DetectedCorrectly

: is the number of component de-

tected by the system which belong to the X layer

and are found in the ground truth for that layer;

• N

TotalDetected

: is the number of component de-

tected by the system as belonging to the X layer;

• N

GroundTruth

: is the number of component which

belong to the X layer in the ground truth.

In addition to those metrics, we are also interested in

the performance of the process not only from a soft-

ware engineering point of view but also from a Ma-

chine Learning perspective. To analyze the ML per-

formance of our approach we have used the following

metrics:

• Silhouette Coefﬁcient score (Rousseeuw, 1987)

and

• Davies-Bouldin Index (Davies and Bouldin,

1979).

The Silhouette Coefﬁcient measures how similar is

a component of a cluster compared to the other ele-

ments which reside in the same cluster compared to

the other clusters. The range of values for this score

is [-1;1] where a high value means that the compo-

nent is well matched in its own cluster and dissimilar

when compared to other clusters. We have computed

the mean Silhouette Coefﬁcient score over all compo-

nents of each codebase.

Davies-Bouldin Index is deﬁned as the average

similarity measure of each cluster with its most sim-

ilar cluster. The similarity is represented by the ra-

tio of within-cluster distances to between-cluster dis-

tances. The Davies-Bouldin Index indicates how well

was the clustering performed; the minimum value for

this metric is 0, the lower the value the better.

Both of these metrics provide insightful informa-

tion regarding the clustering performance: the Sil-

houette Coefﬁcient score indicates how well are the

components placed, while the Davies-Bouldin Index

expresses whether or not the clusters were correctly

constructed.

The Machine Learning perspective was applied to

the output data obtained from the entire HyDe pro-

cess, when computing these metrics we looked at the

ﬁnal result, and we’ve viewed the output architectural

layers as clusters, even some of them might come

from a deterministic step (View layer in our partic-

ular case). By computing these metrics, we can also

understand the structural health of the codebase, how

related are the components between them, and how

high is the degree of differences between architectural

layers.

5 NUMERICAL EXPERIMENTS

Our analysis focuses on MVC, a widely used architec-

tural pattern (Vewer, 2019) for mobile development.

Our experiments are run on the iOS platform; how-

ever, they can be replicated on other platforms that

use SDKs for building their user interfaces.

For validating our proposal, we have compiled a

list of questions for which we search answers with

our experiments:

• RQ1: How can the deterministic approach and the

non-deterministic one be combined?

• RQ2: How effective and performant is the HyDe

approach?

• RQ3: What are the downsides of using a hybrid

approach for architectural layers detection?

ENASE 2021 - 16th International Conference on Evaluation of Novel Approaches to Software Engineering

5.1 Analysed Codebases

We have conducted experiments on eight different

codebases, both open-source and private, of different

sizes:

• Firefox: the public mobile Web browser (Mozilla,

2018),

• Wikipedia: the public information application

(Wikimedia, 2018),

• Trust: the public cryptocurrency wallet (Trust,

2018),

• E-Commerce: a private application,

• Game: a private multiplayer game.

• Stock: a private trading application,

• Education: a private education application for par-

ents,

• Demo: the Apple’s example for AR/VR (Apple,

2019)

We were interested in MVC codebases, as this is one

of the most popular software architecture used on

client applications. iOS applications implement the

MVC pattern more consistently than Android, as Ap-

ple encourages developers to use it through examples

and documentation (Apple, 2012). To our knowledge,

there is not a selection of repositories used for analyz-

ing iOS applications, that is why we have manually

selected the codebases. The selection was performed

to include small, medium, and large codebases. In

addition, we have included both private source and

open source codebases as there might be differences

in how a development company and the community

write code. Companies respect internal coding stan-

dards and styles that might not work or are different

than the ones used by open-source projects.

Table 1: Short description of investigated applications.

Application Blank Comment Code No. of components

Firefox 23392 18648 100111 514

Wikipedia 6933 1473 35640 253

Trust 4772 3809 23919 403

E-Commerce 7861 3169 20525 433

Game 839 331 2113 37

Stock 1539 751 5502 96

Education 1868 922 4764 105

Demo 785 424 3364 27

Table 1 presents the sizes of the codebase, blank –

refers to empty lines, comment – represents com-

ments in the code, code states the number of code

lines, while the number of components represents the

total number of components in the codebase.

5.2 Empirical Evaluation

After the experiments were conducted on all of the

codebases, we have analyzed the results based on the

3 questions which represent the base for this study.

5.2.1 RQ1: How Can the Deterministic

Approach and the Non-deterministic One

be Combined?

For constructing a system that would yield good re-

sults and involve a hybrid approach, it is clear that the

only way to do it is to apply the non-deterministic ap-

proach after the deterministic one. These approaches

can work well together if we leverage the best parts

from both of them and try to improve the methods

where they lacked.

From (Dobrean and Dios¸an, 2019a) it is clear that

the deterministic approach works really well for the

layers for which there are rules strong enough to de-

termine the components with high accuracy. In the

case of the current study, MaCS was able to iden-

tify with high accuracy the View layer. When ana-

lyzing custom architectures, a deterministic approach

might yield great results on other architectural layers

as well, but for the purpose of this study, we have fo-

cused only on MVC.

The non-deterministic approach which was in-

spired by (Dobrean and Dios¸an, 2020) works well for

ﬁnding similarities between components without us-

ing heuristics.

With those ideas in mind, we have concluded that

the best way for those approaches to function well

together and achieve good results is to apply the de-

terministic method for identifying the layers, and af-

terwards to apply the non-deterministic approach to

those layers for which the deterministic approach did

not work that well. The second stage of HyDe could

be considered as a ﬁlter, enhancing the detection re-

sults.

To summarise, we have applied the determin-

istic approach ﬁrst and obtained the components

split into architectural layers based on the rules

used by MaCS, afterward we have applied the non-

deterministic method for the layers in which MaCS

results were not conﬁdent – the Model and Controller

layers.

5.2.2 RQ2: How Effective and Performant is the

HyDe Approach?

We have applied the HyDe approach to 8 codebases of

different sizes, to cover multiple types of applications

from a size perspective, but also from functionality

and the domain they operate in perspectives.

A Hybrid Approach to MVC Architectural Layers Analysis

The proposed approach achieved good results on

the analyzed codebases with three codebases that

achieved over 90% accuracy with the highest one at

97%. The average accuracy for the analyzed set of

applications was 85% as seen in Table 2.

The layer which came from the determinis-

tic approach and was left unaltered by the non-

deterministic step, the View layer, has achieved a per-

fect precision score on all the analyzed codebase, in-

dicating that HyDe does not produce false positives

(it does not label as View components those that, con-

form the ground-truth, they belong to other layers).

In respects to recall, the same layer achieved lower

scores on some of the codebases, this was mainly

caused by the fact that those codebases used exter-

nal libraries for building their UI interfaces, and they

did not rely solely on the SDK for implementing those

features.

In the case of the other layers which were altered

by the deterministic step, the precision and recall were

heavily inﬂuenced by the way the codebase was struc-

tured, naming conventions, and coding standards.

The proposed method worked best for the Game

codebase which was a medium-sized application. For

larger codebase which had higher entropy due to their

dimensions and used external libraries the method did

not work as well.

For some of the smallest codebases, the method

achieved the worst results, this is because the non-

deterministic step did not have enough data to make

accurate assumptions as the codebases were rather

small.

From a View layer precision point of view, HyDe

achieves perfect results, all the detected View layer el-

ements are indeed views, there are no false positives.

In respect to the recall of the View layer, the results

are not perfect, there are some false negatives, this is

because the analyzed codebases use external libraries

for implementing certain UI elements and those li-

braries were not included in the analysis process.

In respect to performance, we have computed the

Mean Silhouette Coefﬁcient, the Davies-Bouldin In-

dex as well as the Homogeneity and Completeness

scores.

The Mean Silhouette Coefﬁcient had the best val-

ues in the case of medium-sized codebases where the

distinction between the 3 output layers was more pro-

nounced as seen in Table 3. In the case of the large

codebases, the accuracy was worse as we had many

types of components in each architectural layer. The

value for the smallest codebase was also poor, this is

because it had small number of components.

As seen in Table 3, the results for the Davies-

Bouldin Index were hand in hand with the ones for

the Mean Silhouette Coefﬁcient: the best performing

codedbases were the medium-sized ones, the largest

and smallest ones achieved worst results due to the

same reasons that affected Silhouette metric.

In case of Homogeneity and Completeness, Table

3, HyDe did achieve good results for the medium-

sized and small-sized codebase. The scores were bet-

ter for the codebases which had a naming convention

and coding standards in place.

5.2.3 RQ3: What Are the Downsides of using a

Hybrid Approach for Architectural Layers

and Components Detection?

The most important downside of using a hybrid ap-

proach is that based on the analyzed codebase and the

architecture it implements, the workﬂow might need

to be adjusted in two places:

• the rules for the deterministic part of the process;

• the feature selection for the non-deterministic

step.

In addition to those, an analysis would have to be

conducted on the output of the deterministic step to

identify the layers which should be feed to the non-

deterministic step.

Another downside would be that this method only

works if the deterministic step yields really good re-

sults for at least some of the layers, otherwise this part

of the process becomes irrelevant.

From a computational performance point of view,

the proposed approach is also heavier on the process-

ing part as a simple singular process as it is composed

of two separate steps; this also applies to the run time,

as this is increased due to the same reasons.

In respect to the results, our proposal’s main

downside is the fact that for the output clusters from

the non-deterministic step a manual analysis of those

might be needed to match a cluster to an architectural

layer if no heuristics can be found in the case of more

specialized or custom architectural patterns.

Our proposed approach remains automatic in the

case of more specialized software architectures, as it

does not need the ground truth of the codebase. The

feature extraction process needs to be enriched with

information regarding the particularities of the ana-

lyzed architecture, in order for the process to yield

good results.

6 THREATS TO VALIDITY

Once the experiments were run and we’ve analyzed

the entire process we have discovered the following

threats to validity:

ENASE 2021 - 16th International Conference on Evaluation of Novel Approaches to Software Engineering

Table 2: Results of the process in terms of detection quality.

Codebase Model precision Model recall View precision View recall Ctrl precision Ctr recall Accuracy

Firefox 0,97 0,77 1,00 1,00 0,47 0,91 82,71

Wikipedia 0,79 0,74 1,00 0,66 0,74 0,95 80,00

Trust 0,86 0,89 1,00 0,70 0,66 0,72 82,86

E-commerce 1,00 0,80 1,00 1,00 0,81 1,00 90,97

Game 0,95 1,00 1,00 1,00 1,00 0,92 97,22

Stock 0,80 0,77 1,00 0,59 0,78 0,93 79,09

Education 0,87 0,95 1,00 1,00 0,92 0,90 93,60

Demo 0,91 0,77 1,00 1,00 0,25 0,67 78,79

Table 3: Results in terms of cohesion and coupling.

Codebase Mean Silhouette Coef. Davies-Bouldin Index

Firefox 0,64 0,61

Wikipedia 0,67 0,51

Trust 0,68 0,52

E-commerce 0,63 0,58

Game 0,86 0,20

Stock 0,81 0,28

Education 0,80 0,29

Demo 0,52 1,55

Table 4: Homogeneity and Completeness on the analyzed

codebases.

Codebase Homogeneity score Completeness score

Firefox 0,60 0,63

Wikipedia 0,50 0,56

Trust 0,20 0,18

E-commerce 0,66 0,73

Game 0,74 0,79

Stock 0,40 0,50

Education 0,17 0,31

Demo 0,80 0,90

• Internal: the selection of features for the non-

deterministic step was found using a trial and er-

ror approach, there might be another set of fea-

tures that yield a better result. The selection of

features can be improved by measuring the en-

tropy of a feature for ﬁnding out its importance.

Another reason for concern is the fact that in this

approach we are only looking at the codebase, ig-

noring the external libraries used, and that can

lead to wrongly detected components. This can

be improved by also running the process on the

libraries used by the analyzed codebase. We’ve

also applied ML speciﬁc metrics to the results, in-

cluding the View layer which was an output of the

deterministic step. This might represent a threat to

validity in respect to the results for the Silhouette

Coefﬁcient and Davies-Bouldin Index.

• External: Our study has focused only on Swift

codebases on the iOS platform; other platforms

and programming languages might come with

their particularities which we have not encoun-

tered in the current environment. Furthermore,

this study focuses on MVC alone. In the case

of more specialized architectures, the rules from

the deterministic step might need to be adjusted

as well as the features selection involved in the

non-deterministic phase.

• Conclusions: The experiments were run on a

small number of applications that might have

some bias, more experiments should be conducted

to strengthen the results.

7 CONCLUSIONS AND FURTHER

WORK

With HyDe we have shown that a hybrid approach be-

tween a deterministic method and a non-deterministic

one can be successfully applied in the domain of soft-

ware architecture recognition with great results on

medium-sized applications. Not only that it works

well on mobile codebases, but this work can also be

applied to other platforms and types of applications

which use SDKs for building UI interfaces.

HyDe leads the way of automatically splitting

codebase components into architectural layers by

only analyzing syntactical aspects of the codebase. It

represents the major piece into a software architecture

checker system, that can highlight the architectural is-

sues early in the development phase. This proposal

combines successfully a deterministic approach that

can detect certain architectural layers with a high de-

gree of conﬁdence with a non-deterministic approach

that offers it ﬂexibility for analyzing more specialized

architectural patterns or even custom ones.

We believe that the accuracy of the system can be

greatly improved by also taking into consideration the

external libraries, especially for large projects which

use external libraries developed by the same teams

using the same coding standards and conventions, as

well as conﬁguring the heuristics and features with re-

spects to a certain type of mobile application (for in-

stance medium-sized applications that are clients for

other backend services). It will be also investigated

if the detection performance could be improved by

A Hybrid Approach to MVC Architectural Layers Analysis

considering some features that reﬂect the speciﬁcities

of the body components, as for the current approach,

HyDe only took into consideration the signature of the

methods, properties, and inherited types.

We plan to integrate HyDe into a software archi-

tecture workﬂow that could be used by developers as

well as managers to have a detailed status of the ar-

chitectural health of a mobile codebase. In additions,

HyDe system would also be valuable to students or

beginners as it could provide insightful information

on how the code should be structured and help them to

respect architectural guidelines in real-world projects.

REFERENCES

Anquetil, N. and Lethbridge, T. C. (1999). Recovering soft-

ware architecture from the names of source ﬁles. Jour-

nal of Software Maintenance: Research and Practice,

11(3):201–221.

Apple (2012). Model-view-controller. https://apple.co/

3a5Aox9link.

Apple (2019). Placing objects and handling 3d interaction.

https://apple.co/3tJw8v2link.

Corazza, A., Di Martino, S., Maggio, V., and Scanniello,

G. (2016). Weighing lexical information for software

clustering in the context of architecture recovery. Em-

pirical Software Engineering, 21(1):72–103.

Daoudi, A., ElBoussaidi, G., Moha, N., and Kpodjedo, S.

(2019). An exploratory study of mvc-based architec-

tural patterns in android apps. In Proceedings of the

34th ACM/SIGAPP Symposium on Applied Comput-

ing, pages 1711–1720. ACM.

Davies, D. L. and Bouldin, D. W. (1979). A cluster separa-

tion measure. IEEE transactions on pattern analysis

and machine intelligence, (2):224–227.

DeLong, D. (2017). A better MVC.

https://davedelong.com/blog/2017/11/06/

a-better-mvc-part-1-the-problems/link.

Dobrean, D. (2019). Automatic examining of software ar-

chitectures on mobile applications codebases. In 2019

IEEE International Conference on Software Mainte-

nance and Evolution (ICSME), pages 595–599. IEEE.

Dobrean, D. and Dios¸an, L. (2019a). An analysis system

for mobile applications MVC software architectures.

pages 178–185. INSTICC, SciTePress.

Dobrean, D. and Dios¸an, L. (2019b). Model View Con-

troller in ios mobile applications development. pages

547–552. KSI Research Inc. and Knowledge Systems

Institute Graduate School.

Dobrean, D. and Dios¸an, L. (2020). Detecting model view

controller architectural layers using clustering in mo-

bile codebases. pages 196–203. INSTICC.

Galster, M. and Angelov, S. (2016). What makes teaching

software architecture difﬁcult? In 2016 IEEE/ACM

38th International Conference on Software Engineer-

ing Companion (ICSE-C), pages 356–359. IEEE.

Garcia, J., Popescu, D., Mattmann, C., Medvidovic, N., and

Cai, Y. (2011). Enhancing architectural recovery us-

ing concerns. In 2011 26th IEEE/ACM International

Conference on Automated Software Engineering (ASE

2011), pages 552–555. IEEE.

Huang, A. (2008). Similarity measures for text document

clustering. In Proceedings of the sixth new zealand

computer science research student conference (NZC-

SRSC2008), Christchurch, New Zealand, volume 4,

pages 9–56.

Levenshtein, V. I. (1966). Binary codes capable of cor-

recting deletions, insertions, and reversals. In Soviet

physics doklady, volume 10, pages 707–710.

Li, Z. (2020). Using public and free platform-as-a-service

(paas) based lightweight projects for software archi-

tecture education. In Proceedings of the ACM/IEEE

42nd International Conference on Software Engineer-

ing: Software Engineering Education and Training,

pages 1–11.

Lutellier, T., Chollak, D., Garcia, J., Tan, L., Rayside, D.,

Medvidovic, N., and Kroeger, R. (2015). Comparing

software architecture recovery techniques using accu-

rate dependencies. In Software Engineering (ICSE),

2015 IEEE/ACM 37th IEEE Int. Conf. on, volume 2,

pages 69–78. IEEE.

Mancoridis, S., Mitchell, B. S., Chen, Y., and Gansner, E. R.

(1999). Bunch: A clustering tool for the recovery and

maintenance of software system structures. In Pro-

ceedings IEEE International Conference on Software

Maintenance-1999 (ICSM’99).’Software Maintenance

for Business Change’(Cat. No. 99CB36360), pages

50–59. IEEE.

Mitchell, B. S. and Mancoridis, S. (2008). On the evaluation

of the bunch search-based software modularization al-

gorithm. Soft Computing, 12(1):77–93.

Mozilla (2018). Firefox iOS application. https://github.

com/mozilla-mobile/ﬁrefox-ioslink.

Murtagh, F. (1983). A survey of recent advances in hierar-

chical clustering algorithms. The Computer Journal,

26(4):354–359.

Rathee, A. and Chhabra, J. K. (2017). Software remodular-

ization by estimating structural and conceptual rela-

tions among classes and using hierarchical clustering.

In International Conference on Advanced Informatics

for Computing Research, pages 94–106. Springer.

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to

the interpretation and validation of cluster analysis.

Journal of computational and applied mathematics,

20:53–65.

Trust (2018). Trust wallet iOS application. https://github.

com/TrustWallet/trust-wallet-ioslink.

Vewer, D. (2019). 2019 raport. https://iosdevsurvey.com/

2019/link.

Wikimedia (2018). Wikipedia ios application. https://

github.com/wikimedia/wikipedia-ios/tree/masterlink.

ENASE 2021 - 16th International Conference on Evaluation of Novel Approaches to Software Engineering