Conceptual Process Models and Quantitative Analysis of Classiﬁcation

Problems in Scrum Software Development Practices

Leon Helwerda

1,2

, Frank Niessink

and Fons J. Verbeek

Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands

Stichting ICTU, The Hague, The Netherlands

Keywords:

Agile, Classiﬁcation, Conceptual Frameworks, Prediction, Scrum, Software Development.

Abstract:

We propose a novel classiﬁcation method that integrates into existing agile software development practices by

collecting data records generated by software and tools used in the development process. We extract features

from the collected data and create visualizations that provide insights, and feed the data into a prediction

framework consisting of a deep neural network. The features and results are validated against conceptual

frameworks that model the development methodologies as similar processes in other contexts. Initial results

show that the visualization and prediction techniques provide promising outcomes that may help development

teams and management gain better understanding of past events and future risks.

1 INTRODUCTION

Software development organizations have to take

many factors into account in order to stay dynamic

and innovative. The people who work on producing a

deliverable product to an actively participating client

must have a diverse set of skills and knowledge about

their development platform and associated topics, in

order to collaborate with their peers and stakeholders.

We study the effectiveness of different practices

within software development processes. Speciﬁcally,

we investigate the use of the Scrum software devel-

opment method, and observe the effects of various

events and actions during the development process

upon the outcome of the process as well as the suc-

cessful release of the product. Moreover, we take

other development aids, such as software quality as-

sessment tools and continuous integration pipelines,

into account in this research.

The research takes place from multiple view-

points: we apply the principles from theoretical soft-

ware engineering, delve into the practical aspects by

following the actions made during a sprint, combine

our experiences with relevant work and conceptual

models from other ﬁelds, and apply machine learning

on features that are extracted according to the models

and deﬁnitions we have formed.

We speciﬁcally focus on the practice of the Scrum

software development process as it is applied at a

government-owned, non-proﬁt organization based in

the Netherlands. This organization develops and

maintains specialized software for other governmen-

tal entities, and keeps close liaison contact with these

ofﬁces. In this paper, we set out to investigate how

Scrum manifests itself in this organization, what other

social and technical practices are involved, and how

these may be used as indicators that point toward the

success of the process and the end result, as detailed

in in the research questions in Section 3.2.

The remainder of our paper has the following

structure: Section 2 presets our theoretical ground-

work as well as points toward related practical studies.

Section 3 provides insight into the problem statement

and theoretical backgrounds, and Section 4 shows the

analytical approach of ﬁnding solutions to some of

the problems. Section 5 discusses the solutions and

Section 6 concludes our ﬁndings thus far.

2 BACKGROUND

In this section, we introduce the foundations of the

Scrum framework which provides us with a model of

the interactions between the people, the code and the

support tools. This helps us understand what certain

properties in the collected data mean and how we can

apply them in other models, such as the conceptual

frameworks in Section 3.3. We show existing work

which is relevant to this approach in Section 2.2.

Helwerda L., Niessink F. and Verbeek F.

Conceptual Process Models and Quantitative Analysis of Classiﬁcation Problems in Scrum Software Development Practices.

DOI: 10.5220/0006602803570366

In Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR 2017), pages 357-366

ISBN: 978-989-758-271-4

2.1 Concepts

Scrum is a lightweight framework which describes

a software development process. A self-organizing

software development team works in sprint iterations

of about two weeks to deliver increments of work-

ing software to the client. The client provides feed-

back on new features during a post-sprint review, and

prioritizes desired items on a product backlog. The

Scrum team commits itself as a whole to develop a

certain number of the top items during the sprint, and

in an optimal situation no stories are added or re-

moved while the sprint is undergoing.

The Scrum process is meant to have a ﬂexible im-

plementation, such as what determines a story to be

‘done’. This deﬁnition can range from implementa-

tion to (automated) testing, documentation and client

acceptance. Rules can be added and removed within

the framework when the team agrees to do so dur-

ing a retrospective, where team members discuss prior

events and determine what practical problems they

need to overcome in the future.

Product

Owner

Product

Backlog

with

Stories

(Pre-)reﬁnement,

Sprint Planning

Sprint

Backlog

Development

Team

Sprint

(2-3 weeks)

Daily Scrum

Potentially

Shippable

Product

Increment

Sprint Review,

Retrospective

Figure 1: Workﬂow of a sprint in the Scrum framework.

Other events surrounding a Scrum sprint, outlined

in Figure 1, are the pre-reﬁnement, where stories are

developed to become ready for selection in a sprint,

the reﬁnement where the stories are picked, and the

pre-sprint planning, where the stories are outlined

once more. Every workday, the team holds a Daily

Scrum stand-up meeting to discuss the situation of the

stories and ask each other what they did thus far, their

plans to do in the remainder of the sprint and if there

are any (foreseeable) problems.

Scrum is an agile software development method,

which means that it adheres to principles that are set

out in the Agile Manifesto (Agile Alliance, 2001).

The manifesto assigns an ordering of value between

pairs of software development aspects, e.g., favoring

individuals and interactions over processes and tools.

Even though our research makes use of such systems

to collect data points, we do so to provide the team

with recommendations based on data regarding their

work (Highsmith, 2002). Potential conﬂicts resulting

from putting the principles of the Agile Manifesto in

practice are resolved by ensuring that there is plenty

attention for higher-valued goals (Cockburn, 2007).

With regard to the individuals and their inter-

actions, the Scrum framework deﬁnes a number of

groups and roles. The shape of the organisationis out-

side the scope of Scrum, as it may include managers,

technical leads, coaches and support teams. The main

Scrum roles are as follows:

• Client: The organization which has procured for

the development of the product. The client may

be the end-user or the software maintainer. The

client expects the product to be delivered accord-

ing to their requirements. An actively involved

client provides regular feedback the development

team and other stakeholders such that potentially

changing wishes are known.

• Software development team: A group of people

that work together on a product or component.

The development team shares a work ethos which

drives them to not only successfully release their

product in the end, but also improve their working

method and the product quality.

• Scrum master: A role that might rotate between

team members, who ensures that any impediments

or other problems are taken care of rapidly, and

veriﬁes that the team commits to the same goals.

• Product owner (PO): A middle-man between the

team and the client, who handles the bidirectional

communication surrounding a Scrum sprint. The

PO assesses requirements from the client and

molds them into stories, with the help of the

team. Additionally, the PO organizes meetings

and demonstrations between the stakeholders.

2.2 Related Work

Some of the longest outstanding questions in the ﬁeld

of software development is whether the use of so-

called methodologies yields a better product that is

delivered earlier than in the absence of them, and how

we can compare the different practices (Wynekoop

and Russo, 1995). Each in vivo study appears to dif-

fer in its scientiﬁc rigorousness (Dyb˚a and Dingsøyr,

2008) and the topic of interest within the study. Meta-

analyses of the related topic of software fault predic-

tion using machine learning show that bias is a strong

factor in the obtained results of such classiﬁers (Shep-

perd et al., 2014).

A large number of studies deal with distributed

software development projects which use agile pro-

gramming or management solutions. While these

may provide relevant results (Paasivaara et al., 2009),

their use in on-site collaboration teams may be lim-

ited. Studies show the (successful) application of

Scrum in small teams (Rising and Janoff, 2000) and

in teams that have a requirement of communicating

with other teams as well as external stakeholders on

a frequent and documented basis (Pikkarainen et al.,

2008).

We distinguish the earlier case studies into two

segments: qualitative and quantitative. The qualita-

tive studies assess the application of Scrum or an-

other agile development practice through means of

interviews, developer experiences, and scoring sys-

tems. The empirical methods used this way still help

laying down new foundations for practices and anti-

patterns in Scrum (Eloranta et al., 2016) and set light

on new relevantfactors (Lee, 2012), providing knowl-

edge models for others to build upon.

Recent quantitative research covers topics includ-

ing agile software development processes, or more

speciﬁcally Scrum practices. The analysis of data

from different sources is often combined with frame-

works and practices that have proven themselves in

other ﬁelds, such as multi-criteria optimization mod-

els (Almeida et al., 2011). There is an analysis of

the effectiveness of Scrum and Kanban on project re-

sources management (Lei et al., 2017), and an ethno-

graphic case study on the correlation with overtime

and customer satisfaction after introduction of Scrum

in an organization (Mann and Maurer, 2005).

3 DESIGN

We formulate our goals and propose our research

questions related to the quantitative validation of soft-

ware development processes in this section.

3.1 Goals

Different types of goals exist in the context of an anal-

ysis of software development processes. We cate-

gorize these goals by level of detail, focus area and

stakeholder interest. For the beneﬁt of the software

development organization, a corporate industrial goal

would be to reduce development and maintenance

costs. We study various factors that inﬂuence the re-

quired effort and sprint success, i.e., whether the esti-

mated effort is realized in time.

Tactical goals are usually high-level, with a focus

on the process itself. For example, we wish to im-

prove the software development process by means of

novel standards and best practices. A research goal is

then to recommend new norms based on analysis and

to verify that these norms boost the progress.

At a more detailed level, we have goals that

strengthen the measurable nature of the process. The

software development organization management may

only have a need for a single indicator of success, but

some stakeholders prefer insight into the underlying

factors. In a research context, we have measurable

domains (projects, teams, deliverable artifacts, and so

on) and we apply speciﬁc measurements to them.

Table 1: The goal, question, metric framework for Scrum

software development research.

FIELD VALUE

Object of study Scrum board, issue tracker,

version control

Purpose Visualization, prediction,

recommendation

Quality focus Scrum sprint progress, code

quality metrics, collaboration

Point of view Team leaders, team members,

management

Environment Scrum software development

organization

We summarize the purposes and context of our

goals in Table 1. From this summary, we build pre-

diction models that reduce bias toward individual do-

main samples, and may be generalized, applied and

inspected more broadly. We extract features from ar-

tifacts and records originating from the software de-

velopment process in order to better understand it and

provide recommendations for stakeholders. We pro-

vide a systematic mapping from conceptual frame-

works to the data set of features.

3.2 Research Questions

We wish to ﬁnd out how we can signiﬁcantly improve

software quality of products developed at software

development organizations. We consider the use of

various kinds of analysis tools that accept collections

of measurable events as input. These events occur

during the development process; they may be based

upon attributes of a Scrum event, changes in the issue

tracker or code, or signals of changes in the quality of

the deliverable product.

From this research question, we can deduce sev-

eral subquestions which form the basis of our re-

search. Are we able to objectively determine best

practices or other quality norms by means of analysis

of data logs detailing the software development pro-

cess? We look for indicators that point toward a suc-

cessful or unsuccessful sprint period within Scrum.

We take into account the viewpoints of involved

stakeholders as semi-quantitative indicators.

Through this scientiﬁc analysis of process data,

we may be able to deduce new, predictive norms or

recommendations for software development projects.

This requires research into feature extraction and

model deﬁnition and validation, to support predic-

tion of success or failure of a current Scrum sprint

period. We make use of information about earlier

sprints, such that we can predict the probable outcome

before the sprint in question has started.

Finally, to what respect and extent, and using

which kinds of measures, can the effectiveness of

novel software engineering methodologies be deter-

mined scientiﬁcally? After model validation, we will

apply the prediction to ongoing projects and deter-

mine the effects of recommendations on the devel-

opment process and its success. The recommenda-

tion model must integrate into the current software de-

velopment practices, for example by augmenting ex-

isting systems for quality reporting, project manage-

ment, logistics and human resources. Such an experi-

mental setup requires thorough veriﬁcation and com-

parison with projects that lack this setup.

3.3 Conceptual Frameworks

We describe a Scrum sprint as models which we will

use to perform model validation. We present three

models that relate to the linear model of a Scrum

sprint, namely a factory process, a symbiotic learning

network, and a predator-prey system.

In the factory model, we start at some predeter-

mined state with a concept for something a user may

want to be able to do with the product, of which the

release is the eventual outcome. This leads to a use

case which can be expanded into a story. The story

may undergo multiple phases in which it is further

detailed in terms of design and scope, after which the

story is reviewed. The review determines whether the

story is ready to develop into an implemented fea-

ture. This step employs programming of source code

to handle the use cases. Again, this step can be re-

viewed to ensure code quality and agreement within

the team about how the code is supposed to function.

Aside from manual inspection, a test process allows

the team to check if the implementation conforms to

their expectationsthrough the use of veriﬁcation mod-

els (with a technical equivalent of automated regres-

sion tests and similar benchmarks).

A special twist of the Scrum factory is that the

client may be involved in the quality acceptance of

the product before it is released to them. This may

materialize in the form of acceptance testing in a test

environment, witness testing, or a demo near the end

of the sprint. This external testing process brings the

story closer to production. In the end, the stories

that are considered to be ‘done’ are released in a po-

tentially shippable increment. Again, this is slightly

different from conventional product launch strategies,

since not all desired functionality may have made it

into the increment, but those that did are working as

expected.

There are indicative moments at each step in this

process: before the entire process starts, in between

the subprocesses, and at the end of the production

line. These moments are shown in Figure 2 and may

occur during the Scrum sprint or before or after it in

the case of designing and reviewing the stories. At

any moment, we may determine how many of these

stories are at the current step as well as how many are

waiting to be pulled into the next step after a subtask

is done. Thus we have separate backlogs for stories

at any point of the development phase, not just before

they are pulled into a sprint.

Ideally, the factory pipeline is a one-way conveyor

belt with a stable speed such that the backlogs re-

main small and manageable. However, one additional

complication is that stories may be pulled back into

an earlier state, for example when review or testing

uncovers problems that require redesigning, ﬁxes in

code, or other changes in an earlier process. Similarly

to the intermediate backlogs, the volume of such set-

backs should be limited. The practice of adding these

backward ﬂows into the model yields a value stream

map, which stems from the Lean software develop-

ment principles (Abdulmalek and Rajgopal, 2007).

In another context, the Scrum sprint can be seen as

a symbiotic environmentthat encourages stakeholders

to learn from past mistakes, such that known prob-

lems can be prevented in the future.

One can deﬁne a time range, such as the start of a

sprint until the end of a sprint, in which the team per-

forms actions that may improvethe product and them-

selves. At the start of this range, we have a number of

artifacts, such as code, components in the system ar-

chitecture, stories in the sprint and in the backlog, and

(reported) bugs. All of these artifacts may have some

measurable indication of how proﬁcient they are: is

the code readable, are the stories detailed enough (but

not too implementation-speciﬁc), etc.

At the end of the sprint, these artifacts have the

same properties, but upon measuring them they may

have improved. We can detect if the solutions were

implemented in the code in such a way that it is

reusable for later features and is future-proof against

unknown bugs or regressions elsewhere in the code.

This includes checks for code duplications or other

code smells within or between components. The

structure of the architecture may improve, which is

more than just aligning it with the initial design con-

cept. Problems that were encountered with certain

stories should be used as a learning moment to ensure

Design

Story

Review

Code

Review

Test

Witness Review or

Acceptance Test

Ship

Use cases

Stories

Ready/approved

stories

Developed

features

Ready

features

Tested

features

Done

stories

Potentially

Shippable

Product

Increment

Prior Phase

Figure 2: Factory model of the Scrum process, similar to value stream maps from Lean.

the use case is clear enough before work commences,

and to lead to fewer bugs in the future.

As a ﬁnal model, the actions of software develop-

ers that complete work on a story, ﬁnd and ﬁx bugs, or

create unit tests can be seen as a predator that intends

to minimize the population size of a prey (Arcuri and

Yao, 2008). Every time the developers get more work

done within a sprint, their ‘prey’ should subsequently

subside. However, if the quality and quantity of the

actions are lower than expected, then the number of

prey grows again due to the rise of bugs and undesir-

able features.

We deﬁne the predator size as the amount of work

that the team achieves, i.e., the velocity of the team.

We map the prey to volume of the product backlog

that need to be actioned upon, such as stories and

bugs. This makes the two population sizes more ab-

stract than in the biological process. The main simi-

larities are that the two volumes are inversely related

to each other, and the assumption that there is enough

‘food’ for the prey to live from, namely the inﬂux of

ideas to improve the product and the code in the prod-

uct itself that may hold – not yet known – bugs. Fi-

nally, we assume that the predator is geared toward

solving these problems as the collective goal.

The powerful dynamics of predator-prey systems

have been studied in depth. In general, the predator

works best with a large population of prey (a deﬁni-

tion which can additionally take into account the well-

orderedness of the backlog and clarity of the stories).

The predator often decreases the size of the prey to

an extent that it is almost extinct. This reduces the

work output of the predator, leading to a resurgence

of the prey stories and bugs. There are however stable

versions of the predator-preysystem, where neither of

the two species changes their size based on the other,

or they slightly oscillate around two mean points.

What we learn from these observations is that

software development processes work best when the

backlog size is large enough. More importantly, the

system becomes stable when each cycle does not yield

tremendous changes to both volumes. Thus, a stable

inﬂux of (new) stories, as well as a stable velocity of

work done per time unit, are factors in the process that

help ensure that the project can continue onward. The

predator-prey system obviously does not include all

aspects of the development process, but it provides a

mathematical concept of the major relevant properties

of the Scrum cycle-based framework: input, changes,

and output of story units as well as the velocity of

the team itself. Responding to changes in the backlog

volume and scope allows the predator team to keep

the prey volume of issues and tasks at a manageable

level.

4 ANALYSIS

We collect data from distributed version control sys-

tems, issue trackers and other tools used by the

projects. This is a completely automated process that

works via a pipeline where data ﬂows one way. After

the collection and processing steps, the data is stored

in a database. The pipeline takes into account the lat-

est state of the collection process such that only up-

dated data is retrieved. This way, we can perform

frequent analysis using the persistent database, for

example feature extraction as demonstrated in Sec-

tion 4.4. We do this every time a new sprint might

start, e.g., weekly, and predict the outcome of new

sprints as soon as possible.

4.1 Data Sources

Each project has its own set of instances of tools used

during software development, such version control

systems (VCS), quality reporting tools, build automa-

Project

Deﬁne

Source code

repository

Issue

tracker

Quality

metrics

Gather

Python

Collection

JSON

Import

Java

Database

MonetDB

Extract

R/SQL

Visualization

D3js

Prediction

TensorFlow

Figure 3: Pipeline of the collection of data from projects and their purposes after feature extraction.

tion, documentation wikis and project management

systems. A project has an associated issue tracker

board, which in our case is JIRA. This software pro-

vides additional functionality for Scrum boards with

a backlog and sprint tracking. Projects use a VCS like

Git or Subversion. In the case of Git, several reposi-

tory managers with review tools are in use, in partic-

ular GitLab, GitHub and Team Foundation Server.

Quality control is achieved using SonarQube with

a diverse set of proﬁles. The results of a Sonar-

Qube check are made available to a quality dashboard

which holds current and previous values of metrics

based on code quality and other sources. A metric

may have details available at the source in question.

Because some of these sources are only available

to the team itself for security considerations, we make

use of Docker-based automated services that are de-

ployed in the development environment of the team.

These ‘agents’ register themselves at a central server,

regularly collect fresh data and send the data to the

server. Additionally, the agents perform health checks

to warn if there are problems with the environment.

We process the data and where possible, automat-

ically create relationships between data sources, such

as matching a code commit with the sprint or issue

it relates to. Next, user accounts in the JIRA issue

tracker and the commit authors in VCS repositories

are linked, with a hand-made ﬁlter when automatic

matches are insufﬁcient. Finally, the data is imported

into a MonetDB database as shown in Figure 3.

4.2 Threats to Validity

During our initial research, we validate the data col-

lected so far against other sources and ﬁndings during

a Scrum sprint. For example, we compare the actions

taken by team members during daily stand-up meet-

ings and ﬁnd that many administrative actions in the

issue tracker take place around such meetings. We

also ﬁnd this by comparing certain actions, such as

rank changes and story point changes, with meeting

reservation data from a self-service desk system.

This means that we cannot assume that the action,

such as closing a story, actually took place when the

task is ﬁnished or a decision is made. Detailed addi-

tions to tasks are often done during lunch breaks or

near the end of the day, for some teams with up to

four times as many changes in such hours compared

to other moments during the sprint. This makes it

more difﬁcult to connect changes to issues with code

changes made during the day, but does not immedi-

ately affect our method when aggregating data over

entire sprints. Knowledge about the existence of these

patterns may in fact help ﬁnd other anomalies.

Sprint are administratively closed in the issue

tracker as well. By default, the end date is a projec-

tion from the start date, so if nothing is done the sprint

is closed automatically a few weeks later. For sprints

whose stories are done in time, a date of completion

is known, but it may suffer the same consequences of

(delayed) administrative actions. We may use it as a

middle ground in some cases, such as when sprints

seem to overlap or have dubious dates.

Teams use the functionality provided by the issue

tracker in different ways. Due to various deﬁnitions of

‘done’, inherent to the Scrum framework, an issue sta-

tus may have several meanings. As another example,

an impediment may indicate that the team is waiting

for feedback from the client, not that the team has a

problem that must be ﬁxed by the Scrum master.

In other data sources, we may have problems with

missing data, such as when a quality metric source is

misconﬁgured. Version control systems allow team

members to describe their changes in short commit

messages. Quite often, developers do not make use

of this, or they use a integrated development editor

which ﬁlls in the latest message automatically. It is

considered good practice to mention the issue that

the commit relates to, but this only happens in up to

14% of all commits in our data set. About 6% of all

commits are merges, which is relatively low consider-

ing that in distributed development, features are often

implemented on a branch, tested and merged later on.

We intend to generalize our approach, and build

a feature extraction model where we create reusable

deﬁnitions of properties related to the Scrum process

whose realizations take into account the unexpected

patterns that exist in the data. Additionally, we take

decisions about improving our coverage of certain

properties across all ﬁelds, consider not using a ﬁeld

directly for some feature, or assume that we can inter-

polate or leave out a metric or event.

4.3 Reporting

We report our ﬁndings back to involved stakeholders,

including team members and management, through

various communication channels. We take into ac-

count that a bare number or classiﬁcation for a sprint

does not provide sufﬁcientcontext. Many people wish

to know how the report came to be and what else can

be deduced from the data. For this reason, we pro-

vide as many details from the steps that we take in the

feature extraction and prediction process.

Aside from the prediction results, we separately

make all features available in a timeline visualization

which displays and compares Scrum sprints from dif-

ferent teams. The timeline includes signiﬁcant events

that took place in each sprint. Additional visualiza-

tions of the collected data come in the form of a burn

down chart, a leaderboard with project statistics, a

calendar showing code commit volumes per day us-

ing a heat map and external data such as daily weather

temperatures, and a network graph showing collabo-

rations between team members on different projects

with time-lapse capabilities.

We hold a system usability scale (SUS) question-

naire. The questionnaire is reachable from the visual-

ization interface and yields 17 responses. The respon-

dents havevarious roles in the organization. We found

that none of the respondents disagreed with the state-

ment that the visualizations were well integrated, and

the general agreement is that the visualizations are

easy to use (only the timeline has two disagreements).

Most of the respondents are not yet inclined to use the

visualizations frequently. Comments seem to indicate

that this is due to the fact that the data shown does not

directly impact their current work progress.

The classiﬁcations for a current sprint are shown

on a distinct page, including a risk assessment as well

as metrics that indicate the performance of the predic-

tion algorithm and its conﬁguration. All data is shared

with other tools, including a quality reporting tool that

is well-used by the teams.

Intermediate results are not only shared electron-

ically but also presented during various meetings,

which immediately provide the possibility for atten-

dees to provide comments and questions. Similar to a

Scrum review, we attempt to display an early version

of a visualization such that we can update it based on

feedback from these meetings.

4.4 Feature Extraction

In order to create a dataset of numerical features that

describe certain properties of the Scrum sprints that

havetaken place, we perform feature extraction on the

collected record data. We use a combination of SQL

statements and R programs to aggregate the data. The

SQL statements may contain variables that deﬁne cer-

tain common properties, ﬁlters and formulas, such as

the actual end date of a sprint, types of issues related

to stories, or the calculation of the velocity in a sprint,

based on the number of story points divided by the

number of working days in the sprint.

This way, we deﬁne features of sprints in a generic

manner, taking into account inconsistencies in the

source data as mentioned in Section 4.2:

1. Sprint:

• Sequential order of the sprint in the project

lifespan.

• Number of weekdays during the sprint.

2. Team size:

• Number of people that made a change in the

code, or on the issue tracker, during the sprint.

• Numberof sprints that each developerhas made

a change in before the sprint.

• Number of new developers in the team that

have not made a change before.

3. Issue tracker:

• Mean number of watchers or people making a

change on an issue.

• Mean number of story points that are ‘done’.

• Mean number of labels provided to an issue.

• Number of impediments.

• Number of changes to the order of stories on

the backlog, or the number of points, before or

after the sprint has started.

• Number of stories that are not closed as ‘done’.

• Numberof workdayssince the start of the sprint

which is the pivot day around which the most

changes are made.

• Velocity, both for the sprint as well as the aver-

age over three sprints prior.

• Number of issues that are closed, except stories.

• Number of concurrent stories, and the average

number of days that the stories are in progress.

4. Code version control:

• Number of commits.

• Average number of additions, deletions, total

difference size, number of ﬁles affected.

5. Metrics:

• The overall sentiment of the team about the

sprint as indicated during the retrospective.

• Number of metrics that are shown in the quality

dashboard, and the number of metrics that are

underperforming or not available.

Any of these features may take on the role of a la-

bel, indicating a single outcome of a sprint to be pre-

dicted from the remaining features. The label may be

converted to binary classiﬁcations. The features are

rescaled such that training models are not inﬂuenced

by unrelated scales.

Because the eventual value of a feature is un-

known while a sprint is in progress, we instead pre-

dict the label for this sprint using features from ear-

lier sprints. We create such a dataset by rolling all

features to the later sprint of the same project. This

loses the features of the latest ‘active’ sprint, as well

as a complete sample of the ﬁrst sprint of the project.

For example, we may have 15 projects of differ-

ent lifespans, with a total of 530 sprints. After the

roll operation, we remove the label of ﬁrst sprint of

each project and stow away the latest sprints as our

prediction target or validation set, leaving us with

500 sprints in the main dataset. Table 2 shows the

actual dimensions and other properties of our data.

Table 2: Dimensions and related properties of the database.

PROPERTY VALUE

Projects 15

Issues 60158

Stories 5369

Changes per issue 8.5

Code repositories 196

Code changes 140357

Metric values 71806613

Sprints 531

We then split up the dataset into training and test

sets, using stratiﬁed cross-validation to avoid biased

sets. We also calculate the distribution of labels across

the sets and the accuracy when we take the label of

previous sprint is as the new label, to better under-

stand the data and to improve the prediction algo-

rithm. The project identiﬁer is never passed to the

model or training algorithm to generalize its use for

all teams; the label distribution may optionally be

used to rebalance the training set.

5 VALIDATION

From our thoughts on conceptual frameworks in Sec-

tion 3.3, we deduce certain properties which appear

to be relevant in both a Scrum process and in similar

processes. One point is that there must be some added

value after a period in which the most relevant actions

take place. For Scrum, this means that there must be

some (predetermined) number of story points reached

at the end of the sprint. Certainly, when value is not

realized within this period, it may need to be done in a

later sprint, which is not helpful for throughput of pri-

oritized stories. Thus, if there are stories that are not

done or closed as unﬁxable at the end of the sprint,

then this indicates a problem.

5.1 Preliminary Results

During our initial research into the quality of the

collected data, we create an inventory of the possi-

ble applications of the data, through discussions with

developers, Scrum coaches, management, and sup-

port team members. We speciﬁcally select questions

which can be answered efﬁciently with the database,

and additionally indicate whether the results lead to

unexpected results. Thus we validate the quantitative

data against human expectations regarding the Scrum

process. This allowed us to ﬁnd some peculiarities in

the data, such as the length of the sprint which is often

predetermined due to a projected end date, or changes

made to priorities or story points at unlikely moments.

One of these questions relates to an often-stated

guideline with regard to the size of a story: If the

story is considered to be large, then it is better to

split it up into multiple smaller stories. We won-

dered whether a story which is awarded with many

points during the reﬁnement, cf. Section 2.1, is more

likely to end up being ‘not done’ than a story with

few points. Story points may not be entirely com-

parable across teams, or even across periods of time.

Story points are awarded according to the Fibonacci

scale. Therefore, we acquire a logarithmic normaliza-

tion factor of the largest story of each sprint. In Fig-

ure 4(a) we aggregate stories with the same points and

demonstrate the ratio of not-done stories with those

points. The numbers above each bar indicate the sto-

ries that are ‘not done’, and the total number of sto-

ries with the same amount of points is shown in the

bar. Figure 4(b) shows aggregated ratios after log-

normalization. The distinct trend shows there an in-

creased likelihood that a story with a higher number

of points is not ﬁnished. This pattern remains visible

when taking subsets of projects, and indicates that we

are able to answer these questions efﬁciently.

140

637

509

646

298

125

0.5 1 2 3 5 8 13

Story points

Ratio (% not done)

(a) Story points and likelihood of ‘not done’.

536

991

182

0.5 1 2 3

Story points (log−normalized)

Ratio (% not done)

(b) Normalized story points and likelihood of ‘not done’.

Figure 4: Results showing a summary of the story points and the ratio of being ‘not done’ over all stories with the same points.

5.2 Prediction

We identify sprints with a high risk of unsatisfactory

results by training a machine learning algorithm on a

dataset with 23 features and one label, which we in-

troduce in Section 4.4. Even though many indicators

of successful sprints are feasible, we use a single met-

ric in this model for simplicity. We consider a sprint

to be successful if and only if all the stories involved

in the sprint are closed at the end of the sprint, with no

deferrals. We convert the feature providing the num-

ber of ‘not done’ stories to this binary label. The class

distribution is highly biased toward sprints that have

no unﬁnished stories, making up 80% of the data set.

Other features, such as the number of impediments

(77%) or the number of story point changes after the

sprint has started (83%) exhibit similar distributions.

A weighted label combining such features at different

thresholds may improve this distribution.

We apply the data set to a deep neural network

with various conﬁgurations to make use of the capa-

bilities of such architectures to handle a large number

of features. We ﬁnd a feasible neural network with

three hidden layers of 100, 200 and 300 activation

nodes, respectively. In Figure 5, we show the accu-

racy curve of this experiment. We reach an accuracy

of 84% on the test set after training the neural network

for 1000 steps. A baseline classiﬁcation using the la-

bel of the previous sprint has an accuracy on the test

set of 78%, indicated by the dashed line. Our trained

model thus outperforms a forecasting operation. This

is a promising result of our novel application of ma-

chine learning on Scrum data.

0 200 400

600

800 1,000

0.5

0.6

0.7

0.8

0.9

Step

Value

Accuracy

Precision

Figure 5: Accuracy and precision of the three-layer neural

network on the data set, with the label ‘all stories are done’.

6 DISCUSSION

Our research is in the preliminary stages of assessing

the possibilities of a quantitative analysis in Scrum

software development processes. Our results thus far

indicate that there is a promising potential of such

an analysis, which may provide us with relevant fac-

tors and risk assessments during a sprint. We wish to

know which features demonstrate the impact on the

progress of the sprint the most, and why they are in-

ﬂuential. From a data mining standpoint, we show

that it is possible to extract features from a collection

of records regarding Scrum sprints. Further analysis

will show what techniques work to improve accuracy.

In our formulation of conceptual models akin to a

Scrum sprint, we ﬁnd that some of these models are

easier to relate than others. Further abstraction some-

times reveals the most important factors in the newly

created model and thus the original Scrum model. In-

spirations from other scientiﬁc ﬁelds help make the

model more intelligible through recognizable proper-

ties. A model may be simpliﬁed, describe a speciﬁc

feature in more detail, or contain inherent attributes

that only become visible once the model is abstracted.

From our proposed models, a feature of stable veloc-

ity, independent of the team size, comes to mind.

7 CONCLUSIONS

This quantitative study of process data from Scrum

development sprints presents a novel application of

data mining in the ﬁeld of software engineering. The

use of state-of-the-art prediction algorithms plays a

huge role in this research. Analytical approaches help

ﬁnd the success factors of Scrum sprints.

Conceptual models present a viable method for

validating a set of features and labels against the

model or fractions of it. The abstractions that are

made by relating one event to another help in ﬁnding

features that one could not perceive beforehand.

7.1 Future Research

The Scrum process provides a model which provides

a set of features that indicate certain behavior during

a sprint. Accurate prediction using these features re-

mains challenging due to noisy data. Future develop-

ments, including feature selection and expansion of

the data set with more variation, help solving this task.

We intendto not only predict binary classiﬁcations

to our end users, but also provide recommendations to

team members and management. The learning model

must be able to tell why it came to a certain conclu-

sion and what can be done to counteract the risk of

a failing sprint within time constraints. This means

that the model will have more introspective abilities

as well as the capability to provide more than a risk

assessment, leading to new norms.

ACKNOWLEDGEMENTS

We want to thank Stichting ICTU for providing the

funding and data access which makes it possible to

perform research and build tools for prediction and

visualization. Particularly, we thank those who assist

us through feedback during meetings, interviews and

surveys and let us observe Scrum in practice.

REFERENCES

Abdulmalek, F. A. and Rajgopal, J. (2007). Analyz-

ing the beneﬁts of lean manufacturing and value

stream mapping via simulation: A process sector

case study. International Journal of Production Eco-

nomics, 107(1):223–236.

Agile Alliance (2001). Manifesto for agile software devel-

opment. http://agilemanifesto.org/ [2017-08-30].

Almeida, L., Albuquerque, A., and Pinheiro, P. (2011). A

multi-criteria model for planning and ﬁne-tuning dis-

tributed Scrum projects. In Proceedings of the 6th

IEEE International Conference on Global Software

Engineering, pages 75–83.

Arcuri, A. and Yao, X. (2008). A novel co-evolutionary

approach to automatic software bug ﬁxing. In Pro-

ceedings of the IEEE Congress on Evolutionary Com-

putation, pages 162–168.

Cockburn, A. (2007). Agile Software Development: The

Cooperative Game. Addison-Wesley, 2nd edition.

Dyb˚a, T. and Dingsøyr, T. (2008). Empirical studies of agile

software development: A systematic review. Informa-

tion and Software Technology, 50(9):833–859.

Eloranta, V.-P., Koskimies, K., and Mikkonen, T. (2016).

Exploring ScrumBut–An empirical study of Scrum

anti-patterns. Information and Software Technology,

74:194–203.

Highsmith, J. A. (2002). Agile Software Development

Ecosystems. Addison-Wesley.

Lee, R. C. (2012). The success factors of running Scrum: A

qualitative perspective. Journal of Software Engineer-

ing and Applications, 5(6):367–374.

Lei, H., Ganjeizadeh, F., Jayachandran, P. K., and Oz-

can, P. (2017). A statistical analysis of the ef-

fects of Scrum and Kanban on software development

projects. Robotics and Computer-Integrated Manu-

facturing, 43:59–67.

Mann, C. and Maurer, F. (2005). A case study on the im-

pact of Scrum on overtime and customer satisfaction.

In Proceedings of the Agile Development Conference,

pages 70–79.

Paasivaara, M., Durasiewicz, S., and Lassenius, C. (2009).

Using Scrum in distributed agile development: A mul-

tiple case study. In Proceedings of the 4th IEEE Inter-

national Conference on Global Software Engineering,

pages 195–204.

Pikkarainen, M., Haikara, J., Salo, O., Abrahamsson, P.,

and Still, J. (2008). The impact of agile practices on

communication in software development. Empirical

Software Engineering, 13(3):303–337.

Rising, L. and Janoff, N. S. (2000). The Scrum software

development process for small teams. IEEE Software,

17(4):26–32.

Shepperd, M., Bowes, D., and Hall, T. (2014). Researcher

bias: The use of machine learning in software defect

prediction. IEEE Transactions on Software Engineer-

ing, 40(6):603–616.

Wynekoop, J. and Russo, N. (1995). Systems development

methodologies: Unanswered questions. Journal of In-

formation Technology, 10:65–73.