Statistical Model Checking of Distributed Programs within SimGrid

Marie Duﬂot-Kremer and Yann Duplouy

Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France

Keywords:

Stochastic Distributed Systems, Distributed Programs, Statistical Model Checking, SimGrid, Simulation.

Abstract:

In this paper, we present an approach to perform statistical model-checking over stochastic distributed pro-

grams using the SimGrid framework. The distributed programs are modeled using SimGrid, a fast and

lightweight framework for the simulation of distributed programs, which we have enhanced in three ways:

a cleaner description of the probabilistic evolution of the capacities of resources, a centralized random number

generator, and a protocol for the observation of the simulations. We also propose a toolset for the statistical

model-checking of those simulated distributed programs, and in particular a prototype tool SimGridStatMC.

The toolset is illustrated to evaluate various properties of an implementation of the peer to peer BitTorrent

protocol.

1 INTRODUCTION

Distributed systems are, by deﬁnition, interesting and

complex to study. Indeed, their complexity relies not

only on the program run by different agents, but also

(and mainly) on the architecture of the system, the

communication between these agents and their het-

erogeneity. It therefore raises new questions and their

veriﬁcation necessitates techniques that can scale to

very large systems.

SimGrid(Casanova et al., 2014) is a framework for

developing simulators of distributed applications. By

emulating both the application to run and the environ-

ment (network capacity, computing power of different

nodes,...) it enables to evaluate the appropriateness of

different algorithmic solutions, to measure their scal-

ability or to dimension a network to achieve a given

task. SimGrid is by design fast and lightweight in

terms of memory, which enables to simulate a quite

large network quickly and on a single machine.

Prior to this work, using SimGrid required a very

precise speciﬁcation of the behavior of the distributed

system. For example, in the case of networks, we had

to know precisely when and how the bandwidth of the

different link would vary with time, but also whether

a server becomes unavailable or when its speed re-

duces. The precise description both of the program to

run and of the environment is a necessary requirement

to get realistic and reproducible simulations.

On the opposite, in order to verify a system, we

need to take into account all possible executions of

the program. These include possible modiﬁcations of

the computation time (due to a change in the workload

of a node), of the transmission delays (due to changes

in the amount of trafﬁc in the network) and failures.

Work has already been done to add model checking

possibilities into SimGrid for both safety (something

bad will never happen) (Merz et al., 2011) and live-

ness (something good will eventually happen) (Guth-

muller et al., 2018), but so far for non probabilistic

systems. Such a veriﬁcation necessitates (upper and

lower) bounds for parameters that can vary, e.g trans-

mission delays. Model-checking can then be applied

to this non deterministic system to check for exam-

ple if a computation can be done in a given amount of

time/memory.

A hurdle for this model-checking approach is that,

in many cases, there exist a (worst) case where the

goal is not met. For example if the application con-

tains deadlines, there is a possibility where the net-

work and/or a node are too slow to meet the deadline

and the answer to the question: will the message ar-

rive before the deadline is "no". Knowing that a bad

interleaving of actions can happen is in general not

sufﬁcient, and it is interesting to get information about

how likely such an event is to happen, and thus insert

probabilities in the model and use methods that can

handle such probabilities to verify/evaluate our sys-

tem.

There are two main approaches for the model-

checking of stochastic models:

• The numerical approach, based on matrix calcu-

lus, is giving precise results (albeit sensitive to

Duﬂot-Kremer, M. and Duplouy, Y.

Statistical Model Checking of Distributed Programs within SimGrid.

DOI: 10.5220/0009875302330239

In Proceedings of the 10th International Conference on Simulation and Modeling Methodologies, Technologies and Applications (SIMULTECH 2020), pages 233-239

ISBN: 978-989-758-444-2

233

numerical errors) but is requiring strong proba-

bilistic hypotheses and, as it stores the transition

relation for the whole system, requires a lot of

memory and is subject to combinatorial explo-

sion, which makes it impossible to use on large

systems such as realistic distributed systems.

• The statistical approach, based on Monte Carlo

simulations, has fewer restrictions and can handle

very large systems as it only requires a running

model. It is also easy to parallelize and a good ap-

proach for large systems. The counterpart is that

it only gives approximated results together with

a conﬁdence level, and it requires a speciﬁc han-

dling for rare events.

Several tools already exist to perform both ap-

proaches of probabilistic model-checking. The most

popular by far is PRISM(Kwiatkowska et al., 2011).

It can perform both numerical and statistical model-

checking on systems modeled as (variants of) Markov

chains or probabilistic automata.

However, none of the existing tools is particularly

suited to handle network communication issues, and

from a practical point of view, using these require

a formalisation of the distributed program; SimGrid

allows for a faster conversion of an already imple-

mented distributed program to a simulator.

For this paper, since we aim at handling large

distributed systems, only the statistical approach is

doable. Furthermore, this approach has two advan-

tages. First, it makes it possible to beneﬁt from the

power of SimGrid, that will be used to run simu-

lations. Second, our integration of our tool in this

framework makes a new veriﬁcation approach avail-

able to Simgrid users.

2 STOCHASTIC MODELING AND

STATISTICAL

MODEL-CHECKING

In order to use the SimGrid platform to perform statis-

tical model-checking of distributed systems, we need

to enhance it with stochastic aspects and develop a

method that combines SimGrid and statistical tools

to evaluate the properties we want to consider. In

this section, we ﬁrst explain how SimGrid has been

extended to neatly model stochastic distributed pro-

grams running in a stochastic environment (with a

probabilistic occurence of failures), and we then de-

scribe our statistical model-checking approach.

Original SimGrid Models. In a SimGrid model,

the distributed program and its associated distributed

system are usually described using three components:

• The actors, written in C++, are subprograms that

execute a task of the distributed program;

• The platform, usually a XML ﬁle, contains the in-

formation of each node and each link;

• The deployment, usually a XML ﬁle, associates

nodes with one of the actors.

We furthermore suppose that the developer of the

model uses the «SimGrid for you»

interface. Using

this interface, a simulator can be built in C++ by de-

scribing the actors with C++ classes, including Sim-

Grid librairies, initializing a SimGrid Engine object,

loading the platform ﬁle, and then loading the deploy-

ment ﬁle, and ﬁnally starting the simulation through

the SimGrid Engine object.

So far, it was possible to describe stochastic dis-

tributed programs by adding probabilities directly, via

C++ statements, to the actors. Concerning platform

and deployment ﬁles, the XML description ﬁles do

not have any ﬁeld for stochastic descriptions. Prob-

ability distributions on bandwith or computational

power could be added by describing the whole plat-

form directly within the C++ main, which would re-

quire recompiling the simulator each time you want to

change the distributions. As it is a cleaner approach

to have platform and deployment ﬁles separated from

the C++ simulator code, we chose to use already ex-

isting SimGrid proﬁles meant to describe temporal

changes, and modify them to allow stochastic descrip-

tions.

A proﬁle can be associated to each parameter of

a node or a link (such as the bandwidth or computa-

tional power) and describes the evolution of the pa-

rameter over time. The following proﬁle could de-

scribe a latency that during the ﬁrst second is set to its

default value (described in the platform ﬁle), then at

time 1 changes to 3ms and at time 3 seconds rises to

15ms.

1 0.003

3 0.015

The LOOP keyword can be added at the end of

the proﬁle with the number of seconds to wait before

looping. If LOOP 2 is added to the previous ﬁle, then

the proﬁle is reset after 5 seconds, setting it back to

the default value, then the latency at 6 seconds would

change back to 3ms, and after 8 to 15ms and so on.

Stochastic Proﬁles. We have enhanced SimGrid

with stochastic proﬁles which allow the user to eas-

ily model probabilistic aspects, such as for example

See https://simgrid.org/doc/latest/app_s4u.html

SIMULTECH 2020 - 10th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

234

failures of the nodes. To do so we replace the de-

terministic times and values of the proﬁles with stan-

dard probability distributions. The STOCHASTIC key-

word must be added at the beginning of the proﬁle,

then each line contains, separated by spaces, the time

distribution (either DET, UNIF, NORMAL or EXP), then

the parameters of the time distribution, then the value

distribution, and ﬁnally the parameters of the value

distribution. The following proﬁle describes a param-

eter, e.g. a latency, that for the two ﬁrst seconds is

set to its default value. At time 2 a new value for the

latency is drawn uniformly between 10ms and 20ms,

then a time instant t is drawn accordingly to an expo-

nential law of mean

0.05

and at time 2 + t the latency

drawn according to the normal law of mean 45ms and

standard deviation 5ms. Finally, at time

2 + t + 10,

the latency is drawn following to an exponential law

of mean

STOCHAST I C

DE T 2 U N IF 0. 0 10 0.02 0

EX P 0.0 5 N ORMAL 0.0 4 5 0 .00 5

DE T 10 EX P 20

As for non stochastic proﬁles, is possible to loop

a stochastic proﬁle by adding the LOOP keyword after

STOCHASTIC. In that case, the last drawn time will be

used as a base for the loop. In the case of our example

of stochastic proﬁle, at time 2+t +10+2, the latency

is drawn again according to the uniform law, due to

the looping of the proﬁle.

Observed Variables and Protocol for Simulation

Observation. Before building tools to perform sta-

tistical model-checking, we introduce a protocol for

the observation of the simulation. The tool will com-

municate with the simulator, listening for a number

of observed variables that are deﬁned for the study by

the SimGrid user, and controlling whether the simula-

tion should continue or not. These observed variables

must be initialized before the start of the simulation,

and their value may be modiﬁed by the actors during

the simulation. The communication with the simula-

tor is done by hooks on SimGrid signals; these signals

are sent at key moments of the simulation (start, end,

completion of a step). At each step of the simulation,

a line composed of the current time and the value of

each observed variable is sent to our tool; then the

simulator waits for the reply of our tool, i.e. whether

it should or not continue the simulation.

Note that we don’t handle time between changes as in

the original proﬁles that were specifying the time instants of

the changes since the start of the simulation (or of the loop).

Here, to avoid overlap of time intervals, the timing values

sampled denote the delay between two changes.

Randomness in SimGrid. SimGrid is meant to per-

form reproducible simulations of a distributed pro-

gram, yet we need different executions in order to

perform a statistical analysis. We also want to keep,

as best as we can, the reproducibility of the statisti-

cal analysis. In the SimGrid framework, the simula-

tions are made using the standard library’s Mersenne-

Twister random number generator. Calls from both

the actors (in the case of a stochastic distributed pro-

gram) and the generation of events from the proﬁle

are redirected to the unique Mersenne-Twister ran-

dom number generator.

When performing multiple simulations in a row,

at the end of each simulation the current state of the

generator is saved to a ﬁle, to be read at the start of

the next simulation. Moreover, in the case of paral-

lel simulations, the ﬁrst batch of executions is per-

formed by seeding the generator with consecutive in-

tegers. These two practices should ensure that the ran-

dom number generation avoids biases in the statistical

evaluation.

HASL. We now introduce the formalism that we

use for the statistical model-checking toolset that will

be introduced in the next paragraph. It comes from

the statistical model-checker Cosmos (Ballarini et al.,

2015) and is called the Hybrid Automata Stochas-

tic Language. A HASL formula consists of two el-

ements:

• First, an hybrid automaton that synchronizes with

the execution of the observed program (or more

generally of a Discrete Event Stochastic Process).

It permits both to select relevant paths and to

maintain indicators, using data variables evolving

along the path and the observed variables of the

distributed program;

• Second, an expression based on the data variables,

that describes the quantity to be evaluated. These

expressions include path operators, such as the

minimum and maximum values reached during an

execution, the last value, the integral over time or

the time average.

Note that the performance indices corresponding to

these expressions are conditional expectations over

the successful paths of the hybrid automaton. More

precisely, the results of the simulation count in the

computation of the value of the expression only if the

automaton reaches a ﬁnal state during the execution.

Since we cannot in our tool synchronize with the sim-

ulator as precisely as we would with the Cosmos mod-

els, we have added rejecting states. If such a state is

reached, the simulation is ignored. This is equivalent

to a failed synchronization in Cosmos.

Statistical Model Checking of Distributed Programs within SimGrid

235

SimGridStatMC. In order to evaluate those ex-

pressions, we propose a prototype named SimGrid-

StatMC that performs statistical model-checking over

simulations performed with our enhanced version of

SimGrid. This prototype is based on the source of the

model-checker Cosmos, but reworked to support the

observation protocol described previously. It takes as

inputs (the path to) the executable of the simulator, the

deployment and platform ﬁles, and the ﬁle describing

the HASL formula, and other arguments that depend

on the statistical procedure being used.

As other statistical model-checking tools, Sim-

GridStatMC generates paths and relies on statistical

results to evaluate the precision of the value com-

puted. Our tool uses conﬁdence interval, that aims at

establishing an interval for possible values of the pa-

rameter to estimate, together with a conﬁdence level

for the parameter to really lie in that interval. For

conﬁdence interval estimate, several methods can be

used:

• The Chernoff-Hoeffding bound (Hoeffding,

1963), for the estimation of the expectation of

a bounded random variable, requires two out of

three related parameters, and determines the third

one from the two others: the interval width, the

conﬁdence level and the number of samples. This

procedure outputs a conﬁdence interval whose

width satisﬁes the requirement and where the

probabilistic guarantee is exact;

• The Chow-Robbins bounds (Chow and Robbins,

1965), that applies to the estimation of a random

variable where no known bound is available. It re-

quires two parameters: the interval width and the

conﬁdence level, and outputs a conﬁdence interval

with the correct width. The number of simulations

is not precomputed; instead, the conﬁdence inter-

val is computed regularly until it is small enough.

The number of samples will depend on the vari-

ability of the values obtained while performing

simulations.

Python Scripts. For some tasks (such as obtaining

a histogram of the different values of a HASL expres-

sion over a path), we prefer to use dedicated python

scripts. The general idea is unchanged: we start sev-

eral simulations in parallel, starting a new simulation

after the last one has completed, then gather the re-

sults at the end of each simulation, until the chosen

end condition is satisﬁed (for example, the number of

simulations).

Biases. Launching several simulators in parallel and

gathering the results at the end of each simulation may

introduce a bias in favor of the fastest simulations.

If the SimGrid model may have greatly varying ex-

ecution times, the slowest simulations are likely to be

still running and ignored when the tool reaches the

chosen end condition and all remaining simulations

are halted, resulting in some long simulations being

ignored in the computation of the result. However,

the number of ignored simulations is strictly lower

than the number of parallel simulations, in our case

32 maximum.

Repository. This toolset is available as a git repos-

itory at: https://framagit.org/pikachuyann/simgrid-

statmc/

3 CASE STUDY: BitTorrent

In order to illustrate the variety of properties that can

be evaluated using our approach, we apply it to a

stochastic SimGrid model, more precisely a model of

the BitTorrent protocol.

BitTorrent. BitTorrent (Cohen, 2008) is a peer-to-

peer protocol for distributing ﬁles. It is thus not a

big server that distributes the ﬁle to all nodes, but it

is the nodes themselves who exchange parts of the

ﬁle depending on the progress of their own download.

One particular node, called the tracker, maintains the

list of peers currently participating in the protocol,

and communicates a randomly chosen

list of peers to

each peer that connects and requests it. The peers are

usually distinguished by whether they already have

the full ﬁle (in which case they are called seeders) or

not (in which case they are called leechers). Finally,

the order in which pieces are requested to other peers

and then downloaded is also chosen randomly.

We have decided for our experiments to use a

100MB ﬁle divided in 100 pieces of 1MB each, with

76 peers (initially including one seeder), a 1MB/s

download and upload speed for each peer (which is

similar to the ﬁle used in (Testa et al., 2012)). Our

model is the BitTorrent example included in the Sim-

Grid distribution, that has been made more resilient to

unavailability of peers in order to measure the impact

of failures on the completion time, and now also sup-

ports a larger number of pieces. We consider that the

peers attempt to download a piece in one go. The ex-

More precisely, if there are more than n nodes known

by the tracker (usually, n = 50), a randomly chosen list of n

nodes is sent to the requesting peer; otherwise the entire list

is sent to the requesting peer.

SIMULTECH 2020 - 10th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

236

periments are performed on nodes of Grid-5000, with

32 simulators being launched in parallel.

Experiments. In the ﬁrst experiment, we measure

the average completion time without node failure. We

halt the simulations when the conﬁdence interval has

reached a 0.5% relative width, with 99% conﬁdence

level. In this case, we measure an average comple-

tion time of 1236.36 seconds (with a conﬁdence in-

terval [1233.28;1239.47]), over 1213 simulations per-

formed within 533 seconds. We also measure the

mean download time for a peer which is 951.96 sec-

onds ([950.28;953.66]).

number of completed downloads

time

600s

700s

800s

900s

1000s

1100s

1200s

1300s

0 10 20 30 40

50 60

Figure 1: Average n-node completion time, out of 1,000

simulations on the standard protocol (squares) and on a vari-

ant where nodes leave as soon as their download is complete

(circles).

The graph in Figure 1 illustrates how long it takes,

depending on n, for the n fastest nodes to ﬁnish the

download. This is illustrated on two variants of the

BitTorrent protocol, the standard one (measures with

squares) and one where all peers except the seeder

stop the protocol upon completing their own down-

load (circles). The linear curve represents the time

taken for the download if all peers download directly

the ﬁle from the seeder. With our hypotheses, we see

that the peer-to-peer version becomes faster after 10

nodes. Note that for the upper curve, the 76

node is

missing due to very long completion time that led to

halted simulations.

In our last experiment of a system without node

failures, we produce in Figure 2 an histogram, over

1,000 simulations, of the total completion time. We

can see that the total completion time most often en-

countered lies within 1220 and 1230. Very few simu-

lations generated a completion time over 1350 and the

Completion time

Number of

simulations

100

120

140

1130 1170 1210 1250 1290 1330 1370 1410

Figure 2: Distribution of completion times of the BitTorrent

protocol, over 1000 simulations.

˙x = inprogress

t = 1

E(LAST (x)/(nodes − seeders))

completed = nodes

t > 50000

Figure 3: HASL formula for the computation of the mean

download time.

fastest and slowest completion times are respectively

1134.62 and 1397.32 seconds.

We can measure the mean download time of a

node using the HASL formula shown in Figure3.

On this example, the variable t represents the time

(and hence evolves at constant speed 1 during the

simulation), whereas variable x counts the cumu-

lated waiting time of all the peers downloading (and

thus its evolution speed is the number of peers cur-

rently downloading). The formula E(LAST (x)) then

counts the average cumulated waiting time over a

simulation, and can be divided by the number of

non-initially seeder peers to get the mean waiting

time for a peer. One could also measure the aver-

age number of peers downloading at any time using

E(LAST (x)/LAST (t)).

In the next experiment, we add failures to each

node, with a varying exponential rate and a ﬁxed (10s)

repair time. We measure the completion and the mean

download time, with each a 99% conﬁdence level and

5% relative width, with a minimum of 100 simula-

tions. Both the mean completion time for each expo-

nential and the mean download time depending on the

rate of failures are shown in Figure 4. The computation

of the mean completion time required between 421

(for the rate λ = 10000s) and 856 simulations (for the

rate λ = 3000s). For the mean download time, with

100 simulations the conﬁdence interval was already

small enough for every value of λ.

Statistical Model Checking of Distributed Programs within SimGrid

237

failure rate λ

Completion time

Download time for one peer

1000s 2000s 3000s 4000s

5000s 6000s

7000s 8000s 9000s 10000s

900s

1000s

1100s

1200s

1300s

1400s

1500s

1600s

1700s

Figure 4: Mean completion time and mean dowload time

per peer for the BitTorrent protocol with introduced failures

with an exponential rate.

Completion time

Number of

simulations

100

150

200

250

1200 1600 2000 2400 2800 3200 3600 4000

Figure 5: Distribution of completion times of the BitTorrent

protocol with introduced failures with an exponential rate

λ = 1000s, over 1000 simulations.

In the case of a failure with an exponential rate

of λ = 1000s, we also produced an histogram of the

completion times, shown in Figure 5.

For the last experiment, we implemented a version

of super-seeding (Hoffmann, 2008), implemented in

libtorrent

. Super-seeding is a mode where a

unique seeder tries to minimize the amount of data

it sends. It does not announce itself as a seeder to

the other peers, but announces only one piece at a

time, announcing a different piece to each of the other

peers. Moreover, the piece announced to a node n has

been received by enough other nodes, then the super-

seeder announces a new piece to the node n. The goal

https://www.libtorrent.org/

is for the seeder to upload the least possible amount

of content while still sharing the entire ﬁle. We intro-

duce another observation variable, namely the num-

ber of pieces sent by the seeder. The average num-

ber of pieces sent by the seeder dropped from 189.61

without to 104.45 with super-seeding, but at the cost

of a longer completion time (3674 with super-seeding

instead of 1230).

4 CONCLUSION

In this article, we have presented how the SimGrid

framework can be used to perform statistical model-

checking of distributed programs. In particular we ex-

plained how the framework can be extended in two

ways: ﬁrstly enhancing the model with a stochas-

tic description of the capacities of a resource (the

stochastic proﬁles) and management of random num-

ber generation, and secondly adding indicators and

communication capabilities for the observation of a

simulation. We ﬁnally showcased these extensions

and their performance evaluation capabilities on a Bit-

Torrent model developed with the SimGrid frame-

work. We were able to compute expected times for

completion and compare different variants of the pro-

tocol, in particular super-seeding, in which less con-

tent is sent by the seeder, at the price of a longer com-

pletion time.

ACKNOWLEDGEMENTS

Experiments presented in this paper were carried out

using the Grid’5000 testbed, supported by a scientiﬁc

interest group hosted by Inria and including CNRS,

RENATER and several Universities as well as other

organizations (see https://www.grid5000.fr).

This work has been supported by IN-

RIA collaborative project IPL HAC-SPECIS:

http://hacspecis.gforge.inria.fr/

REFERENCES

Ballarini, P., Barbot, B., Duﬂot, M., Haddad, S., and Peker-

gin, N. (2015). Hasl: A new approach for performance

evaluation and model checking from concepts to ex-

perimentation. Performance Evaluation, 90:53 – 77.

Casanova, H., Giersch, A., Legrand, A., Quinson, M.,

and Suter, F. (2014). Versatile, scalable, and accu-

rate simulation of distributed applications and plat-

forms. Journal of Parallel and Distributed Comput-

ing, 74(10):2899–2917.

SIMULTECH 2020 - 10th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

238

Chow, Y. S. and Robbins, H. (1965). On the asymptotic

theory of ﬁxed-width sequential conﬁdence intervals

for the mean. Ann. Math. Statist., 36(2):457–462.

Cohen, B. (2008). The bittorrent protocol speciﬁcation.

http://www.bittorrent.org/beps/bep_0003.html.

Guthmuller, M., Corona, G., and Quinson, M. (2018).

System-level state equality detection for the formal

dynamic veriﬁcation of legacy distributed applica-

tions. J. Log. Algebraic Methods Program., 96:1–11.

Hoeffding, W. (1963). Probability inequalities for sums of

bounded random variables. Journal of the American

Statistical Association, 58(301):13–30.

Hoffmann, J. (2008). Superseeding.

http://www.bittorrent.org/beps/bep_0016.html.

Kwiatkowska, M., Norman, G., and Parker, D. (2011).

Prism 4.0: Veriﬁcation of probabilistic real-time sys-

tems. In Computer Aided Veriﬁcation, pages 585–591.

Springer.

Merz, S., Quinson, M., and Rosa, C. (2011). Simgrid mc:

Veriﬁcation support for a multi-api simulation plat-

form. In Bruni, R. and Dingel, J., editors, Formal

Techniques for Distributed Systems, pages 274–288,

Berlin, Heidelberg. Springer Berlin Heidelberg.

Testa, C., Rossi, D., Rao, A., and Legout, A. (2012). Ex-

perimental assessment of bittorrent completion time in

heterogeneous tcp/utp swarms. In Pescapè, A., Salgar-

elli, L., and Dimitropoulos, X., editors, Trafﬁc Moni-

toring and Analysis, pages 52–65, Berlin, Heidelberg.

Springer Berlin Heidelberg.

Statistical Model Checking of Distributed Programs within SimGrid

239