Mutation Selection: Some Could be Better than All
Zhiyi Zhang
, Dongjiang You
, Zhenyu Chen
Yuming Zhou
and Baowen Xu
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Software Institute, Nanjing University, Nanjing, China
Abstract. In previous research, many mutation selection techniques have been
proposed to reduce the cost of mutation analysis. After a mutant subset is se-
lected, researchers could obtain a test suite which can detect all mutants in the
mutant subset. Then they run all mutants over this test suite, and the detection
ratio to all mutants is used to evaluate the effectiveness of mutation selection
techniques. The higher the ratio is, the better this selection technique is. Ob-
viously, this measurement has a presumption that the set of all mutants is the
best to evaluate test cases. However, there is no clearly evidence to support this
presumption. So we conducted an experiment to answer the question whether
the set of all mutants is the best to evaluate test cases. In this paper, our expe-
riment results show that a subset of mutants may be more similar to faults than
all the mutants. Two evaluation metrics were used to measure the similarity –
rank and distance. This finding reveals that it may be more appropriate to use a
subset rather than all the mutants at hand to evaluate the fault detection capabil-
ity of test cases.
1 Introduction
Mutation analysis is a fault-based testing technique that has been used to identify or
create adequate test cases effectively. It was first proposed by [6] and [3]. In mutation
analysis, a faulty program termed mutant is generated by seeding a fault into the orig-
inal program, and the transformation rule to generate mutants is called mutation oper-
ator. Then we can execute a test case on a mutant and the original program to com-
pare their outputs. If their outputs are different, we say that the mutant can be de-
tected by this test case. Otherwise, this mutant survives. However, there are some
mutants that cannot be detected by any test case and these mutants are deemed to be
equivalent mutants. Automatically detecting equivalent mutants has been proved to
be an undecidable problem [10].
Although mutation analysis is an effective evaluation for testing experiment [1], it
has seldom been used in practice because of its high cost. Even a small program can
generate numerous mutants. Obviously, compiling and executing such numerous
mutants are heavy burdens in mutation analysis. In order to alleviate these burdens,
many cost reduction techniques have been developed. A natural idea is to select a
subset from all the mutants, while maintaining the effectiveness to evaluate the fault
detection capability of test cases. Operator-based selection, clustering-based selection,
Zhang Z., You D., Chen Z., Zhou Y. and Xu B..
Mutation Selection: Some Could be Better than All.
DOI: 10.5220/0003559700100017
In Proceeding of the 1st International Workshop on Evidential Assessment of Software Technologies (EAST-2011), pages 10-17
ISBN: 978-989-8425-58-4
2011 SCITEPRESS (Science and Technology Publications, Lda.)
and random selection are three representative mutation selection techniques.
Previous research showed that mutants can be used to replace faults to evaluate
the fault detection capability of test cases [1]. Most of the existing efforts have an
implication that using all mutants to evaluate the fault detection capability of test
cases will increase the validity of their conclusions than using a subset of mutants.
After selecting a subset of mutants, we can obtain test cases that can detect all the
mutants in this subset. Then we execute all non-equivalent mutants on these test cases
and calculate the mutation detection ratio. The higher the ratio is, the better the subset
is. However, there is no convincing evidence whether this methodology is trustwor-
thy because some mutants may have negative affection in mutation analysis.
In this paper, we investigated the similarity between faults and all mutants gener-
ated using mutation operators, and between faults and subsets of mutants. Our expe-
riment results show that a subset of mutants is more similar to faults in the ability of
evaluating test cases. This finding reveals that it may be more appropriate to use a
subset rather than all the mutants at hand to evaluate the fault detection capability of
test cases in some studies.
The rest of this paper is organized as follows. Section 2 introduces our experiment.
Section 3 reports the results and analysis. Section 4 presents the conclusions and
future work.
2 Experiment
We are interested in the following research question:
When considering the ability of evaluating test cases, can a subset of mutants be
better than all mutants? i.e., can a subset of mutants be more similar to faults than all
2.1 Subject Programs
We used seven Siemens C programs as subjects [5], [11]. We obtained all the subject
programs and the corresponding test cases from Software-artifact Infrastructure Re-
pository (SIR) [4]. Table 1 shows the basic information of the seven programs.
Table 1. Subject Programs.
Programs Lines of Code Test Pool Size Non-Equivalent Faults Non-Equivalent Mutants
tcas 137 1608 40 4011
tot_info 281 1052 23 7870
schedule 296 2650 9 3681
schedule2 263 2710 9 4862
print_tokens 343 4130 7 9165
print_tokens2 355 4115 10 8765
replace 513 5542 31 19861
2.2 Techniques for Generating Mutants
In our experiment, we used the following four techniques to obtain a set of mutants.
For all mutants, we used a tool called ProteumIM2.0 [2], which implemented 108
mutation operators for C language. For each subject program, we generated all mu-
tants with all the mutation operators, and equivalent mutants are discarded in our
In order to increase the validity of our experiment, we generated the subsets of
mutants with the same size (denoted as n) when using Offutt et al. operator-based
selection, clustering-based selection and random selection.
For operator-based selection, we preferred several operators to all operators, and
we generated the operator subset with Offutt et al. 5 operators (ABS, UOI, LCR,
AOR and ROR) [9], [8].
For clustering-based selection [7], we classified all non-equivalent mutants into
several different clusters using simple K-means based on the detectable test cases.
According to the characteristics of clustering, mutants in the same cluster can always
be detected by similar test cases. Then we can select several mutants from one cluster
to represent all the mutants in that cluster. In our experiment, the number of clusters
is 20, and we randomly selected n/20 mutants from each cluster. If the number of
mutants in a cluster was less than n/20, we selected all mutants in this cluster.
For random selection, we randomly selected n mutants from all non-equivalent
mutants to obtain the subset of mutants.
2.3 Generate Test Suite
We generated a test suite of 512 test cases (the maximum number of test cases al-
lowed by the tool we used) from the test pool by the following procedure:
We used the test mode in Proteumim2.0. In this mode, if a mutant has been de-
tected by a test case, this mutant will not be executed by the remaining test cases, and
if a test case cannot detect any survived mutant, this test case is defined as redundant
test case and thus can be removed from the test suite.
We randomly selected 512 test cases to form a test suite. After running this test
suite, we removed redundant test cases (assume the number is m) from the test suite
and added another m test cases from the remaining test cases in the test pool. We
repeated this process until all the test cases in the test pool had been executed (m was
always greater than zero in our experiment). As such, the mutant that can be detected
by the test pool can also be detected by this test suite. If the number of test cases in
this test suite was less than 512 (all of our subject programs met this condition), we
randomly selected test cases that were not in this test suite from the test pool to form a
test suite (denoted as TS) with exactly 512 test cases.
2.4 Generate Results
We randomly selected test cases from TS using a selection rate x% from 25% to
100% in steps of 5%. We denoted these subsets of TS as
TS (i is from 1 to 16). Then
we ran faulty versions on
TS and recorded the number of faults that can be detected
TS . Thus we could obtain the fault detection ratio as follows:
KF = (the number of faults that can be detected by
TS ) / (the number of all faults)
For the sets of mutants generated using each of the four selection techniques, we
define mutation detection ratio of this set as follows:
KT = (the number of non-equivalent mutants that can be detected by
TS in mutant
set) / (the number of non-equivalent mutants in mutant set)
In order to increase the validity of our experiment, we repeated our experiment five
times for each
TS , and computed the average over them as the final result.
3 Results and Analysis
In this section, we firstly present our experiment results and the process of evaluating
the results, and answer the research question we proposed in Section 2.1. Then we
analyze the possible reasons that could explain the results of our experiment.
As previously mentioned in this paper, the more similar the ability of evaluating
test cases of a mutant set is to faults, the better this mutant set is. So we use the fol-
lowing two evaluation metrics to measure the similarity between mutant sets and
3.1 Evaluation by Rank
Table 2. Evaluation by Rank on tcas.
Selection Rate All Operator Clustering Random
25% 3 1 4 2
30% 3 1 4 2
35% 3 1 4 2
40% 3 1 4 2
45% 3 1 4 2
50% 2 1 4 3
55% 3 1 4 2
60% 3 1 4 2
65% 2 1 4 3
70% 3 1 4 2
75% 3 1 4 2
80% 2 1 4 3
85% 2 1 4 3
90% 2 1 4 3
95% 2 4 1 3
100% 1 1 1 1
Sum 40 19 58 37
We firstly calculated the numerical difference between
KF and
KT . Then we com-
pared the values of differences among all mutants and subsets of mutants. We as-
signed the smallest difference as 1, the second smallest difference as 2, etc. For each
technique, there will be 16 such values of differences. Table 2 shows those values on
tcas. The column indicates the four techniques, the row indicates selection rate x%
from 25% to 100% in steps of 5%, in a total of 16 rows.
We summed up these 16 values as the rank value. The lower the rank value is, the
better this technique is. Table 3 shows our experiment results. The column indicates
the four techniques, the row indicates different subject programs, and the number is
the rank value.
Table 3. Evaluation by Rank.
Programs All Operator Clustering Random
tcas 40
58 37
tot_info 38
45 47
schedule 40
52 37
schedule2 39 37
print_tokens 35 38 47
print_tokens2 44 36
replace 36
52 42
3.2 Evaluation by Distance
We calculated the Euclidean distance between the result sequences of fault detection
ratios (
KF to
KF ) and that of mutant detection ratios (
KT to
KT ) for each tech-
nique and each subject program. The lower the distance is, the better this technique is.
We summarized our results in Table 3. The column indicates the four techniques, the
row indicates different subject programs, and the number is the Euclidean distance.
Euclidean distance =
Table 4. Evaluation by Distance.
Programs All Operator Clustering Random
tcas 0.4825
0.5205 0.4694
tot_info 0.2229
0.2214 0.2232
schedule 0.9071
0.9163 0.9117
schedule2 0.3108
0.3237 0.3237
print_tokens 0.4129 0.4283 0.4435
print_tokens2 0.1826 0.1897 0.2000
replace 0.1608
0.1993 0.1751
3.3 Result Analysis
To sum up, there is always a subset of mutants which is better than all mutants. And
in none of the seven subject programs, all mutants can be the best. It also means that
the subset selected by three selection techniques, especially by Offutt et al. operators
has a better ability for evaluating test cases than all mutants. We provide the follow-
ing explanations for this phenomenon in our experiment:
Firstly, the faults in Siemens programs are more difficult to be detected than most
mutants, and this fact has been confirmed in [1]. The faults that can be detected by
350 or more test cases have been discarded from the subject programs. Therefore, the
remaining faults are difficult to be detected.
Secondly, the mutants in operator subset are much more difficult to be detected
than most of other mutants. Taking ABS for example, detecting the mutants generated
by ABS requires the test cases which select from different parts of the input domain
related to the mutated expression [12]. Therefore, only a small number of test cases
can detect these mutants.
Thirdly, because of the characteristics of clusters, the mutants in the same cluster
can be detected by the similar test cases. Therefore, if the test cases we selected can
detect one mutant, the mutants we selected from the same cluster may also be de-
tected by those test cases. Consequently, most mutants in clustering subsets can be
Finally, there are two kinds of mutation operators in proteumIM2.0, unit operators
and interface operators. The unit operators include modifications to operands, state-
ments and operators in expressions. The interface operators are related to modifica-
tions to methods and classes features. In our experiment, however, the mutants in
operator subset were all generated by unit operators, others subsets include mutants
which were generated by interface operators. We suppose that interface operator
mutants are easier to be detected than unit operator mutants. So for many programs in
our experiment, the test sets detected fewer mutants in operator subset than other
3.4 Threats to Validity
Threats to internal validity are uncontrolled factors that are also responsible for our
results. The main threat is that there are defects in the process of generating, selecting
and clustering mutants. To reduce this threat, we used ProteumIM2.0 to generate
mutants and Offutt subset, and used Weka to cluster mutants with simple K-means.
ProteumIM2.0 and Weka are tools that have been widely applied in previous research
Threats to external validity are the representativeness of our subject programs and
experiment procedures. The main threat is that the subject programs with their test
cases and faulty versions may not have generality. To reduce this threat, we chose
seven widely used Siemens programs, which contain plenty of test cases and different
numbers of faults.
Threats to construct validity are the measurements we used to evaluate our results.
To reduce the threat, we applied rank sum measurement and Euclidean distance mea-
surement, which are well-known metrics to measure the (dis)similarity between two
4 Conclusions and Future Work
In this paper, we conducted an experiment to investigate an important question in
mutation analysis. We compared the similarities between faults and all mutants, and
between faults and subsets of mutants. We confirmed that using a subset of mutants is
more appropriate than using all mutants when evaluating test cases.
In the future work, we plan to do the following research. Firstly, we will conduct
experiments on larger programs with more faulty versions. Secondly, we will apply
more selection techniques to extend our research. For example, in operator-based
selection, we will use Barbosa’s operators and Siami Namin’s operators, and use
other algorithms in clustering-based selection. Thirdly, we will do more research to
deeply explain the phenomenon in detail that using the subset of mutants is better
than using all mutants when evaluating test cases. Finally, we will figure out which
selection technique should we use when we want to select a subset of mutants from
all mutants. We propose to apply more programs with different sizes and structures to
answer this question.
The work described in this article was partially supported by the National Natural
Science Foundation of China (90818027, 60803007, 60803008, 61003024), the Fun-
damental Research Funds for the Central Universities.
1. Andrews, J. H., Briand, L. C., and Labiche, Y. (2005). Is mutation an appropriate tool for
testing experiments? ICSE 2005: 402-411.
2. Delamaro, M. E. and Maldonado, J. C. (1996) Proteum – A Tool for the Assessment of
Test Adequacy for C Programs – User’s guide.
3. DeMillo, R. A., Lipton, R. J., and Sayward, F. G. (1978). Hints on Test Data Selection:
Help for the Practicing Programmer. Computer 11(4): 34–41 (1978).
4. Do, H., Elbaum, S. G., Rothermel, G. (2005). Supporting Controlled Experimentation with
Testing Techniques: An Infrastructure and its Potential Impact. Empirical Software Engi-
neering 10(4): 405-435 (2005).
5. Elbaum, S. G., Malishevsky, A. G., and Rothermel, G., (2000). Prioritizing test cases for
regression testing. ISSTA 2000: 102-112.
6. Hamlet, R. G., (1977). Testing Programs with the Aid of a Compiler. IEEE Transactions on
Software Engineering 3(4): 279-290 (1977).
7. Hussain, S., (2008). Mutation Clustering. Master’s Thesis, King’s College London, Strand,
London, 2008.
8. Mathur, A. P., (1991). Performance, Effectiveness, and Reliability Issues in Software Test-
ing. COMPSAC 1991: 604–605.
9. Offutt, A. J., Lee, A., Rothermel, G., Untch, R. H., and Zapf, C., (1996). An Experimental
Determination of Sufficient Mutant Operators. ACM Transactions on Software Engineering
and Methodology 5(2): 99-118 (1996).
10. Offutt, A. J. and Pan, J. (1977). Automatically Detecting Equivalent Mutants and Infeasible
Paths. Software Testing, Verification and Reliability 7(3): 165-192 (1997).
11. Rothermel, G., Harrold, M. J., Ostrin, J., and Hong, C., (1998). An Empirical Study of the
Effects of Minimization on the Fault Detection Capabilities of Test Suites. ICSM 1998: 34-
12. Wong, W. E. and Mathur, A. P., (1995). Reducing the cost of mutation testing: An empiri-
cal study. Journal of Systems and Software 31(3): 185-196 (1995).