Assessing Information Security Risks using Pairwise Weighting
Henrik Karlzén, Johan Bengtsson and Jonas Hallberg
Department for Information Security and IT Architecture, Swedish Defence Research Agency,
Olaus Magnus väg 42, Linköping, Sweden
{henrik.karlzen, johan.bengtsson, jonas.hallberg}@foi.se
Keywords: Risk Assessments, Pairwise Weighting, Information Security Risk, Cognitive Style, Cognitive Load.
Abstract: In practice, assessing information security risks is difficult since available methods lack specificity on how
to perform the assessments as well as what input should be used. Thus, the process becomes resource
demanding with fairly large rater-dependency. An established way of facilitating rating processes is to
weight objects against each other, rather than rating each object independently on an absolute scale. In this
paper, we investigate whether such a method, inspired by the Analytic Hierarchy Process, can be useful for
assessing information security risks. However, the new method did not result in higher inter-rater reliability
or lower cognitive load. This result was true both for experts and non-experts, as well as among raters with
different cognitive styles.
1 INTRODUCTION
A large number of methods have been proposed for
the assessment of information security risks
associated with threats. However, these methods do
not provide any substantial guidance on how to
perform the underlying assessments of probability
(likelihood) and severity (impact or consequence) or
a common description of what input should be used
during such assessments (Korman et al., 2014).
Unfortunately, this results in assessing probability
and severity being difficult in practice (Fenz et al.,
2014), with high resource demands and often
insufficient reliability among different raters,
leading to rater-dependent assessments.
Although rater-independence does not indicate
assessments closer to the truth per se, the objective
truth typically remains elusive and inter-rater
reliability is a suitable surrogate indicator, since it is
necessary for validity (Gwet, 2014). For these
reasons, it would be useful to find a new way of
assessing risks that shows higher inter-rater
reliability and results in lower cognitive load,
without increasing the number of raters. A potential
method would be to compare the threats rather than
make an absolute assessment about each one, e.g.
similar to the weighting used in the Analytic
Hierarchy Process (AHP) (Saaty, 1990). This paper
investigates the possible advantages of this approach
over the more traditional ones where each threat is
assessed independently.
While improved inter-rater reliability is the main
goal of applying a new method to risk assessments,
there must also be a balance with the required
resources to use the methods. In the field of
cognitive load theory relating to learning, three types
of cognitive processing are included (Deleeuw and
Mayer, 2008): the intrinsic processing which relates
to the inherent complexity of the task; the
extraneous processing which concerns the redundant
information included and therefore the presentation;
and the germane processing which relates to
knowledge and learning. Hence, extraneous load
should be kept at a minimum whereas germane load
is encouraged in learning situations. However,
separating the two in measurements has proven
difficult. Measuring cognitive load is typically done
with self-ratings (Paas et al., 2003) where
participants put numerical values on their own
perceived mental burden.
The following hypotheses were tested:
H1. Inter-rater reliability is higher when rating
probability using pairwise weighting rather
than the traditional method.
H2. Inter-rater reliability is higher when rating
severity using pairwise weighting rather
than the traditional method.
H3. Cognitive load is lower when rating
probability and severity using pairwise
weighting rather than the traditional
318
Karlzén, H., Bengtsson, J. and Hallberg, J.
Assessing Information Security Risks using Pairwise Weighting.
DOI: 10.5220/0006138203180324
In Proceedings of the 3rd International Conference on Information Systems Security and Privacy (ICISSP 2017), pages 318-324
ISBN: 978-989-758-209-7
Copyright
c
2017 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
method. Specifically:
i. Mental effort is lower when rating
probability and severity using
pairwise weighting rather than the
traditional method.
ii. Difficulty is lower when rating
probability and severity using
pairwise weighting rather than the
traditional method.
iii. Time consumption is lower when
rating probability and severity
using pairwise weighting rather
than the traditional method.
Section 2 of the paper describes the method.
Section 3 gives the results and section 4 discusses
these results.
2 METHOD
The sections below describe the participants, the
survey instrument, and the data collection procedure.
2.1 Participants
The survey was distributed to a strategic sample of
10 researchers active in the areas of information
security, IT security, IT management or human
factors and so were not all experts on risk
assessments or information security. All respondents
were from the Swedish Defence Research Agency,
possess university degrees, and work as researchers,
consultants or in management.
2.2 Material and Scales
The study was conducted using two paper-based
questionnaires, each questionnaire comprising of
three parts which were filled out by each participant.
The first part consisted of eight questions about the
respondent, and were identical between the two
questionnaires, although the respondents only had to
answer them on whichever questionnaire they filled
out first. The second part consisted of 105 potential
incidents that the respondent was asked to assess
regarding both probability and severity using visual
analogue scales. One questionnaire applied the
traditional method, while the other questionnaire
instead used weighting. A third part, identical to
both questionnaires, concerned cognitive load of
filling out the questionnaires.
2.2.1 Incidents
The 105 potential incidents were reused from a
previous study which investigated whether people
truly tend to perceive risk as a multiplicative
function of probability and severity (Sommestad et
al., 2016). The incidents were intended to be
relatable for the participants and to cover the entire
risk matrix, although naturally with fewer incidents
with both high probability and severity. Some
examples include:
“A security flaw in the authentication tokens
allows a malicious outsider access to the
local network.”
”An employee installs freeware that covertly
copies local and networked folders to a
server controlled by a large defence
corporation”.
“An employee gathers large amounts of
secret documents concerning IT-security
and hands them over to a foreign nation.”
2.2.2 Probability and Severity
The questionnaire for the traditional method asked
respondents to rate severity and probability of each
incident on each of two lines. Ten helping markers
per line were present but exact indicated values were
measured using a ruler. The Severity scale stretched
from 0 (Minimal, no harm at all) to 10 (Greatest
harm among all listed incidents). The probability
scale stretched from 0% (Minimal, completely
unlikely for the next ten years) to 100% (Maximal,
guaranteed to happen).
Conversely, the second questionnaire compared
incidents with each other. The first incident was
pitted against the second, the second against the
third, the third against the fourth, and so on. For
each comparison, the respondents were asked how
the incidents compared in both probability and
severity separately. A scale was provided where
circling the suitable number on the left part indicated
that the first mentioned incident had greater
probability (or severity if that was measured), while
circling the middle 1 meant that the incidents were
equal in this regard, and finally circling the suitable
number on the right part of the scale indicated that
the second mentioned incident had greater
probability or severity. The scale ran from a factor
of 9 and every odd number: (9, 7, 5, 3, 1, 3, 5, 7, 9),
which corresponds to the commonly used original
scale in AHP (Ishizaka and Labib, 2011). The
numbers on the scale were also explained as: 1 –
equal, 3 – moderately more important, 5 – strongly
Assessing Information Security Risks using Pairwise Weighting
319
more important, 7 – very strongly more important,
and 9 – extremely more important. Since each
comparison introduced one new incident (except the
first one which introduced two), a total of 104
comparisons were needed for probability and
severity respectively in order to enable the
computation of weights for each of the 105 incidents
(Ishizaka and Lusti, 2004).
2.2.3 Cognitive Load
To measure the impact on the participants when
assessing, three aspects of cognitive load were
gauged on each of the two questionnaires.
One question asked about the mental effort to fill
out the questionnaire (Likert scale 1–7 from
extremely low to extremely high effort). This is
similar to the effort scale in e.g. (Paas, 1992) and is
related to intrinsic cognitive load (Deleeuw and
Mayer, 2008). Another question concerned how
difficult it was to fill out the questionnaire (Likert
scale 1–7 from extremely simple to extremely
difficult), similar to e.g. (Marcus et al., 1996) and
relating to germane cognitive load, according to
(Deleeuw and Mayer, 2008).
As suggested in (Marcus et al., 1996), both self-
reported rating scales were administered
immediately after the main test as to ensure that
evaluations are fresh in memory. Both self-ratings
have been shown to be reliable, sensitive and do not
impact performance (Paas et al., 1994).
Furthermore, the time it took each respondent to
fill out the questionnaire was (objectively)
measured. This factor is sometimes underestimated
(Paas et al., 2003) although it was measured in e.g.
(Fink and Neubauer, 2001). In (Deleeuw and Mayer,
2008) response time (to a concurrent secondary task)
related to the third cognitive load dimension of
extraneous load.
2.2.4 Cognitive Style and Expertise
Eight questionnaire items related to decision making
and were taken from (McShane, 2006). Four of these
measured rationality tendency with a focus on
objective information and logical analysis. The other
four instead measured the respondents’ propensity to
utilise intuition and instinct rather than rationality.
Three further items related to expertise
concerning information security, IT security, and
risk assessments.
2.3 Data Collection
A crossover study with counterbalancing was used.
To alleviate order effects, the respondents were
randomly divided into two equal groups. One group
assessed the items using the traditional method for a
first session and used the new weighting method for
the second session. The reverse order was applied
for the other group. General risk analysis learning
effects should be fairly equal between the two
methods, so an order effect is unlikely and indeed no
statistically significant order effect was found.
To make sure the respondents’ assessments on
the second questionnaire were not affected by the
first, there was a gap of at least one week between
questionnaires. Also, the respondents were
instructed not to keep notes and were not told the
specific aim of the study beyond the investigation of
risk perceptions.
2.4 Internal Validity Measurement
2.4.1 Probability and Severity
We constructed our own incidents, which may have
led to incidents that were difficult to interpret.
However, we are primarily interested in comparing
the reliability of two methods, which is fairly robust
against ambiguity of incidents, especially in view of
the fairly large number of incidents. As will be seen,
incidents were clearly well enough understood.
Also, the answers for each respondent showed
correlations between probability and severity in (-
0.710)–(-0.219) for the traditional method. A
negative correlation is natural since most incidents
have high values for at most one of probability and
severity, and is in line with e.g. -0.56 in (Weinstein,
2000).
2.4.2 Cognitive Load
Standardised Cronbach's Alpha = 0.797 (95 % CI
0.573–0.913) showing acceptable reliability, i.e.,
they were internally consistent, indicating that the
three items measure the same construct of cognitive
load, so we do not see different types of cognitive
load. The situation in the literature is not clear, some
studies gives support for our model, others see
distinct types of cognitive load, e.g. (Deleeuw and
Mayer, 2008). For instance, they have a statistically
significant correlation of only 0.33 between effort
and difficulty in one experiment compared to our
0.595 (p < 0.01). This is reasonable seeing as higher
complexity demands more effort, although effort can
ICISSP 2017 - 3rd International Conference on Information Systems Security and Privacy
320
be high regardless and increased effort cannot
handle all increases of complexity.
2.4.3 Cognitive Style and Expertise
The eight cognitive style items had a Cronbach's
Alpha = 0,913 (95 % CI 0.797–0.975) showing very
high internal consistency, which is consistent with
the item basis.
The expertise items had a Cronbach's Alpha =
0,756 (95 % CI 0.286–0.934) where removing
question 9 (working with security or risk
assessments) would produce considerably higher
alpha, expectedly suggesting that this reflects a
separate construct from questions 10 and 11 (which
are about working with security more generally). It
should be noted that self-ratings of expertise can be
ambiguous, since more knowledgeable people can
tend to be more humble about their abilities, e.g. in
(Holm et al., 2014).
3 RESULTS
3.1 Inter-rater Reliability
3.1.1 Probability
Inter-respondent reliability for probability using the
traditional method had Cronbach’s alpha of 0.861
(95 % CI 0.817–0.897) with corrected item-total
correlations 0.358–0.714.
For the new method, each item asked for a rating
pitting two incidents against each other and, to
homogenise each rater’s scale, ratings were
transposed to express each incident in terms of the
first incident, I12. Since the value of each rater’s I12
was not gaged, each of the rater’s ratings may
depend on different I12s, resulting in a
multiplicative effect. To eliminate such a possible
effect, each rater’s ratings were independently
standardised. Inter-rater reliability for probability
using the new method had standardised Cronbach’s
alpha of 0.805 (95 % CI 0.744–0.857) with corrected
item-total correlations -0.111–0.804 where three of
the raters were below 0.3. So, the traditional method
performed better in terms of error for probability,
although for the raters as a whole it was only a slight
difference with the confidence intervals overlapping
in part.
3.1.2 Severity
Inter-respondent reliability for severity using the
traditional method had Cronbach’s alpha of 0.908
(95 % CI 0.880–0.932) with corrected item
correlations 0.491–0.818. Inter-rater reliability for
severity using the new method had standardised
Cronbach’s alpha of 0.415 (95 % CI 0.232–0.569)
with corrected item correlations 0.080–0.260. So,
the traditional method performed much better in
terms of error for severity, with the new method
having very low reliability.
3.2 Cognitive Load
3.2.1 Mental Effort
Self-reported mental effort was on average 1.3
points higher for the new method, which is a large
effect size (absolute value of Cohen’s d = 0.80, p =
0.022 < 0.05). So hypothesis H3.i was not supported.
3.2.2 Difficulty
Difficulty ratings were on average 0.6 higher for the
new method, equivalent to a small effect size
(absolute value of Cohen’s d = 0.32), so no support
for hypothesis H3.ii, although this is not statistically
significant (p = 0.382).
3.2.3 Time
The questionnaire for the new method took on
average 15.8 minutes longer than the old method,
equivalent to a large effect size (absolute value of
Cohen’s d = 0,94, p = 0.004 < 0,05). Hence,
hypothesis H3.iii was not supported.
4 DISCUSSION
4.1 Hypotheses
While it is not possible to know whether the raters’
ratings reflect true probability and severity of the
incidents, the new method performed worse in
regard to all measured factors: the inter-rater
reliability for probability (overlapping CIs) and
severity, time, mental effort, and difficulty (although
not statistically significantly for the last factor).
Hence, none of the hypotheses were supported.
Furthermore, it is important to keep in mind that
measuring probability and severity is usually a
stepping stone to estimating risk. Since risk is the
product of probability and severity, overall
reliability will be at most as high as the lowest
reliability of the constituents, cf. (Krippendorf,
Assessing Information Security Risks using Pairwise Weighting
321
2004).
4.2 Increasing Reliability and the Role
of Cognitive Style and Expertise
To improve reliability, any differences between
raters, including cognitive style and expertise, must
be addressed.
Raters may let personal feelings and attitudes
towards the outcomes (severity) of the incidents play
a role, e.g. not caring that the organisation loses a
document since the rater does not really care about
the organisation, or being overly risk-averse and
easily scared. This amounts to a systematic
difference between raters. Cronbach’s alpha treats
systematic inter-rater differences as irrelevant and
are equivalent to intra-class coefficients (ICC) for
consistency. Not ignoring systematic differences and
thus using ICC for absolute agreement, the
coefficients decrease by approximately 0.05 for each
of probability and severity using the traditional
method. This shows that systematic error is not a
large part of the reliabilities, but can nevertheless be
meaningful to target for an improving organisation.
It should be noted that standardising scores removes
systematic differences so no similar calculation can
be performed for the new method. An additive
difference between raters would however skew the
calculations for the new method, since each
transposed score in terms of the first incident would
be on the form:
(x
i
+ a) / (x
i-1
+ a) (1)
rather than simply:
x
i
/ x
i-1
(2)
With a small overall systematic difference of -
0.05 this should however not be a major issue.
Furthermore, in practice, starting questionnaires with
calibration of each rater’s responses would alleviate
this.
Furthermore, raters may have different
knowledge about the incidents, e.g. what use an
attacker can make of a stolen document, or raters
may not be very used to rating information security
incidents, at least with the specific method. Raters
may also differ in how used they are to risk analysis
and logical thinking. Item 7 on cognitive style was
correlated (0.733, p = 0.016) with probability for the
new method (in terms of corrected item-total
correlation), implying that raters with a more logical
cognitive style showed more inter-rater reliability
with other raters. On the other hand, items 2 and 5
were just about correlated (-0.627, p = 0.053 and -
0.629, p = 0.051) with severity of the traditional
method, implying that raters with a more logical
cognitive style showed less inter-rater reliability
with other raters. There were no other statistically
significant correlations concerning probability or
severity and cognitive style or expertise. All in all,
these results do not show any clear relationship
between cognitive style or expertise and inter-rater
reliability, for either method. This is not entirely
surprising since experts do not typically perform
better in areas where systems are dynamic and
behavioural, with limited outcome feedback, as is
the case in information security (Shanteau, 2015). It
is also feasible that the task is more related to other
fields than information security, such as business
sense or systems engineering.
4.3 Possible Limitations
In contrast with e.g. the NASA Task Load Index
(TLX) (Luca, 2014) we do not measure cognitive
load in terms of physical effort, but this should
anyway be very low for our study. Likewise, we do
not explicitly measure TLX’s frustration or the
Subjective Workload Assessment Technique’s stress
factor (Luca, 2014). However, the exact scale – or
even whether it is unidimensional or
multidimensional – and the use of verbal labels, is
not critical in measurements of cognitive load
(Sweller et al., 1998) and both frustration and stress
would intuitively seem to map to effort ratings with
no further fine-grained measures necessary here.
Another possibly limiting factor is the length of
the questionnaires. Comparing the inter-rater
reliability between the first 52 items and the
remaining 52 or 53 items of the questionnaire for
each method, showed practically no difference for
the traditional method’s probability and a small
difference for the traditional method’s severity
(alpha 0.929 for the first half compared to 0.881 for
the second), displaying very little impact of the
length of the questionnaire. This is fortunate,
because 105 incidents is not likely an unusually
large amount to rate in one sitting. For the new
method, the inter-rater reliability for probability was
much lower for the first half (0.303 vs. 0.862), while
severity showed the reverse with a higher first half
(0.581 vs. 0.298). As the response on each item on
the new method depends on all previous items, it is
unsurprising that the new method produces large
differences over time, and the improvement for the
second half of probability is likely inflated because
of this. In fact, closer examination shows that the
split data no longer fits the necessary underlying
ICISSP 2017 - 3rd International Conference on Information Systems Security and Privacy
322
models for the new method.
5 CONCLUSIONS
All in all, information security risk assessments
using the method based on pairwise weighting tested
in this paper cannot be recommended. However,
before dismissing pairwise weighting altogether,
there are a few possible modifications to be
evaluated. First, the use of the traditional AHP scale
for the comparisons should be compared to the
merits of using other scales, such as scales based on
fewer steps or different sets of values assigned to the
steps of the scale.
Secondly, alternative approaches to selecting the
pairs of threats to be compared should be tested.
Ideally each pair of threats should be compared.
However, such an approach would be highly
cumbersome to the raters since the number of
necessary comparisons grows by roughly the total
amount of threats for each additional threat.
Conversely, the approach used in this study is based
on the lowest possible number of comparisons,
which although less unwieldy cannot easily account
for inconsistencies in inter-respondent ratings.
Redundancy in the comparisons could be used to
decrease the problem of inconsistent weightings and
provide an overall more consistent results among the
respondents. A probable improvement would be to
utilise a software tool to give raters a better
overview of the threats as a whole, while also
facilitating backtracking and further analysis.
Consequently, there is room for more
experiments on using pairwise weighting for
information security risk assessments.
ACKNOWLEDGEMENTS
This work was conducted in the FOI research project
Assessment and Analysis of IT Systems, which is
funded by the R&D program of the Swedish Armed
Forces.
REFERENCES
Deleeuw, K. & Mayer, R., 2008. A Comparison of Three
Measures of Cognitive Load: Evidence for Separable
Measures of Intrinsic, Extraneous, and Germane Load.
Journal of Educational Psychology, Vol. 100, No. 1,
223–234.
Fenz, S., Heurix, J., Neubauer, T., & Pechstein, F., 2014.
Current challenges in information security risk
management. Information Management & Computer
Security, 22, 410–430.
Fink, A., & Neubauer, A., 2001. Speed of information
processing, psychometric intelligence, and time
estimation as an index of cognitive load. Personality
& Individual Differences, 30, 1009–1021.
Gwet, K. L., 2014. Handbook of Inter-Rater Reliability:
The Definitive Guide to Measuring The Extent of
Agreement Among Raters (4th ed.). Advanced
Analytics, LLC.
Holm, H., Sommestad T., Ekstedt M., & Honeth, N., 2014.
Indicators of expert judgement and their significance:
an empirical investigation in the area of cyber security.
Expert Systems. Volume 31, Issue 4, pages 299–318.
Ishizaka, A., & Labib, A., 2011. Review of the main
developments in the analytic hierarchy process. Expert
Systems with Applications, 38(11), 14336–14345.
Ishizaka, A., & Lusti, M., 2004. An expert module to
improve the consistency of AHP matrices.
International Transactions in Operational Research,
11(November), 97–105.
Korman, M., Sommestad, T., Hallberg, J., Bengtsson, J.,
& Ekstedt, M., 2014. Overview of Enterprise
Information Needs in Information Security Risk
Assessment. Proceedings of the 18th IEEE
International Enterprise Distributed Object Computing
Conference (EDOC). pp. 42-51.
Krippendorff, K., 2004. Reliability in content analysis:
Some common misconceptions and recommendations.
Human Communication Research. Vol. 30, pp. 411-
433.
Luca, L., 2014. Formalising Human Mental Workload as a
Defeasible Computational Concept. A Dissertation
submitted to the University of Dublin, Trinity College.
Marcus, N., Cooper, M., & Sweller, J., 1996.
Understanding Instructions. Journal of Educational
Psychology. Vol. 88, No. 1, 49-63.
McShane, S., 2006. Activity 8.8: Decision Making Style
Inventory. In Canadian Organizational Behaviour.
McGraw-Hill Education.
Paas, F., 1992. Training strategies for attaining transfer of
problem-solving skill in statistics: A cognitive-load
approach. Journal of Educational Psychology, 84,
429–434.
Paas, F., Tuovinen, J., Tabbers, H. & Van Gerven, P.,
2003. Cognitive Load Measurement as a Means to
Advance Cognitive Load Theory. Educational
Psychologist, 38(1), 63–71.
Paas, F., van Merriënboer, J., & Adam, J., 1994.
Measurement of cognitive load in instructional
research. Perceptual and Motor Skills, 79, 419–430.
Saaty, T. L., 1990. How to make a decision: The analytic
hierarchy process. European Journal of Operational
Research, 48(1), 9–26.
Shanteau, J., 2015. Why Task Domains (Still) Matter for
Understanding Expertise.
Journal of Applied Research
in Memory and Cognition, July 2015.
Sommestad, T., Karlzén, H., Nilsson, P., & Hallberg, J.,
2016. An empirical test of the perceived relationship
Assessing Information Security Risks using Pairwise Weighting
323
between risk and the constituents severity and
probability. Information & Computer Security.
Volume 24, Issue 2.
Sweller, J., van Merriënboer, J., & Paas, F., 1998.
Cognitive Architecture and Instructional Design.
Educational Psychology Review, Vol. 10, No. 3.
Weinstein, N., 2000. Perceived probability, perceived
severity, and health-protective behavior. Health
psychology: official journal of the Division of Health
Psychology, American Psychological Association,
19(1), pp.65–74.
ICISSP 2017 - 3rd International Conference on Information Systems Security and Privacy
324