To Inspect or to Test? What Approach Provides Better Results When

It Comes to Usability and UX?

Walter T. Nakamura

, Leonardo C. Marques

, Bruna Ferreira

, Simone D. J. Barbosa

and Tayana Conte

Institute of Computing (IComp), Federal University of Amazonas, UFAM, Manaus, Brazil

Informatics Department, Pontifical Catholic University of Rio de Janeiro, PUC-Rio, Rio de Janeiro, Brazil

Keywords: Usability, User eXperience, Usability Inspection, Usability Test, Evaluation Methods.

Abstract: Companies are constantly striving to improve their products for satisfying customers. Evaluating the quality

of these products concerning usability and User eXperience (UX) has become essential for obtaining an

advantage over competing products. However, several evaluation methods exist, making it difficult to decide

which to choose. This paper presents a comparison between usability inspection and testing methods and a

UX evaluation. We investigated the extent to which each method allows identifying usability problems with

efficiency and effectiveness. We also investigated whether there is a difference in UX ratings between

inspectors and users. To do so, we evaluated a Web platform designed for a government traffic department.

Inspectors used TUXEL to evaluate the usability and UX of the platform, while usability testing moderators

employed Concurrent Think-Aloud and User Experience Questionnaire with users. The inspection method

outperformed usability testing regarding effectiveness and efficiency while addressing most major problems

that occurred in usability testing, even when considering only the results from novice inspectors. Finally, the

UX evaluation revealed contrasting results. While inspectors evaluated the platform as neutral, reflecting the

problems they identified, users, by contrast, rated it very positively, in contradiction to the problems they had

during the interaction.

1 INTRODUCTION

For many years, effective and efficient goal

achievement was the prime objective of Human-

Computer Interaction (HCI) (Hassenzahl, 2018),

making usability one of the main concerns when

designing a product. Although it is necessary, “even

the best usability may never be able to put a smile on

users’ faces” (Hassenzahl et al., 2006), but User

eXperience (UX), “when desirable, can do so” (Law

et al., 2007). The concept of usability is more narrow,

task-oriented, focusing primarily on user cognition

and performance (Law et al., 2009). By contrast, UX

is more holistic, considering not only pragmatic

aspects (task-oriented) but also augmenting

subjective aspects, such as affect, sensations,

emotions and value of user’s interaction in everyday

life, thus subsuming usability (Law et al., 2009). In

this context, practitioners and researchers from

academia have been looking for new approaches to

the design of interactive products, aiming to

accommodate not only product qualities but also

experiential qualities of technology use (Hassenzahl

et al., 2010). In a scenario of fierce competition,

understanding how technology can be used to

promote unique, satisfying, and enlightening

experiences seems to provide a competitive

advantage for business and industry (Alves et al.,

2014), leading practitioners and researchers to debate

on how to design products capable of providing

positive UX (Ardito et al., 2014). In this context,

usability and UX evaluation has become an important

activity to assess the quality of the products being

developed, aiming to identify improvement

opportunities and meet consumers’ expectations.

Despite the importance of usability and UX

evaluation and its increasing adoption in the industry,

many software development companies are still

neglecting these two quality in use attributes due to

different reasons, such as the lack of suitable methods

(Ardito et al., 2014), resource demands (Alves et al.,

2014), and lack of trained personnel (Teka et al.,

2017). Moreover, the existence of different

evaluation methods might make it difficult for

Nakamura, W., Marques, L., Ferreira, B., Barbosa, S. and Conte, T.

To Inspect or to Test? What Approach Provides Better Results When It Comes to Usability and UX?.

DOI: 10.5220/0009367904870498

In Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020) - Volume 2, pages 487-498

ISBN: 978-989-758-423-7

487

practitioners to identify which are more efficient or

more adequate to a company’s needs (Nakamura et

al., 2019). As distinct methods allow identifying

different sets of problems (Law & Hvannberg, 2002;

Maguire & Isherwood, 2018) and require different

expertise, resources, and user availability,

comparative studies may help practitioners to identify

which method meets a company’s needs.

This paper presents a comparative study between

two of the most employed types of usability

evaluation methods: inspection and testing. We

carried out the study in a software development

company to evaluate a Web platform designed for a

government traffic department, for a population of a

state with over 4 million inhabitants. Our goal is to

verify the extent to which each method allows

identifying usability problems with efficiency and

effectiveness while providing a good level of

coverage of most severe problems. This type of

research has been extensively carried out in the 90s.

However, due to continuous changes in technology

and interaction over time, further comparative studies

should be carried out to investigate whether previous

findings still apply (Maguire & Isherwood, 2018).

Moreover, the shift to the experiential highlight the

need for broad research that considers not only

traditional usability but also investigate whether and

how its results relate to UX. In this sense, we also

carried out a UX evaluation study with both

inspectors and users to get subjective feedback about

the experience conveyed by the platform. We aimed

to investigate whether there is a difference between

the inspectors’ and users’ perceptions. The results of

this study provide empirical evidence on the benefits

and drawbacks of the methods employed and their

cost-benefit assessment, helping practitioners to

select those that best meet their needs.

2 RELATED WORK

The comparison of evaluation methods is a concern

of several years, dated back to the 90s, when

researchers started to investigate the cost-benefit ratio

of the methods in an attempt to bring down the cost

and time requirements of traditional usability testing

(Hartson et al., 2003). In this section we summarize

some of these works.

Jeffries et al. (1991) compared four Usability

Evaluation Methods (UEMs): Heuristic Evaluation

(HE), Usability Testing (UT), Guidelines, and

Cognitive Walkthrough (CW). They evaluated the

methods through the number of problems found,

problem severity, and cost-benefit ratios (problems

found per person-hour). The results indicated that HE

produced the best results, finding more problems,

including more of the most serious ones, and at the

lowest cost. By contrast, it found a large number of

specific, one-time, and low-priority problems. UT

was second, finding recurring and general problems

while avoiding low-priority problems. However, it

was also the most expensive of the four methods.

Desurvire et al. (1992) compared three methods:

HE, CW, and UT. Rather than comparing the number

of problems found by each method as Jeffries et al.

(1991) did, they aimed to investigate whether HE and

CW find problems that users face in UT, according to

the evaluators’ level of expertise. The results

indicated that HE and CW found 44% and 28% of the

problems, respectively, when employed by experts.

By contrast, when employed by system designers and

non-experts, the percentage of problems found

dropped to 16% and 8%, respectively.

Although this is not a new topic, to this day,

researchers keep carrying out comparative studies to

evaluate new methods or to employ the existing ones

in different domains or types of products. As websites

and interaction continually change over time, it is

important to carry out further studies to verify

whether previous findings still apply (Maguire &

Isherwood, 2018). Hasan et al. (2012), for example,

evaluated the usability of three e-commerce Websites

by employing ordinary UT and a specific HE method

they developed for this context. To compare the

methods, the authors considered the number of

problems identified and their severity level. The

results indicated that HE found a great number of

problems, most of them minor ones. By contrast, UT

found fewer problems, but more major ones.

More recently, Maguire and Isherwood (2018)

compared two UEMs: UT and HE. The HE group

comprised 16 participants with experience in

usability evaluation, acting as expert inspectors,

while 16 regular computer users without usability

knowledge acted as users in usability testing. They

compared both methods regarding effectiveness and

efficiency by using four metrics: number of problems

identified, problem severity, type of problem

according to Nielsen’s ten heuristics (1994), and time

spent to find these problems. Overall, HE was more

effective, finding almost five times more individual

problems than UT. By contrast, UT identified slightly

more severe problems and required less time to

complete than HE, excluding the analysis time.

Although recent studies comparing inspection and

testing methods do exist, most of them do not use a

standardized set of usability metrics for analyzing the

data as proposed by Hartson et al. (2003), making it

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

488

difficult to compare the results from previous studies

directly. Moreover, they have only compared the

results based on the overall number of problems, not

measuring the effectiveness of HE in identifying

problems found during actual user interaction.

Finally, these studies have not evaluated the UX to

complement the findings from usability evaluations

and provide a broader view of the product evaluated.

In this paper, we compared two UEMs (inspection

and testing) in a software development company by

evaluating a Web platform designed for a government

traffic department. To have a more holistic view of

the methods evaluated, we employed both novices

and expert inspectors and used metrics such as

effectiveness and efficiency to compare them. We

also calculated three standard usability metrics

proposed by Hartson et al. (2003) and used by

Hvannberg et al. (2007) to evaluate the extent to

which an inspection method predicts problems that

actual users face during UT: thoroughness, validity,

and effectiveness. Finally, we carried out a UX

evaluation to obtain subjective data about the

platform under evaluation and to investigate whether

there is a difference between the inspectors’ and

users’ perceptions of their experiences.

3 METHODOLOGY

3.1 Participants and Materials

We evaluated a Web platform under development by

a software development company for a government

traffic department. It offers functionalities such as

service scheduling and information about driver’s

licenses and vehicle fines. The stakeholders aimed to

evaluate the usability of this platform before its public

release to deliver a high-quality product for the target

audience. The study involved 20 participants, 10 for

each evaluation method. According to (Hwang &

Salvendy, 2010), a general 10±2 rule of thumb for

optimal sample size in usability evaluations may

detect 80% of usability problems.

The inspection group comprised 10 Computer

Science students (six men and four women between

20 and 38 years old) from the Federal University of

Amazonas (UFAM), all licensed drivers. Five

inspectors had low experience with usability

evaluations, i.e., they had learned about it in the

classroom and did some exercises, which makes them

comparable to typical novice practitioners

(Fernandez et al., 2013). The other five had high

https://www.atube.me

experience, i.e., they had already carried out this type

of evaluation at least once in the industry in the last

six months. All inspectors used Web platforms

frequently, but they did not know the application

domain, nor the platform under development.

Ten company employees participated in UTs as

users (four men and six women, between 25 and 52

years old), all licensed drivers and from different

departments unrelated to software development. We

chose company employees to avoid confidentiality

issues, as it is a common practice by professionals in

usability studies and required by the stakeholders.

We selected those without too much experience with

technology to allow identifying the most common

problems that end-users may face while using the

platform. Two participants had very low experience

with computers, i.e., they knew how to use the

computer but rarely used it. Seven participants had

low experience with computers, i.e., they knew how

to use the computer and used it occasionally. One

participant had medium experience with computers,

i.e., they knew how to use the computer and used it

regularly. None of them knew about the development

of the platform, nor had used it before.

We used the following materials in this study: (i)

an informed consent form, explaining the study and

the subjects’ voluntariness and confidentiality of their

identities; (ii) a characterization questionnaire; (iii) a

script with the set of tasks; (iv) a screen capture tool

for recording participants’ interactions; and (v)

computers and notebooks.

3.2 Evaluation Methods

For inspection, we employed a method developed by

one of the authors of this paper, called TUXEL

(Technique for User eXperience Evaluation in e-

Learning). Originally designed to evaluate e-learning

platforms, it comprises three main dimensions:

general usability, pedagogical usability, and UX.

Previous studies indicated that TUXEL identifies

more problems in less time than an adapted HE based

on Nielsen’s ten traditional heuristics with additional

criteria for evaluating didactic effectiveness

(Nakamura et al., 2018). We aimed to investigate

whether TUXEL can be applied to evaluate other

types of software products and how well it performs

in comparison to other general evaluation methods.

Given that the evaluated platform is not for learning

purposes, we removed the pedagogical usability

dimension, as it is specific to evaluate e-learning

To Inspect or to Test? What Approach Provides Better Results When It Comes to Usability and UX?

489

aspects, such as collaborative learning and

instructional assessment.

TUXEL employs a guided inspection approach so

that either experts or non-experts can apply it. It

provides a set of items similar to heuristics, but at a

fine-grained level, in addition to tips that guide the

inspector through examples or actions that they

should perform to identify the problem. TUXEL also

provides a tool to facilitate both evaluation and

analysis process, especially for consolidating

usability defects. According to Hornbæk (2010),

matching similar descriptions from different

inspectors is not straightforward, given that usability

reports usually contain brief and context-free

descriptions. As a result, researchers can err when

extracting or merging actual discrepancies to produce

a single set of problems, corrupting problem counts

and biasing the study (Cockton et al., 2004). The

TUXEL tool (a Google Chrome extension) minimizes

this issue through its screenshot and markup feature.

By visualizing the screenshot tagged with the selected

item, together with the description provided by the

inspector, the researcher can easily identify where

and what the problem reported is.

First, the inspector performs the tasks while

evaluating the usability of the platform by checking

the items from TUXEL and selecting an adequate one

according to the problem identified. Next, the

inspector marks the area where the problem occurs

and provides additional information about it. The tool

then captures a screenshot with the selected area and

the item identifier associated with it by TUXEL.

Then, the inspector evaluates the overall usability of

the platform through a checklist comprising items

related to ease of use and help and documentation. In

this step, the inspector can provide details about the

items they checked. Finally, the inspector fills a UX

questionnaire comprising 7-point semantic

differential scales using adjectives extracted from the

User Experience Questionnaire (UEQ) (Laugwitz et

al., 2008) to evaluate six UX dimensions: Attractive-

ness, Perspicuity, Efficiency, Dependability, Stimu-

lation, and Novelty. The inspector evaluates their

experience with the evaluated platform by marking

the point that is closest to the adjective that better

describes the UX conveyed by the platform. The

questionnaire also has two open-ended questions

where the inspector can make criticisms based on

their ratings and provide improvement suggestions.

Finally, the tool generates a report with the inspection

time, the problems reported with their corresponding

items, and the URL where each problem occurred.

For UT, we looked for methods that: (i) are easy

to apply; (ii) do not require additional equipment

(e.g., eye-tracking devices); (iii) are not much time

consuming; (iv) requires no more than one observer

per participant; and (v) provides real-time

information without obstructing the participant’s

interaction with the platform. Considering these

criteria, we selected Concurrent Think-Aloud (CTA).

According to Alhadreti & Mayhew (2018), CTA is

one of the most widely used UT methods and allows

the detection of a high number of problems with less

time than its retrospective and hybrid versions. CTA

is a variation of the Think-Aloud method that

provides “real-time” information during the

participant’s interaction with a system (Alhadreti &

Mayhew, 2018). The participant performs tasks as

they verbalize their thoughts while being observed by

a moderator that takes notes about their interaction in

a problem reporting form. The moderator can identify

the problems through three approaches (Van den

Haak et al., 2004): i) observation (i.e., from observed

evidence without verbal data); ii) verbalization (i.e.,

from verbal data without accompanying behavioral

evidence); and iii) a combination of observation and

verbalization. We also considered using

Retrospective Think-Aloud (RTA) in order to not

interfere with the participant’s thought process.

However, given that RTA requires double the time of

CTA, and that CTA outperformed both RTA and the

Hybrid Method (HB) (Alhadreti & Mayhew, 2018),

we decided to use CTA. Finally, given that CTA does

not evaluate UX specifically, we looked for a method

that was fast, easy, and low cost. As the UX

dimension of TUXEL is derived from UEQ

(Laugwitz et al., 2008), we decided to use UEQ to

make a fair comparison.

3.3 Empirical Procedures

The experiment comprised two sessions, each session

in a different day. Each participant took part in only

one session. The first session involved the inspection

group and was conducted by two researchers in a

laboratory at UFAM. Before the evaluation, we asked

the participants to review and sign a consent form,

explaining the importance of the study and the

confidentiality of their personal information. Next,

we introduced the participants to TUXEL, explaining

its purpose and how to use and report problems with

it, without giving much detail to avoid bias. We also

explained the purpose of the target platform and

provided the script with the set of tasks to be

performed during the inspection process (see Table

1). Each participant inspected individually, and all the

interaction process was recorded for further analyses.

Given that it would be important to identify every

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

490

problem found in the platform, we oriented the

participants to report problems that did not match any

of TUXEL items in a notepad.

Table 1: Description of the functionalities of the platform.

Functionalit

Descri

tion

Registration It allows users to create an account to

manage information regarding their

vehicles and driver’s license.

Scheduling It allows scheduling a service related

to vehicles or drive

s’ license.

Driver’s

license

consultation

It allows users to check their driver’s

license status and infringements.

Vehicle’s

consultation

It allows users to consult the vehicle’s

information, fines, and status.

The second session involved the UT participants

and was conducted by three researchers who acted as

moderators in a computer lab at the software

development company. Each researcher carried out

the tests with one participant at a time, and we

recorded all the interaction process for further

analyses. Initially, we presented ourselves to the

participants and explained the concept of usability

and the importance of the study. Then, we started the

testing process. First, we introduced the platform to

the participants, explaining its purpose. Next, we

provided the script with the set of tasks and asked

them to perform one task at a time, in order. We also

asked the participants to verbalize their thoughts and

feelings during the accomplishment of the tasks. We

took notes in the problem reporting form, describing

the problem faced by the participant and registering

the start and end time of each task. When a participant

was not able to accomplish a task after many

attempts, we instructed them to skip to the next task.

After performing the tasks, we provided them the

UEQ to evaluate the UX conveyed by the platform,

explaining its purpose and how to fill it.

3.4 Consolidation and Extraction of

Usability Problems

We divided the extraction process among three

researchers. First, we created a spreadsheet in Google

Sheets to facilitate the process. The spreadsheet was

an N x M matrix, where ‘N’ is the description of the

discrepancy extracted from the participants and ‘M’

is the participant id. Discrepancy means every

description of a potential problem provided by the

participant that was not validated yet. Each researcher

filled the spreadsheet by including the description of

the discrepancy and assigning it to the ID of the

participant from whom it was obtained. Before

including a new discrepancy, the researchers read the

previous ones to verify whether it was already

reported by another researcher. After including all the

discrepancies in the spreadsheet, we assigned a

unique ID for each of them. Similar discrepancies

were merged into a single one, with a clear and

complete description. Discrepancies that addressed

more than one potential problem were split into

different discrepancies. This process was carried out

by one researcher and reviewed by the other two

researchers. After consolidating the discrepancies, we

analyzed each one and discussed whether it was a

problem, false positive (i.e., did not represent a real

problem) or suggestion (i.e., did not describe a

problem, but a participant’s opinion).

We set up a presentation with all usability

problems identified and presented them to the

stakeholders and to the development team, which

comprised three team leaders (software architecture,

software quality, and Web design), a designer, two

programmers, a web designer, and two analysts. We

assured that all information that could lead to the

identification of the participants was removed from

the presentation. We asked the development team to

rate each problem according to its level of severity, as

follows (Nielsen, 1994): 1) Cosmetic: not need to fix

unless there is extra time available; 2) Minor: fixing

this should be given low priority; 3) Major:

important to fix, should be given high priority; 4)

Catastrophic: imperative to fix this before product

can be released.

4 RESULTS

For comparing the methods quantitatively, we

calculated effectiveness, efficiency, thoroughness,

and validity. We defined effectiveness as the ratio

between the number of problems identified by the

participant/inspector and the total number of all

problems identified in the study. With regards to

efficiency, ISO 9241-11 defines it as “resources used

in relation to the results achieved”, which includes

time, human effort, costs, and materials (International

Organization for Standardization, 2018). Given that

usability inspection requires only one person (the

inspector), while usability testing requires at least two

persons (the participant and the moderator), we

calculated the cost-efficiency using the formula Effic.

= P

/ (time

* n), where P

and time

refer to the total

number of problems found by participant i and the

time they spent in the evaluation, respectively, and n

is the number of people required to perform the

To Inspect or to Test? What Approach Provides Better Results When It Comes to Usability and UX?

491

evaluation (n=1 for inspection and n=2 for testing).

To investigate the extent to which TUXEL predicts

problems that actual users face during usability

testing, we calculated two standard usability metrics

proposed by Hartson et al. (2003) – thoroughness

and validity –, as follows:

𝑇ℎ𝑜𝑟𝑜𝑢𝑔ℎ𝑛𝑒𝑠𝑠 

ℎ𝑖𝑡𝑠

ℎ𝑖𝑡𝑠  𝑚𝑖𝑠𝑠𝑒𝑠

𝑉𝑎𝑙𝑖𝑑𝑖𝑡𝑦 

ℎ𝑖𝑡𝑠

ℎ𝑖𝑡𝑠  𝑓𝑎𝑙𝑠𝑒 𝑎𝑙𝑎𝑟𝑚𝑠

Hits are the number of problems found in both

inspection and testing. Misses refers to the number of

problems that were found in testing but not during

inspection. Finally, False Alarms are the number of

problems identified in the inspection but not

confirmed during UT.

We formulated the following hypotheses (null and

alternative, respectively):

: There is no difference in effectiveness

between inspection and testing.

: The effectiveness of inspection is greater

than that of testing.

: There is no difference in efficiency between

inspection and testing.

: The efficiency of inspection is greater than

that of testing.

We also compared the number of major and

catastrophic problems identified per method, given

that methods that address a higher number of these

problems may be more useful than those that identify

only minor ones (Hartson et al., 2003). Additionally,

we calculated the number of problems identified by

inspectors according to the level of knowledge in

usability evaluation to evaluate whether novices can

employ TUXEL without losing effectiveness. Thus,

we formulated the following hypotheses:

: There is no difference between the number of

major/catastrophic problems identified by inspection

and testing.

: Inspection identifies more major/

catastrophic problems than testing.

: There is no difference in effectiveness in the

detection of major/catastrophic problems between

novice and expert inspectors.

: The effectiveness of expert inspectors in the

detection of major/catastrophic problems is greater

than that of novice inspectors.

: There is no difference in efficiency in

identifying major/catastrophic problems between

novice and expert inspectors.

: The efficiency of expert inspectors in

identifying major/catastrophic problems is greater

than that of novice inspectors.

We selected these metrics because they reflect

aspects that companies with budget and time

constraints may consider when choosing a method.

According to Ardito et al. (2014), practitioners state

that usability/UX evaluation requires several

resources in terms of cost, time, and people involved.

In this sense, it is important that the selected method:

i) address as many problems as possible (effective-

ness) in less time (efficiency); ii) do not require

experts for being employed, helping to reduce costs;

and iii) address most of the high-priority problems.

To test the hypotheses, we performed statistical

analyses by using IBM SPSS v25 to verify whether

there was a significant difference between the results

of each method per evaluated metric. Before running

each statistical test, we needed to know how the data

were distributed, given that different experiment

designs and data distribution require different

statistical tests (Wohlin et al., 2012). To do so, we

performed a Shapiro-Wilk normality test (Shapiro &

Francia,

1972). If p-value >= 0.05 (i.e., the data

Table 2: Raw data from usability evaluation.

Usability Inspection

Participant I1 I2 I3 I4 I5 I7 I8 I9 I10

Discrepancies 21 20 27 26 16 37 28 23 19

False Positives 2 2 2 0 1 3 1 0 3

Total Problems 19 18 25 26 15 34 27 23 16

Time (min) 108 94 101 81 98 111 96 114 92

Effectiveness (%) 15.0 14.2 19.7 20.5 11.8 26.8 21.3 18.1 12.6

Efficiency (%) 10.6 11.5 14.9 19.3 9.2 18.4 16.9 12.1 10.4

Usability Testing

Participant U1 U2 U3 U4 U5 U6 U7 U8 U9 U10

Total Problems 11 8 13 5 8 7 14 7 9 7

Time (min) 33 25 69 15 30 20 29 27 57 39

Effectiveness (%) 8.7 6.3 10.2 3.9 6.3 5.5 11.0 5.5 7.1 5.5

Efficiency (%) 10.0 9.6 5.7 10.0 8.0 10.5 14.5 7.8 4.7 5.4

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

492

follow a normal distribution) in both groups for a

given metric, we applied the Student’s t-test. By

contrast, if p-value < 0.05 (i.e., the data do not follow

a normal distribution) in at least one group for that

metric, we applied the Mann-Whitney non-

parametric statistical test.

Finally, regarding UX evaluation, we compared

the outcomes between inspectors and users. We

aimed to investigate whether there is a difference

between the perceptions of inspectors and users about

the UX conveyed by the platform.

4.1 Usability Problems Overview

A total of 157 unique discrepancies were identified.

Among them, we classified 5 as suggestions and 9 as

not applicable (i.e., aspects related to features that

were not implemented in the platform yet, such as

links for functionalities under development). After

removing these discrepancies, 126 were identified as

problems and 17 as false positives.

Table 2

presents an

overview of the discrepancies per participant, per

group. It is worth mentioning that inspector I6

performed the inspection in two days due to their time

constraints. As this can affect the results, we removed

the data from this participant from both usability and

UX evaluations.

Regarding usability problems, the registration

task was the one that had the highest number of issues

identified: 9 out of 10 participants from the usability

testing had difficulty in finding the registration

option, which was only visible when clicking on the

even able to complete this task. This issue was also

reported by 6 out of 9 inspectors.

Participants from both groups had difficulty in

defining the password, as it required a combination of

numbers, letters, and one capital letter. Moreover, this

requirement was only informed by a warning message

that appeared when trying to submit the registration

form. This message also appeared at the bottom of the

page for only a few seconds, making it difficult for

the participants to read the entire message. Overall,

the registration task also demanded much time (9

minutes on average).

4.2 Effectiveness and Efficiency

The analysis indicated that the effectiveness and

efficiency of the inspection group (18.6% and 13.7%)

were, on average, higher than of the usability testing

(7.4% and 8.6%), indicating that the former allows

identifying a higher number of usability problems in

less time. With regard to these metrics, it is important

to highlight some issues. The time recorded in the

usability inspection included the time spent by

inspectors during the UX evaluation step, given that

it is part of TUXEL. For usability testing, we only

recorded the time spent during the execution of the

tasks. By contrast, the dual task of thinking-aloud

while working may have interfered on the accuracy

of the time-on-task metric.

The normality test showed that the data were

normally distributed for effectiveness and efficiency

in both groups, thus we performed Student’s t-test.

The results evidenced that the inspection was

significantly more effective (t(11.096) = 6.089,

p < .001) and more efficient (t(17) = 3.294, p = .004)

than the testing, thus rejecting both H

and H

null

hypotheses.

4.3 Problems by Severity

The analysis of the severity of the problems identified

per evaluation method showed that the inspection

group identified a greater number of cosmetic and

minor problems in comparison to the UT group (see

Figure 1). Additionally, they identified most of the

problems pointed out by the participants of the UT

group, while addressing a higher number of unique

major problems. None of the groups pointed out

catastrophic problems in the platform. The t-test

revealed that TUXEL identified significantly more

major problems than CTA (t(17) = 3.349, p = .004),

thus rejecting the H

null hypothesis.

Figure 1: Problems identified by the level of severity.

4.4 Problems by Evaluator Experience

in Usability Evaluations

Usability inspection highly depends on the

inspectors’ expertise to identify usability problems

(Følstad et al., 2012; Hornbæk, 2010). As employing

expert evaluators to perform an inspection may be

costly, it is important to verify how well novice

inspectors perform in comparison to expert ones.

Figure 2|a presents the average number of

problems grouped by inspectors’ expertise in

usability evaluation. The results indicated that

inspectors with low experience tended to identify

more major issues than those with a high level of

experience. By contrast, the former was not as

effective in identifying minor and cosmetic problems.

To Inspect or to Test? What Approach Provides Better Results When It Comes to Usability and UX?

493

(a) (b)

Figure 2: (a) Average number of problems and

(b) effectiveness and efficiency by evaluators’ level of

experience in usability evaluation.

We also calculated the effectiveness and

efficiency of novices and experts (see Figure 2|b).

The results showed that experts were more efficient

than novices. The t-test, however, indicated that the

differences were not significant, neither for

effectiveness (t(7) = -1.271, p = .244) nor for

efficiency (t(7) = -1.219, p = .262), thus not rejecting

the H

and H

null hypotheses.

4.5 Usability Problems Coverage

As stated before, a method that identifies a high

percentage of major problems may have more utility

than those that identify a larger number of minor ones

(Hartson et al., 2003). However, given that two or

more participants can report the same problem, it is

also important to analyze the level of coverage per

evaluation method and per level of experience in

usability evaluations, rather than just verifying the

average number of major problems identified. This

will highlight how broad, i.e., how many unique

problems each method allowed to identify.

Figure 3: Level of coverage of major usability problems per

method and experience in usability evaluation.

First, we calculated the ratio between the number

of major problems identified by each evaluation

method and all major problems identified in the study,

grouping the results according to the level of

experience in usability evaluation (see Figure 3). The

results showed that both novice and expert inspectors

outperformed UT. Novice inspectors identified 18 out

of the 26 major problems (69.2%). In contrast, UT

identified only half of all major problems.

4.6 Thoroughness, Validity, and

Effectiveness

When employing inspection methods, they should

identify the highest number of problems that could

occur during actual user interaction. Thus, we

calculated the thoroughness, validity, and

effectiveness as proposed by Hartson et al. (2003).

TUXEL identified a total of 21 out of 41 problems

that occurred during UT, which gives a thoroughness

of 51.2%. This value is greater than those obtained by

traditional HE in previous works, such as those by

Hvannberg et al. (2007) and Desurvire (1992), which

resulted in 36% and 44% of thoroughness,

respectively. Regarding validity, TUXEL identified

106 problems. However, only 21 were confirmed in

UT, yielding a validity of 19.8%.

Among the 13 major problems that occurred in

usability testing, 9 (69.2%) were predicted by novice

inspectors and 7 (53.8%) by experts. All the 7

problems identified by experts were also identified by

novice inspectors.

4.7 Problems by Experience in

Usability Evaluations

The results from the UX evaluation revealed a

different perspective of the experience between the

participants who acted as users in UT and inspectors

(see Figure 4). The bars represent the mean for each

dimension evaluated by the participants. The ratings

range from -3 to 3, where values greater than or equal

to 1 indicate a positive perception about the UX of the

platform, while values less than or equal to -1 indicate

a negative perception. Finally, values between -1 and

1 indicate a neutral perception.

The results indicated that, for the participants who

acted as users in UT, despite the usability problems

they faced during the test, the UX conveyed by the

platform was positive, as the average rating for each

dimension ranged from 1 to 2 approximately (Figure

4a). On the other hand, the results from the inspection

group revealed a quite different perspective on the

UX (Figure 4b). The results indicated that inspectors

tended to be more consistent about the UX conveyed

by the platform, as the ratings reflected the problems

they identified during the evaluation. The mean for

each dimension ranged from -1 to 1, indicating a

neutral perception of the experience. The t-test

statistical analysis revealed that inspectors evaluated

the UX significantly lower than users in all UX

dimensions: ATTractiveness: t(10.013) = -3.802, p =

.003; PERSPicuity: t(11.624) = -3.303, p = .007;

EFFiciency: t(16) = -2.616, p = .019; DEPendability:

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

494

t(12.170) = -3.561, p = .004; STIMulation:

t(16) = -3.653, p = .002), except for NOVelty

(t(16) = -1.981, p = .065). It is worth mentioning that

one participant from UT had to leave the experiment

before evaluating the UX.

(a) (b)

Figure 4: Results of each dimension evaluated by the users

from usability testing (a) and inspectors (b).

We also investigated the correlation between time

spent, number of problems identified, and UX

dimensions. Since the analysis involves ordinal and

interval scale types, we calculated, for each group, the

Spearman’s rho correlation coefficient (Mukaka,

2012). We did not find any significant correlation

between these variables except for the Stimulation

dimension, which had a high negative correlation

with number of problems for the inspection group (r

= -.724, p = .028). This indicates that, the more

problems inspectors find in the platform, the less they

are motivated to use it.

The qualitative analysis from the open-ended

questions of TUXEL allowed us to identify which

aspects affected the ratings by the inspection group.

We coded the sentences (Corbin & Strauss, 2014) by

analyzing the inspectors’ answers and creating codes

that represent the concepts identified in them. For

example, participant I5 stated: “The platform is little

intuitive. I think that it lacks shortcuts to access the

platform’s option more easily. I could not read some

feedback messages because they were little

highlighted and faded out quickly”. The underlined

words are key points identified in these sentences,

which we used to start coding and understanding the

phenomenon. As we wanted to identify what affected

the inspectors’ UX with the platform, we analyzed

these key points and created codes for UX-related

issues. For example, for the key point ‘little intuitive’,

we assigned the code ‘hard to understand’, and for

‘little highlighted and faded out quickly’, we assigned

the code ‘low visibility of the feedback message’.

After coding the sentences, we grouped those that

represent the same idea, creating a broader code that

addresses the concepts identified in these sentences.

The first code indicates that the platform is not

intuitive. Inspector I2, for instance, reported

“Sometimes I do not know where to go, which limits

its utilization. Therefore, I found it a little

complicated to use”. The second code relates to the

low contrast of interface’s color, which may impair

the visibility of options and notifications. Inspector

I10 pointed out “[...] when choosing the place and

time [for scheduling], the font color was not visible”.

Finally, the third code reveals the difficulty in

visualizing the feedback messages. Inspector I6 stated

“Something recurrent is the lack of helpful feedback

to the user because the existing ones are not

significant or much visible.”

5 DISCUSSION

The results of our study reinforce previous findings

from the literature, where inspection allowed

identifying a greater number of problems in

comparison to usability testing (Hasan et al., 2012;

Hvannberg et al., 2007; Maguire & Isherwood, 2018).

The inspection was, overall, more effective and

efficient than the UT, indicating that it is still a cost-

effective method for identifying usability problems.

The analysis of problem severity showed that,

proportionally to the total of problems identified by

each method, inspection led to the identification of a

higher number of cosmetic and minor problems than

major ones, while usability testing identified more

minor and major problems than cosmetic ones. As

usability testing is task-oriented, i.e., more focused on

the identification of aspects that may impair the

accomplishment of the tasks, it may identify more

severe problems than cosmetic ones, given that it does

not evaluate the interface as a whole. By contrast,

inspection methods guide inspectors to search for

many specific aspects that may influence the usability

of the product/system, leading to the identification of

details that may be missed during usability testing.

However, although inspection proportionally

identified fewer major problems than minor and

cosmetic ones, the number of major problems

reported by inspectors surpasses those found in

usability testing. Additionally, inspectors addressed

most of the major problems reported in usability

testing while identifying a greater number of unique

ones, highlighting the effectiveness of TUXEL in

addressing potential high priority problems that can

occur during actual user interaction.

When considering the level of experience in

usability evaluation, the results showed that novice

inspectors identified as many problems as experts,

indicating that TUXEL supports the identification of

problems even by inspectors without much

experience with usability evaluation. Moreover,

To Inspect or to Test? What Approach Provides Better Results When It Comes to Usability and UX?

495

novices identified slightly more major problems than

experts. These results are contrasting to those found

by Desurvire et al. (1992), where non-experts using

HE identified less than half of the problems found by

experts. This indicates that TUXEL supports novice

inspectors to find problems during the evaluation

process. By contrast, experts reported a higher

number of cosmetic and minor problems in

comparison to novices. Given that experts are more

familiarized with this type of evaluation, they

probably were more meticulous in identifying every

aspect that was not in compliance with the evaluated

items, which would have led to the identification of

those many minor and cosmetic issues. However, as

it may be costly for companies to employ experts,

TUXEL may be a good alternative for reducing costs

without impairing the results of the evaluation, as

significant differences in effectiveness and efficiency

between novices and experts were not observed.

Regarding thoroughness, our results were better

than those obtained by Hvannberg et al. (2007) and

Desurvire (1992) who employed inspection methods,

such as Nielsen’s HE. Although we cannot make a

direct comparison, as the evaluated product is

different and we did not employ Nielsen’s HE in this

study, the results indicate good effectiveness,

especially given that TUXEL was primarily designed

for the e-learning context. Moreover, the fact that

novices predicted 69.2% of all major problems found

in UT highlights that TUXEL is cost-effective. By

contrast, TUXEL led to the identification of many

other problems that were not confirmed during UT,

resulting in low validity. It is probably because four

out of ten users from UT failed to create an account

on the platform, hindering them from performing

tasks that required logging in. Consequently, usability

problems related to these tasks could not be addressed

in UT. Although it cannot be guaranteed that these

unconfirmed problems will occur, they highlight

opportunities for improving the platform.

Despite the advantages of identifying many

problems, TUXEL also requires more effort from the

practitioners for analyzing and consolidating all the

discrepancies and verifying whether they are real

problems or not. In this sense, UT has the advantage

of not requiring further analysis for false positives, as

only real problems faced by users and identified by

the moderator are reported. Moreover, as UT focuses

only on the problems that actually occurred during the

interaction and not on those that violated a given

heuristic or standard, the number of discrepancies to

analyze and consolidate is reduced. A drawback of

UT is that it is costly, given that more participants are

needed for identifying more problems, while

inspection requires only few inspectors, even those

without too much experience (in the case of TUXEL).

If the company has access to users, UT is a good

option. By contrast, if the product involves

confidentiality issues or is under early stages of

development, employing an inspection method may

be more suitable. A combination of both approaches,

however, may provide the best results.

Finally, regarding UX evaluation, given that

inspectors found a higher number of problems, their

perception about the UX of the platform may have

been influenced by the inspection process, leading to

neutral evaluation. By contrast, the participants from

UT evaluated the UX of the platform very positively,

even those who had many difficulties, could not

perform some tasks, or took a long time to accomplish

them. Previous works have already pointed out this

phenomenon (Nakamura et al., 2019), indicating that

other factors may have had stronger influence on UX

than the problems they faced during the interaction.

As they knew that the platform was being developed

by the company, they may not have felt at ease to

criticize it, although we explained the importance of

being honest and that the object of study was the

platform, not the participants themselves. Another

possibility is related to the profile of the UT

participants. As they did not use computers very

often, they probably had never used this type of

platform before, thus everything was new to them.

Previous works, for example, have demonstrated that

participants’ expectations influence UX evaluations

(Kujala et al., 2017; Kujala & Miron-Shatz, 2015). As

they had not used this type of platform before and

only use computers occasionally, they probably did

not have any expectations about the platform, nor a

basis for evaluating their experience, leading to a

more positive evaluation.

It is worth mentioning that the small sample size

limits the generalization of the results. However, it is

representative for empirical studies in the industry,

where not many subjects are available. We also

selected participants whose profiles reflect the target

population. Although we involved employees in user

testing, we selected those from different departments,

with varied digital literacy, low experience with

technology, and were not part of the development

team. For the inspection group, we selected both

participants with and without experience in usability

evaluation to reflect companies that may or may not

have usability experts available. Finally, the platform

domain and its specificities also limit the

generalization of the results, as it did not require

domain knowledge to be evaluated.

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

496

6 CONCLUSIONS

Inspection remains a cost-effective approach for

evaluating the usability of current Web platforms,

allowing the identification of a greater number of

problems in comparison to usability testing. These

problems highlight many points that can be improved,

leading to the development of high-quality products.

Our results also showed that it is possible to employ

an inspection method with novices and still maintain

its effectiveness in identifying problems, which can

help companies to reduce costs.

Although usability testing identified considerably

fewer problems, it allowed the identification of a

great number of major ones, considered by the

development team as important to fix with high

priority. As the effort for consolidating and analyzing

the data is proportional to the number of problems

reported, usability testing is a good alternative for

focusing on the main and recurrent problems that

users may face during their interaction.

It is worth mentioning that combining these

approaches might provide more complete results,

allowing practitioners to have a broader view of the

quality of the product being evaluated. However, this

implies more cost due to the need for more personnel

and time for consolidating and analyzing the results.

In this sense, practitioners should decide according to

the company’s constraints and needs.

Regarding UX evaluation, the differences in the

results between inspectors and users raise doubts

about which results to rely on and indicate that other

factors may have influenced their subjective

evaluations. The lower ratings from inspectors

indicate a possible influence of the problem detection

process inherent to inspection, leading them to focus

on the negative aspects of the platform. The higher

ratings of the participants from usability testing, in

turn, may be related to their profile. As they only use

computers occasionally, they may have had no

expectations about the platform nor a baseline for

comparing their experience with previous ones. The

fact that they were also employees of the company

that developed the Web platform may have also

contributed to a more positive evaluation.

In contrast to usability, research in the UX field is

challenging, given that different factors can affect

users’ evaluations due to the subjective nature of

experiences. Future studies may investigate what

factors (e.g. previous experience with similar

products and UX evaluations) influence the

perceptions about their experiences. By doing so, it

would make it possible for practitioners and

researchers to focus on the factors that influence UX,

either by reducing their effects during evaluations or

by considering them when designing new products.

Another possibility is to investigate the impact of

different outcomes on practitioners’ decisions in the

development process. As practitioners rely on the

results from this type of evaluation for improving

their products and planning future releases,

contrasting results as those found in our study may

lead to different design decisions.

ACKNOWLEDGEMENTS

This work was supported by the Brazilian funding

agency FAPEAM through process number

062.00478/2019, the Coordination for the

Improvement of Higher Education Personnel - Brazil

(CAPES) process 175956/2013, and CNPq processes

204081/2018-1/PDE, 311316/2018-2, 311494/2017-

0, and 423149/2016-4. We especially thank all the

subjects who participated in this research.

REFERENCES

Alhadreti, O., & Mayhew, P. (2018). Rethinking Thinking

Aloud: A Comparison of Three Think-Aloud Protocols.

Proceedings of the 2018 CHI Conference on Human

Factors in Computing Systems - CHI ’18, 1–12.

Alves, R., Valente, P., & Nunes, N. J. (2014). The state of

user experience evaluation practice. Proceedings of the

8th Nordic Conference on Human-Computer

Interaction Fun, Fast, Foundational - NordiCHI ’14,

93–102.

Ardito, C., Buono, P., Caivano, D., Costabile, M. F., &

Lanzilotti, R. (2014). Investigating and promoting UX

practice in industry: An experimental study.

International Journal of Human-Computer Studies,

72(6), 542–551.

Cockton, G., Woolrych, A., & Hindmarch, M. (2004).

Reconditioned merchandise: Extended structured

report formats in usability inspection. Extended

Abstracts of the 2004 Conference on Human Factors

and Computing Systems - CHI ’04, 1433.

Corbin, J., & Strauss, A. (2014). Basics of qualitative

research. Sage Publications, Inc.

Desurvire, H., Kondziela, J., & Atwood, M. E. (1992).

What is gained and lost when using methods other than

empirical testing. Posters and Short Talks of the 1992

SIGCHI Conference on Human Factors in Computing

Systems - CHI ’92, 125.

Fernandez, A., Abrahão, S., & Insfran, E. (2013). Empirical

validation of a usability inspection method for model-

driven Web development. Journal of Systems and

Software, 86(1), 161–186.

Følstad, A., Law, E., & Hornbæk, K. (2012). Analysis in

practical usability evaluation: A survey study.

To Inspect or to Test? What Approach Provides Better Results When It Comes to Usability and UX?

497

Proceedings of the SIGCHI Conference on Human

Factors in Computing Systems, 2127–2136.

Hartson, H. R., Andre, T. S., & Williges, R. C. (2003).

Criteria For Evaluating Usability Evaluation Methods.

International Journal of Human-Computer Interaction,

15(1), 145–181.

Hasan, L., Morris, A., & Probets, S. (2012). A comparison

of usability evaluation methods for evaluating e-

commerce websites. Behaviour & Information

Technology, 31(7), 707–737.

Hassenzahl, M. (2018). The Thing and I (Summer of ’17

Remix). In M. Blythe & A. Monk (Orgs.), Funology 2

(p. 17–31). Springer International Publishing.

Hassenzahl, M., Diefenbach, S., & Göritz, A. (2010).

Needs, affect, and interactive products – Facets of user

experience. Interacting with Computers, 22(5), 353–

362.

Hassenzahl, M., Law, E. L.-C., & Hvannberg, E. T. (2006).

User Experience-Towards a unified view. Ux Ws

Nordichi, 6, 1–3.

Hornbæk, K. (2010). Dogmas in the assessment of usability

evaluation methods. Behaviour & Information

Technology, 29(1), 97–111.

Hvannberg, E. T., Law, E. L.-C., & Lárusdóttir, M. K.

(2007). Heuristic evaluation: Comparing ways of

finding and reporting usability problems. Interacting

with Computers, 19(2), 225–240.

Hwang, W., & Salvendy, G. (2010). Number of people

required for usability evaluation: The 10±2 rule.

Communications of the ACM, 53(5), 130.

International Organization for Standardization. (2018).

Ergonomics of human-system interaction—Part 11:

Usability: Definitions and concepts.

Jeffries, R., Miller, J. R., Wharton, C., & Uyeda, K. M.

(1991). User interface evaluation in the real world: A

comparison of four techniques. Proceedings of the

SIGCHI Conference on Human Factors in Computing

Systems Reaching through Technology - CHI ’91, 119–

124.

Kujala, S., & Miron-Shatz, T. (2015). The Evolving Role

of Expectations in Long-term User Experience.

Proceedings of the 19th International Academic

Mindtrek Conference, 167–174.

Kujala, S., Mugge, R., & Miron-Shatz, T. (2017). The role

of expectations in service evaluation: A longitudinal

study of a proximity mobile payment service.

International Journal of Human-Computer Studies, 98,

51–61.

Laugwitz, B., Held, T., & Schrepp, M. (2008). Construction

and Evaluation of a User Experience Questionnaire. In

A. Holzinger (Org.), HCI and Usability for Education

and Work (Vol. 5298, p. 63–76). Springer Berlin

Heidelberg.

Law, E. L.-C., & Hvannberg, E. T. (2002).

Complementarity and Convergence of Heuristic

Evaluation and Usability Test: A Case Study of

UNIVERSAL Brokerage Platform. Proceedings of the

Second Nordic Conference on Human-Computer

Interaction, 71–80.

Law, E. L.-C., Roto, V., Hassenzahl, M., Vermeeren, A. P.,

& Kort, J. (2009). Understanding, scoping and defining

user experience: A survey approach. Proceedings of the

SIGCHI conference on human factors in computing

systems, 719–728.

Law, E. L.-C., Vermeeren, A. P., Hassenzahl, M., & Blythe,

M. (2007). Towards a UX manifesto. Proceedings of

the 21st British HCI Group Annual Conference on

People and Computers: HCI... but not as we know it-

Volume 2, 205–206.

Maguire, M., & Isherwood, P. (2018). A Comparison of

User Testing and Heuristic Evaluation Methods for

Identifying Website Usability Problems. In A. Marcus

& W. Wang (Orgs.), Design, User Experience, and

Usability: Theory and Practice (Vol. 10918, p. 429–

438). Springer International Publishing.

Mukaka, M. M. (2012). Statistics Corner: A guide to

appropriate use of Correlation coefficient in medical

research. Malawi Medical Journal, 24(3), 69–71.

Nakamura, W. T., de Oliveira, E. H. T., & Conte, T. (2019).

Negative Emotions, Positive Experience: What Are We

Doing Wrong When Evaluating the UX? Extended

Abstracts of the 2019 CHI Conference on Human

Factors in Computing Systems - CHI EA ’19, 1–6.

Nakamura, W. T., Oliveira, E. H. T., & Conte, T. (2018).

Applying Design Science Research to develop a

Technique to Evaluate the Usability and User

eXperience of Learning Management Systems.

Brazilian Symposium on Computers in Education.

Nielsen, J. (1994). Heuristic Evaluation. J. Nielsen, & RL

Mack (Eds.), Usability Inspection Methods (pp. 25–61).

New York: John Wiley & Sons.

Shapiro, S. S., & Francia, R. S. (1972). An Approximate

Analysis of Variance Test for Normality. Journal of the

American Statistical Association, 67(337), 215–216.

Teka, D., Dittrich, Y., Kifle, M., Ardito, C., & Lanzilotti,

R. (2017). Usability Evaluation in Ethiopian Software

Organizations. Proceedings of the Second International

Conference on Information and Communication

Technology for Africa Development, ICT4AD, 17, 102–

118.

Van den Haak, M. J., de Jong, M. D. T., & Schellens, P. J.

(2004). Employing think-aloud protocols and

constructive interaction to test the usability of online

library catalogues: A methodological comparison.

Interacting with Computers, 16(6), 1153–1170.

Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C.,

Regnell, B., & Wesslén, A. (2012). Experimentation in

software engineering. Springer Science & Business

Media.

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

498