Using Well-Known Techniques to Visualize Characteristics of

Data Quality

Roy A. Ruddle

School of Computing and Leeds Institute for Data Analytics, University of Leeds, Leeds, U.K.

Keywords:

Visualization, Data Quality, Data Science, Empirical Study.

Abstract:

Previous work has identiﬁed more than 100 distinct characteristics of data quality, most of which are aspects

of completeness, accuracy and consistency. Other work has developed new techniques for visualizing data

quality, but there is a lack of research into how users visualize data quality issues with existing, well-known

techniques. We investigated how 166 participants identiﬁed and illustrated data quality issues that occurred

in a 54-ﬁle, longitudinal collection of open data. The issues that participants identiﬁed spanned 27 different

characteristics, nine of which do not appear in existing data quality taxonomies. Participants adopted nine

visualization and tabular methods to illustrate the issues, using the methods in ﬁve ways (quantify; alert;

examples; serendipitous discovery; explain). The variety of serendipitous discoveries was noteworthy, as was

how rarely participants used visualization to illustrate completeness and consistency, compared with accuracy.

We conclude by presenting a 106-item data quality taxonomy that combines seven previous works with our

ﬁndings.

1 INTRODUCTION

Investigating data quality is a key part of preparing

data for analysis or modeling (Wirth and Hipp, 2000).

Both descriptive statistics and visualizations have

distinct beneﬁts for such investigations (Anscombe,

1973). Our interest is in the visual approach, where

previous research has primarily focused on develop-

ing new techniques for visualizing data quality (e.g.,

for missing values (Fernstad, 2019) or outliers (Pham

and Dang, 2019)).

That research often includes user studies to eval-

uate the new techniques. However, there is a notable

lack of research that investigated how users ﬁnd and

illustrate data quality issues with existing visualiza-

tion techniques. We addressed that gap by conducting

a study in which 166 data science Masters students in-

vestigated the quality of a large dataset of longitudinal

open data.

The paper makes three main contributions. First,

we identify ﬁve ways (quantify; alert; examples;

serendipitous discovery; explain) in which visualiza-

tion and table-based methods help users to ﬁnd and il-

lustrate data quality issues. Second, we provide guid-

ance about methods to use for different issues, tak-

https://orcid.org/0000-0001-8662-8103

ing account of scalability and visual attributes such as

pop out (Spence, 2001). Third, we document char-

acteristics of completeness and accuracy that are not

in previous data quality taxonomies. We readily ac-

knowledge that being missing from those taxonomies

does not mean that the characteristics are completely

unknown to practicing data scientists, but it does in-

dicate that they only tend to reside tacit knowledge.

2 RELATED WORK

The ISO/IEC 25012:2008 international standard di-

vides data quality into 15 types (completeness, accu-

racy, consistency, etc.). Previous research gathered

information ﬁrst-hand about data quality (Dungey

et al., 2014; Wang and Strong, 1996) or reviewed

characteristics of data quality that were reported else-

where (Gschwandtner et al., 2014; Kandel et al.,

2012; Laranjeiro et al., 2015; Weiskopf and Weng,

2013). Even though those papers and their source ma-

terial only represent a subset of the full body of pre-

vious work on data quality, they identify more than

100 distinct data quality characteristics. Most of them

are characteristics of completeness (missing data, its

opposite duplicates, and coverage), accuracy (syntax

and semantics) or consistency (within individual enti-

Ruddle, R.

Using Well-Known Techniques to Visualize Characteristics of Data Quality.

DOI: 10.5220/0011664300003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 3: IVAPP, pages

89-100

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

ties and between comparable entities).

Visualizations may be presented using a wide va-

riety of chart types (Munzner, 2014)). Tables are gen-

erally left out from research about visualization, but

are commonly used by analysts to “eyeball” data to

conﬁrm it meets expectations (Bartram et al., 2022).

Each visualization technique is appropriate for

certain types of data. E.g., bar charts are appropriate

for showing numerical variables against a categorical

variable or a discrete numerical variable such as the

day of the week, whereas scatter plots are appropriate

for showing pairs of continuous numerical variables

(Andrienko and Andrienko, 2006). Data quality vi-

sualizations adhere to the same rules, with bar charts

appropriate for visualizing any scalar (e.g., the num-

ber of missing values in each variable), box plots for

visualizing the distribution of numerical values, line

charts for visualizing temporal data, and pie charts for

showing proportions (e.g., value counts).

Visualizations work by showing people graphi-

cal patterns from which features either “pop out” or

can be found by inspecting the patterns (i.e., “visual

search”) (Spence, 2001). Pop out occurs when peo-

ple notice a pattern instantaneously, irrespective of

the complexity of the visualization, and takes place

when a small number of items differ from others in

terms of visual channels such as color, shape or ori-

entation (Maguire et al., 2012). By contrast, visual

search takes longer as a visualization contains more

information or becomes more complex. Thus, visual-

ization involves a trade-off between simplicity which

facilitates pop out vs. complexity that displays richer

information. Placing that in the context of data qual-

ity, outliers pop out on a box plot because they are dis-

played using a different shape (e.g., dots) to the box

and whiskers that is used for the other data. Pop out

occurs in a bar chart if a bar’s length is substantially

different to the others, but visual search is needed if

they are similar. The same is true for other visual-

ization techniques – whether or not pop out occurs

depends on the type of pattern that is portrayed.

The encoding channel affects the saliency of pat-

terns in a visualization. E.g., length is a more accurate

than colour for encoding numerical data (Mackinlay,

1986), which is why a bar chart is more effective than

a heat map for visualizing the number of missing val-

ues in different variables. As the scale or complex-

ity of data increases, additional aspects of good prac-

tice need to be considered. Perceptual discontinuity

may be needed to ensure that users can distinguish

small numbers from zero values (e.g., inserting a dis-

crete step between 0 and 1 in a color map (Kandel

et al., 2012) or giving bars a minimum length (Ruddle

and Hall, 2019)). When small multiples, sparklines

(Tufte, 2006) or a trellis of visualizations (Stolte et al.,

2002) are used then the spatial arrangement (e.g., a

data- vs. variable-centric layout (Ruddle and Hall,

2019)) affects the saliency of any patterns.

Interaction often makes it easier for users to ﬁnd

patterns. E.g., ﬁltering reduces the quantity of data

that is shown (Monroe et al., 2013) and ordering mul-

tiple attributes reduces the complexity of a visual-

ization (Gratzl et al., 2013). Visualizations may be

panned or scrolled if all of the detail cannot be seen

at once on a computer display, but that increases the

time that users take to analyze the data and makes it

more likely that they completely fail to see some of

the patterns (Ruddle et al., 2013). Alternatively, mul-

tiple views can simultaneously show overviews and

ﬁne-grained details (Shneiderman, 2003).

3 METHOD

The research was conducted by analyzing submis-

sions about a data quality assignment made by Mas-

ters students. Each student’s task was to identify, de-

scribe and illustrate ﬁve of the wide variety of data

quality issues that occurred in a speciﬁc dataset. They

were instructed to illustrate each data quality issue us-

ing a method such as “descriptive statistics output, ex-

ample values or visualization.”

3.1 Participants

A total of 166 Masters students participated. They

came from 12 countries in three continents (Africa,

Asia and Europe), had variety of academic back-

grounds (including computer science, mathematics,

engineering, science and business) and at the time

were studying for degrees in the departments of com-

puting (116 participants), mathematics (47 partici-

pants) and geography (3 participants). The students

completed the assignment in the 5th week of an

11-week course, having already covered topics on

business understanding, data understanding and data

preparation.

The data preparation topic included an overview

of data proﬁling and data quality. The students had

also been given practical training about data visual-

ization, using “getting started” material from Tableau,

and then a custom-written 24-page tutorial and eight

data analysis challenges. Although the students were

at the very beginning of their career as data scien-

tists, they did have the beneﬁt of some formal edu-

cation about both data quality and visualization, un-

like many more experienced data scientists who only

acquire such knowledge during “on the job” training.

IVAPP 2023 - 14th International Conference on Information Visualization Theory and Applications

Table 1: Description of the variables in the dataset (* indi-

cates variable was not documented on the dataset website).

Variable Description

PCN Unique identiﬁer for each Parking Charge Notice

Issued* The date a ﬁne was issued (subdivided into Issue

Date and Issue Time in some ﬁles)

Location* The name of the car park (called Parking Location

in some ﬁles)

Contravention* The type of parking offence (called Description of

Offence in some ﬁles)

Charge Level* H or L

Fine The amount of the ﬁne (called Fine (£) or Full Fine

£ in some ﬁles)

Discount £ The amount of the ﬁne if it is paid within 14 days

Last Pay Date* The date on which a payment was paid for the ﬁne

Total Paid* The total amount that has been paid for a ﬁne

(called Paid in £ or Total Paid (£) in some ﬁles)

Balance The outstanding amount of a ﬁne (called Balance

(£) in some ﬁles)

3.2 Dataset

The dataset (https://datamillnorth.org/dataset/off-

street-parking-ﬁnes; see Table 1) is open data and

contained information about every parking ticket

issued for vehicles in car parks over seven years

(April 2013 – September 2020) in a city of 800,000

people. The dataset comprised 54 CSV and Excel

ﬁles (20 MB and 230,038 rows in total).

Like many longitudinal datasets, the columns

changed over time, as did the names of variables

and even the number of ﬁles per quarter (one ﬁle for

each of the ﬁrst 6 quarters, but separate ﬁnes issued

and ﬁnes paid ﬁles for the subsequent quarters). As

well as data quality issues caused by those deliber-

ate changes, others concerned clear-cut omissions or

errors, and some arose from the dataset’s documenta-

tion which was correct for the most recent years but

not for earlier years.

3.3 Data Analysis

Two participants only submitted four rather than ﬁve

issues, and another two participants each appeared to

be confused by the data for one of their issues. From

the illustration and free text description that the partic-

ipants provided, two researchers used emergent cod-

ing to classify the remaining 826 submissions, using

different codes if they involved the same data quality

characteristic but different variable. E.g., there were

separate codes for missing values in the Balance, Is-

sued, Location and PCN variables.

The researchers performed the classiﬁcation sepa-

rately, apart from liaising to ensure that they under-

stood all the codes. The inter-rater agreement was

79% (Cohen’s Kappa = 0.77, indicating substantial

agreement). The differences were resolved as fol-

lows. First, the researchers discussed the relevant

codes’ descriptions. Then the researchers worked

asynchronously to review each issue where there was

disagreement and decide which of the two codes was

appropriate. Finally, the researchers met online to dis-

cuss the ﬁve issues where disagreement remained and

agree the ﬁnal code for each.

One of the researchers then grouped the issues ac-

cording to the data quality characteristics, and we also

recorded the type of illustration that was used for each

submission.

4 RESULTS

Collectively the participants identiﬁed 79 different is-

sues. One concerned accessibility (illustrated with a

table). The other 78 issues spanned 11 characteristics

of completeness, 12 of accuracy and three of consis-

tency. The rest of this section focuses on those com-

pleteness, accuracy and consistency characteristics.

The majority are included in existing taxonomies of

data quality but nine are not (see Appendix for details

of every issue and our combined 106-item taxonomy).

Participants used seven visualization techniques

(bar chart, box plot, bubble plot, heat map, line chart,

radial bar chart, scatter plot) and two types of table

(summary and data extract) to illustrate the charac-

teristics. A summary table was one in which partic-

ipants presented aggregated output (see Table 3). A

data extract table showed raw data for a subset of the

rows/columns in a data ﬁle. Table 2 summarizes the

number times the each technique was used.

Issues typically pop out in a summary table, al-

though it does little to help a user understand why the

issue actually occurred. By contrast, a data extract

table shows raw data, which may aid users’ under-

standing of remedies, but makes an issue less salient

because a user has to inspect the table, and is less scal-

able because only a tiny proportion of the data can be

shown even if the dataset is small (e.g., 1000 records).

The remainder of this section starts by reporting

the results for completeness because that is the start-

ing point for a rigorous investigation of data quality.

Next we report the results for accuracy, and then con-

sistency because that concerns the accuracy of multi-

ple data values that each appear to be accurate when

considered by themselves. Each part describes the us-

age of the visualization techniques and tables, com-

menting about their strengths and weaknesses under

various circumstances.

The ﬁgures are based on participants’ submis-

sions, but redrawn to improve the images, and some-

Using Well-Known Techniques to Visualize Characteristics of Data Quality

Table 2: The number of times each method of illustration was used for each data quality characteristic.

Data quality Bar Box Bubble Heat Line Radial Scatter Summary Data

Type Characteristic chart plot plot map chart bar plot table extract

chart table

Accessibility Interpretability 1 1

Accuracy Data format 2 1 9

Accuracy Domain violation 1 1 2

Accuracy Validity 1

Accuracy Wrong data type 4 3

Accuracy Extreme: Numeric outliers 31 9 6 1 6 5 2

Accuracy Extreme: Special value 6

Accuracy Extreme: Time-series outliers 20 2 62 10 1

Accuracy Extreme: Unusual category name 2 8

Accuracy Implausible range 10 1 15 1 1 18

Accuracy Pattern of value is unusual 7 1 1 1 1 15

Accuracy Same value for too many records 1

Accuracy Unexpected low/high values 13 7 1 1 19

Completeness Coverage 16 3 32 2 4

Completeness Duplicates: Duplicate header 2 3 60

Completeness Duplicates: Exact duplicates 1 34

Completeness Duplicates: Uniqueness violation 3 1 93

Completeness Empty column 1 1 1 1 82

Completeness Completely missing column 1 22

Completeness Missing column name 2

Completeness Completely missing header 47

Completeness Missing record 10

Completeness Missing value 6 2 1 1 6

Completeness Zero value 1 2 5

Consistency Inconsistent duplicates 1 1

Consistency Violation of functional dependency 9 4 4 1 40

Consistency Different data formats 1 1 7

Table 3: A summary table used to report a missing column

name (“Unnamed: 5”). The “Number of missing values”

also pops out because it has four digits.

Variable Data type Number of missing values

PCN object 0

ISSUED object 0

LOCATION object 0

CONTRAVENTION object 0

FINE object 0

Unnamed: 5 ﬂoat64 1798

times simpliﬁed to better illustrate the pros and cons

of different visualization techniques. Overall, the il-

lustration methods were used in ﬁve ways:

• Quantify (e.g., the number of missing values).

• Alert (e.g., warning message about null values, in-

dicating how many values could not be plotted,

but not stating which variable).

• Examples (identify records that exhibit an issue).

• Serendipitous discovery (found by accident with

a visualization created to analyze other aspects of

the data, e.g., noticing an axis label called “null”).

• Explain (characterize issue’s nature, e.g., in terms

of the number of records vs. distinct values).

4.1 Completeness

Participants primarily used tables, with visualizations

only comprising 16% of the illustrations. Missing

records and a completely missing header were only

illustrated with a data extract table, and a missing col-

umn name only with a summary table (see Table 3).

Missing values were presented in a variety of

ways, including conventional ones (a data extract ta-

ble showing null values, or a summary table showing

counts of the number of missing values in each vari-

able). Some visualizations had alerted participants to

the existence of null values (see Figure 1a) when they

were analysing other aspects of the data. Participants

also noticed “null” (or similar text) appearing in axis

labels (see Figure 1b), thereby serendipitously discov-

ering missing values.

An empty column is one in which all of the val-

ues are missing. Participants primarily illustrated that

with a data extract table. One participant used the

quality map approach (Ward et al., 2011) (see Fig-

ure 1c). Other participants found the same empty col-

umn issue after seeing an N nulls alert in a line chart

or noticing a null X-axis label on a bar chart. Columns

that were present in some data ﬁles but completely

IVAPP 2023 - 14th International Conference on Information Visualization Theory and Applications

Figure 1: Missing values: (a) alert (“1 null”) provided by

the visualization software, (b) serendipitous discovery (the

participant noticed a label called Null on the X axis), or

“Unnamed: 8”).

Figure 2: A bar chart that quantiﬁes the number of zero

values that occurred for “Total Paid” each week.

missing from others were primarily illustrated with a

data extract table, but one participant discovered the

missing column after noticing an unexpected “null”

in a bar chart X axis label.

As analysts know, some data sources set numeri-

cal data equal to zero if it is missing. Some partic-

ipants illustrated that with a data extract table. One

participant provided a summary table together with

the analysis code that they had used to create the ta-

ble, which is an unambiguous way of showing exactly

how they identiﬁed the issue although, clearly, that is

only appropriate for certain audiences. Another par-

ticipant used a bar chart to quantify how often zero

values occurred for each date (see Figure 2).

Participants primarily used a data extract table to

illustrate duplicates, but visualization was also some-

times effective. Two participants serendipitously dis-

covered that a data ﬁle contained records that were

actually duplicates of the header row, by noticing a

variable name appearing as an axis label, although the

name only stood out because it much shorter than the

Figure 3: A bar chart in which one combination of PCN/day

pops out as occurring twice (as the scroll bar indicates, only

a few of the 5883 PCNs can be shown at once). In fact pay-

ments had been made on the 24th of two different months.

Figure 4: Coverage: (a) line charts plotting the number of

records vs. day of a month, with gaps popping out, (b) the

same data using a line chart that interpolates across days

with no records (the gaps are hidden, so the software has

misled users by implying that those days did have ﬁnes),

so the gap pops out, (d) the same data with a discrete X

axis, which omits years with no data so participants had to

inspect the labels to notice the gaps, (e) heat map showing

there is no data for the 6th and 27th May.

valid location names (see the X-axis label “LOCA-

TION” in Figure 1a). Some of the exact duplicates

issues that participants identiﬁed were genuine, but

others were not (see Figure 3).

The coverage issues all involved time and were

the only aspect of completeness for which visual-

ization was dominant. Participants most often used

a line chart, which contained gaps when there were

time gaps in the data and applied semantic encoding

(Ruddle and Hall, 2019) by using a different mark

type (a point) if a date was isolated (see Figure 4a).

That was a beneﬁt of creating the visualizations with

Tableau, because the gaps and different mark types

made the coverage issues pop out. By contrast, some

visualization software interpolates across missing val-

ues, which hides coverage issues from users (see Fig-

ure 4b).

The effectiveness of bar charts for presenting tem-

porally based coverage depended on whether missing

dates were included or excluded. When they were in-

Using Well-Known Techniques to Visualize Characteristics of Data Quality

cluded then a coverage issue popped out because of

the gap on the time axis (see Figure 4c), but if miss-

ing dates were excluded then participants needed to

carefully read the date axis labels to notice that some

were missing (i.e., the years 2012–2017 in Figure 4d).

Participants also used data extract tables, scatter

plots and heat maps to illustrate temporally based cov-

erage issues. A data extract table is not effective be-

cause it necessitates that a person carefully reads the

table to notice that some dates are missing. A heat

map is superior, provided that adjacent cells abut so

that gaps are clear (see Figure 4e). However, some

software inserts gaps between heat map cells and scat-

terplot markers produce a similarly problem.

4.2 Accuracy

Data formats can be incorrect a multitude of ways and

one that occurred in the present study was a location

that ended with many trailing spaces, which partici-

pants discovered serendipitously from the labels of a

bar chart where that location appeared to be left justi-

ﬁed text unlike all the others.

Data type and other data format issues were only

illustrated with tables.

The domain violation issues always concerned

ﬁnes. Some participants presented that using sum-

mary table, which listed the number of times each

distinct value of Fine occurred. The rarity of £60

popped out (it occurred 1000 times less often than

the other two values), leading participants to com-

ment that £60 was not one of the values that were

listed in the dataset’s documentation. Other partic-

ipants used a bar chart or a scatterplot to present the

ﬁne for each PCN, from which a ﬁne that summed to a

total of £250 popped out because it was much greater

than the others, leading to another comment about the

discrepancy between the data and the documentation.

Time-series and numeric outliers were most often

illustrated with line and bar charts, which are de facto

methods of presenting numerical and time data when

the reference is continuous (see Figure 5a) or discrete

(see Figure 5b). However, some bar charts contained

the same perceptual distortion as in Figure 4d so out-

liers did not pop out. Box plots are also purpose-

designed to ensure that outliers pop out, because they

they are displayed using a different shape to the rest of

the plot (see Figure 5c). Scatter plots were also used

effectively for showing outliers. Sometimes that was

for values that were only outlying from a bivariate

perspective (see Figure 5d). Another example showed

univariate outliers, and is notable because the X and

Y axes are for discrete variables so the participant had

to use jittering to avoid overplotting (see Figure 5e).

Figure 5: Time-series and numeric outliers: (a) line chart

with an obvious peak for one date, (b) bar chart showing

that the Total Paid was much greater for one PCN than any

others, (c) box plot showing outlying values of the Total

Paid for the two different values of Fine, (d) scatter plot

revealing a bivariate outlier (a day on which ﬁnes totalling

£1100 had been issued but the Total Paid was £1894), and

(e) scatter plot where an outlier (Total Paid = £241) pops

out because of its Y-axis position.

Participants also reported extreme values for both

categorical and date variables. They found categorical

extremes serendipitously, by noticing that one con-

travention in bar chart labels or a data extract table

had the textual value “QTR”, whereas all the others

had names such as “83 WITHOUT DISPLAYING A

VALID TICKET”. The date extreme concerned PCNs

that had a plausible issue date (in the year 2014) but a

special value (1899/12/30) for the Issue Time, exam-

ples of which were presented in a data extract table.

Participants reported two issues with implausible

values. One was negative balances, which were il-

lustrated using ﬁve methods. A line chart and scatter

plot were best because they were capable of show-

ing every record in a data ﬁle while still allowing the

implausible values to pop out (see Figure 6a and 6c).

Another was exemplary use of a trellis of bar charts to

question the plausiblity of some contraventions only

having one value of ﬁne and another set of contraven-

tions having another ﬁne (see Figure 6b).

Of the other plausibility characteristics, the most

common was where there was an unexpectedly long

interval between a ﬁne being issued and paid. Par-

ticipants illustrated that by annotating a data extract

table to highlight examples of the values, showing the

number of ﬁnes for each year of issue in a bar chart,

generating a summary table that showed similar in-

IVAPP 2023 - 14th International Conference on Information Visualization Theory and Applications

Figure 6: Implausible values: (a) downwards pointing spike

causing negative balances to pop out from a line chart, (b)

trellis of bar charts that groups contraventions into two lev-

els of ﬁne (£50 and £70) so they pop out, and (c) diverging

color map causing a negative balance to pop out from a scat-

ter plot.

Figure 7: Implausible range for ﬁnes issued vs. paid: (a) bar

chart, (b) scatter plot, and (c) line chart. Each has strengths

and weaknesses. The bar chart is compact, but the 2009

and 2017 bars are not visible because there are so few ﬁnes

(that is why the values are labelled). The scatter plot mark-

ers are equally salient for all years, but users need to read

the X axis labels to notice that there is no data for 2010–

2016, and it lacks scalability because it only shows a small

number of the 3941 PCNs. The line chart is compact, but

users may misunderstand the line from 2009 to 2017, which

the software automatically interpolated, and think there was

data for the intervening years.

formation, serendipitously noticing an old year in the

axis labels of a heat map, and creating scatter plots

or line charts. The strengths and weaknesses of the

bar chart, scatter plot and line chart are illustrated in

Figure 7.

Some values for the Balance or Total Paid were

considered implausible because the pattern of the

value was unusual (most were integers, but a few were

pounds and pence, i.e., decimals). Those decimal val-

ues were sometimes found serendipitously, when par-

ticipants noticed the unusual value amongst the axis

labels or legend items of a visualization, or the text

Figure 8: Violation of functional dependency: (a) aggre-

gated line chart showing that, after accounting for the dis-

count that was stated in the data ﬁle for every record, the

total paid was greater than the ﬁne, (b) bar chart created for

records with Total Paid = 0 and showing that about 1200 of

those records also had a Balance = 0, which should clearly

be impossible, (c) scatter plot from the same data ﬁle, show-

ing that the Total Paid + Balance = 0 rule was broken for

different ﬁnes, but overplotting hides the number of PCNs

that were involved for each point in the plot.

in a summary table. Other participants annotated data

extract tables to highlight examples of the values. The

other plausibility issue and the validity issue were

both only illustrated with a data extract table.

4.3 Consistency

Unlike accuracy, for consistency issues participants

only used visualizations a third of the time. The most

common inconsistency occurred when values violated

a functional dependency, and those issues involved

two (e.g., Balance and Fine), three (e.g., Fine, Dis-

count and Total Paid) or four variables (e.g., Fine,

Discount, Total Paid and Issued date). Participants

typically illustrated the issues with a data extract ta-

ble, indicating example records. Visualizations were

used occasionally, but with good effect for several

purposes. One was providing clues that data may be

inconsistent by plotting an aggregated summary (see

Figure 8a), after which individual records could be

checked. Other visualizations quantiﬁed the number

of PCNs that broke a certain rule (see Figure 8b) or

provided a pointer to the PCNs that did so (see Fig-

ure 8c).

The other characteristics of consistency were only

reported a few times by participants. Different data

formats were usually illustrated with a data extract ta-

ble, but one participant provided a data summary table

in which the inconsistently formatted values popped

out, and another participant serendipitously discov-

ered the issue from the labels of a bar chart. Inconsis-

tent duplicates involved the values of ﬁnes and were

Using Well-Known Techniques to Visualize Characteristics of Data Quality

reported by two participants, using a data extract ta-

ble and a bubble chart. The latter showed the distinct

values of Fine in two data ﬁles, so the presence in one

ﬁle of a small number of £60 Fine records popped out.

5 DISCUSSION

This research helps to improve our understanding of

the methods (with an emphasis on visualization) that

are effective for identifying and illustrating data qual-

ity issues. There is a considerably body of previous

work that has developed visualization tools or tech-

niques for data quality investigation (e.g., (Kandel

et al., 2012; Fernstad, 2019; Pham and Dang, 2019)),

but that research has tended to take a tool developer’s

perspective and provide the visualization technique

that the developer thinks is most suitable for each

given aspect of data quality, rather than taking a data-

driven approach (i.e., the visualizations and tables that

our participants created) to investigate the pros and

cons of a broad range of visualization techniques, and

how each can provide “eureka moment” insights.

The research also identiﬁed six important, charac-

teristics of completeness, (concerning duplicates, and

missing values, columns & headers) and three of ac-

curacy (concerning extreme values and plausibility)

that are absent from previous data quality taxonomies.

Of course, and has already been noted, those charac-

teristics are known to some data scientists. However,

by documenting the characteristics we make it more

likely that they will be treated equally with the other

characteristics of data quality, and not overlooked by

researchers, educators and practitioners.

5.1 Five Uses of Visualization and

Tables

Our results highlighted ﬁve ways in which partici-

pants used the visualization and tabular illustration

methods. Quantifying an issue is a mainstream part

of tools and libraries that are designed for data qual-

ity investigations, through bar and line charts, and the

output of textual information as descriptive statistics

and in summary tables. However, more of those tools

should support perceptual discontinuity (Ruddle and

Hall, 2019) so that bars do not become invisible when

small quantities are being displayed.

Alerts are an integral part of the visualization

functionality of some tools (e.g., Tableau) but not oth-

ers (e.g., Excel), which hide data quality issues from

users when visualizations are created. The provision

of alerts should be encouraged as standard function-

ality in all visualization software.

Data extract tables were often used to provide ex-

amples of a given issue. A guideline for that is to

annotate the extract to draw users’ attention to the rel-

evant values/records/columns, as some of our partici-

pants did exhibiting good practice.

Serendipitous discovery and explaining the nature

of an issue are synonymous with the core capabilities

of data visualization because, as the famous statisti-

cian John Tukey said, “the greatest value of a picture

is when it forces us to notice what we never expected

to see” (Tukey, 1977). Examples of serendipitous dis-

covery included participants noticing outlying graph-

ical elements (e.g., peaks in a line chart or points on a

scatter plot) or unexpected text (e.g., “null” or a vari-

able’s name), formatting or values in the tick labels

of charts. Therefore, another guideline is for users to

always take the time to inspect every label in a visu-

alization – you never know what you will ﬁnd out!

Examples of participants using visualizations to ex-

plain an issue included records that were thought to

be duplicates and inconsistencies in the amount of a

ﬁne, the total paid and the balance.

Serendipitous discovery, and to a lesser extent

explanatory visualizations, depend on patterns pop-

ping out to users so the unexpected becomes obvious.

Classically, pop out occurs in a visualization when

one graphical entity stands out from the others be-

cause of its difference in length, shape, position or

color. However, as our results show, pop out also of-

ten occurred in the axis and legend labels of visualiza-

tions, which led to participants discovering data qual-

ity issues such as missing values, an empty column, a

missing column, a duplicate header, an incorrect data

format, an unusual category name, an implausible pat-

tern of a value, or different data formats. Previous

research has noted that tables are an important visual-

ization idiom in their own right (Bartram et al., 2022),

and our results provided examples where issues such

as a missing variable name, domain violation or dif-

ferent data formats popped out from summary tables.

5.2 Scalability

The ever-increasing size of data (e.g., in terms of

the number of records, variables and distinct values

in variables) presents data scientists with challenges.

Line charts, box plots and scatter plots often scale

well, because the visual properties that cause a graph-

ical entity to pop out still work well if a dataset con-

tains (say) 1000 times more records (e.g., the distinc-

tive peaks in Figure 5a would still appear).

Bar charts and heat maps do not scale very well,

because each bar or heat map cell is discrete, so as

they get more numerous the width of each bar or size

IVAPP 2023 - 14th International Conference on Information Visualization Theory and Applications

of each cell gets smaller, until they become difﬁcult

to see. Some software avoids that problem by impos-

ing a minimum size for each discrete interval, but that

introduces a new problem which is that users have to

perform many scrolling actions to see all of the data

(e.g., Figure 5b only shows a few of the 5883 PCNs;

they span a 32,000 pixel wide visualization so a user

would have to scroll hundreds of times to see them

all). One approach for partially dealing with the scal-

ability problem is to sort the data. Another is to create

a scatter plot with software that, instead of forcing the

user to scroll, ﬁts all of the data within the plot area

(e.g., Excel or Matplotlib). That causes a lot of over-

plotting for low values, but is effective for discovering

extreme numerical values. A third and more sophisti-

cated approach is for users to interact and create a set

of visualizations that show different levels of detail.

5.3 Greater Adoption of Visualization

One striking ﬁnding was the rarity with which partic-

ipants used visualizations for consistency and com-

pleteness issues (28% and 16% of illustrations, re-

spectively) when compared with accuracy (68%), al-

though there was considerable variation within each

of those types of data quality (see Table 2). Coverage

issues, numeric outliers and time-series outliers only

become apparent if users look at details in context

(e.g., individual values against all of the data), which

plays to a general strength of visualization that most

participants exploited. The same is arguably true for

an implausible range and unexpected low/high values,

which were also characteristics of data quality that

participants illustrated more often with a visualization

than a table. The only other characteristic for which

participants used visualization on the majority of oc-

casions was missing values. On six occasions partic-

ipants serendipitously discovered the missing values

from axis labels, and on the other three the visualiza-

tion software provided a null values alert.

So why was visualization not used more often for

the other 21 characteristics. Of course some char-

acteristics are inherently well-suited to tables (e.g.,

wrong data type and completely missing header), but

how to encourage greater adoption of visualization?

One approach is providing exemplars of more sophis-

ticated visualizations. Some from our results show

the beneﬁts of dimensional stacking (the number of

records for each combination of PCN and day, to

try to identify exact duplicates; Figure 3), trellis lay-

outs (causing an unexpected pattern in the number of

records across three variables to pop out; Figure 6b),

determining speciﬁc criteria to interactively ﬁlter data

prior to creating a visualization (to show a functional

dependency violation where both the Total Paid and

Balance equalled zero; Figure 8b), or interactively

calculating a new combined variable (Total Paid +

Balance) to simplify a three-variable functional de-

pendency violation so that it could be visualized with

an ordinary scatter plot (see Figure 8c).

Finally, the following strengths and weaknesses

should be borne in mind. Although the research only

used one dataset, it was real-world data, used “as is”

rather than modiﬁed in any way, and also comprised

of many data ﬁles to cover the seven-year period. As

such, the data was typical of the uncurated open data

that is often used in data science projects. Our par-

ticipants were diverse in terms of their academic and

cultural backgrounds, but at the same time were all

students at the beginning of their careers in data sci-

ence rather than having extended experience.

6 CONCLUSIONS

This research investigated the visualization and tab-

ular methods that participants used to illustrate data

quality issues, distinguishing between ﬁve broad

ways in which the methods were used, which range

from those that are central to the mainstream func-

tionality of data quality tools/libraries to serendipi-

tous discovery. We also identiﬁed nine characteristics

of data quality that are not included in existing data

quality taxonomies.

Our ﬁndings point the way to areas where fur-

ther work is needed. One is to encourage the wider

implementation of certain functionality in data qual-

ity visualization software, including alerts, annotation

at the click of a button, semantic encoding (to help

users differentiate between values that are semanti-

cally distinct but numerically similar) and perceptual

discontinuity (to prevent graphical features from be-

ing hidden). The second concerns professional prac-

tice, training and educating data scientists so they are

aware of data quality’s very diverse characteristics

and better equipped to rigorously investigate them.

Finally, further research is required to: (a) run con-

trolled user studies that compare different visualiza-

tion techniques for a suite of benchmark data quality

issues, and (b) investigate effective ways of visualiz-

ing complex data quality issues in large datasets, par-

ticularly for issues that involve multiple variables.

ACKNOWLEDGEMENTS

This research was supported by the Alan Turing Insti-

tute.

Using Well-Known Techniques to Visualize Characteristics of Data Quality

REFERENCES

Andrienko, N. and Andrienko, G. (2006). Exploratory anal-

ysis of spatial and temporal data: a systematic ap-

proach. Springer Science & Business Media.

Anscombe, F. J. (1973). Graphs in statistical analysis. The

American Statistician, 27(1):17–21.

Bartram, L., Correll, M., and Tory, M. (2022). Untidy

data: The unreasonable effectiveness of tables. IEEE

Transactions on Visualization & Computer Graphics,

28(01):686–696.

Dungey, S., Beloff, N., Puri, S., Boggon, R., Williams, T.,

and Tate, A. R. (2014). A pragmatic approach for

measuring data quality in primary care databases. In

IEEE-EMBS International Conference on Biomedical

and Health Informatics (BHI), pages 797–800. IEEE.

Fernstad, S. J. (2019). To identify what is not there: A

deﬁnition of missingness patterns and evaluation of

missing value visualization. Information Visualiza-

tion, 18(2):230–250.

Gratzl, S., Lex, A., Gehlenborg, N., Pﬁster, H., and Streit,

M. (2013). Lineup: Visual analysis of multi-attribute

rankings. IEEE Transactions on Visualization and

Computer Graphics, 19(12):2277–2286.

Gschwandtner, T., Aigner, W., Miksch, S., G

artner, J.,

Kriglstein, S., Pohl, M., and Suchy, N. (2014). Time-

Cleanser: A visual analytics approach for data cleans-

ing of time-oriented data. In Proc. 14th Int. Conf.

Knowledge Technologies and Data-driven Business,

page 18. ACM.

Kandel, S., Parikh, R., Paepcke, A., Hellerstein, J. M., and

Heer, J. (2012). Proﬁler: Integrated statistical analy-

sis and visualization for data quality assessment. In

Proceedings of the working conference on Advanced

visual interfaces, pages 547–554. ACM.

Laranjeiro, N., Soydemir, S. N., and Bernardino, J. (2015).

A survey on data quality: classifying poor data. In

2015 IEEE 21st Paciﬁc rim international symposium

on dependable computing (PRDC), pages 179–188.

IEEE.

Mackinlay, J. (1986). Automating the design of graphical

presentations of relational information. ACM Trans-

actions on Graphics (TOG), 5(2):110–141.

Maguire, E., Rocca-Serra, P., Sansone, S.-A., Davies, J.,

and Chen, M. (2012). Taxonomy-based glyph de-

sign—with a case study on visualizing workﬂows of

biological experiments. IEEE Transactions on Visual-

ization and Computer Graphics, 18(12):2603–2612.

Monroe, M., Lan, R., Lee, H., Plaisant, C., and Shneider-

man, B. (2013). Temporal event sequence simpliﬁca-

tion. IEEE Transactions on Visualization and Com-

puter Graphics, 19(12):2227–2236.

Munzner, T. (2014). Visualization analysis and design.

CRC press.

Pham, V. and Dang, T. (2019). Outliagnostics: Visualizing

temporal discrepancy in outlying signatures of data

entries. In 2019 IEEE Visualization in Data Science

(VDS), pages 29–37. IEEE.

Ruddle, R. A., Fateen, W., Treanor, D., Sondergeld, P.,

and Ouirke, P. (2013). Leveraging wall-sized high-

resolution displays for comparative genomics analy-

ses of copy number variation. In 2013 IEEE Sympo-

sium on Biological Data Visualization (BioVis), pages

89–96. IEEE.

Ruddle, R. A. and Hall, M. (2019). Using miniature vi-

sualizations of descriptive statistics to investigate the

quality of electronic health records. In Proceedings of

the 12th International Joint Conference on Biomedi-

cal Engineering Systems and Technologies-Volume 5:

HEALTHINF, pages 230–238. SciTePress.

Shneiderman, B. (2003). The eyes have it: A task by data

type taxonomy for information visualizations. In The

craft of information visualization, pages 364–371. El-

sevier.

Spence, R. (2001). Information visualization, volume 1.

Springer.

Stolte, C., Tang, D., and Hanrahan, P. (2002). Polaris: A

system for query, analysis, and visualization of multi-

dimensional relational databases. IEEE Transactions

on Visualization and Computer Graphics, 8(1):52–65.

Tufte, E. R. (2006). Beautiful evidence. Graphics Press.

Tukey, J. W. (1977). Exploratory data analysis. Addison-

Wesley Series in Behavioral Science: Quantitative

Methods.

Wang, R. Y. and Strong, D. M. (1996). Beyond accuracy:

What data quality means to data consumers. Journal

of management information systems, 12(4):5–33.

Ward, M., Xie, Z., Yang, D., and Rundensteiner, E. (2011).

Quality-aware visual data analysis. Computational

Statistics, 26(4):567–584.

Weiskopf, N. G. and Weng, C. (2013). Methods and di-

mensions of electronic health record data quality as-

sessment: Enabling reuse for clinical research. Jour-

nal of the American Medical Informatics Association,

20(1):144–151.

Wirth, R. and Hipp, J. (2000). CRISP-DM: Towards a stan-

dard process model for data mining. In Proceedings of

the 4th International Conference on the Practical Ap-

plications of Knowledge Discovery and Data mining,

volume 1, pages 29–40. Manchester.

APPENDIX

The table below lists the 79 data quality issues and

number of participants who identiﬁed each issue.

The last page combines the data quality taxonomies

from seven previous works: a (ISO/IEC 25012:2008),

b (Dungey et al., 2014), c (Gschwandtner et al., 2014),

d (Kandel et al., 2012), e (Laranjeiro et al., 2015),

f (Wang and Strong, 1996), g (Weiskopf and Weng,

2013). An “*” indicates characteristics that were

identiﬁed in the present research but do not appear

in any of those taxonomies.

IVAPP 2023 - 14th International Conference on Information Visualization Theory and Applications

ACCESSIBILITY: Intrepretability

Variable’s name is not included in the dataset’s documentation. 2

COMPLETENESS: Coverage

No rows for certain combinations of Location and Issued value 14

No rows for certain Issued values 43

COMPLETENESS: Duplicate header*

Value is the same as the ﬁeld name (e.g., “LOCATION”) 2

Column names are repeated as rows 63

COMPLETENESS: Exact duplicates

Row is an exact duplicate 35

COMPLETENESS: Uniqueness violation

Uniqueness violation in the PCN ﬁeld 97

COMPLETENESS: Empty column*

Column in data ﬁle does not contain any values 1

Unnamed column in data ﬁle with no values 85

COMPLETENESS: Completely missing variable*

Missing a variable that is in other data ﬁles 23

COMPLETENESS: Missing column name*

Column in data ﬁle does not have a name 2

COMPLETENESS: Completely missing header*

First row of the data ﬁle is used as the column names 12

Data ﬁle has no column names 35

COMPLETENESS: Missing record

Data ﬁle has fewer records than expected 3

Record does not contain any values 7

COMPLETENESS: Missing value

Missing values in the Location ﬁeld 3

Missing values in the Issued ﬁeld 7

Missing values in the Balance ﬁeld 5

Missing values in the PCN ﬁeld 1

COMPLETENESS: Zero value*

Values of zero in the Paid in £ ﬁeld 2

Values of zero in the Total Paid ﬁeld 5

Values of zero in the Balance ﬁeld 1

ACCURACY: Numeric outliers

Sum of Fine is much larger/smaller for one Location 18

A Contravention has a much larger/smaller number of records 11

A Location has a much larger/smaller number of records 14

A value of Total Paid is much larger/smaller 3

The value of Total Paid is much larger for one PCN than the others 1

Sum of Fine is much larger/smaller for one Contravention 6

A value of Balance is much larger/smaller than others 1

Sum of Fine is much larger/smaller for one Location for a speciﬁc

Issued year

Value of Fine occurs rarely, so may be incorrect 2

A value of Total Paid is much larger/smaller than others for the

same value of Fine

Values of Balance and Total Paid are much larger/smaller 1

ACCURACY: Time-series outliers

Sum of Fine is much larger/smaller for one Issued date 25

A Last Pay Date occurs a much larger/smaller number of times 4

Average of Balance is much larger/smaller for one Issued date 4

Sum of Total Paid is much larger/smaller for one Issue Date 2

An Issue Date has a much larger/smaller number of records 2

An Issue Date has a much larger/smaller number of ﬁnes issued 54

Sum of Total Paid is outlier for sum of Fine for one Issued date 1

On a particular day of the week, one Contravention was issued a

notably different number of times

A value of Total Paid is much larger/smaller on one date 1

ACCURACY: Special value*

Issued year is 1899 6

ACCURACY: Unusual category name*

Name of a Contravention is much shorter and looks different to

others

ACCURACY: Domain violation

Fine has a value that, after taking possible discount into account,

is different to those speciﬁed in the documentation

Fine has a value that is difﬁerent to those speciﬁed in the docu-

mentation

ACCURACY: Validity

Invalid value for Issued date (’R’) 1

ACCURACY: Implausible range

Very long time between date Issued and when ﬁne was paid 45

Last Pay Date is years after ﬁne was issued 1

ACCURACY: Pattern of value is unusual*

Unusual that Total Paid is a decimal value 21

Unusual that Balance is a decimal value 2

Value for Total Paid is decimal and occurs rarely 3

ACCURACY: Same value for too many records

Old ﬁnes all have the same recent Last Paid Date 1

ACCURACY: Unexpected low/high values

The Balance is negative 40

Unexpected relationship between Contraventions and values of

Fines

ACCURACY: Data format

Fine values contain currency (£) sign 8

Some Location values have trailing spaces 2

Value has wrong number of decimal places for a Fine 2

ACCURACY: Wrong data type

Fine has wrong data type 5

Fine and Issued have wrong data type 1

Fine, Total Paid and Balance have wrong data type 1

CONSISTENCY: Inconsistent duplicates

Different values of Fine for the same Contravention 1

One data ﬁle contains a ﬁne with a value that doesn’t appear in

another data ﬁle but is not mathematically an outlier.

CONSISTENCY: Violation of functional dependency

The Balance is greater than the Fine 6

Sum of Total Paid is greater than the sum of Fine in some Loca-

tions

Total Paid equals Fine, but Discount is non-zero 1

Sum of Fine is similar across Issue Date but sum of Total Paid is

not

Balance does not equal Fine - Total Paid 20

Step-change in pattern for Fine vs. Total Paid from one year to

another

Last Pay Date is earlier than Issued 2

The Total Paid is greater than the Fine (after accounting for Dis-

count).

Sums of Fine and Total Paid do not match across time 1

Total Paid and Balance both are both zero 16

The Total Paid is greater than the Fine, taking Issued date and

Discount into account

CONSISTENCY: Different data formats

The date format of Issued is not consistent in the ﬁle. 4

Total Paid values have different number of decimal places 2

Fine has different number of decimal places in different data ﬁles 1

PCN contains an alphabetical character (’A’ not just digits) 1

Total Paid has different formats (and data types) 1

Using Well-Known Techniques to Visualize Characteristics of Data Quality

TAXONOMY OF DATA QUALITY TYPES AND CHARACTERISTICS (Level 1–3)

Type Level 1 Level 2 Level 3

Accessibility [a,e,f,g] Interpretability [f]

Completeness [a,b,e,f,g] Missing data [c,d,e] Missing record [d] (also termed: Missing

(also termed: Missingness [g], tuple [c], Unit non-response [b])

Omission [g], Presence [g], Missing value [c,d] Dummy entry [c]

Sensitivity [g]) Item non-response [b]

Semi-empty tuple [c]

Zero value*

Missing variable* Completely missing variable*

Empty column*

Missing header* Completely missing header*

Missing column name*

Duplicates [c,e] Exact duplicates [c]

Uniqueness violation [c,e]

Duplicate header*

Coverage [b] Appropriate amount of data [f] Rate of recording [g]

Concise representation [f]

Relevance [b,f] Coding speciﬁcity [b]

Relevant time intervals [b]

Accuracy [a,b,e,f,g] Ambiguous data [c,e] Abbreviations or imprecise/unusual coding [c]

(also termed: Corrections Extreme [d] Numeric outliers [d]

made [g], Correctness [g], Time-series outliers [d]

Errors [g], Incorrect [d], Unusual category name*

Positive predictive value [g]) Special value*

Incorrect value [c,e] Coded wrongly or not conform to real entity [c]

(also termed: Wrong data [c]) Domain violation [c,e]

Embedded values [c]

Erroneous entry [d]

Extraneous data [d,e]

Incorrect derived values [c]

Measurement or recording error [b]

Misﬁelded [c,d,e]

Misspelling [c,e]

Recording accuracy [b]

Validity [b,g] Invalid substring [c]

Misleading [g] Objectivity [f]

Plausibility [g] Implausible range [c]

(also termed: Unexpected low/high values [c]

Believability [f,g], Same value for too many records [c]

Implausible values [c], Implausible changes of values over time [c]

Trustworthiness [g]) Pattern of value is unusual*

Syntax violation [c] Data format [c]

Wrong data type [c,d,e]

Consistency [a,b,d,e,g] Heterogeneity of semantics [c] Heterogeneity of aggregation/abstraction [c,e]

(also termed: Agreement [g], (also termed: Representational Heterogeneity of measure units [c,d,e]

Concordance [g], Heterogeneity consistency [f]) Inconsistent duplicates [c] Approximate duplicates [c]

of representations [c, d, e], Inconsistent spatial data [c]

Reliability [b, g], Variation [g]) Information refers to different points in time [c,e]

Misspelling (inconsistent) [d]

Naming conﬂicts [c] Synonym/Homonym [c,e]

Ordering [d]

Violation of functional dependency [c,e]

Heterogeneity of syntaxes [c] Different data formats [c]

Different encoding formats [c,e]

Different table structure [c]

Different word orderings [c,e]

Special characters [c,d,e]

References [c] Referential integrity violation [c,e]

Incorrect reference [c,e]

Primary key violation [d]

Circularity among tuples in a self-relationship [c]

IVAPP 2023 - 14th International Conference on Information Visualization Theory and Applications

100