A Strategy for Automating the Presentation of Statistical Graphics for

Users without Data Visualization Expertise

A Position Paper

Pere Mill

´

an-Mart

´

ınez

1

and Pedro Valero-Mora

2

1

Universitat de Val

`

encia, Val

`

encia, Spain

2

Department of Methodology of the Behavioural Sciences, Universitat de Val

`

encia, Val

`

encia, Spain

Keywords:

Statistical Graphics Taxonomy, Data Visualization, Automatic Presentation, Visual Data Analysis, Graphic

Literacy.

Abstract:

The growing need to convert the data in databases into knowledge for a public without data visualization

expertise requires the ever more precise selection of graphics to be presented to the user for consideration.

This can be achieved through a more detailed characterization of the data as well as the data visualization task

that the user wishes to accomplish. One way to limit the number of possible graphics based on the data is to

characterize the multiple properties that can be described for each variable represented by a column of data.

This paper presents seven dimensions with their respective levels that can serve as a framework for classifying

statistical graphics such that their effectiveness in performing a given task may then be evaluated.

1 INTRODUCTION

Open data policies have made an enormous amount of

data available that needs to be converted into knowl-

edge, and the most effective way of doing it is with

graphics that show the properties and relationships of

the variables in a dataset. If the goal is to help users

unfamiliar with data visualization to better interpret

the data, it is necessary to deﬁne a strategy to ade-

quately limit the number of graphic representations

that are automatically presented for the users consid-

eration, without requiring users to predetermine the

characteristics of the graphic they are looking for.

There have been many attempts to construct an au-

tomated system to present statistical graphics to users,

but none of the strategies employed have received

broad acceptance because they suggest only one sup-

posedly ideal solution or a broad, unranked selection

of possible graphical solutions. These strategies can

be classiﬁed in terms of how they address the char-

acteristics of the data, the characteristics of the user,

the limitations of the hardware and the characteris-

tics of the sought after graphic. Figure 1 summarizes

the types of inputs considered by graphic’s automa-

tion systems.

The automated selection of graphics via the char-

acterization of the data that Kamps (1999) calls “func-

tional design”, is capable of considering different as-

GRAPHIC

AUTOMATION

Data

variables

rela-

tions

num-

ber

stand-

alone

structure

User

perception

task

history

Hardware

graphic

display

processing

Graphic

coord.

system

visual

vari-

ables

type

multiple

panels

Figure 1: Types of inputs considered by graphic’s automa-

tion systems.

pects of the data: the characteristics of the variables

taken separately; the relationships between variables;

the structure of the data; and the number of variables

to graphically relate and represent. If the characteris-

tics of the variables taken separately cannot be implic-

itly deduced from the data, the users ability to specify

these characteristics is constrained by the number of

variables in a dataset and the dimensions of the vari-

ables being considered. A much greater effort on the

part of the user is required to describe the relation-

ships between the variables if these cannot be implic-

itly deduced form the data, given that the number of

relationships increases exponentially with the number

294

MillÃ ˛an-MartÃ nez P. and Valero-Mora P.

A Strategy for Automating the Presentation of Statistical Graphics for Users without Data Visualization Expertise - A Position Paper.

DOI: 10.5220/0006220702940298

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 294-298

ISBN: 978-989-758-228-8

Copyright

c

2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

of variables to be related. Additionally, selecting the

graphic types to be presented based on the structure

of the data may not be useful given that the structure

of data may be easily changed without losing the in-

formation.

With respect to the characteristics of the user, we

may consider: human perceptual capabilities, or what

Kamps (1999) calls “perceptual design”; the ques-

tions the user is seeking answers to, or what Casner

(1990) calls “task-based graphic design”; and ﬁnally

user preferences as deduced from a users graphic se-

lection history, which is known as a “recommender

system”. Graphic automations that rely on any of

these user characteristics require that the gamut of

possible graphic representations be previously limited

based on the characteristics of the data; otherwise the

system might evaluate and suggest graphics that can-

not be constructed from the data.

Hardware limitations have more of an impact on

aesthetics and usability than in the selection of the

graphic types to be presented, especially consider-

ing that the users of databases tend to work on desk-

top computers with a broadband internet connection;

therefore, these limitations can be overlooked a priori

when classifying statistical graphics.

With respect to the characteristics of the sought

after graphic, an automated system could require the

user to deﬁne the coordinate system, the visual vari-

ables to be used, the more or less generic graphic type,

or its possible decomposition into multiple panels.

Deﬁning any of these, however, requires a certain de-

gree of data visualization knowledge; therefore, this

strategy is not suited for users who are not all that fa-

miliar with data visualization.

When thinking about an automated graphic selec-

tion system for users who are not that familiar with

data visualization, the strategy is to identify those

graphics that could possibly work based on the data

and thus restrict the selection to only those that can

perform a speciﬁc purpose with the greatest effective-

ness.

Of the various aspects of data characteristics, the

characterization of variables taken separately is a very

effective method because it allows us to breakdown

the problem of what types of graphics to present

into three parts; the number of variables to be repre-

sented; the dimensions to be considered to character-

ize those variables; and the mutually exclusionary lev-

els to be considered for each dimension. However, the

strategies used thus far by automated graphic systems

have not tried to classify statistical graphics based on

a multidimensional characterization of the variables

taken separately.

The aim of this paper is to identify the various di-

mensions of characteristics that can be described for

a variable as represented in a data column, and that

can then be used to limit the gamut of graphics to be

evaluated prior to presenting it to the user. Before do-

ing so, we will review the state of the art in automated

graphic selection and then present a list of dimensions

with mutually exclusive levels that make it possible to

limit the set of possible graphic types.

2 PREVIOUS WORKS

Bertin (1967, p.34) refers to the components of the

various variable measurement scales as levels of orga-

nization, and he distinguishes three such levels: qual-

itative for those concepts that can simply be differen-

tiated; ordered for those variables that have an inher-

ent sequence; and quantitative for those with a quan-

tiﬁable quality. Another characteristic Bertin uses to

limit the gamut of acceptable graphics is the length

of a variable, deﬁned by Bertin (1967, p.33) as the

number of divisions that make it possible to iden-

tify them as short variables if their length is equal

to or less than four, long variables if their length is

greater than 15, and medium variables for those with

lengths between ﬁve and 15. This classiﬁcation yields

two dimensions with three levels each. Ware (2004,

p.24) relates Bertins measurement scales with those

of Stevens (1946) and distinguishes 4 levels of vari-

able attributes: nominal, ordinal, interval and ratio.

Another classiﬁcation is proposed by Bachi (1968,

p.10), who characterizes variables according to se-

quence type, such as linear, circular, geographical and

unordered qualitative sequences. Further, Bachi iden-

tiﬁed subcategories for linear sequences, differentiat-

ing quantitative, temporal and qualitative linear se-

quences. Bachi also identiﬁed subcategories for ge-

ographic sequences, distinguishing between distribu-

tion and movement. This classiﬁcation yields one di-

mension with seven levels.

The BHARAT system (Gnanamgari, 1981), a pio-

neer in the automated presentation of graphics, uses

multiple dimensions to characterize variables, such

as: continuity, totality, cardinality (deﬁned as the

number of unique values for a variable), units and

range. From these ﬁve dimensions, Gnanamgari iden-

tiﬁes levels for only the ﬁrst two, which are dichoto-

mous, and for the other three he establishes ad hoc

rules to evaluate the graphics to be presented to the

user.

Other systems, such as APT (Mackinlay, 1986),

BOZ (Casner, 1990) and Vista (Senay and Ignatius,

1994) also use Bertins levels of organization, but not

his variable length classiﬁcation. Thus, their charac-

A Strategy for Automating the Presentation of Statistical Graphics for Users without Data Visualization Expertise - A Position Paper

295

terization is unidimensional with three levels. This

characterization has variants, such as that used in

SAGE (Roth and Mattis, 1990), which subdivides

the ordinal and cardinal levels according to whether

they refer to amounts or reference values, and adds

a second dimension that refers to the fundamental

physical magnitudes of time, space, temperature and

mass. The NSP system (Robertson, 1990) distin-

guishes variables according to whether they are nom-

inal or ordinal. Further, for nominal variables, it dis-

tinguishes between those with multiple values and

those with one single value, and for ordinal variables,

it distinguishes between discrete and continuous val-

ues, thus yielding only one dimension with 4 levels.

The Polaris system (Stolte et al., 2002) only distin-

guishes between ordinal and quantitative levels, while

the Tableau system (Mackinlay et al., 2007), which

derived from Polaris, distinguishes between the cate-

gorical level, with three sublevels based on whether

the values are normal, dates or geographical units,

and the quantitative level with two sublevels, based

on whether the variables are predictor or response

variables, thus yielding one dimension with 5 lev-

els. Lastly, the VizRec recommender system (Mutlu

et al., 2015) distinguishes 3 levels based on whether

the variables are categorical, temporal or numeric.

3 CHARACTERIZATION OF

VARIABLES TAKEN

SEPARATELY

Usually the variables stored in a database are char-

acterized based on the memory reserved by the com-

puter system to store variable values. Thus we end

up with, for example, datatype classiﬁcations such

as Boolean, text strings of a predeﬁned maximum

length, numeric values that can be whole or real num-

bers with an established decimal point precision, etc.

The characterization we propose is based on the prop-

erties that can be described for each column of values

for a variable, which may affect the selection of one

graphic over another.

3.1 Characterization Dimensions

Graphic Measurement Scale. Variables can be

classiﬁed according to measurement scale, and we

can distinguish between qualitative and quantita-

tive variables, the difference being that the val-

ues of the former cannot be summed up. Among

the qualitative variables, we can distinguish be-

tween unordered and ordered, the difference be-

ing that the values of the latter maintain a lesser-

to-greater relationship. Among the quantitative

variables, we can distinguish three levels: conven-

tionally bounded scalar values, which are mea-

sured on an interval scale, such as the temperature

in Celsius degrees; scalars bounded on one end,

which are measured on a ratio scale, such as a per-

sons age; and lastly scalars bounded on two ends,

which are measured on an absolute scale, such as

the probability of a certain event occurring, which

is bounded between zero and one. A variable

can transform its measurement scale in the direc-

tion of greater to lesser restrictions; for instance,

scalars bounded on two ends can be transformed

into scalars bounded on one end, and these can in

turn be transformed into conventionally bounded

scalars, which can then be transformed into or-

dered categories that can then be transformed into

unordered categories. The graphic measurement

scale allows us to distinguish, for example, be-

tween a simple point graph, a simple bar chart and

a dichotomous pie chart.

Cardinality Factor. This dimension relates the car-

dinality of a variables data (meaning the actual

number of unique values in the data) with the car-

dinality of the variable (this being the potential

number of observable values). This dimension

distinguishes between sample type values (when

the values in the data are few compared to those

that can be potentially observed or of interest) and

population type variables (when the values in the

data coincide or practically coincide with those

that can be potentially observed or of interest).

Population type values are typically factors, cate-

gories or equidistant intervals of quantitative vari-

ables within an interval of interest. The cardinal-

ity factor allows us to distinguish, for example,

between a histogram and a point graph with drop

lines that connect points to one of its axes.

Sequentiality. This dimension addresses the possi-

bility that the order in which a variables values ap-

pear in a data column contains information about

the sequence in which they were observed. Typ-

ically, this order is found in the “date” variable

of a temporal series, but information on sequence

can be contained in any data column with sequen-

tially ordered values. Sequentially ordered data

columns can be transformed into non-sequentially

ordered data columns when the position of the

data in the column is irrelevant in the graphic rep-

resentation to be suggested. Sequentiality allows

us to distinguish, for instance, between a scatter

plot and a scatter plot with points connected by a

line that follows a sequence.

IVAPP 2017 - International Conference on Information Visualization Theory and Applications

296

Cyclicality. Cyclicality, meaning the domains cyclic

or non-cyclic character, concerns only quantita-

tive and ordered qualitative variables, since un-

ordered qualitative variables lack an inherent or-

der. Periodic variables can sometimes be trans-

formed into aperiodic variables and vice versa, as

is the case, for example, with the variable “time”

depending on the data analysis being performed.

The characterization of a variable as cyclic is es-

pecially important when determining whether to

suggest graphics with polar, cylindrical and spher-

ical coordinate systems.

Explicitness. A variables explicitness helps us dis-

tinguish the explicit level, when the value scale

must be represented graphically, from the am-

biguous level, when the scale should not be

represented graphically, but other characteristics

should be, such as the number of values or the

unique values contained in a data column. An ex-

plicit variable can be transformed into an ambigu-

ous variable by simply omitting the scale values,

and vice versa. A clear example of a variable rep-

resented in an ambiguous manner is the various

observations of value pairs for two variables in a

scatter plot, where each point can be discerned,

but not the order that it occupies in the data col-

umn nor the name of the corresponding informant.

Variable Length. This dimension was deﬁned by

Bertin (1967) as the number of unique values

that it is useful to identify, and that inﬂuences

the use of one visual variable over another and

the use of translations, rotations and reﬂections.

As previously noted, Bertin distinguishes between

short, medium and long variables, but automated

graphic systems tend to establish ad hoc rules

when evaluating what graphics to present. The

levels that we believe are the most relevant are:

variables with a unique value, such as the name of

the winner of a race or an observed temperature,

since many graphical methods were developped to

represent single observations; dichotomous vari-

ables that are very common in datasets and may

suggest the use of reﬂections to facilitate the com-

parison between pairs of observations; and vari-

ables with lengths greater than two. A length

threshold of between 5 and 12 could also be estab-

lished to identify, for example, variables suscepti-

ble to being represented via multiple panels or a

retinal visual variable instead of a spatial visual

variable. The ideal number of levels in this di-

mension depends on the accuracy that is intended

in a graphics automation system.

Georeferencing. This dimension is applicable only

to qualitative variables and considers the possi-

bility that these categories can be linked with

geospatial points, lines and polygons. The char-

acterization of a variable as georeferenced allows

us to present values as postal, census and political

units on a map.

4 DISCUSSION

The dimensions presented above allow us to divide

the domain of known graphics in ever more limited

subgroups, but if we consider the possible transforma-

tions between levels for each dimension, the gamut of

possible graphics expands. Using the graphic mea-

surement scale, for example, enables us to divide

single-variable graphics into ﬁve groups, two-variable

graphics into ﬁfteen groups, and three-variable graph-

ics into 35 groups. If we also use the cardinality

factor, the possible combinations for single-variable

graphics increases to 10, for two-variable graphics it

grows to 55, and for three-variable graphics it jumps

to 220. Thus, the use of successive dimensions al-

lows us to more precisely narrow the set of appropri-

ate graphics.

The multidimensional characterization of vari-

ables taken separately has few comparable an-

tecedents. The characterization implemented by the

BHARAT system comes closest. It considers conti-

nuity, totality, cardinality, units and range, but identi-

ﬁes dichotomous levels for only the former two di-

mensions and establishes ad hoc thresholds for the

other dimensions when evaluating which graphics to

present to the user. Because the automated presenta-

tion of one graphic or another requires the classiﬁca-

tion of graphics based on the dimensions of the vari-

ables represented, we believe that it is necessary to

unambiguously deﬁne each level and limit them to a

reduced number. It is for this reason that we have not

introduced dimensions such as, for example, units of

measurement; doing so would produce as many levels

as combinations of fundamental magnitudes.

Once graphics have been classiﬁed according to a

multidimensional characterization of the data, we can

then further reduce the gamut of graphics to be pre-

sented to the user by using a multidimensional charac-

terization of the tasks to be performed, as proposed by

Schulz et al. [2013]. In order to evaluate the task per-

formance efﬁciency of a graphic drawn from a gamut

of possibilities based on the data, it is necessary to in-

clude information provided by the user, whether via

experiments that measure a users task performance

efﬁciency with respect to a gamut of graphics, the

score users give to the graphic-task binominal, or a

users graphic selection history when that information

A Strategy for Automating the Presentation of Statistical Graphics for Users without Data Visualization Expertise - A Position Paper

297

is available for the task being performed.

5 CONCLUSION

This article critiques the strategies employed by dif-

ferent automated graphic systems and focuses speciﬁ-

cally on their characterization of variables taken sepa-

rately. We also identify as many as seven dimensions

of attributes that can be used to describe a column of

data, and that help determine the appropriateness of

one graphic over another, and, consequently, can be

used to limit the gamut of graphics to evaluate prior

to presenting them to the user. Of these seven dimen-

sions, one has ﬁve levels, another four, and the rest

are dichotomous; thus, it is not difﬁcult to character-

ize the variables if their properties cannot be implic-

itly deduced from the data.

This paper presents a framework for classifying

statistical graphics without requiring that the user

predetermine the characteristics of the graphic being

sought and without considering hardware limitations.

Ideally, this classiﬁcation should be complemented

with information about the effectiveness with which

the user performs various tasks. In this way, the gamut

of graphics presented to the user can be reduced to just

one or only a few possible solutions.

ACKNOWLEDGEMENTS

This work has been developed during a research res-

idency facilitated by Michael Friendly in his DataVis

laboratory at the Department of Psychology of York

University (Toronto, ON, Canada).

REFERENCES

Bachi, R. (1968). Graphical rational patterns: A new ap-

proach to graphical presentation of statistics. Trans-

action Publishers.

Bertin, J. (1967). S

´

emiologie graphique. Mouton, Paris.

Casner, S. M. (1990). Task-analytic design of graphic pre-

sentations. Ph.D. dissertation.

Gnanamgari, S. (1981). Information presentation through

default displays. Ph.D. dissertation.

Kamps, T. (1999). Diagram Design: A Constructive The-

ory. Springer Berlin Heidelberg.

Mackinlay, J. (1986). Automating the design of graphical

presentations of relational information. ACM Trans.

Graph., 5(2):110–141.

Mackinlay, J., Hanrahan, P., and Stolte, C. (2007). Show

me: Automatic presentation for visual analysis. Visu-

alization and Computer Graphics, IEEE Transactions

on, 13(6):1137–1144.

Mutlu, B., Veas, E., Trattner, C., and Sabol, V. (2015).

Vizrec: A two-stage recommender system for person-

alized visualizations. In Proceedings of the 20th In-

ternational Conference on Intelligent User Interfaces

Companion, IUI Companion ’15, pages 49–52, New

York, NY, USA. ACM.

Robertson, P. (1990). A methodology for scientiﬁc data vi-

sualisation: choosing representations based on a natu-

ral scene paradigm. In Visualization, 1990. Visualiza-

tion ’90., Proceedings of the First IEEE Conference

on, pages 114–123.

Roth, S. F. and Mattis, J. (1990). Data characterization for

intelligent graphics presentation. In Proceedings of

the SIGCHI Conference on Human Factors in Com-

puting Systems, pages 193–200. ACM.

Senay, H. and Ignatius, E. (1994). A knowledge-based sys-

tem for visualization design. Computer Graphics and

Applications, IEEE, 14(6):36–47.

Stevens, S. S. (1946). On the theory of scales of measure-

ment. Science, 103:677–680.

Stolte, C., Tang, D., and Hanrahan, P. (2002). Polaris: A

system for query, analysis, and visualization of mul-

tidimensional relational databases. Visualization and

Computer Graphics, IEEE Transactions on, 8(1):52–

65.

Ware, C. (2004). Information Visualization: Perception for

Design. Interactive Technologies. Elsevier Science,

2nd edition.

IVAPP 2017 - International Conference on Information Visualization Theory and Applications

298