A Strategy for Automating the Presentation of Statistical Graphics for
Users without Data Visualization Expertise
A Position Paper
Pere Mill
´
an-Mart
´
ınez
1
and Pedro Valero-Mora
2
1
Universitat de Val
`
encia, Val
`
encia, Spain
2
Department of Methodology of the Behavioural Sciences, Universitat de Val
`
encia, Val
`
encia, Spain
Keywords:
Statistical Graphics Taxonomy, Data Visualization, Automatic Presentation, Visual Data Analysis, Graphic
Literacy.
Abstract:
The growing need to convert the data in databases into knowledge for a public without data visualization
expertise requires the ever more precise selection of graphics to be presented to the user for consideration.
This can be achieved through a more detailed characterization of the data as well as the data visualization task
that the user wishes to accomplish. One way to limit the number of possible graphics based on the data is to
characterize the multiple properties that can be described for each variable represented by a column of data.
This paper presents seven dimensions with their respective levels that can serve as a framework for classifying
statistical graphics such that their effectiveness in performing a given task may then be evaluated.
1 INTRODUCTION
Open data policies have made an enormous amount of
data available that needs to be converted into knowl-
edge, and the most effective way of doing it is with
graphics that show the properties and relationships of
the variables in a dataset. If the goal is to help users
unfamiliar with data visualization to better interpret
the data, it is necessary to define a strategy to ade-
quately limit the number of graphic representations
that are automatically presented for the users consid-
eration, without requiring users to predetermine the
characteristics of the graphic they are looking for.
There have been many attempts to construct an au-
tomated system to present statistical graphics to users,
but none of the strategies employed have received
broad acceptance because they suggest only one sup-
posedly ideal solution or a broad, unranked selection
of possible graphical solutions. These strategies can
be classified in terms of how they address the char-
acteristics of the data, the characteristics of the user,
the limitations of the hardware and the characteris-
tics of the sought after graphic. Figure 1 summarizes
the types of inputs considered by graphic’s automa-
tion systems.
The automated selection of graphics via the char-
acterization of the data that Kamps (1999) calls “func-
tional design”, is capable of considering different as-
GRAPHIC
AUTOMATION
Data
variables
rela-
tions
num-
ber
stand-
alone
structure
User
perception
task
history
Hardware
graphic
display
processing
Graphic
coord.
system
visual
vari-
ables
type
multiple
panels
Figure 1: Types of inputs considered by graphic’s automa-
tion systems.
pects of the data: the characteristics of the variables
taken separately; the relationships between variables;
the structure of the data; and the number of variables
to graphically relate and represent. If the characteris-
tics of the variables taken separately cannot be implic-
itly deduced from the data, the users ability to specify
these characteristics is constrained by the number of
variables in a dataset and the dimensions of the vari-
ables being considered. A much greater effort on the
part of the user is required to describe the relation-
ships between the variables if these cannot be implic-
itly deduced form the data, given that the number of
relationships increases exponentially with the number
294
Millà ˛an-Martà nez P. and Valero-Mora P.
A Strategy for Automating the Presentation of Statistical Graphics for Users without Data Visualization Expertise - A Position Paper.
DOI: 10.5220/0006220702940298
In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 294-298
ISBN: 978-989-758-228-8
Copyright
c
2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
of variables to be related. Additionally, selecting the
graphic types to be presented based on the structure
of the data may not be useful given that the structure
of data may be easily changed without losing the in-
formation.
With respect to the characteristics of the user, we
may consider: human perceptual capabilities, or what
Kamps (1999) calls “perceptual design”; the ques-
tions the user is seeking answers to, or what Casner
(1990) calls “task-based graphic design”; and finally
user preferences as deduced from a users graphic se-
lection history, which is known as a “recommender
system”. Graphic automations that rely on any of
these user characteristics require that the gamut of
possible graphic representations be previously limited
based on the characteristics of the data; otherwise the
system might evaluate and suggest graphics that can-
not be constructed from the data.
Hardware limitations have more of an impact on
aesthetics and usability than in the selection of the
graphic types to be presented, especially consider-
ing that the users of databases tend to work on desk-
top computers with a broadband internet connection;
therefore, these limitations can be overlooked a priori
when classifying statistical graphics.
With respect to the characteristics of the sought
after graphic, an automated system could require the
user to define the coordinate system, the visual vari-
ables to be used, the more or less generic graphic type,
or its possible decomposition into multiple panels.
Defining any of these, however, requires a certain de-
gree of data visualization knowledge; therefore, this
strategy is not suited for users who are not all that fa-
miliar with data visualization.
When thinking about an automated graphic selec-
tion system for users who are not that familiar with
data visualization, the strategy is to identify those
graphics that could possibly work based on the data
and thus restrict the selection to only those that can
perform a specific purpose with the greatest effective-
ness.
Of the various aspects of data characteristics, the
characterization of variables taken separately is a very
effective method because it allows us to breakdown
the problem of what types of graphics to present
into three parts; the number of variables to be repre-
sented; the dimensions to be considered to character-
ize those variables; and the mutually exclusionary lev-
els to be considered for each dimension. However, the
strategies used thus far by automated graphic systems
have not tried to classify statistical graphics based on
a multidimensional characterization of the variables
taken separately.
The aim of this paper is to identify the various di-
mensions of characteristics that can be described for
a variable as represented in a data column, and that
can then be used to limit the gamut of graphics to be
evaluated prior to presenting it to the user. Before do-
ing so, we will review the state of the art in automated
graphic selection and then present a list of dimensions
with mutually exclusive levels that make it possible to
limit the set of possible graphic types.
2 PREVIOUS WORKS
Bertin (1967, p.34) refers to the components of the
various variable measurement scales as levels of orga-
nization, and he distinguishes three such levels: qual-
itative for those concepts that can simply be differen-
tiated; ordered for those variables that have an inher-
ent sequence; and quantitative for those with a quan-
tifiable quality. Another characteristic Bertin uses to
limit the gamut of acceptable graphics is the length
of a variable, defined by Bertin (1967, p.33) as the
number of divisions that make it possible to iden-
tify them as short variables if their length is equal
to or less than four, long variables if their length is
greater than 15, and medium variables for those with
lengths between five and 15. This classification yields
two dimensions with three levels each. Ware (2004,
p.24) relates Bertins measurement scales with those
of Stevens (1946) and distinguishes 4 levels of vari-
able attributes: nominal, ordinal, interval and ratio.
Another classification is proposed by Bachi (1968,
p.10), who characterizes variables according to se-
quence type, such as linear, circular, geographical and
unordered qualitative sequences. Further, Bachi iden-
tified subcategories for linear sequences, differentiat-
ing quantitative, temporal and qualitative linear se-
quences. Bachi also identified subcategories for ge-
ographic sequences, distinguishing between distribu-
tion and movement. This classification yields one di-
mension with seven levels.
The BHARAT system (Gnanamgari, 1981), a pio-
neer in the automated presentation of graphics, uses
multiple dimensions to characterize variables, such
as: continuity, totality, cardinality (defined as the
number of unique values for a variable), units and
range. From these ve dimensions, Gnanamgari iden-
tifies levels for only the first two, which are dichoto-
mous, and for the other three he establishes ad hoc
rules to evaluate the graphics to be presented to the
user.
Other systems, such as APT (Mackinlay, 1986),
BOZ (Casner, 1990) and Vista (Senay and Ignatius,
1994) also use Bertins levels of organization, but not
his variable length classification. Thus, their charac-
A Strategy for Automating the Presentation of Statistical Graphics for Users without Data Visualization Expertise - A Position Paper
295
terization is unidimensional with three levels. This
characterization has variants, such as that used in
SAGE (Roth and Mattis, 1990), which subdivides
the ordinal and cardinal levels according to whether
they refer to amounts or reference values, and adds
a second dimension that refers to the fundamental
physical magnitudes of time, space, temperature and
mass. The NSP system (Robertson, 1990) distin-
guishes variables according to whether they are nom-
inal or ordinal. Further, for nominal variables, it dis-
tinguishes between those with multiple values and
those with one single value, and for ordinal variables,
it distinguishes between discrete and continuous val-
ues, thus yielding only one dimension with 4 levels.
The Polaris system (Stolte et al., 2002) only distin-
guishes between ordinal and quantitative levels, while
the Tableau system (Mackinlay et al., 2007), which
derived from Polaris, distinguishes between the cate-
gorical level, with three sublevels based on whether
the values are normal, dates or geographical units,
and the quantitative level with two sublevels, based
on whether the variables are predictor or response
variables, thus yielding one dimension with 5 lev-
els. Lastly, the VizRec recommender system (Mutlu
et al., 2015) distinguishes 3 levels based on whether
the variables are categorical, temporal or numeric.
3 CHARACTERIZATION OF
VARIABLES TAKEN
SEPARATELY
Usually the variables stored in a database are char-
acterized based on the memory reserved by the com-
puter system to store variable values. Thus we end
up with, for example, datatype classifications such
as Boolean, text strings of a predefined maximum
length, numeric values that can be whole or real num-
bers with an established decimal point precision, etc.
The characterization we propose is based on the prop-
erties that can be described for each column of values
for a variable, which may affect the selection of one
graphic over another.
3.1 Characterization Dimensions
Graphic Measurement Scale. Variables can be
classified according to measurement scale, and we
can distinguish between qualitative and quantita-
tive variables, the difference being that the val-
ues of the former cannot be summed up. Among
the qualitative variables, we can distinguish be-
tween unordered and ordered, the difference be-
ing that the values of the latter maintain a lesser-
to-greater relationship. Among the quantitative
variables, we can distinguish three levels: conven-
tionally bounded scalar values, which are mea-
sured on an interval scale, such as the temperature
in Celsius degrees; scalars bounded on one end,
which are measured on a ratio scale, such as a per-
sons age; and lastly scalars bounded on two ends,
which are measured on an absolute scale, such as
the probability of a certain event occurring, which
is bounded between zero and one. A variable
can transform its measurement scale in the direc-
tion of greater to lesser restrictions; for instance,
scalars bounded on two ends can be transformed
into scalars bounded on one end, and these can in
turn be transformed into conventionally bounded
scalars, which can then be transformed into or-
dered categories that can then be transformed into
unordered categories. The graphic measurement
scale allows us to distinguish, for example, be-
tween a simple point graph, a simple bar chart and
a dichotomous pie chart.
Cardinality Factor. This dimension relates the car-
dinality of a variables data (meaning the actual
number of unique values in the data) with the car-
dinality of the variable (this being the potential
number of observable values). This dimension
distinguishes between sample type values (when
the values in the data are few compared to those
that can be potentially observed or of interest) and
population type variables (when the values in the
data coincide or practically coincide with those
that can be potentially observed or of interest).
Population type values are typically factors, cate-
gories or equidistant intervals of quantitative vari-
ables within an interval of interest. The cardinal-
ity factor allows us to distinguish, for example,
between a histogram and a point graph with drop
lines that connect points to one of its axes.
Sequentiality. This dimension addresses the possi-
bility that the order in which a variables values ap-
pear in a data column contains information about
the sequence in which they were observed. Typ-
ically, this order is found in the “date” variable
of a temporal series, but information on sequence
can be contained in any data column with sequen-
tially ordered values. Sequentially ordered data
columns can be transformed into non-sequentially
ordered data columns when the position of the
data in the column is irrelevant in the graphic rep-
resentation to be suggested. Sequentiality allows
us to distinguish, for instance, between a scatter
plot and a scatter plot with points connected by a
line that follows a sequence.
IVAPP 2017 - International Conference on Information Visualization Theory and Applications
296
Cyclicality. Cyclicality, meaning the domains cyclic
or non-cyclic character, concerns only quantita-
tive and ordered qualitative variables, since un-
ordered qualitative variables lack an inherent or-
der. Periodic variables can sometimes be trans-
formed into aperiodic variables and vice versa, as
is the case, for example, with the variable “time”
depending on the data analysis being performed.
The characterization of a variable as cyclic is es-
pecially important when determining whether to
suggest graphics with polar, cylindrical and spher-
ical coordinate systems.
Explicitness. A variables explicitness helps us dis-
tinguish the explicit level, when the value scale
must be represented graphically, from the am-
biguous level, when the scale should not be
represented graphically, but other characteristics
should be, such as the number of values or the
unique values contained in a data column. An ex-
plicit variable can be transformed into an ambigu-
ous variable by simply omitting the scale values,
and vice versa. A clear example of a variable rep-
resented in an ambiguous manner is the various
observations of value pairs for two variables in a
scatter plot, where each point can be discerned,
but not the order that it occupies in the data col-
umn nor the name of the corresponding informant.
Variable Length. This dimension was defined by
Bertin (1967) as the number of unique values
that it is useful to identify, and that influences
the use of one visual variable over another and
the use of translations, rotations and reflections.
As previously noted, Bertin distinguishes between
short, medium and long variables, but automated
graphic systems tend to establish ad hoc rules
when evaluating what graphics to present. The
levels that we believe are the most relevant are:
variables with a unique value, such as the name of
the winner of a race or an observed temperature,
since many graphical methods were developped to
represent single observations; dichotomous vari-
ables that are very common in datasets and may
suggest the use of reflections to facilitate the com-
parison between pairs of observations; and vari-
ables with lengths greater than two. A length
threshold of between 5 and 12 could also be estab-
lished to identify, for example, variables suscepti-
ble to being represented via multiple panels or a
retinal visual variable instead of a spatial visual
variable. The ideal number of levels in this di-
mension depends on the accuracy that is intended
in a graphics automation system.
Georeferencing. This dimension is applicable only
to qualitative variables and considers the possi-
bility that these categories can be linked with
geospatial points, lines and polygons. The char-
acterization of a variable as georeferenced allows
us to present values as postal, census and political
units on a map.
4 DISCUSSION
The dimensions presented above allow us to divide
the domain of known graphics in ever more limited
subgroups, but if we consider the possible transforma-
tions between levels for each dimension, the gamut of
possible graphics expands. Using the graphic mea-
surement scale, for example, enables us to divide
single-variable graphics into five groups, two-variable
graphics into fifteen groups, and three-variable graph-
ics into 35 groups. If we also use the cardinality
factor, the possible combinations for single-variable
graphics increases to 10, for two-variable graphics it
grows to 55, and for three-variable graphics it jumps
to 220. Thus, the use of successive dimensions al-
lows us to more precisely narrow the set of appropri-
ate graphics.
The multidimensional characterization of vari-
ables taken separately has few comparable an-
tecedents. The characterization implemented by the
BHARAT system comes closest. It considers conti-
nuity, totality, cardinality, units and range, but identi-
fies dichotomous levels for only the former two di-
mensions and establishes ad hoc thresholds for the
other dimensions when evaluating which graphics to
present to the user. Because the automated presenta-
tion of one graphic or another requires the classifica-
tion of graphics based on the dimensions of the vari-
ables represented, we believe that it is necessary to
unambiguously define each level and limit them to a
reduced number. It is for this reason that we have not
introduced dimensions such as, for example, units of
measurement; doing so would produce as many levels
as combinations of fundamental magnitudes.
Once graphics have been classified according to a
multidimensional characterization of the data, we can
then further reduce the gamut of graphics to be pre-
sented to the user by using a multidimensional charac-
terization of the tasks to be performed, as proposed by
Schulz et al. [2013]. In order to evaluate the task per-
formance efficiency of a graphic drawn from a gamut
of possibilities based on the data, it is necessary to in-
clude information provided by the user, whether via
experiments that measure a users task performance
efficiency with respect to a gamut of graphics, the
score users give to the graphic-task binominal, or a
users graphic selection history when that information
A Strategy for Automating the Presentation of Statistical Graphics for Users without Data Visualization Expertise - A Position Paper
297
is available for the task being performed.
5 CONCLUSION
This article critiques the strategies employed by dif-
ferent automated graphic systems and focuses specifi-
cally on their characterization of variables taken sepa-
rately. We also identify as many as seven dimensions
of attributes that can be used to describe a column of
data, and that help determine the appropriateness of
one graphic over another, and, consequently, can be
used to limit the gamut of graphics to evaluate prior
to presenting them to the user. Of these seven dimen-
sions, one has five levels, another four, and the rest
are dichotomous; thus, it is not difficult to character-
ize the variables if their properties cannot be implic-
itly deduced from the data.
This paper presents a framework for classifying
statistical graphics without requiring that the user
predetermine the characteristics of the graphic being
sought and without considering hardware limitations.
Ideally, this classification should be complemented
with information about the effectiveness with which
the user performs various tasks. In this way, the gamut
of graphics presented to the user can be reduced to just
one or only a few possible solutions.
ACKNOWLEDGEMENTS
This work has been developed during a research res-
idency facilitated by Michael Friendly in his DataVis
laboratory at the Department of Psychology of York
University (Toronto, ON, Canada).
REFERENCES
Bachi, R. (1968). Graphical rational patterns: A new ap-
proach to graphical presentation of statistics. Trans-
action Publishers.
Bertin, J. (1967). S
´
emiologie graphique. Mouton, Paris.
Casner, S. M. (1990). Task-analytic design of graphic pre-
sentations. Ph.D. dissertation.
Gnanamgari, S. (1981). Information presentation through
default displays. Ph.D. dissertation.
Kamps, T. (1999). Diagram Design: A Constructive The-
ory. Springer Berlin Heidelberg.
Mackinlay, J. (1986). Automating the design of graphical
presentations of relational information. ACM Trans.
Graph., 5(2):110–141.
Mackinlay, J., Hanrahan, P., and Stolte, C. (2007). Show
me: Automatic presentation for visual analysis. Visu-
alization and Computer Graphics, IEEE Transactions
on, 13(6):1137–1144.
Mutlu, B., Veas, E., Trattner, C., and Sabol, V. (2015).
Vizrec: A two-stage recommender system for person-
alized visualizations. In Proceedings of the 20th In-
ternational Conference on Intelligent User Interfaces
Companion, IUI Companion ’15, pages 49–52, New
York, NY, USA. ACM.
Robertson, P. (1990). A methodology for scientific data vi-
sualisation: choosing representations based on a natu-
ral scene paradigm. In Visualization, 1990. Visualiza-
tion ’90., Proceedings of the First IEEE Conference
on, pages 114–123.
Roth, S. F. and Mattis, J. (1990). Data characterization for
intelligent graphics presentation. In Proceedings of
the SIGCHI Conference on Human Factors in Com-
puting Systems, pages 193–200. ACM.
Senay, H. and Ignatius, E. (1994). A knowledge-based sys-
tem for visualization design. Computer Graphics and
Applications, IEEE, 14(6):36–47.
Stevens, S. S. (1946). On the theory of scales of measure-
ment. Science, 103:677–680.
Stolte, C., Tang, D., and Hanrahan, P. (2002). Polaris: A
system for query, analysis, and visualization of mul-
tidimensional relational databases. Visualization and
Computer Graphics, IEEE Transactions on, 8(1):52–
65.
Ware, C. (2004). Information Visualization: Perception for
Design. Interactive Technologies. Elsevier Science,
2nd edition.
IVAPP 2017 - International Conference on Information Visualization Theory and Applications
298