Domain-independent Data-to-Text Generation for Open Data

Andreas Burgdorf, Micaela Barkmann, Andr

e Pomp and Tobias Meisen

Chair of Technologies and Management of Digital Transformation, University of Wuppertal, Wuppertal, Germany

Keywords:

Open Data, Data to Text Generation, Natural Language Generation, Transformer, Semantic Data Management.

Abstract:

As a result of the efforts of the Open Data movements, the number of Open Data portals and the amount of data

published in them is steadily increasing. An aspect that increases the utilizability of data enormously but is

nevertheless often neglected is the enrichment of data with textual data documentation. However, the creation

of descriptions of sufﬁcient quality is time-consuming and thus cost-intensive. One approach to solving this

problem is Data to text generation which creates descriptions to raw data. In the past, promising results were

achieved on data from Wikipedia. Based on a seq2seq model developed for such purposes, we investigate

whether this technique can also be applied in the Open Data domain and the associated challenges. In three

studies, we reproduce the results obtained from a previous work and apply them to additional datasets with

new challenges in terms of data nature and data volume. We can conclude that previous methods are not

suitable to be applied in the Open Data sector without further modiﬁcation, but the results still exceed our

expectations and show the potential of applicability.

1 INTRODUCTION

Nowadays, large amounts of heterogeneous data are

produced daily in various contexts. In order to pro-

mote public development based on data, the Open

Data Charter (ODC) (ODC, 2013) was initiated. The

ODC is an international collaboration between ex-

perts and governments with the aim of appropriately

making data publicly available. Five principles have

been elaborated for the implementation of the Charter

and the achievement of its goals: (1) Open Data by

Default, (2) Quality and Quantity, (3) Useable by All,

(4) Releasing Data for Improved Governance and (5)

Releasing Data for Innovation. Thousands of Open

Data Portals have been created, often with a regional

or thematic focus. To generate an added value and

to comply with the ﬁve principles, data sets require

a good infrastructure to be easily searchable and dis-

coverable. However, Burgdorf et al. (Burgdorf et al.,

2020) observed that the ODC does not prescribe any

standardizations, and thus unhindered access to data

is not guaranteed. Metadata provided along data pub-

lished on Open Data Portals poses a critical aspect for

the (re-)usability of the data.

As Chandola and Booker (Chandola and Booker,

2022), we understand metadata as all information pro-

vided in addition to the actual data to be published.

Metadata are, so to speak, ”data about data” con-

taining information on the origin, type, interpretation,

dates, descriptions, etc. They specify the given data

and provide context. Metadata can help make the

associated data easier to understand and interpret by

data consumers. Unfortunately, the lack of standard-

ization, synonyms, various formats(Schauppenlehner

and Muhar, 2018), ambiguity (Tygel et al., 2016) or

the absence of metadata hinder reuse. This circum-

stance becomes even more pronounced when more

data portals have to be visited to aggregate certain

data (Burgdorf et al., 2020). Nevertheless, metadata is

an important source of information, both for the pro-

cessing systems that eventually add semantic mean-

ings to data and the human consumer. For both pur-

poses, textual and human readable data documenta-

tion is a crucial part of metadata. Burgdorf et al.

propose to build exactly this bridge between modern

Natural Language Processing methods and Seman-

tic Modeling. That being said, they outlined differ-

ent research directions to achieve this objective, such

as the identiﬁcation of methods and the compilation

of a data set aligned with the requirements. Neverthe-

less, the proposed research perspective depends on the

availability of textual data documentation for all col-

lected data. However, these are not always available,

human-readable, or of good quality (Schauppenlehner

and Muhar, 2018). Furthermore, the authors argue

that, if provided, there is also no guarantee that the

Burgdorf, A., Barkmann, M., Pomp, A. and Meisen, T.

Domain-independent Data-to-Text Generation for Open Data.

DOI: 10.5220/0011272900003269

In Proceedings of the 11th International Conference on Data Science, Technology and Applications (DATA 2022), pages 95-106

ISBN: 978-989-758-583-8; ISSN: 2184-285X

textual data documentation actually correlates with

the associated data.

To enrich the data landscape of Open Data Por-

tals on the one hand and to support research in meta-

data-driven automated Semantic Data Management

on the other hand, we propose an automated and

domain-independent generation approach for textual

data documentation under the usage of Data-To-Text

Generation methods. For this purpose, we investigate

whether we can successfully apply existing data-to-

text generation models to the open domain sector and

whether we can achieve satisfactory results even with

minimal amounts of data. We also examine which

evaluation method is best suited to assess the quality

of generated data documentation.

In the remainder of this study we provide a brief

introduction into the Natural Language Generation

approach we use for our experiments. We present

three different experiments utilizing different data

sets and their results and discuss how well the selected

method performed in each experiment. Finally, we

give an outlook on what has to be done to actually

implement NLG in the open data sector.

2 METHODS

To examine the use of Data-to-Text Generation in the

Open Data sector, this paper is based on the work

of Chen et al. (Chen et al., 2019) who placed simi-

lar requirements on a Data-to-Text-Generation (DTG)

model as we do except for the application in Open

Data Portals.

Chen et al. present a structure-aware seq2seq

model based on the GPT-2 language model. It en-

codes ﬁeld information from a given table into the

cell memory and state of an LSTM. It thus allows

an internal structural representation of the given ta-

ble within the framework. This is achieved using a

modiﬁed LSTM which has an additional ﬁeld gate.

To efﬁciently incorporate the additional information

about the table into the generation, they employ a

Dual Attention mechanism that allows for both word-

level attention and ﬁeld-level attention. Finally, to

teach the model when to copy values from the table

and when to generate new words, the authors include

a trainable function that calculates a copy probability

for copying versus generation. Above all, they show-

case Few-Shot settings with only 50-500 training in-

stances across multiple domains. They achieved great

performance and outperformed previous best BLEU

baselines by 8.0 points. The methods used provide

a good basis for approaching our research objectives.

In our study we ﬁrst tried to replicate the results of

Chen et al. (Chen et al., 2019) using the WikiBio

data set (Lebret et al., 2016). Beyond that we applied

the model to the domain-overlapping ToTTo (Parikh

et al., 2020) data set to test its generalization capa-

bilities. In a third step we applied the model to the

VC-SLAM (Burgdorf et al., 2022) data set, which is

based on data from Open Data Portals and which is

very limited in quantity. This allows us to examine

how realistic a Few-Show setting is in the real open

data sector.

In the following, we will introduce the main ideas

of Chen et al. (Chen et al., 2019) and the theoretical

setup of their model. We will then present the data

sets used in this work whereby we leave out the prac-

tical processing steps of these data sets for ﬁtting to

the model for now. Finally, we encounter the evalua-

tion methods used in this work.

2.1 Baseline Method

The authors start with the statement that conventional

neural-based end-to-end approaches for NLG that

take structured data or knowledge as input are very

“data-hungry” (Chen et al., 2019, p. 1). As this makes

their “adoption for real-world applications difﬁcult”

(Chen et al., 2019, p. 1), the authors propose the task

of Few-Shot Data-To-Text Generation. With the un-

derlying research questions (1) Can we signiﬁcantly

reduce human annotation effort to achieve reasonable

performance using neural NLG models?; and (2) Can

we make the best generative pre-training, as prior

knowledge, to generate text from structured data?;

the authors introduce a model architecture based on

content-selection from input data and on generating

natural language text with the help of a pre-trained

Language Model.

According to the authors, one needs two skills to

describe information in a table: (1) select and copy

factual content from the table; and (2) compose gram-

matically correct sentences that bring those facts to-

gether, whereby the second skill is not restricted to

any domain. The task of forming ﬂuent and coherent

sentences can thus be detached from the task-speciﬁc

components of DTG and be presented in the form of

a pre-trained Language Model that represents the ”in-

nate” language skill of the neural DTG model. In this

way, the authors bypassed data-intensive training be-

cause the content-selection skill can be learned “rela-

tively quickly” (Chen et al., 2019, p. 1).

As previously mentioned, the authors used an ar-

chitecture separated into a content-selection mecha-

nism and a pre-trained Language Model. A switch

policy is applied to decouple the framework into those

tasks. Figure 1 shows a schematic sketch of the ap-

DATA 2022 - 11th International Conference on Data Science, Technology and Applications

Figure 1: Overview of the approach of Chen et al. (Chen

et al., 2019). Illustration modiﬁed and adapted from (Chen

et al., 2019, p 2).

proach.

Original Problem Formulation. The input data is

semi-structured. The goal is to automatically gener-

ate a natural language description based on that data,

based on only a few hundred training instances. We

have semi-structured data in the form of attribute-

value pairs, formalized with:

: V

}

i=1

(1)

With R

representing the attribute and V

representing

the values and of a table size i. Both, R

and V

, can

either be a number, a phrase or a sentence. Further,

each value is represented as a sequence j of instances:

: V

}

i=1

(2)

This leads to the effect that for each instance of in-

formation v

, all the information about its attribute R

and its position in the value sequence is available.

Language Model. For generation, Chen et al.

(Chen et al., 2019) used a pre-trained Language

Model. The currently most prominent Neural Lan-

guage Models are GPT from OpenAI (GPT-1: (Rad-

ford et al., 2018); GPT-2: (Radford et al., 2019); GPT-

3: (Brown et al., 2020)) and BERT (Devlin et al.,

2018). The language model used here is GPT-2. It

is a transformer-based NLM trained on a data set of

8 million web pages (approximately 40 GB of text).

The published GPT-2 model contains 117 million pa-

rameters and 12 Transformer layers. In their model,

Radford et al. (Radford et al., 2019) chose to repre-

sent the input according to Byte-Pair Encoding (BPE).

2.2 Data Sets

In this part, the data sets used for the studies are pre-

sented in more detail.

WikiBio Data Set. The Wikipedia Biography Data

Set

(short WikiBio) gathers approximately 728,000

biographies from Wikipedia. R

emi Lebret, David

Grangier, and Michael Auli in 2016 (Lebret et al.,

2016) built this data set in connection with their work

”Neural Text Generation from Structured Data with

Application to the Biography Domain”. Their paper

introduced a neural model for DTG, which generates

biographical sentences from fact tables. In contrast

to prevailing related works which experimented on

DTG, this self-created data set was signiﬁcantly larger

than existing data sets up to that point. Compared to

other popular data sets used in the context of DTG,

this data set contains around 728,000 samples with a

vocabulary of over 400,000 words. Essentially, the

data set consists of two parts: a text part and associ-

ated structured data.

ToTTo Data Set. The ToTTo data set was published

at the beginning of 2020 by Parikh et al. (Parikh et al.,

2020). It is a designated open-domain data-to-text

data set in the English language with over 120,000

instances. It consists of tables taken from different

domains and articles from Wikipedia. In order to pre-

vent overlaps with the WikiBio data set (Lebret et al.,

2016), Wikipedia infoboxes were excluded from the

collection. When choosing Wikipedia tables, table-

sentence pairs were selected that overlapped in at least

three non-zero digits. This way, mainly statistical ta-

bles were included.

VC-SLAM Data Set. The VC-SLAM (Versa-

tile Corpus for Semantic Labeling and Modeling)

(Burgdorf et al., 2022) corpus originally comes from

a different domain than DTG, namely OBDM. The

focus of the corpus is to advance the developmental

landscape of Semantic Mapping as an essential part

of OBDM. In this context, Semantic Mapping de-

scribes the process of mapping an attribute (e.g., from

a data set) with the corresponding entry in an ontology

(Burgdorf et al., 2022). Unlike typical DTG data sets,

the corpus contains an ontology along with an associ-

ated data set. The data set itself consists of 101 data

sets coming from different domains of the ”(smart)

city” context. To implement this limitation in the col-

lection, only data records containing geo-references

were considered. However, under the assumption that

https://github.com/DavidGrangier/wikipedia-biograp

hy-dataset

Domain-independent Data-to-Text Generation for Open Data

the data falls under this context, they can come from

many different domains such as speed limits, pub-

lic restrooms, or air pollution (Burgdorf et al., 2022,

p. 7). The data was gathered from an extensive search

of over 190 Open Data Portals. In addition to the con-

text criterion, the data also had to provide a textual

data description in English to be included in the set.

2.3 Evaluation

In our studies, we use the BLEU-4 (Papineni et al.,

2002), ROUGE-4 (Lin, 2004), and PARENT (Dhin-

gra et al., 2019) evaluation methods suitable for the

evaulation of DTG results. The BLEU-4 values will

be calculated and documented by the model during

training so that values will be available over time in

intervals of 30 epochs. ROUGE-4 scores are calcu-

lated separately and retrospectively. For the PARENT

scores, we will set the λ weight to 0.5 for all studies.

Whilst BLEU and ROUGE are not optimal evalua-

tion methods for DTG, they have been and still are

used in the majority of cases. The usage of the PAR-

ENT score can be useful when combined with data

sets such as WikiBio, as the textual data documenta-

tion and associated tabular data do not offer too much

room for interpretation and inference. Therefore, in

the potential summary of data records, there is pro-

portionally less variation possible than with, for ex-

ample, larger historical data sets, ROUGE and BLEU

can certainly provide a good orientation value. In

the other case with larger data records, PARENT is a

more appropriate method to estimate the occurrence

of phenomena such as hallucinations, divergences,

and omissions. It is thus better suited for the evalu-

ation of DTG models. Nevertheless, the combination

of the BLEU, ROUGE, and PARENT scores give a

good quantiﬁcation of the model’s capabilities and al-

low a potentially differentiated conclusion.

3 STUDIES

3.1 Study 1: Conceptual

Methodological Replication

The ﬁrst study aims to conduct a conceptual, method-

ological replication of Chen et al. (Chen et al., 2019).

To get an intuition for how the model works, we repli-

cate one of the training settings presented in (Chen

et al., 2019). For evaluation, the authors have only

used the methods BLEU and ROUGE. We addition-

ally evaluate with PARENT.

Experimental Setup. Chen et al. (Chen et al.,

2019) use WikiBio as data set and they further cre-

ate two more sets according to the same principle and

structure for the domains Books and Songs by crawl-

ing Wikipedia. In total the data sets contain 6452

instances for Wiki

books

, 14787 instances for WikiBio

and 13079 instances for Wiki

songs

. Table 1 shows the

statistical properties of the respective input tables and

target summaries.

For optimizing, Chen et al. (Chen et al., 2019)

used the Adam (Kingma and Ba, 2014) optimizer al-

gorithm with a learning rate of 0.0003. The ﬁeld gate

is being applied and the copy loss weight λ from the

Switch-Policy is set to 0.7. Also, the Dual Attention

mechanism is applied. The PARENT-λ weight is set

to 0.5 for all studies. The model’s hyperparameters

are set to a hidden size of 500, a ﬁeld embedding size

of 768, and a position embedding of size 5. The num-

ber of epochs is not speciﬁed explicitly in the paper

(Chen et al., 2019); however, the number of epochs in

the original code is set to 5000. Due to computational

limitations, we have run 330 epochs for all studies.

Results. The results of the evaluation of our repli-

cation and the values published by Chen et al. (Chen

and Mooney, 2008) are quite close. Table 2 shows our

values for BLEU-4 and ROUGE-4 and Table 3 those

of Chen et al (Chen et al., 2019). Although the results

are not identical, we can claim that the values differ

within a normal range due to different sampled data

records and training time.

Since the PARENT score for the original results

for Chen et al. are not available, they cannot be com-

pared. However, it can be observed that in our study,

in the evaluations with BLEU-4 and ROUGE-4, the

Wiki

songs

set performs best, and when table informa-

tion is included in the evaluation process as in PAR-

ENT, the Wiki

books

set performs best. We suspect that

the reason for the poorer performance of the WikiBio

set is that the target summaries may be more diverse

and/ or show more divergences which has been shown

by Dhingra et al. (Dhingra et al., 2019). This is less

likely the case for the Wiki

books

and Wiki

songs

sets.

While the BLEU-4 and ROUGE-4 scores can only

tell us to what degree prediction and target text match,

the PARENT scores are more relevant in the context

of DTG. Interpreting the PARENT score, we can say

that the reference text or the table entails 66.8% of

all n-grams from the total predictions. However, the

prediction only contains about one-ﬁfth of the infor-

mation from the table and the target text.

Figure 2 shows the copy loss during the training.

While the copy loss for the data sets Wiki

books

and

Wiki

songs

, after a common initial descent, settles at a

signiﬁcantly lower level, the copy loss for the WikiBio

DATA 2022 - 11th International Conference on Data Science, Technology and Applications

Table 1: Study 1: Statistics of table properties of Wiki sets.

Set

Cells Rows

Columns

(Attributes)

Min Max Mean Min Max Mean Min Max Mean

Train WikiBio 5 82 15.3 1 2 1.1 5 80 13.9

Wiki

books

3 18 10.6 1 1 1 3 18 10.6

Wiki

songs

2 18 9.1 1 1 1 2 18 9.1

Valid WikiBio 2 18 10.5 1 1 1 2 18 10.5

Wiki

books

3 52 16.7 1 2 1.2 3 40 13.9

Wiki

songs

1 20 9.4 1 1 1 1 20 9.4

Table 2: Study 1: Results for the three data set variations obtained after 330 epochs with 200 training instances. Architecture

and parameters as in (Chen et al., 2019). Data sets used for training were randomly sampled from original set.

ROUGE-4 PARENT

Set BLEU-4

F-Score precision recall F-Score

Wiki

books

35.2 21.7 66.8 36.1 44.3

WikiBio 33.4 16.6 61.7 26.1 34.1

Wiki

songs

36.9 26.8 66.2 33.5 43.9

Table 3: Study 1: Evaluation results for Wiki from Chen et

al. (Chen et al., 2019) in 5000 epochs.

ROUGE-4

Set BLEU-4

F-Score

Wiki

books

37.9 25.0

WikiBio 36.1 22.1

Wiki

songs

30.4 30.1

Figure 2: Study 1: Development of the copy loss.

remains at a higher value. The copy loss indicates

the performance of the copy probability term p

copy

that aims to learn when to generate over vocabulary

and when to copy a word from the table for the pre-

diction. In preprocessing, values from the input ta-

ble were matched with the target text. During train-

ing, the copy probability was maximized at these po-

sitions. High or ﬂuctuating values in copy loss thus

mean that the model has problems generating a suit-

able mapping between the behavior of copying values

from the table and generating new over vocabulary.

The behavior of the copy loss for WikiBio again in-

dicates that it may contain more divergences than the

other two sets.

Figure 3: Study 1: Development of the loss.

The overall model loss, (c.f. Figure 3), shows a

similar picture regarding the behavior of WikiBio to

Wiki

books

and Wiki

songs

. While all sets show an ini-

tial substantial decrease, WikiBio remains at a higher

level than the other two sets. This can be observed

very well at epoch 250: in Figure 2 for the copy loss

and Figure 3 for the loss WikiBio shows a slight, si-

multaneous increase. However, there seem to be other

mechanisms that weaken the relatively strong ﬂuctu-

ation of the copy loss.

Discussion. Overall, we can claim that we were

able to reproduce the results of Chen et al. (Chen

et al., 2019) on sets WikiBio, Wiki

books

, and Wiki

songs

Although the values do not fully match their results,

we believe that this may be due to the training dura-

tion of only 330 epochs and slight variations between

our sampled sets and the authors’ sets. Although,

according to the BLEU-4 and ROUGE-4 scores, in

our study Wiki

songs

achieves best results, the best per-

forming set according to the PARENT evaluations

matches with the best scores of Chen et al. (Chen

Domain-independent Data-to-Text Generation for Open Data

et al., 2019) for set Wiki

books

. However, since no

PARENT scores are available for the original results

of Chen et al. (Chen et al., 2019), we can only

speculate whether the best values in the BLEU-4 and

ROUGE-4 scores would also be reﬂected in PARENT.

Concerning our sets, we strongly hypothesize that

there are more divergences in WikiBio than in the

other two. We infer this from the combination of the

worse performance in the evaluation scores and the

course of the copy loss. The values here are clearly

higher and indicate that the model has problems ﬁnd-

ing a function for the copy or generation behavior of

tokens for the prediction. At these points, hallucina-

tions arise because it learns to use supposedly con-

textless content words. This, in turn, leads to a lower

score in the precision because the table or the tar-

get summary entails fewer n-grams of the prediction.

These developments in the copy loss are reﬂected in

the overall loss of the model. While Wiki

books

and

Wiki

songs

remain at a lower level, the loss of WikiBio

stands out from them.

The results of this study engage the intention to

apply the model to a more diverse data set. On the

one hand, to test the generalization ability when do-

mains are no longer tested separately from each other,

but also to observe the model’s behavior with more di-

verse and larger input tables. In addition, the results

suggest that PARENT scores provide a more nuanced

view of the performance of a DTG model.

3.2 Study 2: Methodological

Replication with ToTTo

The second study aims to apply the model to a more

diverse data set. More diverse in terms of the top-

ics covered but also in terms of the type of records.

ToTTo contains statistics about, e.g., sports, elections,

and the sciences. Such data requires a certain degree

of deduction skills and the ability to generalize. Also,

unlike Chen et al. (Chen et al., 2019), the different

domains are no longer trained separately to examine

generalizability.

Experimental Setup. The ToTTo data set offers the

possibility to include metadata to different degrees in

a model’s training. We perform a total of six runs

which are divided into two main categories: (1) sets

containing only direct table information such as head

and subtitle marked by

T T

and (2) table informa-

tion as in

T T

plus page title denoted by

. Since

the ToTTo data set comes entirely from Wikipedia

entries, page title in this context means the title of

the entire Wikipedia entry. For both sets there are

three training sets: (1) f ew-shot with 200 training in-

stances, (2) extended with 400 training instances and

(3) standard with the set split into Train (10%), Valid

(10%) and Test (80%). Tables 4 and 5 show the statis-

tical properties of the respective input tables and tar-

get summaries. Compared to the Wiki sets, the model

covers with tables 28 times larger and an decreased

number of tokens per target summary (20 vs. 18).

All sets of a category (

T T

) consist of the same sub-

set of records for a better estimation of the effect of

the training size. In total, our studies include 42,452

data records from the ToTTo data set. We excluded

records with multiple row and column spans to keep

the format of the tables to a minimum. We also ex-

cluded tables that contain ﬁeld values with more than

20 tokens to avoid tables that contain text rather than

data points.

We only considered the edited, ”ﬁnal” description

in all sets, as we believe that hallucinations can best

be avoided with a clean table-description alignment.

For the rest, we use the same hyperparameters as

in study 1. Again, we trained all sets for 330 epochs.

Dual Attention, Switch-Policy, and ﬁeld gate are ap-

plied, the copy loss weight λ is again set to 0.7, and

the PARENT λ weight is set to 0.5. The learning

rate remains with a step size of 0.0003 along with the

Adam optimizer.

Results. We ﬁrst analyze the evaluation results of

all six sets. Table 6 shows that the two standard vari-

ations of the ToTTo set show the best values except

for a small deviation in the ROUGE score. We ob-

serve that scores increase with the number of training

instances. Additionally, they also increase with the

metadata added to the set. But only in tendency, as

the values for the Few-Shot setting with less metadata

(ToT To

f ew-shot T T

) are better than those for the setting

with more metadata (ToT To

f ew-shot PT

All in all, there is a discrepancy between the

BLEU-4 precision values and the PARENT precision

values. It shows that the models recognize that the

prediction values must be taken from the table (PAR-

ENT precision) but do not select the same content as

the target summary (BLEU-4).

The PARENT recall shows that only about 1.2 -

3.0% of the n-grams from the table and target text ap-

pear in the predictions at all

. So, although on average

about 45% of the n-grams of all predictions are en-

tailed by the table and target summary, they, on aver-

age, contain only about 1.5% of all n-grams occurring

in the table and target text.

The BLEU-4 values for the respective variants

f ew-shot and extended remain below a certain level

during the training. The two standard variations, on

With the PARENT λ weight set to 0.5.

DATA 2022 - 11th International Conference on Data Science, Technology and Applications

100

Table 4: Study 2: Statistics of token in tables and target texts of ToTTo sets.

Set Token in table Token in target text

Min Max Mean Min Max Mean

Train ToT To

f ew-shot T T

8 7752 426.9 6 52 17.8

ToT To

f ew-shot PT

9 3678 351.6 4 67 17.3

ToT To

extended TT

8 9462 417.0 6 52 17.7

ToT To

extended PT

9 3802 329.5 4 67 17.7

ToT To

standard T T

5 12623 388.8 4 60 17.6

ToT To

standard PT

7 12250 391.5 4 67 17.5

Valid ToT To

f ew-shot T T

5 9943 392.5 4 59 17.5

ToT To

f ew-shot PT

8 12250 377.8 4 59 17.5

ToT To

extended TT

5 12623 382.9 4 60 17.5

ToT To

extended PT

7 12250 394.5 4 61 17.5

ToT To

standard T T

5 16920 396.5 4 67 17.3

ToT To

standard PT

6 14029 380.6 4 61 17.3

Table 5: Study 2: Statistics of table properties of ToTTo sets.

Set

Cells Rows

Columns

(Attributes)

Min Max Mean Min Max Mean Min Max Mean

Train ToT To

f ew-shot T T

5 4048 198.3 1 513 25.1 4 17 7.9

ToT To

f ew-shot PT

4 1938 161.0 1 323 18.5 4 19 8.7

ToT To

extended TT

4 5451 202.9 1 908 26.0 4 27 7.8

ToT To

extended PT

4 3391 154.9 1 323 17.6 4 19 8.8

ToT To

standard T T

2 6085 180.2 1 908 23.4 2 39 7.7

ToT To

standard PT

3 8755 179.5 1 1250 20.4 3 39 8.8

Valid ToT To

f ew-shot T T

2 5597 185.6 1 908 24.1 2 34 7.7

ToT To

f ew-shot PT

3 8755 177.8 1 1250 20.2 3 35 8.8

ToT To

extended TT

2 6085 177.1 1 908 23.0 2 39 7.7

ToT To

extended PT

3 8755 181.3 1 1250 20.6 3 39 8.8

ToT To

standard T T

2 10352 184.0 1 1529 23.9 2 37 7.7

ToT To

standard PT

3 10353 176.6 1 862.8 20.3 3 40 8.7

the other hand, outperform the other sets, and the

variation with the most metadata (ToT To

standard PT

)

achieves higher values from epoch 90 onwards than

the variation with less metadata (ToT To

standard T T

The copy loss history of all ToTTo variants is

shown in Figure 4. The variants with more training

instances tend to achieve better scores (i.e., lower val-

ues) than those with fewer training instances. And

within these spaces, the sets with more metadata

again achieve lower scores than those with less meta-

data. As can be seen in Figure 4, a large part of

the summaries contains information from the page

title and therefore ToT To

standard PT

performs signif-

icantly better than ToT To

standard T T

. Since all sets

have the same subset of records, this can be stated

reliably. Nevertheless, it is interesting to see that

ToT To

extended PT

performs on about the same level as

ToT To

standard T T

. Probably this is also related to the

additional information of the page title.

Figure 5 shows the overall loss of the models. In

general, the values for the loss decrease visibly for all

sets. One can see very well that these developments

are characteristic for the respective variant pairs (

T T

and

). The f ew-shot variants show the steepest de-

Figure 4: Study 2: copy loss with ToTTo. The sufﬁx

T T

denotes data sets with only additional table information and

data sets with additional page and table information.

cline, followed by the extended variants and ﬁnally

the standard variations. This is due to the fact that it

is more difﬁcult for the model to ﬁnd a mapping func-

tion for in- and output the more instances are used for

training.

Discussion. In the evaluation results of study 2, we

cannot ﬁnd any contradictory scores between BLEU-

Domain-independent Data-to-Text Generation for Open Data

101

Table 6: Study 2: Results for the ToTTo data set variations obtained after 330 epochs with varying amount of training instances.

The sufﬁx

T T

denotes data sets with additional table information and

those with additional page and table information.

ROUGE-4 PARENT

Set BLEU-4

F-Score precision recall F-Score

ToT To

f ew-shot T T

4.7 1.1 40.5 2.8 4.0

ToT To

extended T T

5.2 1.3 42.2 3.1 4.5

ToT To

standard T T

8.3 2.6 51.4 4.9 7

ToT To

f ew-shot PT

3.4 0.7 38.1 2.4 3.4

ToT To

extended PT

5.2 1.2 43.2 3.1 4.5

ToT To

standard PT

9.6 0.02 55.4 6.1 8.7

Figure 5: Study 2: loss with ToTTo. The sufﬁx

T T

denotes

data sets with only additional table information and

data

sets with additional page and table information.

4 and PARENT. All passages with the highest PAR-

ENT scores also show the highest BLEU-4 score

and vice versa. However, the discrepancy between

the two scores is much higher than in the evalua-

tion scores of study 1. While the best performing set

ToT To

standard PT

shows a PARENT precision score of

about 55%, the BLEU-4 is only 9.4. In contrast, the

PARENT precision of WikiBio from study 1 shows a

value only about seven percentage points higher but

has a BLEU-4 score of 33.4. This shows more clearly

what was already apparent in study 1. Because the

BLEU score does not take the input table into account

in the evaluation calculation, we get a distorted pic-

ture of the results for our purposes. We do not aim to

precisely reproduce a table’s target text but to obtain

an output text that covers the table to a certain extent

and represents the correct representation in a seman-

tic context. In relation to classic DTG data sets such

as WeatherGov, WebNLG, and WikiBio, this metric

may still be appropriate, as the target texts leave less

room for variation as compared to ToTTo.

Regarding the coverage criterion, we can state that

the model at least learns that it should take content

from the table. This is most noticeable in most meta-

data and training instances sets.

If we only look at the results of the sets used in

this study, we see the same movement in all aspects:

the more training instances a set has, the better it per-

forms. If the set also includes more metadata, it per-

forms even better. The Few-Shot setting in our study

with ToTTo does not achieve comparable results to

those of Chen et al. (Chen et al., 2019).

Ultimately, the development in Figure 5 suggests

that the two standard variants as whole produce mod-

els that are better able to generalize, i.e., respond to

previously unseen input. While the f ew-shot and

extended variants both move relatively quickly to a

low level and stay there, the model takes longer to

adapt the mapping between in- and output in the stan-

dard variant. Again, it would be interesting for further

research to see if the loss of the standard variants can

settle to the level of the other sets with longer training

time.

3.3 Study 3: Methodological

Replication with VC-SLAM

For our ﬁnal study, we apply the model to the VC-

SLAM data set. As collection of data records from

Open Data Portals, it gives us a ﬁrst impression of the

applicability of the model in the context of the goals

of the Open Data Charter. Although the data set is

small, it reﬂects a real-world application in terms of

quantity (data record sizes) and quality (unedited and

unaligned records and descriptions).

Experimental Setup. This study follows the same

settings as study 2 in terms of input representations,

input parameters to the model, and evaluation. Since

this set is much smaller in comparison to WikiBio and

ToTTo, we applied only one variation: Train (59.4%),

Valid (29.7%), and Test (10.9%). At VC-Slam, we

experience the largest input tables so far, seen in Table

7 and Table 8. The model has to cope with even more

input per example than in the other two studies, while

training with just over a quarter of the instances of the

Few-Shot setting.

Results. Table 9 shows the BLEU-4, ROUGE-4

and PARENT evaluation scores. While VC-SLAM

achieves a BLEU-4 value of 5.1, the model with only

30 training instances achieves a PARENT precision

DATA 2022 - 11th International Conference on Data Science, Technology and Applications

102

Table 7: Study 3: Statistics of token in target texts of VC-SLAM.

Set

Token in tables Token in target texts

Min Max Mean Min Max Mean

Train VC-SLAM 80 1940850 96338.2 10 367 73.4

Valid VC-SLAM 605 982277 59003.4 10 193 56.4

Table 8: Study 3: Statistics of table properties of VC-SLAM.

Set

Cells Rows

Columns

(Attributes)

Min Max Mean Min Max Mean Min Max Mean

Train VC-SLAM 13 934590 32711.7 3 54975 2947.6 4 28 11.1

Valid VC-SLAM 151 539603 20873.7 13 19985 1546.2 4 33 13.5

of 26.7. So roughly one-quarter of the n-grams of

the prediction is entailed by the table and the target

summary, and these cover about 0.1% of the entire

content.

In Figure 7 we can observe that the overall model loss

reaches a low level quickly and stays there. Since the

set is minimal, the model has fewer problems ﬁnding

a mapping between input and output than is the case,

for example, with the other sets.

Figure 6: Study 3: copy loss with VC-SLAM.

Figure 7: Study 3: loss with VC-SLAM.

Discussion. The results of study 3 compare poorly

with those of the other studies and are therefore con-

sidered in isolation. Even though the model re-

ceived only 60 training samples, the predictions still

achieved a PARENT precision value of almost 27%.

This exceeds the expectations we had for this study.

Although there are few training instances, the input

comes from several domains and has extensive data

records. Additionally, the summaries do not have

edited properties, so they are not corrected in respect

of alignment. The underlying input tables and tar-

get texts are larger than those of the other two stud-

ies. These ﬁndings raise the question of what role

the BLEU score and the ROUGE score still play in

this application area and whether they are any longer

a benchmark for DTG systems. Almost the same ap-

plies to the PARENT recall value because its value

can only be interpreted to a limited extent in applica-

tions with large inputs. For such data sets, a suitable

PARENT-λ value should be carefully speciﬁed, which

deﬁnes the relation of content in the summary and that

of the table. A mismatch in the PARENT RECALL

leads to the F-Score being pulled down by worse re-

call values.

Overall, the results provide a little perspective on

the application in an authentic setting regarding the

quality of the data itself rather than quantity.

4 DISCUSSION AND

CONCLUSION

This work aimed to evaluate an existing model for

DTG that ﬁts our requirements for its applicability in

Open Data Portals or the support of Semantic Models

for a better organization of the former. The require-

ments were that it should be domain-independent and

not data-hungry. We decided on an architecture that

divides the task of generating a coherent text to un-

derlying (semi-) structured data into two indepen-

dent tasks: content-selection and language genera-

tion, with the latter relying on a pre-trained Language

Model. In addition, we aimed to ﬁnd an evaluation

Domain-independent Data-to-Text Generation for Open Data

103

Table 9: Study 3: BLEU-4, ROUGE-4 and PARENT scores of VC-Slam after epoch 330.

Set BLEU-4

ROUGE-4 PARENT

F-Score precision recall F-Score

VC-SLAM 5.1 0.2 26.7 0.3 0.5

method that considers both the target text and the as-

sociated table in the assessment of the results.

In the ﬁrst study, the goal was to replicate the re-

sults of Chen et al. (Chen et al., 2019). We included

the PARENT score for evaluation and were to get ad-

ditional insights. The WikiBio set performed worst in

both studies. Using the model outputs and the PAR-

ENT score, we can fairly conﬁdently claim that the

WikiBio data set exhibits more divergences than the

other two Wiki sets. Unfortunately, no correspond-

ing data are available for Wiki

songs

and Wiki

books

, but

Dhingra et al. (Dhingra et al., 2019) found that the

WikiBio data set in general contains about 62% di-

vergences. The observation that models perform bet-

ter with data with fewer divergences was conﬁrmed

with the help of the ToTTo data set in study 2. Fig-

ure 8 shows the development of the copy loss for

all sets. Even in comparison with the otherwise

better-performing Wiki sets, it can be seen that the

ToT To

standard PT

achieves the best scores. Precise

table-description alignment and additional metadata

allow the framework to better learn the relationship

between copy and generation. Overall, it can be ob-

served that those with more metadata perform better

among all settings. In the context of Open Data Por-

tals and the application of DTG for semantic mod-

eling, this inevitably leads to the conclusion that we

need metadata to generate metadata for another sys-

tem. However, it must also be noted that the table-

description alignment of ToTTo was done under the

premise that title and other direct metadata belong to

the actual table (Parikh et al., 2020).

Figure 8: Development of the copy loss with all Data Sets.

Next we look at necessary data sets for DTG.

Apart from the Wiki sets, we observed that more

training instances perform better. As can be seen in

Table 10 in direct comparison, the values improve

more or less proportionally to the size of the train-

ing set. Despite the pre-trained language model, the

task of copying does not seem as trivial as Chen et

al. (Chen et al., 2019) assumed. That the ability of

content-selection “can be learned by reading a hand-

ful of tables” (Chen et al., 2019, p. 1) does not hold in

our studies. This may be true in the case of the Wiki

sets, but since the tables in most records are only one

line, the task is easier to perform. The situation is

different for ToTTo and VC-Slam. The records for

ToT To are on average 21 lines, for VC-SLAM even

over 2400. Also, for this reason, a 1:1 translation of

the performance results should be done with reserva-

tion. Additional data sets such as ToTTo or an ex-

tension of the VC-SLAM set are needed to directly

compare architectures and their generations.

For example, Nan et al. (Nan et al., 2020) pub-

lished a large open-domain DTG data set named

DART (Data Record to Text). The corpus is enriched

by annotated tree ontologies converted from underly-

ing tables. The corpus is constructed from different

sources like Wikipedia, WikiSQL and incorporated

records from WebNLG. Further comparison and re-

search using these data sets are needed to evaluate

their contributions further.

Overall, the results of study 1, especially under

the BLEU-4 metric, were the best of all three studies

conducted. The results are not surprising considering

that these sets have, on average, only one data record.

It is to be expected that the target text with such data

will entail the majority of the table. Thus the gener-

ating system will have less room for maneuver in the

prediction, taking the table into account.

In general, we have repeatedly encountered the

limits of the validity of the BLEU-4 and ROUGE-4

scores in the context of DTG. While still standard,

we have found that two questions should be asked in

advance when using or interpreting them: (1) Is the

goal that the predictions are as close as possible to

the target text?, and (2) Is the goal that the predic-

tion covers as much of the tables or target texts as

possible? The former need not always be the goal.

Originally, both metrics come from Neural Machine

Translation, where the ambition was to get a predic-

tion as close as possible to the target. In the context

of DTG, this question certainly also depends on the

used data. In our understanding, the objective is dif-

ferent with diverse data. Our goal is a description

DATA 2022 - 11th International Conference on Data Science, Technology and Applications

104

Table 10: Overview of all evaluation scores for all Data Sets.

Set

Training

instances

BLEU-4 ROUGE-4 PARENT

precision recall F-Score

ToT To

f ew-shot T T

200 4.7 1.1 40.5 2.8 4

ToT To

f ew-shot PT

200 3.4 0.7 38.1 2.4 3.4

ToT To

extended TT

400 5.2 1.3 42.2 3.1 4.5

ToT To

extended PT

400 5.2 1.2 43.2 3.1 4.5

ToT To

standard T T

4245 8.3 2.6 51.4 4.9 7

ToT To

standard PT

4245 9.6 0.02 55.4 6.1 8.7

Wiki

books

200 35.2 21.7 66.8 36.1 44.3

WikiBio 200 33.4 16.6 61.7 26.1 34.1

Wiki

songs

200 36.9 26.8 66.2 33.5 43.9

VC-SLAM 60 5.1 0.2 26.7 0.3 0.5

that reﬂects semantic concepts and the context of the

given table. Apart from that, both the BLEU and the

ROUGE score do not include the input table in the

evaluation. Furthermore, Wang (Wang, 2020) argues

that “hallucinated facts may unrealistically boost the

BLEU score. Thus the possibly misleading evalua-

tion results inhibit systems to demonstrate excellence

on this task” (Wang, 2020, p. 312).

Concerning the second inquiry, the PARENT re-

call score is also to be questioned. Again, this de-

pends on the data sets used as just described. How-

ever, for data sets with more extensive records, it is

rather unlikely to aim for the highest possible cov-

erage of the entire input in the prediction. Above

all, such a task description is not available as a ref-

erence in the data sets known to us (for larger records

like those found in ToTTo and VC-SLAM). In any

case, the PARENT-λ weight should be considered in

the evaluation, and, if necessary, experiments should

be carried out with different proportions of coverage.

Finally, there remains the question of factual accu-

racy. Even if higher PARENT precision scores are

achieved, i.e., a certain number of n-grams of the ta-

ble and the target text are found in the prediction, this

does not necessarily mean that the prediction content

is factually correct. Especially in the ﬁeld of jour-

nalism or public relations, it must be ensured that no

”fake news” is spread (Portet et al., 2009).

The ﬁnal development of the models, as observed

in Figure 9 in the overall comparison, shows that the

frameworks, in general, seem to cope with the task of

the DTG. As expected, the sets with fewer instances

show a faster framework adaptation. However, there

is the assumption that these models generalize less

well than the standard variations. Future research

could conﬁrm this.

In our model, we have used the Language Model

GPT-2 with 177 million parameters. Studies by

Brown et al. (Brown et al., 2020) have shown that

proﬁciency in in-context learning increases with the

parameters of the pre-trained Language Model. For

Figure 9: Development of the loss with all Data Sets.

future research, it would be interesting to evaluate this

framework with GPT-3, the successor of GPT-2.

It would also be of interest to extend the studies pre-

sented here and examine the results after a longer

training period. Except for the development in the

overall loss of the smaller sets, the values in the BLEU

and copy loss indicate that the scores have not yet lev-

eled off and that further development could still occur.

In the course of the last year, other promising

DTG systems have been developed. For example,

Rebuffel et al. (Rebuffel et al., 2021) follow a

word-level approach to control hallucinations in gen-

eration. Labels were obtained by employing co-

occurrence analysis and dependency parsing. The au-

thors achieved state-of-the-art performances on Wik-

iBio for the PARENT F-Score of approximately 56%.

Filippova (Filippova, 2020) reaches on WikiBio up to

52% on the PARENT F-Score by adding a halluci-

nation score as an additional attribute to an instance.

However, it can be assumed that the frameworks in

other domains or with different data sets than Wik-

iBio will achieve inferior results.

In light of the results from the VC-SLAM data

set, we encourage expanding research in this direc-

tion and collecting more data from Open Data Por-

tals. Despite the small training set, the predictions al-

Domain-independent Data-to-Text Generation for Open Data

105

ready show a nearly 27% entailment. Although these

data records are substantially larger and more diverse,

the model seems to adapt. However, concerning our

research question, we cannot claim to have obtained

satisfactory results with a minimal amount of real-

world data. At this point, another research objec-

tive emerges, which has already been articulated by

Burgdorf et al. (Burgdorf et al., 2020). To say re-

liably whether given metadata is useful for semanti-

cally modeling tabular data requires some kind of as-

sessment or evaluation. The authors propose to use

historical data from the (potentially) established on-

tology to make some kind of prediction about how

much manual effort a given semantic model will need

with a given data set. Schauppenlehner and Muhar

(Schauppenlehner and Muhar, 2018) support this ap-

proach.

Finally, our conclusion is somewhat ambivalent.

We were able to test the present framework on an

open-domain setting and achieved valuable results,

even if not in the few-shot setting. However, many

open research questions remain: How can the genera-

tions of DTG systems be qualitatively evaluated? We

have been able to identify a method that allows us to

assess the degree of entailment, but this does not tell

us anything about factual correctness, nor whether it

is semantically relevant for use in the context of Open

Data Portals. Furthermore, our results show that the

amount of metadata is crucial for the performance of a

DTG model. If not all the information needed for the

generation can be obtained from the table, we must

rely on additional information. At this point, a vicious

circle arises because, in order to generate metadata,

we need metadata.

REFERENCES

(2013). G8 open data charter and technical annex.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,

J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry,

G., Askell, A., et al. (2020). Language models are

few-shot learners. arXiv preprint arXiv:2005.14165.

Burgdorf, A., Paulus, A., Pomp, A., and Meisen, T. (2022).

Vc-slam—a handcrafted data corpus for the construc-

tion of semantic models. Data, 7(2):17.

Burgdorf, A., Pomp, A., and Meisen, T. (2020). Towards

nlp-supported semantic data management. arXiv

preprint arXiv:2005.06916.

Chandola, T. and Booker, C. (2022). Archival and Sec-

ondary Data. SAGE.

Chen, D. L. and Mooney, R. J. (2008). Learning to

sportscast: a test of grounded language acquisition. In

Proceedings of the 25th international conference on

Machine learning, pages 128–135.

Chen, Z., Eavani, H., Chen, W., Liu, Y., and Wang,

W. Y. (2019). Few-shot nlg with pre-trained language

model. arXiv preprint arXiv:1904.09521.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Dhingra, B., Faruqui, M., Parikh, A., Chang, M.-W., Das,

D., and Cohen, W. W. (2019). Handling divergent ref-

erence texts when evaluating table-to-text generation.

arXiv preprint arXiv:1906.01081.

Filippova, K. (2020). Controlled hallucinations: Learning

to generate faithfully from noisy data. arXiv preprint

arXiv:2010.05873.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Lebret, R., Grangier, D., and Auli, M. (2016). Neural text

generation from structured data with application to the

biography domain. arXiv preprint arXiv:1603.07771.

Lin, C.-Y. (2004). Rouge: A package for automatic evalu-

ation of summaries. In Text summarization branches

out, pages 74–81.

Nan, L., Radev, D., Zhang, R., Rau, A., Sivaprasad,

A., Hsieh, C., Tang, X., Vyas, A., Verma, N., Kr-

ishna, P., et al. (2020). Dart: Open-domain struc-

tured data record to text generation. arXiv preprint

arXiv:2007.02871.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).

Bleu: a method for automatic evaluation of machine

translation. In Proceedings of the 40th annual meet-

ing of the Association for Computational Linguistics,

pages 311–318.

Parikh, A. P., Wang, X., Gehrmann, S., Faruqui, M., Dhin-

gra, B., Yang, D., and Das, D. (2020). Totto: A con-

trolled table-to-text generation dataset. arXiv preprint

arXiv:2004.14373.

Portet, F., Reiter, E., Gatt, A., Hunter, J., Sripada, S., Freer,

Y., and Sykes, C. (2009). Automatic generation of

textual summaries from neonatal intensive care data.

Artiﬁcial Intelligence, 173(7-8):789–816.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever,

I. (2018). Improving language understanding by gen-

erative pre-training.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,

Sutskever, I., et al. (2019). Language models are un-

supervised multitask learners. OpenAI blog, 1(8):9.

Rebuffel, C., Roberti, M., Soulier, L., Scoutheeten, G., Can-

celliere, R., and Gallinari, P. (2021). Controlling hal-

lucinations at word level in data-to-text generation.

arXiv preprint arXiv:2102.02810.

Schauppenlehner, T. and Muhar, A. (2018). Theoretical

availability versus practical accessibility: The criti-

cal role of metadata management in open data portals.

Sustainability, 10(2):545.

Tygel, A., Auer, S., Debattista, J., Orlandi, F., and Cam-

pos, M. L. M. (2016). Towards cleaning-up open

data portals: A metadata reconciliation approach. In

2016 IEEE Tenth International Conference on Seman-

tic Computing (ICSC), pages 71–78. IEEE.

Wang, H. (2020). Revisiting challenges in data-to-text

generation with fact grounding. arXiv preprint

arXiv:2001.03830.

DATA 2022 - 11th International Conference on Data Science, Technology and Applications

106