A Comparative Analysis of JSON Schema Inference Algorithms

Ivan Veinhardt Latt

ak and Pavel Koupil

Department of Software Engineering, Charles University, Prague, Czech Republic

Keywords:

Schema Inference, Reverse Engineering, Document Model, JSON.

Abstract:

NoSQL databases are becoming increasingly more popular due to their undeniable advantages in the context

of storing and processing Big Data, mainly horizontal scalability and minimal requirement to deﬁne a schema

upfront. In the absence of the explicit schema, however, an implicit schema inherent to the stored data still

exists and it needs to be reverse engineered from the data. Once inferred, it is of a great value to the stake-

holders and database maintainers. Nevertheless, the problem of schema inference is non-trivial and is still

the subject of ongoing research. In this paper we provide a comparative analysis of ﬁve recent proposals of

schema inference approaches targeting the JSON format. We provide both static and dynamic comparison of

the approaches. In the former case we compare various features. In the latter case we involve both functional

and performance analysis. Finally, we discuss remaining challenges and open problems.

1 INTRODUCTION

Traditional database management systems (i.e., rela-

tional, object, object-relational) enforce that the data

are highly structured and conform to a strictly pre-

deﬁned schema which is designed in one of the ﬁrst

steps of forward engineering approaches. From the

point of view of data management this approach is de-

noted as schema-on-write. On the other hand, novel

NoSQL systems reﬂecting the V-features of Big Data

(Volume, Velocity, Variety, Veracity, ...) relax this

rule, since it does not reﬂect requirements typical for

Big Data applications, which work also with semi-

structured or unstructured data. However, when the

data is retrieved to be processed, its structure needs

to be known. We speak about schema-on-read ap-

proaches.

Although semi-structured and unstructured data is

not bound with an explicit schema, this schema is im-

plicitly present and can therefore be inferred, i.e. re-

verse engineered, from the data. Despite its limita-

tions (given by the quality and richness of the input

data), a schema inferred from a sample dataset is of

great value – it can be used by stakeholders to reason

about the data, by automated tools for data validation

and migration, for object code generation, etc. Or, an

inferred schema of newly added schema-less data can

be integrated to the originally designed schema and

thus enrich the knowledge of the structure of data.

https://orcid.org/0000-0003-3332-3503

The inference process itself, however, is non-

trivial. Several schema inference approaches already

exist but many are insufﬁcient in various aspects.

Hence, our aim in this paper is to thoroughly ana-

lyze the problems of schema inference. In particular,

we focus on semi-structured data, namely document-

oriented data, due to higher complexity of the docu-

ment model and the overwhelming popularity of doc-

ument databases

compared to key/value or columnar

ones.

In our previous paper (

Conto

s and Svoboda,

2020), we already examined a number of existing so-

lutions and we described their strengths and weak-

nesses. In addition, we provided an example of in-

ferred JSON schema for each discussed approach. In

this paper we signiﬁcantly extend our work. The main

contributions are as follows:

• We analyze ﬁve recent JSON schema inference

approaches. Apart from a static analysis and dis-

cussion of various features, we provide an exper-

imental analysis of the algorithms involving both

functional and performance analysis.

• Implementations of all the examined approaches

were acquired from the authors

and debugged

when necessary. The datasets, as well as the used

generator are available in GitHub repository

for

https://db-engines.com/en/ranking

The links to repositories are provided in Table 1.

https://github.com/ivan-lattak/schema-inference

Veinhardt Latták, I. and Koupil, P.

A Comparative Analysis of JSON Schema Inference Algorithms.

DOI: 10.5220/0011046000003176

In Proceedings of the 17th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2022), pages 379-386

ISBN: 978-989-758-568-5; ISSN: 2184-4895

379

further exploitation.

• We discuss a set of open problems and challenges

in the area of schema inference, including the con-

text of currently highly popular multi-model data.

• As long as we are not authors of any of the com-

pared approaches, we provide an unbiased com-

parative study.

The rest of the paper is structured as follows: Sec-

tion 2 summarizes both the compared approaches

and other related work. In section 3 we provide a

comparison of the selected algorithms based on their

static characteristics. In section 4, we present an

experimental analysis of selected existing solutions,

both functional and performance. In Section 5 we

overview challenges and discuss possible future im-

provements. We conclude in Section 6.

2 SCHEMA INFERENCE

APPROACHES

Research on schema inference of semi-structured

data is not new as it involves both modern NoSQL

databases as well as a bit older technologies such

as XML

or RDF

. In this section, we introduce se-

lected comparative works and we summarize the most

prominent and relevant schema inference approaches.

There exist several surveys dealing with schema

inference approaches in document databases.

ynkov

a et al. (Ml

ynkov

a and Ne

cask

y, 2013)

provide an overview of the ﬁeld of heuristic

XML Schema inference and summarize existing

approaches and open problems. Morales in his dis-

sertation thesis (Morales, 2017) statically compares

several schema extraction algorithms over multiple

NoSQL stores.

We believe that due to the versatility and popular-

ity of the JSON format, JSON schema inference ap-

proaches are the most promising, especially in terms

of extensibility towards schema inference in popular

multi-model systems. Hence, in this paper, we focus

on these approaches. In particular, we analyze the fol-

lowing ones:

• Sevilla et al. (Sevilla Ruiz et al., 2015b) present

an approach for inferring versioned schemas from

document NoSQL databases based on Model-

Driven Engineering (MDE) along with sample

applications created from such inferred schemas.

This research is furthered by Morales in his dis-

sertation thesis (Morales, 2017) and by Hernan-

https://www.w3.org/TR/xml/

https://www.w3.org/RDF/

dez et al. who tackle the issues of visualization of

schemas of aggregate-oriented NoSQL databases

and propose desired features which should be sup-

ported in visualization tools (Chill

on et al., 2017).

Most recently, Fernandez et al. expand upon

the meta-model from paper (Sevilla Ruiz et al.,

2015b) by introducing a uniﬁed meta-model ca-

pable of modeling both NoSQL and relational

data (Candel et al., 2021).

• Scherzinger et al. (Scherzinger et al., 2013) intro-

duce a platform-agnostic NoSQL data evolution

management and schema maintenance solution.

The same research group later proposes an ap-

proach for extraction of a schema from JSON data

stores, measuring the degree of heterogeneity in

the data and detecting structural outliers (Klettke

et al., 2015). They also introduce an approach for

reconstructing schema evolution history of data

lakes (Klettke et al., 2017b). Additionally, Moller

et al. present jHound (M

oller et al., 2019), a

JSON data proﬁling tool which can be used to

report key characteristics of a dataset, ﬁnd struc-

tural outliers, or detect documents violating best

practices of data modeling. Finally, Fruth et al.

present Josch (Fruth et al., 2021), a tool that en-

ables NoSQL database maintainers to extract a

schema from JSON data more easily, refactor it,

and then validate it against the original dataset.

• Baazizi et al. (Baazizi et al., 2019b) propose a dis-

tributed approach for parameterized schema infer-

ence of massive JSON datasets and introduce a

simple but expressive JSON type language to rep-

resent the schema.

• Izquierdo and Cabot provide an MDE-based ap-

proach for discovering schema of multiple JSON

web-based services (Izquierdo and Cabot, 2013a)

and later put it in practice as a web-based applica-

tion along with a visualization tool (Izquierdo and

Cabot, 2016).

• Frozza et al. introduce a graph-based approach

for schema extraction of JSON and BSON

doc-

ument collections (Frozza et al., 2018a) and an-

other inference process for columnar NoSQL

databases (Frozza et al., 2020), speciﬁcally

HBase

In addition, there exists a number of approaches that

deal with different data models. For example:

• Bex et al. (Bex et al., 2010) introduce a method

of inference of a concise XML DTD

by reduc-

https://bsonspec.org

http://hbase.apache.org

https://www.w3.org/XML/

ENASE 2022 - 17th International Conference on Evaluation of Novel Approaches to Software Engineering

380

ing the problem to learning concise regular ex-

pressions from positive examples (i.e., documents

valid against the target DTD).

• DiScala and Abadi (DiScala and Abadi, 2016)

present an algorithm for automatic generation of a

relational database schema from JSON data along

with subsequent transformation of the data itself.

• Galinucci et al. (Gallinucci et al., 2018) propose

a way how to enable non-technical users to enrich

RDF data cubes by recognizing recurring patterns

in linked open data.

• Finally, Bouhamoum et al. (Bouhamoum et al.,

2018) deal with the issues of horizontal scaling of

existing RDF schema discovery approaches and

present a method based on extracting a condensed

representation of the initial dataset.

Nevertheless, in the rest of the comparative study

we will focus only on the following JSON schema

inference approaches: Sevilla et al. (Sevilla Ruiz

et al., 2015b), Klettke et al. (Klettke et al., 2017b),

Baazizi et al. (Baazizi et al., 2019b), Izquierdo and

Cabot (Izquierdo and Cabot, 2013a), and Frozza et

al. (Frozza et al., 2018a).

3 STATIC ANALYSIS

First of all, we focus on the statically determinable

features. Table 1 summarizes the comparison of the

key characteristics discussed in the following para-

graphs.

• Input Format: All selected approaches support

schema extraction from JSON data. Frozza et al.

approach supports also BSON data. In addition,

approaches by Sevilla et al. and Klettke et al.

mention how other aggregate-oriented data can be

trivially converted to JSON, so that in practice any

schema inference approach for JSON data can be

used also for key/value and columnar data.

• Multiple Collections: The majority of algorithms

support inference only from a single document

collection at a time, while merging of these

schemas is left up to the user. Approaches that

can process an entire JSON database are Sevilla

et al. and Izquierdo and Cabot exploiting MDE to

do so.

• Inference Process: Most of the approaches extract

schema information from all the documents stored

in the input collection. Approaches by Sevilla et

al. and Frozza et al. ﬁrst select a minimal collec-

tion of mutually distinct documents such that they

still bear all the different input cases. A common

feature of all the approaches is the replacement

of values of properties by names of the primitive

types encountered. In addition, this step is usually

parallelized using MapReduce or Apache Spark

which greatly improves the scalability.

• Output Format: The majority of approaches out-

put the inferred schema in a textual format. Klet-

tke et al. and Frozza et al. use JSON Schema

and Baazizi et al. use their own type description

language. The approaches by Sevilla et al. and

Izquierdo and Cabot output the inferred schema

as a data model, both based on UML

• Implementation: The approaches of Sevilla et al.

and Izquierdo and Cabot were implemented as

Java applications running on the Eclipse platform.

Both of them offer a Java API and a simple web

application wrapper. Klettke et al. implemented

their approach as Java application running on the

Spring Boot platform. Baazizi et al. implemented

their approach as a Scala application designed to

run in the Apache Spark environment. Finally,

the approach by Frozza et al. is implemented as

a JavaScript web application. The front end is

written in TypeScript and provides the user with

a presentation layer, while the back end is written

using Node.js and contains the implementation of

the inference approach.

• Structural Components: All the approaches are

capable of inference of various structural compo-

nents of JSON documents, including simple data

types (String, Number, Boolean) and some com-

plex data types (array, object). Parent/child rela-

tionships (within aggregates) are also captured in

inferred schemas, either as nesting in textual form

or an arrow in the graphical form of an inferred

schema. On the other hand, there is no approach

capable of inference of complex data structures,

including sets, maps and tuples.

• Integrity Constraints: Only one speciﬁc type of

integrity constraints (ICs), namely simple referen-

tial integrity, is detected by the inference approach

by Sevilla et al. No other ICs are inferred by any

of the researched inference approaches.

• Optional Properties: All approaches, except for

Izquierdo and Cabot, are able to describe optional

properties in their schemas. JSON Schema-based

approaches, i.e., Klettke et al. and Frozza et al.,

use keyword required to enumerate the required

properties, while others are optional. Approach

of Baazizi et al. uses the optionality modiﬁer (i.e.,

https://json-schema.org

https://www.omg.org/spec/UML/

A Comparative Analysis of JSON Schema Inference Algorithms

381

Table 1: Comparison of selected schema inference approaches.

Sevilla Klettke Baazizi Izquierdo Frozza

Repository (Sevilla Ruiz et al., 2015a) (Klettke et al., 2017a) (Baazizi et al., 2019a) (Izquierdo and Cabot, 2013b) (Frozza et al., 2018b)

Algorithm MapReduce + MDE Fold into graph

Type reduction

in Apache Spark

MDE

Aggregation +

fold into graph

Input format

Aggregate-oriented

NoSQL data

JSON JSON

JSON web service

responses

Extended JSON

Input type Multiple collections Single collection Single collection Multiple collections Single collection

Output format NoSQL Schema model JSON Schema

Custom textual

type language

Ecore model JSON Schema

Schema root Entities Documents Documents Entities Documents

Implementation Eclipse bundle

Spring Boot

application

Apache Spark

application in Scala

Eclipse bundle

Node.js web

application

Optional Yes Yes Yes No Yes

Entity versions Yes No Yes No No

Union type No Yes Yes No Yes

References Yes No No No No

Tuple No No No No No

Set No No No No No

Map No No No No No

Extended JSON No No No No Yes

Complex IC No No No No No

Scalable design Yes Yes Yes Yes No

Scalable

implementation

Yes No Yes No No

?) to describe optional properties when a kind-

equivalence relation is used in the reduction. Ap-

proach by Sevilla et al. can infer optional prop-

erties by merging all versions of a single entity

together, marking each property as required if it is

present in all of them, and optional if not.

• Union Type: Approaches by Klettke et al., Frozza

et al., and Baazizi et al. can infer and express

union types in the schema. The former two use

JSON Schema keyword oneOf, while the latter

one deﬁnes for this purpose the union type con-

structor (i.e., +). Approaches by Sevilla et al. and

Izquierdo and Cabot do not support union types of

properties. Approach by Sevilla et al. uses entity

versioning instead while approach by Izquierdo

and Cabot uses the alternative approach of reduc-

ing different types to their most generic type, like

EString.

• Scalability: Despite the fact that the majority of

approaches has a scalable design, the implemen-

tations are not parallelized, therefore they do not

scale horizontally. Sevilla et al. use MapReduce

to decrease the number of input documents that

are considered in the rest of the schema inference

process. Similarly, Apache Spark is used in ap-

proach of Baazizi et al.

4 DYNAMIC ANALYSIS

Next we design, execute, and evaluate experiments

which demonstrate the behavior of individual ap-

proaches. We identify points of failure and illus-

trate differences between the approaches. The ﬁrst

phase involves a functional analysis of the given ap-

proaches. It exempliﬁes the functional behavior of

the approaches when applied on datasets containing

different schema features. The second phase, perfor-

mance analysis, compares the relative runtime per-

formance of the approaches by executing them in an

identical environment and against identical datasets.

4.1 Functional Analysis

For this analysis we have created manually 8 sep-

arate datasets with self-descriptive names, each fo-

cusing on a different schema feature, namely Primi-

tiveTypes, SimpleArrays, SimpleObjects, ComplexAr-

rays, ComplexObjects, Optional, Union, and Refer-

ences. They are located in the /experiment directory

within the root of the GitHub repository

as JSON

ﬁles. These datasets were imported to a collection

named articles, each within a separate MongoDB

database. Then, all 5 inference approaches were run

over the databases. Since the majority of the ap-

proaches behaved according to the expectation with

a majority of the datasets, we will only discuss the

detected abnormalities.

Dataset SimpleArrays features an empty JSON ar-

ray in property nothings. The empty array is not

handled correctly by the implementation of Sevilla et

al. – it throws an uncaught exception during the infer-

ence. The implementation by Frozza et al. also has

a problem with this edge case, although not so severe

one – the resulting JSON Schema is invalid as it con-

tains an invalid deﬁnition for the array element type.

Dataset ComplexArrays contains a two-

dimensional array in property nested arrays.

The dimensionality is not handled correctly by the

approach by Izquierdo and Cabot, which models the

property as a simple one-to-many relationship.

https://github.com/ivan-lattak/schema-inference

ENASE 2022 - 17th International Conference on Evaluation of Novel Approaches to Software Engineering

382

In dataset Optional the optional properties were

not modeled by the approach by Izquierdo and

Cabot For an unknown reason, optional property

body.compressed was not inferred at all.

Approach by Sevilla et al. inferred the union types

in dataset Union as versioned entities instead. How-

ever, the union type is used as the element type of

an heterogeneous array in property comments. This

heterogeneous array is modeled by Sevilla et al. as a

tuple containing a string and a number instead. In the

schema inferred by Izquierdo and Cabot the heteroge-

neous types are reduced to the most generic type, i.e.,

String.

Finally, dataset References contains references to

entities in two forms: a property named article id

and a property containing BSON DBRef. These were

difﬁcult to handle for most inference approaches –

Sevilla et al. and Izquierdo and Cabot output an

empty schema and an empty package deﬁnition, re-

spectively, while approaches by Klettke et al. and

Frozza et al. end with an error and do not output any

schema.

4.2 Performance Analysis

Next we compare the approaches in terms of runtime

performance. This was done using a series of perfor-

mance experiments, running the existing implementa-

tions against a number of datasets.

4.2.1 Execution

To the best of our knowledge, there is no open JSON

dataset covering all the structural aspects we want to

test. So, ﬁrst, we have generated a dataset of 500,000

JSON documents, serving as the master dataset for

our experiments. For this generation we have used

the jsongenerator open-source library, whose source

code is available in GitHub

. This library is able to

generate JSON documents according to a given JSON

Schema. As the schema for the generation, we used a

manually created schema that covers all the aspects of

JSON data, i.e., simple types, complex types, nesting,

union types, optional ﬁelds, etc.

We conducted 8 experiments differing in the size

of input data. Namely the chosen sample sizes were

1k, 2k, 4k, 8k, 16k, 32k, 64k, and 128k documents.

Each experiment was conducted using the following

steps:

1. Extract a randomly sampled subset of the given

size.

2. Run each algorithm on the extracted subset 30

times.

https://github.com/jimblackler/jsongenerator

3. Repeat the previous steps a total of 30 times.

Experiments of different sizes were chosen, because

we wanted to measure the performance of a given im-

plementation as it changes depending on the number

of input documents. If we had extracted just one ran-

dom sample of a given size for an experiment, the re-

sults could be distorted as the runtime performance of

the algorithms could become dependent on the par-

ticularities of each random selection. 30 different

random samples of a given size were extracted from

the master dataset to mitigate this distortion. Fur-

thermore, if only a single run of each algorithm was

performed for a given size, the results could be dis-

torted by the runtime cost of algorithm initialization

and would not reﬂect the sustained performance of

the algorithm. Additional distortion could be caused

by momentary decrease of system resources caused

by random external inﬂuences. To mitigate both of

them, 30 runs of each algorithm were performed on

each extracted subset.

The experiments were performed on a virtual ma-

chine running on VMware

infrastructure with 64

gigabytes of memory and 8 single-thread processor

cores. When performing the measurements, the ﬁrst,

so-called warm-up run for each random sample for

each approach was signiﬁcantly longer than the rest.

Measurements for these warm-up runs were removed

not to skew the results.

4.2.2 Results

Figure 1 depicts the behavior of each of the ap-

proaches in all experiments. The x-axis represents

the different experiment sizes. The y-axis represents

the average runtime of each approach. The legend

above explains the meaning of the different colors of

the lines.

The linear scale shows the differences in average

runtimes for large data. Frozza et al. performed the

best for larger datasets, twice as fast as the slowest

approaches. In the 16k, 32k, and 64k experiments

Sevilla et. al., Baazizi et al., and Izquierdo and Cabot

kept roughly identical performance. However, in the

128k experiment, Baazizi had signiﬁcantly better per-

formance compared to the other two.

Figure 2 contains a line chart almost identical to

the previous one, only the y-axis is logarithmic. This

chart can better express the performance behavior for

smaller-size experiments since the logarithmic scale

emphasizes relative differences in small numbers and

shrinks them in large numbers. More importantly,

though, this chart demonstrates the linear scalability

https://www.vmware.com/

A Comparative Analysis of JSON Schema Inference Algorithms

383

Time [ms]

7500

15000

22500

30000

Documents [unit]

16k

32k

64k

128k

Sevilla Klettke Baazizi Canovas Frozza

Figure 1: Average runtimes of measured inference ap-

proaches across all experiment sizes, linear scale.

of each of the measured approaches. All ﬁve ap-

proaches exhibit their performance as more-or-less

straight lines on the chart.

Time [ms]

100

1000

10000

100000

Documents [unit]

16k

32k

64k

128k

Sevilla Klettke Baazizi Canovas Frozza

Figure 2: Average runtimes of measured inference ap-

proaches across all experiment sizes, logarithmic scale.

Looking at the logarithmic line chart, we can see

that Sevilla et al., Izquierdo and Cabot and especially

Klettke et al. approaches performed signiﬁcantly bet-

ter for smaller sample sizes. This can be attributed

mainly to the high overhead cost of using Apache

Spark in the approach by Baazizi et al. This differ-

ence between the approaches by Klettke et al. ver-

sus the approach by Baazizi et al. becomes less pro-

nounced for larger sample sizes. Baazizi et al. even

started performing better in the largest experiment.

4.2.3 Evaluation

The results of the experiment provide useful insight to

the performance of each approach and the dependence

of the performance on the size of the input data.

Different approaches are preferable for usage de-

pending on the size of the input data. For small

datasets, Sevilla et al., Klettke et al., Izquierdo and

Cabot, and even Frozza et al. are advisable due to

the fact that they do not incur high ﬂat overhead costs

of the Apache Spark framework unlike Baazizi et

al. Out of these, Sevilla et al. had the best perfor-

mance. Combining that with the most interesting fea-

ture set (especially the unique ability to infer refer-

ences) makes it the best option for small datasets.

As far as large datasets are considered, Frozza et

al. is the best option if inference is to be run on

a single machine. However, as MongoDB datasets

can span multiple database nodes and can contain up-

wards of millions of documents, horizontal scaling of

schema inference may be desired. In that case, Frozza

et al. is unsuitable, as it cannot be horizontally scaled.

Sevilla et al. and Izquierdo and Cabot can

scale horizontally by decreasing the input size using

MapReduce built into MongoDB, in which case the

number of MongoDB cluster nodes is the scaling fac-

tor. However in the worst-case scenario where every

(or almost every) document has a unique raw schema

this may not signiﬁcantly decrease the input size, in

which case the algorithm will run slowly.

Baazizi et al. can linearly scale even for the de-

scribed worst-case scenario simply by adding nodes

to the Apache Spark cluster on which the approach is

run.

5 CHALLENGES AND OPEN

PROBLEMS

After a detailed analysis of existing approaches, we

provide a discussion of important open problems and

challenges in the area of schema inference.

In general, a schema describes the structure of the

data, i.e., it is a set of named (ordered) sets of possibly

hierarchical properties. Additionally, a schema may

contain a list of integrity constraints, e.g., describing

complex business rules or references between data.

All these features need to be inferred.

• Complex Types: There are complex types which

can not be inferred by any examined approach and

whose consideration could increase the usability

of the approach. Maps, sometimes also called

dictionaries, are similar in structure to JSON ob-

jects but semantically different, because the key

of the set is not a part of the schematic informa-

tion (metadata), but it is part of actual data. Simi-

larly tuples are special cases of arrays – they have

a ﬁxed size and the types of their positional ele-

ments must be modeled separately, not as a union.

(Tuples in this sense are supported only by Sevilla

ENASE 2022 - 17th International Conference on Evaluation of Novel Approaches to Software Engineering

384

et al.) Finally, sets, i.e., unordered arrays, should

be modeled distinctly from standard ordered ar-

rays.

• References: Another area of the approach that

could be improved is modeling of entity refer-

ences. Currently, to conﬁrm that a property is an

entity reference, the inferred entities are searched

to ﬁnd one with a matching name. To make this

heuristic stronger, the reference itself could be

checked against existing objects of the given en-

tity type. Additionally, a support for other than

primitive-typed entity references, such as refer-

ences with composite keys, could be beneﬁcial.

In general, various types of references are com-

monly used in document databases.

• Complex Integrity Constraints: There is a room

for improvements in the ﬁeld of inference of com-

plex integrity constraints for semi-structured data

in general. At ﬁrst sight, this goes beyond the lim-

its of JSON, since JSON Schema does not con-

sider complex integrity constraints. On the con-

trary, e.g., XML allows one to model not only

keys and references, but also conditional expres-

sions and complex integrity constraints. Or, the

combination with (a subset of) the Object Con-

straints Language

may extend the expressive

power of JSON Schema for practical purposes.

From a more general point of view a JSON Schema

inference approach could serve as a basis for a multi-

model schema inference approach capable of process-

ing data from relational, graph, key/value, document,

columnar, and other logical models within a single

schema. However, from this point of view the prob-

lem of multi-model schema inference involves the fol-

lowing issues:

• Fetching Data: The currently existing schema in-

ference approaches are closely bound to a partic-

ular database system. They use speciﬁc ways of

retrieving data and thus are not applicable to other

database systems or data models in general. An

optimal algorithm should allow for different ways

of reading data and be independent of a particular

system. Additionally, and for performance rea-

sons of the approach, it should allow a choice of

the framework for data retrieval, e.g., a choice be-

tween MapReduce, Apache Spark, etc.

• Uniﬁed Inference Process: There are at least two

approaches to multi-model schema inference: 1)

Since JSON is a complex format, it allows us to

model constructs of various data models. Thus,

existing inference approaches can be applied to

https://www.omg.org/spec/OCL/2.4/PDF

multi-model environment if other data models are

ﬁrst converted to a collection of JSON documents

(similarly to ArangoDB which stores even the

graph model in JSON documents) and then their

schema is inferred. 2) An alternative approach can

create a universal schema inference approach that

uses data-model speciﬁc wrappers implementing

properties of particular data models.

• Schema Representation: Taking a look at this cri-

terion with the multi-model context in mind, there

is a concern how the output schema should be ap-

propriately represented. For example, we have

proposed a schema description format (Svoboda

et al., 2021) suitable for both semi-structured data

(i.e., the document model) and data from other

models, both aggregate-oriented and aggregate-

ignorant.

6 CONCLUSION

The purpose of this paper was to provide an unbiased

comparative analysis of recent proposals of JSON

schema inference approaches. Our aim was to intro-

duce interesting research directions for scientists, as

well as to describe possibilities and limitations of ex-

isting solutions for practitioners.

To summarize our ﬁndings, despite the fact that

there have recently been numerous attempts to de-

vise an approach for schema inference from JSON

documents, there are still many areas in which they

need to be improved. From the ability to model even

deeply nested JSON structures to various issues with

the necessary horizontal scalability, there is a signiﬁ-

cant number of aspects in which the existing solutions

are still limited. Primarily, as far as we know, there

still does not exist a schema inference approach for

JSON data able to infer and detect other than basic

integrity constraints.

The most promising and at the same time proba-

bly the most applicable seems to be the area of multi-

model schema inference. Today, it often makes sense

for a company to store different parts of their data us-

ing different storage technologies and, consequently,

different logical models. For this purpose there exist

tens of multi-model databases

originally both rela-

tional and NoSQL. This multi-model context is es-

pecially difﬁcult to grasp and develop a suitably ro-

bust schema inference approach. The JSON docu-

ment model is robust and complex, as are schema in-

ference approaches based on it. Hence, they can serve

as a good starting point. However, the problem of de-

https://db-engines.com/en/ranking

A Comparative Analysis of JSON Schema Inference Algorithms

385

tection and modeling of inter-model entity references

and other integrity constraints becomes signiﬁcantly

more challenging.

ACKNOWLEDGEMENTS

This paper is based on Ivan Veinhardt Latt

ak’s Master

thesis (Veinhardt Latt

ak, 2021). This work was sup-

ported by the GA

CR project no. 20-22276S.

REFERENCES

Baazizi, M.-A., Colazzo, D., Ghelli, G., and Sartiani, C.

(2019a). https://gitlab.lip6.fr/collab/pstl2020. (un-

available).

Baazizi, M.-A., Colazzo, D., Ghelli, G., and Sartiani, C.

(2019b). Parametric Schema Inference for Massive

JSON Datasets. The VLDB Journal.

Bex, G. J., Neven, F., Schwentick, T., and Vansummeren,

S. (2010). Inference of Concise Regular Expressions

and DTDs. ACM Trans. Database Syst., 35(2):11:1–

11:47.

Bouhamoum, R., Kellou-Menouer, K., Lopes, S., and

Kedad, Z. (2018). Scaling up Schema Discovery for

RDF Datasets. In 2018 IEEE ICDEW, pages 84–89.

IEEE.

Candel, C. J. F., Ruiz, D. S., and Garc

ıa-Molina, J. (2021).

A Uniﬁed Metamodel for NoSQL and Relational

Databases. CoRR.

Chill

on, A. H., Morales, S. F., Sevilla, D., and Molina, J. G.

(2017). Exploring the Visualization of Schemas for

Aggregate-Oriented NoSQL Databases. In ER Fo-

rum/Demos 1979, volume 1979 of CEUR, pages 72–

85.

Conto

s, P. and Svoboda, M. (2020). JSON Schema Infer-

ence Approaches. In ER Workshops, pages 173–183.

Springer.

DiScala, M. and Abadi, D. J. (2016). Automatic Generation

of Normalized Relational Schemas from Nested Key-

Value Data. In SIGMOD ’16, pages 295–310.

Frozza, A. A., Defreyn, E. D., and dos Santos Mello,

R. (2020). A process for inference of columnar

nosql database schemas. In Anais do XXXV Simp

osio

Brasileiro de Bancos de Dados, pages 175–180. SBC.

Frozza, A. A., dos Santos Mello, R., and da Costa, F.

d. S. (2018a). An Approach for Schema Extraction

of JSON and Extended JSON Document Collections.

In IRI 2018, pages 356–363. IEEE.

Frozza, A. A., dos Santos Mello, R., and da Costa,

F. d. S. (2018b). https://github.com/gbd-

ufsc/jsonschemadiscovery.

Fruth, M., Dauberschmidt, K., and Scherzinger, S. (2021).

Josch: Managing Schemas for NoSQL Document

Stores. In ICDE ’21, pages 2693–2696. IEEE.

Gallinucci, E., Golfarelli, M., Rizzi, S., Abell

o, A., and

Romero, O. (2018). Interactive Multidimensional

Modeling of Linked Data for Exploratory OLAP. Inf.

Syst., 77:86–104.

Izquierdo, J. L. C. and Cabot, J. (2013a). Discovering Im-

plicit Schemas in JSON Data. In ICWE ’13, pages

68–83. Springer.

Izquierdo, J. L. C. and Cabot, J. (2013b).

https://github.com/som-research/jsondiscoverer.

Izquierdo, J. L. C. and Cabot, J. (2016). JSONDiscoverer:

Visualizing the Schema Lurking behind JSON Docu-

ments. Knowledge-Based Systems, 103:52–55.

Klettke, M., Awolin, H., Storl, U., Muller,

D., and Scherzinger, S. (2017a).

https://github.com/dbishagen/darwin.

Klettke, M., Awolin, H., Storl, U., Muller, D., and

Scherzinger, S. (2017b). Uncovering the Evolution

History of Data Lakes. In 2017 IEEE International

Conference on Big Data, pages 2380–2389, New

York, United States. IEEE.

Klettke, M., St

orl, U., and Scherzinger, S. (2015). Schema

Extraction and Structural Outlier Detection for JSON-

based NoSQL Data Stores. In DBIS ’15, pages 425–

444.

ynkov

a, I. and Ne

cask

y, M. (2013). Heuristic Methods

for Inference of XML Schemas: Lessons Learned and

Open Issues. Informatica, 24(4):577–602.

oller, M. L., Berton, N., Klettke, M., Scherzinger, S., and

orl, U. (2019). jhound: Large-scale proﬁling of open

json data. BTW 2019.

Morales, S. F. (2017). Inferring NoSQL Data Schemas with

Model-Driven Engineering Techniques. PhD thesis,

University of Murcia, Murcia, Spain.

Scherzinger, S., Klettke, M., and St

orl, U. (2013). Manag-

ing schema evolution in NoSQL data stores. In DBPL

’13.

Sevilla Ruiz, D., Morales, S. F., and Garc

ıa Molina,

J. (2015a). https://github.com/catedrasaes-

umu/nosqldataengineering.

Sevilla Ruiz, D., Morales, S. F., and Garc

ıa Molina, J.

(2015b). Inferring versioned schemas from NoSQL

databases and its applications. In Conceptual Model-

ing, pages 467–480. Springer.

Svoboda, M., Contos, P., and Holubova, I. (2021). Cate-

gorical Modeling of Multi-Model Data: One Model

to Rule Them All. In MEDI ’21, pages 1–8. Springer.

Veinhardt Latt

ak, I. (2021). Schema Inference for NoSQL

Databases. Master thesis, Charles University in

Prague, Czech Republic.

ENASE 2022 - 17th International Conference on Evaluation of Novel Approaches to Software Engineering

386