Authors:
Fhabiana T. Machado
;
Deise Saccol
;
Eduardo Piveta
;
Renata Padilha
and
Ezequiel Ribeiro
Affiliation:
Federal University of Santa Maria, Santa Maria, Brazil
Keyword(s):
JSON, Schema Extraction, Information Integration.
Abstract:
NoSQL (Not Only SQL) document-oriented databases stand out because of the need for scalability. This storage model promises flexibility in documents, using files and data sources in JSON (JavaScript Object Notation) format. It also allows documents within the same collection to have different fields. Such differences occur in database integration scenarios. When the user needs to access different datasources in an unified way, it can be troublesome, as there is no standardization in the structures. In this sense, this work presents a process for conceptual schema extraction in JSON datasets. Our proposal analyzes fields representing the same information, but written differently. In the context of this work, differences in writing are related to treatment of synonyms and character. To perform this analysis, techniques such as character-based and knowledge-based similarity functions, as well as stemming are used. Therefore, we specify a process to extract the implicit schema present in
these data sources, applying different textual equivalence techniques in field names. We applied the process in an experiment from the scientific publications domain, correctly identifying 80% of the equivalent terms. This process outputs an unified conceptual schema and the respective mappings for the equivalent terms contributing to the schema integration’s problem.
(More)