but also seeks to tackle the root causes such as data
ownership concerns behind data-sharing reluctance.
In this paper, we present an open data paradigm,
embodied in an open data ecosystem called Aralia. It
follows the Web3 principles, featuring a
decentralized framework that aligns with the dynamic
nature of modern data to serve the development of AI
agents. The main property behind the decentralized
environment is its ability to extract synergetic insights
from multiple sites without downloading or copying
datasets, thereby alleviating concerns about data
privacy, integrity, etc. Data providers would be open
to facilitate secured access to other data users and
unlock the opportunities of exploring diverse data, all
the while adhering to environmental principles.
An open data ecosystem thrives on its viral
interplay between data providers and data consumers,
enabling data to attract more data without
redundancy. At the core is the sustainability driven by
a shared responsibility model, where data providers
commit to maintaining high-quality datasets, and
consumers engage in responsible data usage. Mutual
benefit ensures trust, innovation, and lasting viability.
1.3 AI-Ready Data Provision
This initiative aims to build an AI-ready, data service
infrastructure for AI agents to access the live and
diverse data worldwide without the acquisition and
preparation efforts by agent developers.
The key question that remains is what defines
“AI-ready data provision”. An AI agent that relies on
a data provisioning service must trust the service’s
quality and its governance over time. Besides
modelling the Aralia open data ecosystem, we present
a criteria framework for AI-ready data provision,
covering three facets of AI readiness: service
infrastructure, data, and application framework.
In this paper, we focus on the structured data (i.e.,
tabular datasets), which constitutes a substantial
portion of grounding data and serves as a critical
foundation for data analysis and AI applications.
While our focus is on structured data, it is worth
noting that the challenges of unstructured data have
been thoroughly addressed in the contexts of the
World Wide Web and AI, both of which provide a
wealth of resources for AI agents to access.
2 BACKGROUND
We began our research after observing the inertia of
the data economy and the paradox it embodies. The
paradigm shift ignited by the LLM and agentic AI
tsunami further amplified the issues related to data
provisioning, which expanded our focus into covering
an AI-ready data provision framework.
2.1 AI-Ready Data
Recent research on AI-ready data focuses its role on
AI-driven applications, emphasizing attributes such
as real-time availability, interoperability, and
contextual relevance. Whang et al. (2023) found
significant portion of machine learning process in
data preparation and studied data validation, cleaning
and integration techniques for data-centric AI.
Machine learning and AI models that rely on large
scale training datasets have also put pressures on data
quality issues (Jain et al., 2020).
The U.S. National Science Foundation (NSF)
released the report "National AI Research Resource
(NAIRR) Pilot Seeks Datasets to Facilitate AI
Education and Researcher Skill Development",
calling for high-quality AI-ready datasets across
various fields to directly support AI-related education,
research, and model training.
To address AI-readiness, principles of findability,
accessibility, interoperability, and reusability (FAIR)
for data and artificial intelligence (AI) are adopted to
create AI-ready datasets, especially when some
disciplines are subject to restrictive regulations that
prevent data fusion and centralized analyses,
commonly governed by federal regulations,
consortium-specific data usage agreements, and
institutional review boards (Chen et al., 2022). These
restrictions have spawned the development of
federated learning approaches, privacy enhancement
technologies (PET), and the use of secure data
enclaves. Developing AI models by harnessing
disparate data enclaves will only be feasible if
datasets adhere to a common set of rules, or FAIR
principles.
Meanwhile, a growing emphasis on the
importance of ability to include external data in
piecing together a broader understanding for
analytical goals. Without confidence that data is
treated “fairly”, data owners may be unwilling to
participate in data sharing (Aaron Gabisch & R.
Milne, 2014).
The concept of data ownership has become more
important with the prevalence of big data, AI, and
privacy laws (Asswad & Marx Gómez, 2021). The
continuous breaches from data scams, in addition to
unintentional leak incidences, have raised grave
concerns, calling for data ownership, stewardship and
governance measures (Baker, 2024).