
tities. The non-redundant schemas derived from this
dataset—particularly at a threshold of 0.98—reveal
coherent semantic structures that mirror typical onto-
logical hierarchies (e.g., from product name to brand,
subcategory, and category). Among the evaluated
cutoffs, θ = 0.98 shows the strongest alignment with
the gold standard by optimizing the precision–recall
trade-off, maximizing agreement across axiom types,
and preserving a compact schema; stricter thresholds
underfit key concepts, whereas looser ones inflate the
axiom set without improving fidelity.
In contrast, the E-commerce Transactions dataset
displays a more fragmented structure, where no single
attribute universally determines the others. However,
clusters of strong dependencies (e.g., InvoiceNo →
CustomerID, StockCode → Description) suggest
the presence of localized semantic groupings such
as invoice, customer, and product entities. Despite
this, non-redundant schemas extracted from the de-
pendency matrix reveal limitations: at stricter thresh-
olds, key entities are isolated, while at looser thresh-
olds, semantically implausible dependencies emerge.
This highlights a central trade-off between seman-
tic precision and schema completeness when deter-
mining thresholds for dependency extraction. Within
this trade-off, a threshold of 0.96 best aligns with
the gold standard, recovering the core classes and
datatype properties with minimal noise—improving
completeness over tighter cutoffs (1.0, 0.99) while
avoiding the spurious links that appear at looser set-
tings (θ = 0.91).
The analysis of quality ratios confirms the impor-
tance of data completeness: missing values notably
reduce the interpretability and confidence of discov-
ered dependencies.
Our approach (i) provides a smooth and quantita-
tive spectrum for assessing how close a relationship
is to being functionally deterministic; ii) it supports
practical applications in data quality assessment, nor-
malization design, and error detection in tabular data;
iii) it also allows the granularity of instances to be de-
termined for each inferred class. Further work will
focus on the modeling of hierarchical attributes, cal-
culating multivariate dependencies, considering re-
lationships between multiple attributes and a score-
based schema selection.
5 CONCLUSIONS
This study presents a probabilistic framework for
modeling functional dependencies in tabular datasets.
Our approach is able to capture varying degrees of
functional association through the functional depen-
dency probability matrix, complemented by quality
ratios. This enables the identification of semantically
meaningful structures, even in the presence of noisy
or incomplete data, and facilitates the construction of
non-redundant schemas that align with intrinsic data
semantics.
DATA AVAILABILITY
The data generated in this work is available in our
GitHub repository
3
.
ACKNOWLEDGEMENTS
This research has been funded by MI-
CIU/AEI/10.13039/501100011033/ [grant numbers
PID2020-113723RB-C22, PID2024-155257OB-I00].
REFERENCES
Almagro-Hern
´
andez, G., Mulero-Hern
´
andez, J., Desh-
mukh, P., Bernab
´
e-D
´
ıaz, J. A., S
´
anchez-Fern
´
andez,
J. L., Espinoza-Arias, P., Mueller, J., and Fern
´
andez-
Breis, J. T. (2025). Evaluation of alignment meth-
ods to support the assessment of similarity between e-
commerce knowledge graphs. Knowledge-Based Sys-
tems, 315:113283.
Armstrong, W. W. (1974). Dependency structures of data
base relationships. In IFIP Congress.
Codd, E. F. (1970). A relational model of data for large
shared data banks. Commun. ACM, 13(6):377–387.
Je
ˇ
zkov
´
a, J., Cordero, P., and Enciso, M. (2017). Fuzzy func-
tional dependencies: A comparative survey. Fuzzy
Sets and Systems, 317:88–120. Theme: Logic and
Computer Science.
Koci, E., Neumaier, S., and Umbrich, J. (2018). A machine
learning approach for interlinking tabular data. In The
Semantic Web: ESWC 2018, volume 10843 of Lecture
Notes in Computer Science, pages 307–322. Springer.
Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T.,
Rudolph, J.-P., Sch
¨
onberg, M., Zwiener, J., and Nau-
mann, F. (2015). Functional dependency discovery:
An experimental evaluation of seven algorithms. Pro-
ceedings of the VLDB Endowment, 8(10):1082–1093.
Presented at the 41st International Conference on Very
Large Data Bases (VLDB), 2015.
Ullman, J. D. (1988). Principles of Database and
Knowledge-Base Systems, Vol. I. Computer Science
Press, Rockville, MD.
Zhang, S. and Balog, K. (2018). Ad hoc table retrieval using
semantic similarity. In Proceedings of the 2018 World
Wide Web Conference on World Wide Web - WWW ’18,
WWW ’18, page 1553–1562. ACM Press.
3
https://github.com/gines-almagro/
Inferring-Semantic-Schemas-from-Functional-Probabilities
Inferring Semantic Schemas on Tabular Data Using Functional Probabilities
163