Joins vs. Links or Relational Join Considered Harmful
Alexandr Savinov
Bosch Software Innovations GmbH, Stuttgarterstr. 130, 71332 Waiblingen, Germany
Keywords: Data Processing, Data Modeling, Join Operation, Links, References, Connectivity.
Abstract: Since the introduction of the relational model of data, the join operation is part of almost all query languages
and data processing engines. Nowadays, it is not only a formal operation but rather a dominating pattern of
thought for the concept of data connectivity. In this paper, we critically analyze properties of this operation,
its role and uses by demonstrating some of its fundamental drawbacks in the context of data processing. We
also analyze an alternative approach which is based on the concept of link by showing how it can solve
these problems. Based on this analysis, we argue that link-based mechanisms should be preferred to joins as
a main operation in data model and data processing systems.
1 INTRODUCTION
A data model is a definition of data, that is, it
answers the question what is data. It defines how
data is organized and how it is manipulated by
providing structural elements and operations. For
example, the relational model (Codd, 1970)
organizes data by using tuples, domains and
relations while the functional data model (Sibley and
Kerschberg; 1977) does the same by using functions.
One of the major concerns in any data model is
describing how elements are related or connected as
well as providing means for retrieving related
elements. The mechanism of connectivity
determines such important aspects as semantic
clarity, conciseness of queries, maintainability and
performance.
Probably the most wide spread approach to
retrieving related elements in a database is based on
the operation of relational algebra called join. A join
takes two or more relations as input and produces a
new relation as output. The output relation contains
tuples composed of related tuples from the inputs.
The way they are related is specified in the join
condition which is a parameter of the operation.
Although the join operation dominates in the
area of data management, there are also alternative
approaches. One of them defines how data elements
are related and accessed by using the notion of link
or reference (we will use these terms
interchangeably because their differences are not
important for this paper). A link is a value which
identifies a data element (referent) and is used to
access it. It is a very simple and natural concept
which dominates in programming but is also used in
many data processing systems and models.
This paper is devoted to comparing joins and
links as alternative mechanisms for accessing related
data elements which are based on completely
different patterns of thought. We critically analyze
join operation and demonstrate that it has some
significant drawbacks while links on the other hand
can solve these problems. We argue that joins should
not be used in data modeling and data processing
systems or at least their role should be significantly
diminished. We also want to show that links can be
used as a basis for a new mechanism that can replace
joins and do what joins have been intended for.
The paper has the following layout. Section 2
and Section 3 are devoted to describing joins and
links, respectively. Each of these sections starts from
describing the corresponding mechanism and then
we analyze their more specific properties. Section 4
makes concluding remarks.
2 “WHO IS TO BLAME?” JOINS
2.1 What is in a Join? Common Value
Let us assume that there are two CSV files with a list
of employees and departments which might have the
followings structure and data:
362
Savinov, A.
Joins vs. Links or Relational Join Considered Harmful.
DOI: 10.5220/0005932403620368
In Proceedings of the International Conference on Internet of Things and Big Data (IoTBD 2016), pages 362-368
ISBN: 978-989-758-183-0
Copyright
c
2016 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
emp,name,dept_id
25,Smith,RD
dept,mngr,location
RD,30,Stuttgart
These two lines are stored in different files and
formally are not connected because there is no
indication of any relation between them. However,
we can easily see that both of these lines store the
same value ‘
RD’ and we also know that this value
(semantically) represents the same entity.
This fact of being characterized by the same
value is a basis for establishing relationships
between data elements. In our example, the fact that
one employee and one department are both
characterized by the value ‘
RD’ is not a coincidence
but rather a way to represent that they are connected
(Fig. 1). In other words, two data elements are
supposed to be related if they have something in
common or, more specifically, share the same value.
Therefore, this general connectivity principle could
be called a common value or shared value approach.
Importantly, in addition to the two data items being
connected, there is a third item stored in both of
them and treated as a means of connectivity. There
is no way to connect two elements without
specifying a third element they share.
Figure 1: Two elements are related if they share a third
element (data value).
Having something in common is a conceptual
definition of related elements. It allows us to
determine whether two elements are related or not.
However, the problem is not only to define related
elements but also to retrieve them by materializing
this conceptual relationship. In the relational
algebra, such a mechanism is provided by the join
operation which is applied to relations and returns a
new relation. In addition to input relations, this
operation needs a parameter which provides a
criterion of connectivity. The output relation will
contains only tuples satisfying this condition.
Although joins can specify any condition the
combinations of input tuples have to satisfy (theta-
join), the most common case is to select only input
tuples which contain equal values of some attributes
(equijoin). Fig. 2 provides an example of joining two
tables
Employees and Departments by producing a
new table
Emps_Depts. Note that every tuple of the
result table is a combination of matching tuples from
the two input tables which share the same value.
Figure 2: Output relation contains combinations of related
tuples from input relations.
The idea of using common values for matching
data elements has its formal roots in predicate
calculus and is used various technologies, for
example, deductive databases (Ullman & Zaniolo;
1990) or query languages. If two predicates in a
logical expression have the same free variables then
they have to be bound to the same value in order for
the resulting proposition to be true. For example,
given two predicates
Employees(emp,ename,city)
Departments(dept,dname,city)
we can define a logical expression Employees(emp,
ename, city) & Departments(dept, dname,
city) which will be true only if the city variable
takes the same value. In this way, we can retrieve all
employees who are located in (share) the same city
as their department.
2.2 Join is Symmetric
The underlying semantics of the join operation is
two elements are defined to be related if they have
the same property. This property makes it a
symmetric operation where all inputs have the same
roles. For example, if we look at this join condition
Employees.dept_id=Departments.dept
then it is not possible to assign a special role to one
of the input tables
Employees or Departments (Fig.
3).
Here we see one major problem of joins: in fact,
these two tables have different semantic roles and
the relationship between them is not symmetric. One
indication of asymmetricity is that it is a many-to-
one relationship where many employees belong to
one department. Another observation is that this
Employees
Departments
emp name dept_id
25 Alex RD
26 John HR
27 Anna HR
emp name dept_id
25 Alex RD
dept mngr location
RD 30 Dresden
HR 20 Berlin
dept mngr location
RD 30 Dresden
emp name dept_id dept mngr locatioin
25 Alex RD RD 30 Dresden
26 John HR HR 20 Berlin
27 Anna HR HR 20 Berlin
Emps_Depts
Employees
Departments
25 SmithRD
RD
RD 30 Stuttgart
Sharedvalue
Joins vs. Links or Relational Join Considered Harmful
363
same semantics can be (correctly) represented as an
employee record referencing one department (but
not vice versa). Also, the shared value is a
department identifier and employee identifier. If we
change the direction of this relationship then we
change the meaning of the connection. Yet, joins are
not able to represent this semantics because all input
relations have equal rights in a join.
Figure 3: Join is symmetric.
In the relational model, a unit of connectivity
(between domains) is one relation and composition
means joining relations. In other words, join allows
to compose or chain relations. Now assume that we
have 10 relations and want to retrieve tuples from
one of them which are related to tuples in another
relation. Formally, we need to build a Cartesian
product of these relations by adding also all relevant
join conditions. This approach is highly unnatural
and very difficult to use in practice (therefore, its use
is quite rare). Why we have to include all input
relations if we want to retrieve records from only
one of them? It is also not obvious what join
conditions to use (especially if we do not have FKs
declared). It is therefore very easy to produce
formally correct but semantically wrong results. And
the reason is that the constraints are propagated
along a sequence of relations connected by common
values. It is probably one of the reasons why the
development of the conception of automatic
reasoning in the unified relation model (URM) failed
(Maier et al., 1984).
It should be noted that there exist a mechanism
of foreign keys (FK) that can solve this problem of
symmetricity of joins. Indeed, a FK declares one
input relation and one output relation which have
different roles. If two relations are used in a join
then this FK declaration can be used as a semantic
specification of our intention in the join operation.
Yet, the use of FKs for this purpose has the
following drawbacks. First, it is not an original
purpose of FKs to describe semantics of joins (FK is
a mechanism of imposing constraints). Second, the
need in an additional mechanism like FK
emphasizes that joins have some limitations. Third,
FKs are not enforced by existing models in general
and they must not be used in the context of joins in
particular. Join is an operation which is used at
query time while FK is a declaration which is used
at design time. Fourth, it can be difficult to
understand how to use FKs in the case of arbitrarily
complex join conditions. Essentially, FK is an
attempt to introduce a mechanism of links but they
have incompatible semantics and therefore their
simultaneous use is quite controversial and eclectic.
The use of FKs in combination with joins is
analogous to introducing constructs for structural
programming along with goto operator. Their
simultaneous use will result in strange mixtures of
different patterns.
2.3 Join is a Cross-cutting Concern
Let us assume that we want to get a list of
employees working at some department. It can be
done by means of the following query:
SELECTE.emp,D.dept,D.location
FROMEmployeesE,DepartmentsD
WHEREE.dept_id=D.dept
An important observation here is that many similar
queries will include the same join condition. In other
words, this same join condition
E.dept_id=D.dept
will appear in quite many queries which involve
these two tables. It is because join conditions
describe the details of how entities are connected in
the model rather than the logic of what needs to be
retrieved.
Such fragments of the source code which scatter
throughout the whole program or query are referred
to as a cross-cutting concern (Kiczales et al., 1997).
The existence of such repeated fragments of code is
an indication of either bad design or impossibility to
modularize their logic due to limitations of the
language. The main negative consequence is that the
same fragment can appear in quite different and
unrelated contexts semantically belonging to
different levels of organization. As a result, the
program or query can become error-prone and
difficult to maintain. The solution of this problem is
to provide a mechanism for modularizing such
repeated fragments in a separate modeling construct.
Ideally, a mechanism of connectivity should
declare how different entities in the whole model are
connected independent of where these connections
will be used. Yet, in join-based queries, both the
logic of the query and the logic of the connections
between relations are described together in the same
construct. It is a typical example of mixing different
concerns. On one hand, the main purpose of the
query is to retrieve employees with the related
department information. It is application-specific
logic and we do not care how these relations are
Employee
String
Departments
dept_id
dept
domains
relations
join
IoTBD 2016 - International Conference on Internet of Things and Big Data
364
connected. On the other hand, this same query
involves a fragment which has nothing to do with
this application because it describes how the
connection between employees and departments is
implemented (independent of any queries). These
two concerns should be separated but the approach
based on joins does not support this separation.
This join-based approach to querying data leads
to exposing low-level structure of connections at the
higher level of application-specific logic. For
example, if later on we will change the way relations
(departments and employees) are connected then all
queries that use it will have to be updated. It is also
error-prone because query writers are not necessarily
experts in connectivity – they have to know only that
these two tables have a specific connection while its
implementation should be effectively hidden in a
separate module or mechanism. This task of
combining connectivity criteria with the business
logic of the query is especially difficult in the case
of complex multi-table queries. It is also a potential
security flaw because a low-level definition of
connectivity can be influenced from the higher level
of user-oriented operations. This explicit use of join
conditions can also result in lower performance
because the database engine cannot use
optimizations for pre-defined connections but rather
has to assume the possibility of any join conditions
in any new query.
2.4 Joins do Not Support Types
Let us assume that department identifiers are unique
strings and we want to retrieve tuples with the same
department:
SELECT*
FROMEmployees,Departments
WHEREdept_id=dept
If we now modify this query by changing its join
conditions as follows
WHEREdept_id=dept_name
then we get a formally valid query which however is
semantically wrong because it does not make sense
to compare department identifiers with department
names. The problem is that
dept_id is declared as a
string but semantically it is a
Department. Yet, we
are not able to declare the correct type of
dept_id:
neither as an attribute type at the level of the model
(relations cannot be used as types) nor at the query
level in the join condition. Therefore, it is not
possible to prevent the user from writing
semantically wrong conditions even if we know that
they are wrong.
It should be noted that foreign keys could help in
this situation. But it is an auxiliary mechanism
which exists independently of types (and joins).
Both FKs and types are used to constrain sets but do
it differently. In our example, if we want to
(correctly) declare the
dept_id attribute as
referencing a department then we either add a
foreign key that constrain the values by those
existing in the
Departments table or directly use
Departments as a type. It is obviously duplication of
functionality in two mechanisms with significant
negative consequences.
3 “WHAT IS TO BE DONE?”
LINKS
3.1 What is in a Link? Inclusion
Links or references are used if it is necessary to
represent some target element within this element.
This representation can be then used to access the
target element. Links are widely used in
programming languages to represent and access
objects using some kind of a unique value. This
value, called reference, can be then stored in other
objects. As a result, a program is represented as a
graph where nodes are objects and edges are
references. Link is also a basic mechanism of the
world wide web where text documents can reference
other text documents by using their unique
identifiers. Users can traverse this graph of
documents by following links and retrieving new
documents.
Link as a mechanism of connectivity possess the
following general properties:
Link is a value (passed by-value only) which
uniquely represents some object (referent) and is
used to access it. Primary key or similar
annotation is a design pattern that can be used to
model references but is different from true
references because references do not belong to
the represented object properties.
Link is a mechanism of indirectly including one
object in another object. Foreign keys can be
used to declare (annotate) such an inclusion. Yet,
it is a design pattern and not a true link
declaration. In particular, foreign keys cannot be
used for access and do not have types.
The mechanism of links assumes the existence of
three roles: represented object (referent),
reference (link), and a referencing object which
stores the reference.
Joins vs. Links or Relational Join Considered Harmful
365
The mechanism of access provided by a
reference is hidden in its implementation and is
not exposed at the application level where links
are used.
Link is a directed relation. In particular, this
means that one (referencing) entity knows about
the other (referenced) entity but not vice versa.
Links on sets can be formally described by using
mathematical functions (mappings).
3.2 Link is a Directed Relationship
Links are not supported in the relational model of
data. Yet, conceptually we can think of table records
as being linked by using the following interpretation.
If a record stores one or more values which uniquely
identify and can be used to access another record
then these values are thought of as a link. For
example, if one employee record stores a unique
identifier of a department then this means that this
employee references this department (Fig. 4). Note
however that although the (physical) representation
via tables is the same for both joins and links, they
have different meanings.
One of the main distinguishing features of links
is that the ordering of the roles differs from that used
in the join-based model. Links also use three roles:
identifier (reference), referenced entity (referent)
and referencing entity. However, rather than sharing
the same value (department identifier in Fig. 4), the
Employees table includes an identifier from the
Department table while which in turn includes a
string as identifiers. It is not only a visual ordering –
it has significant semantic consequences.
Figure 4: Two tuples are connected by storing in one of
them a data value representing the other one.
In contrast to joins, links represent binary
directed relationships among data elements with
opposite roles for the two arguments. One advantage
of such an approach is that links are much easier to
compose. For example, given an employee element,
we can find the corresponding department address
by composing two references:
Addressaddr=emp.department.address;
This composition operation (syntactically encoded
as dot notation) follows a path in a directed graph of
links which is a very natural and semantically
unambiguous operation. For comparison, the same
task can be solved by applying two join operations:
SELECT*FROM
EmployeesE,DepartmentsD,AddressesA
WHEREE.dept_id=D.dept&&D.addr_id=A.addr
Here it is difficult to determine the real purpose of
this query and the role of its arguments (relations)
because the result of the operation is one relation
combining tuples from the three input relations. It is
also not obvious how to inverse this query.
The difference between link composition and
join composition is analogous to that between
function composition in mathematics and bindings
in predicate calculus. In fact, these are two quite
different ways to think about connectivity. The join-
based approach assumes that two elements are
connected if they both include a third element. The
link-based approach assumes that two elements are
connected if one of them includes (a reference to)
the other.
3.3 Links Modularize and Hide the
Access Mechanism
If we want to access related records using the
mechanism of joins then we have to provide a
specification of how the records are related.
Moreover, this specification has to be provided for
each query in the same statement with other
application specific parameters.
Links are much easier to use. In order to retrieve
related records it is necessary to provide only one
attribute name. For example, given a reference to an
employee object, we can get a related department
object by simply specifying an attribute name:
Departmetdept=emp.department;
What is important here is that we actually do not
know how departments are retrieved and what is the
definition of the relation between departments and
employees. This definition and the access
mechanism are hidden in the
department attribute.
Effectively, two concerns are principally separated:
how links are used, and
how links are defined
This approach provides significant advantages
especially for complex systems. We can easily
Employees
Departments
emp name dept_id
25 Alex RD
RD
dept mngr location
RD 30 Dresden
Canbemodele
d
byPKs
Canbemodele
d
byFKs
IoTBD 2016 - International Conference on Internet of Things and Big Data
366
change the definition of a link without the need to
update all its usages. One possible implementation
of user-defined links is implemented in
DataCommandr (Savinov; 2016). For example,
assume that we want to re-design our data model so
that an employee can be assigned to more than one
department. In this case, the department attribute
will have to be redefined so that it returns an
element marked as a main department. If we were
using joins then this would require modification of
all queries that access departments using an
employee. In the case of links, it is enough to change
only one link definition. References can also be
applied to sets in the form of project operation and
arrow notation as proposed in the concept-oriented
model of data (Savinov; 2014).
3.4 Links Support Types
Type is a set of elements and specifying a type
means imposing constraints on possible elements.
Any link definition involves two such types: a set of
input elements and a set of output elements. Thus
links very naturally support typing as an integral part
of this mechanism.
For example, if we want to define a link from
employees to departments then this means that it has
to work for only input elements from the
Employees
set and output elements from the
Departments set.
These constraints can be declared as a function
signature:
Departmentsdepartments(Employeesthis);
They also can be declared as a field type of a class:
classEmployees{
Departmentsdepartments;
}
Note that links allow us to impose type constraints
independent of whether the output is a relational
domain or another relation. Also, typed links make
foreign keys unnecessary.
4 CONCLUSIONS
The main result of this paper is that although join is
a powerful formal operation it has the following
major drawbacks in the context of data processing:
Join relies on the semantics of common values
for representing connectivity which is not very
natural semantically and can be counterintuitive
when using composition.
Join is a cross-cutting concern because the same
join condition can spread over many queries by
mixing two concerns: application specific logic
of the query and how connectivity is
implemented.
Join does not inherently support typing because
two related (joined) sets are separated by a third
set with common values.
Therefore, the use of joins normally requires high
expertise and can lead to error prone and difficult to
maintain queries and data models.
We also analyzed an alternative mechanism of
links which provides the following benefits:
Links rely on the semantics of inclusion of an
identifier of one element into another element
which is very natural for describing how things
are connected and allows for easily composing
and inverting the relationships.
Links effectively modularize and hide the
underlying mechanism of access so that the main
logic of the query does not involve the
description of how the connectivity is
implemented. Primary keys can be viewed as one
possible design pattern for implementing true
references.
Links are typed by specifying sets for their input
and output elements which is precisely what is
normally needed. Foreign keys can be viewed as
one possible design pattern for implementing
true typing.
Taking into account these properties we argue that
links should be preferred to joins as a primary
mechanism for accessing related elements in
databases. Yet, classical references miss some
properties which are of crucial importance for data
modeling. How links can be revisited in order to
overcome these drawbacks will be our focus for
future research.
REFERENCES
Codd, E., 1970. A Relational Model for Large Shared
Data Banks. Communications of the ACM, 13(6), 377-
387.
Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C.,
Lopes, C., Loingtier, J.-M., Irwin, J., 1997. Aspect-
Oriented Programming. ECOOP’97, 220-242.
Maier, D., Ullman, J.D., Vardi, M.Y., 1984. On the
foundation of the universal relation model. TODS’84,
9(2), 283–308.
Savinov, A., 2014. Concept-oriented model. In J. Wang
(Ed.), Encyclopedia of Business Analytics and
Joins vs. Links or Relational Join Considered Harmful
367
Optimization. IGI Global, 502-511.
Savinov, A., 2016. DataCommandr: Column-Oriented
Data Integration, Transformation and Analysis.
Internet of Things and Big Data (IoTBD’2016).
Sibley, E.H. & Kerschberg, L., 1977. Data architecture
and data model considerations. In Proceedings of the
AFIPS Joint Computer Conferences. 85-96.
Ullman, J.D., Zaniolo, C., 1990. Deductive databases:
achievements and future directions. ACM SIGMOD
Record, 19(4), 75-82.
IoTBD 2016 - International Conference on Internet of Things and Big Data
368