On Usage Control in Relational Database Management Systems

Obligations and Their Enforcement in Joining Datasets

Mortaza S. Bargh

, Marco Vink

and Sunil Choenni

1,2

Research and Documentation Centre, Ministry of Security and Justice, The Hague, The Netherlands

Creating 010, Rotterdam University of Applied Sciences, Rotterdam, The Netherlands

{m.shoae.bargh, m.e.vink, r.choenni}@minvenj.nl, r.choenni@hr.nl

Keywords: Access Control, Inner Join, Obligations, Privacy, Usage Control.

Abstract: When datasets are collected and accessed legitimately, they must still be used appropriately according to

policies, guidelines, rules, laws, and/or the (current) preferences of data subjects. Any inconsistency

between the data collection and data usage processes can conflict with many principles of privacy like the

transparency principle, no secondary use principle, or intended purpose usage principle. In this contribution

we show how the usage control for the inner join operation in vertically separated relational datasets can be

characterized as pre and post obligations of the Usage Control (UCON) model. This type of obligations is

defined not only by the state of the UCON object (i.e., a dataset) itself, but also with respect to the state of

another dataset. Such dependency on two datasets/objects provides a new insight in UCON obligation

constructs when applied to the join operation. We describe also a mechanism to realize the identified

obligation in a database management system and present an example realization of the proposed

mechanism. Furthermore, we enlist a number of methods to determine whether two given datasets can be

joined.

1 INTRODUCTION

Currently data are created in an explosive rate with

the surge of new services/applications as well as

smart and sensory devices. Digitalization and e-

administration, e-services, Big Data, Open Data, and

Internet of Things are example cases that contribute

to this data outpouring and overflow. Consequently,

it becomes a common practice in (business) data

analytics and data intensive applications to integrate

data from different sources, of various types, of

large volumes, and/or of high rates. These

applications and services aim at easing our daily

lives, providing insight in societal phenomena, or

creating added values for businesses. Delivering

these benefits, however, must not violate or

compromise, for example, the privacy, commercial,

and intellectual rights of individuals and parties who

contribute their data to the data integration process.

For a long time, access control mechanisms have

been used to protect the security and privacy of data.

An access control mechanism controls the access to

the data by granting or rejecting an access request.

Although in this way the input datasets for a data

integration process may be acquired or accessed

legitimately, it is crucial for the output dataset of the

data integration process to be legitimate and

acceptable for all parties who provided the input

datasets. For example, the privacy and business

sensitivity requirements of these parties must be

preserved. Today, personal devices produce more

and more personal data than before. Big data

analytics makes it possible to combine these data,

resulting in (new) personal data that may expose the

private lives of people in quite detail. Such data

combinations may result in unexpected and harmful

impacts on individuals. Therefore, access control is

insufficient in current era of data expulsion.

Given the fact that the access to data is obtained

legitimately, one needs to control how the data are

used practically. Suppose that a tax officer needs to

know the name, the annual income, the spouse’s

name, and the number of children of a person in

order to carry out his/her tasks. It is not, however,

the business of the tax officer to find out how many

spouses or children per spouse a certain person (like

a celebrity) has had. The system, therefore, should

note such illegitimate use of attribute values and

exclude them from the tax officer’s access.

Therefore, a query like “find all spouses of singer-X

190

Bargh, M., Vink, M. and choenni, s.

On Usage Control in Relational Database Management Systems - Obligations and Their Enforcement in Joining Datasets.

DOI: 10.5220/0006209801900201

In Proceedings of the 3rd International Conference on Information Systems Security and Privacy (ICISSP 2017), pages 190-201

ISBN: 978-989-758-209-7

and for each spouse the name of the children” is an

improper use of the attribute values and should not

be executed.

Determining the (privacy) policies that govern

such data integrations become steadily unforeseeable

due to availability of vast amount of background

information to data receivers and adversaries. For

example, one cannot predetermine the datasets that

will be encountered and integrated with a given

dataset in the future. This makes it difficult to assess

the potential risks in combining the released data

with any other datasets (i.e., with the background

information). This uncertainty relates to the extrinsic

characteristics of data, e.g., the (privacy) issues of a

given datasets in relation to other datasets. The other

datasets exist in outside world due to, for example,

sequential data release, multiple data release,

continuous data release, collaborative data release,

social networks, Big Data, and Open Data.

One may conclude that it is unwise to share data

anymore. This policy appears to be too restrictive

and unrealistic nowadays. Another solution direction

is to devise and realize mechanisms that control

compliance with data privacy policies after sharing

the data with others, i.e., during the data usage

lifecycle. This solution, which can be realized in

controllable environments like an organization’s

Database Management System (DBMS), requires a

flexible and adaptive framework that decides based

on a data integration policy and enforces the

decision at runtime. Hereby it becomes possible to

deal with the issue of authorized-access and

unauthorized-use of datasets (Choenni et al., 2016).

To this end, for example, the Usage Control

(UCON) model (Park and Sandhu, 2004) is one of

the promising models.

Our research objective is to control the usage of

relational datasets in volatile and dynamic settings

i.e., when data analysts gradually and unforeseeably

gain access to datasets and want to link/integrate a

subset of these datasets. We limit our scope to

relational databases and those structured datasets

that are vertically separated. By vertically separated

datasets we mean vertically distributed datasets, as

illustrated in (Karr et al., 2007), which are not

necessarily at different locations (i.e., they can be

collocated as in the case of typical data warehouse

environments). We consider the usage control for

the inner join operation among these vertically

separated datasets. Inspired by the UCON model, we

specifically investigate: How the inner join

operation can be framed in such a data usage control

framework. This investigation results in a new

insight in UCON obligation constructs. As our first

contribution, we distinguish a new type of

obligations where the state of the object (e.g., a

dataset) is determined with respect to existence of

another dataset. This type of dependency, to the best

of our knowledge, has not been identified so far. As

our second contribution, we present a mechanism to

realize the identified obligation in a DBMS. As our

third contribution, we present an example realization

to illustrate how the proposed mechanism can be

implemented and analyze the results. Furthermore,

we enlist a number of methods for determining

whether two given datasets can be joined.

The paper starts with a problem statement in

Section 2 and provides some background

information on access control and usage control in

Section 3. Subsequently Section 4 presents our

proposed approach and mechanism. Section 5

describes our example realization of the proposed

mechanism and discusses its issues. Section 6

presents the related work and Section 7 captures our

conclusions and future research directions.

2 PROBLEM STATEMENT

In this contribution we shall focus on the issue of

authorized-access and unauthorized-use of datasets

that are vertically separated, as described below.

2.1 Motivation

When collected datasets are accessed legitimately,

they should still be used appropriately according to

policies, guidelines, rules, laws, and/or the (current)

preferences of data subjects. For example, in the

context of business and public administration, data

may be collected for a specific data registration (e.g.,

for hospital, municipality or judicial administration-

purposes), due to a service operation (e.g., the list of

website visitors or mobile telephony users), or for a

research study (e.g., a study over household or crime

victimization). As such, the data can be collected

within different legal domains corresponding to

regions/countries, public sectors (e.g., healthcare,

justice, and trade), etc. Many issues may arise when

the data are used in another context than the one

they were collected for and accessed to. Such an

inconsistency between the data collection and data

usage processes can conflict with, for instance, many

principles of privacy like the transparency principle,

no secondary use principle, or intended purpose

usage principle.

Nowadays many cases arise where it is important

to deal with unauthorized usage of those datasets

On Usage Control in Relational Database Management Systems - Obligations and Their Enforcement in Joining Datasets

191

that are accessed to in an authorized way.

Businesses, organizations and services merge in

various public, private and semi public sectors. For

example, Google has merged various services like

Gmail, Google+, Google Drive; and Facebook has

acquired Instagram and WhatsApp. Such strategic

merges require integration of information systems,

with various datasets that are generally collected for

different purposes and within various contexts.

There are also Open Data initiatives to release public

sector data to citizens as a means of, among others,

government transparency, innovation and economic

growth stimulator, and public participation in

government (Dawes, 2010b)(Dawes, 2010a). Such

initiatives motivate and encourage combining data

from various sources in order to deliver added value

services and insights. In such cases where

information systems and data are integrated, there

are potential risks of privacy breaches when (self-

provided) data of users are combined with data

retrieved from elsewhere (Bargh and Choenni,

2013); (Fung et al., 2010).

Within one organization collected data can also

be used in an unauthorized way due to, for example,

secondary use, i.e., data that are collected for one

purpose but are used for another one.

Crowdsourcing, for instance, is a means of

collecting relevant data in an affordable way. The

resulting datasets may encompass some sorts of

personal data from participants such as profile data

(including their names, email addresses and phone

numbers), activity data (indicating their sporting,

sleeping, and eating habits), and situational data

(revealing their visited locations, adjacency to other

users/objects, and conversation buddies). Such

personal data must basically be accessible to a

limited number of authorized entities (like system

administrators and specific services/systems) and be

used in an authorized way (like for the specified

purpose). Authorized insiders with ill intentions (i.e.,

those insider intruders or employees with

questionable ethics as mentioned in (Agrawal et al.,

2002)) may reveal and misuse such personal

information that they have access to for their

illegitimate purposes like personal satisfaction,

financial gains, and political benefits. Revealing

personal information makes data subjects (i.e., those

individuals and organizations that the data are about)

vulnerable to cyber attacks such as identity theft,

phishing and spams, and privacy breaches.

Therefore, the crowd may become fearful and

unwilling to participate in the data collection process

due to being subjected to such threats and becoming

victims of such attacks. Even when users voluntarily

participate in crowdsourcing, they desire sometimes

their personal information not to be processed when,

for instance, they are at certain situations like during

evenings, in the weekends, and during holidays.

Even highly sensitive data attributes may be

disclosed or inferred by means of easily accessible

data and data linkage. Kosinski et al. (Kosinski et

al., 2013) show that easily accessible digital records

of behavior, e.g., Facebook Likes, can be used to

automatically and accurately predict a range of

highly sensitive personal attributes (such as sexual

orientation, ethnicity, religious and political views,

personality traits, intelligence, happiness, use of

addictive substances, parental separation, age, and

gender). De Montjoye et al., (2013) analyzed a

dataset of fifteen months of human mobility data for

1.5 million individuals. According to (de Montjoye

et al., 2013), human mobility traces are highly

unique. For example, when the location of an

individual is specified hourly at the precision level

of mobile network cells, it is possible to uniquely

identify 95% of the individuals based on four

spatiotemporal points. They also found that even

rather highly aggregated datasets provide little

anonymity.

2.2 Problem Formalization

For scientific studies, our research center maintains

a data warehouse that contains various datasets from

several organizations involved in the Dutch justice

system. These organizations include the Police, the

Public Prosecution Office, the courts, the Central

Fine Collection Agency, the agency of correctional

institutions (i.e., prisons) and the Probation Service.

In some projects the data of more than one

organization can be used to measure the

performance of the Dutch justice chain by

combining the necessary datasets from these

organizations. For combining these datasets, a case

number is used to uniquely identify a judicial case

across all these organizations. Although our data

analysts have access to all datasets, they may not

combine all datasets due to privacy and other

reasons (for instance, as the number and contents of

the datasets are growing over time, one may

combine old and new data under certain conditions).

Inspired by (Agrawal et al., 2002), we focus on

the (privacy) policy violation issues that arise when

linking/ integrating datasets in relational datasets

during their usage time. Assume a data analyst, who

works for project A, obtains access to dataset A at

time 



. At a later time 







the data analyst, who

now works also for project B, gets access to dataset

ICISSP 2017 - 3rd International Conference on Information Systems Security and Privacy

192

B. We shall, therefore, denote these datasets also by











and 









notations, respectively.

Dataset A and B contain data tuples represented

by 





,

,

,

,⋯,

,



and 







,

,

,

,⋯,

,



, respectively, where :1,⋯,



and :1,⋯,



. Every tuple 



is defined over single

valued attributes from set 











,





,⋯,







. In other words, dataset A is

a subset of the Cartesian product of dom







dom





⋯dom







, in which dom







is the set of the values that can be assumed by

attribute 





,where:1,⋯,



. A tuple 



dataset A is an ordered list of attribute values to

which a unique identifier is attached. Without loss of

generality, we assume this identifier corresponds to

attribute 





, which is drawn from dom





.

Similarly, every tuple 



is defined over single

valued attributes from set 











,





,⋯,







 and dataset B is a subset of

the Cartesian product of dom





dom







⋯dom







. A tuple 



in dataset B is an

ordered list of attribute values to which a unique

identifier is attached. Without loss of generality, we

assume this identifier corresponds to attribute 





which is drawn from dom





.

We assume that datasets A and B are partially

vertically separated, i.e.,





∩



∅









; because otherwise datasets A and

B become horizontally separated (and thus they

provide the very same information about those

entities whose tuples appear in both datasets),

c) For some  and  we have 

,



,

where



,

∈ dom





 and 

,

∈dom





. I.e.,

data tuples 



and



refer to the same entity.

From time 



on, the data analyst has access to both

datasets A and B for two different purposes: For

Project A execution and for project B execution. It is

foreseeable that combining/joining datasets A and B

may not be allowed from the viewpoint of project A,

project B or both. This lack of permission for joining

datasets A and B can be due to, for example, privacy

or information sensitivity reasons.

We define an inner join action 



as a tuple

〈











,









,









,









,,

〉

, where

‐ Parameter

 represents the criterion, condition

or predicate for the inner join action,

‐











and 









are datasets A and B, obtained

by the data analyst at times 



and 



respectively.

‐ Parameter , being









, represents the

time of executing the inner join operation.

‐ 









and











are the (privacy) policies

associated with datasets A and B at time . These

policies, which are obtained by the data analyst

at times 



and 



, respectively, can be adapted

during the lifecycle of the corresponding

datasets.

In order to allow the inner join action 



to be

carried out at , there should be two requirements

satisfied, namely:

‐ The resulting dataset should not violate the

(privacy) policy of project A. This is denoted by

requirement 













,









,









, where

the notation should be read as: Requirement for

project A in regard to dataset 









to be

considered for the join operation at time, given

project A’s own dataset 









and own policy











which are acquired at times 



and,

respectively.

‐ The resulting dataset should not violate the

denoted by requirement















,









,











The research questions to be addressed in this

contribution are:

‐ How can the UCON model be characterized for

the inner join operation of the datasets?

‐ How can we determine when the inner join is

(dis)allowed?

‐ How is it possible to realize the resulting

restricted join functionality?

3 BACKGROUND

In this section we present the theoretical background

on access and usage control models.

3.1 Access Control

Traditionally, gaining access to a resource (e.g., a

service, document, computer system, and processing

time of a computer) has been realized by a system

functionality called ‘access control’. Access control

can be defined as the ability to permit or deny access

to a particular resource by a particular entity

(Lazouski et al., 2010). The entity that seeks access

to the resource and the resource that is sough by the

entity are referred to as subject and object,

respectively, in access control terminology. The

access to an object can be in a specific mode like

read, write and execute. These are usage

On Usage Control in Relational Database Management Systems - Obligations and Their Enforcement in Joining Datasets

193

permissions, or so-called rights, that the subject is

allowed to carry out on the object. Note that a

particular right is not predefined and it exists at the

time of the authorization (Lazouski et al., 2010). In

our example, the data analyst is the subject, dataset

A for project B and dataset B for project A are the

objects, and the ‘inner join’ is the right. Figure 1

illustrates a traditional access control model. As

mentioned in (Hilty et al., 2005) the reference

monitor in Figure 1 is a control program that

monitors and prohibits actions.

Figure 1: A traditional access control model.

Traditionally Discretionary Access Control

(DAC), Mandatory Access Control (MAC), and

Role Based Access Control (RBAC) models have

been used. The DAC model may use a set of

predicates to represents access constraints/rules.

These access rules are often stored in an access

control matrix, see the survey paper (Lopez et al.,

2004) and the references therein for the material

presented in the rest of this subsection. For an object

there can be a so-called Access Control List that

specifies which subjects have which

permissions/rights to the object. For a subject, on the

other hand, there can be a Capability List to specify

the access rights of the subject to various objects.

The MAC model, which originated from

military, focuses on the flow of information within a

system. The model assigns security labels to objects

and subjects, called as Classification Label (to

represent the object’s sensitivity) and Clearance

Label (to represent the subject’s trustworthiness),

respectively. The model grants a subject with access

to an object if their labels match from the viewpoints

of their classification (e.g., top secret, confidential,

non secret) and category properties (e.g.,

management level, department level, and project

level). Compared to DAC, MAC requires higher

implementation costs due to the complexity of

planning and management of access rights.

Both DAC and MAC models require

considerable amount of time and effort overhead

when a new user is introduced into an organization.

The new subject, in such cases, should be related to

every resource in the organization. This process is

also prune to human errors. Therefore RBAC is

proposed by introducing roles as a link between

subjects and objects. In RBAC subjects are

authorized for roles and roles are authorized for

objects to hold certain permissions or rights. Hereby

instead of establishing subject-object associations

one needs to establish two sets of subject-role

associations and role-object associations.

There are other access control methods in the

literature that we do not mention for brevity of the

presentation. Traditional access control models have

been used successfully in many application domains

for many years. These traditional models are mostly

suitable for closed organizational environments,

where the subjects and objects are well known and

when the sensitivity and trustworthiness of the

objects and subjects are well defined and rather

static. In modern application settings, where for

example social networks, Big Data, and information

(sharing) systems across organizations like in our

case are dealt with, one needs to cope with rather

dynamic environments to authorize (previously

unknown) entities who want to access and use

objects with dynamic sensitivity, within varying

contextual situations, across multiple organizational

domains and boundaries, and with a commitment to

unprecedented conditions.

3.2 Usage Control

To cope with the shortcomings of the traditional

access control models, usage control models (like

UCON model (Sandhu and Park, 2003), (Park and

Sandhu, 2004) and (Zhang et al., 2005)) are

introduced. The UCON model extends the

traditional models to include also controlling the

access decisions during the object’s usage interval

(i.e., access decision continuity) and to allow also

adapting the access criteria before, during and after

object usage interval (i.e., attribute mutability). The

continuity of decisions and mutability of attributes in

UCON allow adapting to the changes of subject,

object and environmental attributes before, during or

after the data usage period. For example, the number

of subjects that concurrently may access the object

can change depending on the consumption intensity.

As illustrated in Figure, the reference monitor in

the UCON model uses three types of decision-

making factors. The first type is called

authorizations. These authorizations include those

predicates that put constraints on subject and object

ICISSP 2017 - 3rd International Conference on Information Systems Security and Privacy

194

attributes. The attributes of the subject (e.g., the

name, age, role, and nationality) in UCON are

similar to Capability List in DAC and Clearance

Label in MAC. The attributes of the object (e.g.,

document type, content sensitivity, and data

ownership) in UCON are similar to Access Control

List in DAC and Classification Label in MAC (Park

and Sandhu, 2004).

Figure 2: An illustration of the UCON model, adopted

with adaption from (Park and Sandhu, 2004) and (Zhang

et al., 2005).

The other two types of decision-making factors

in the UCON model are conditions and obligations,

which are not uniquely defined in the literature

(Colombo and Ferrari, 2014). For conditions, the

authors in (Park and Sandhu, 2004) consider the

environmental or system-oriented constraints that

should hold before or during the object’s usage

interval. Examples conditions are those related to the

time of the day, room temperature, and disastrous

situation. As such, conditions are not dependent of

the subject and the object (i.e., the data) directly. We

shall elaborate upon obligations (especially in the

context of relational databases) in Section 4.1.

The reference monitor in UCON controls the

access to and usage of the object (e.g., data items) by

the subject. Similarly to (Hilty et al., 2005), we

regard the UCON “reference monitor rather liberally

to describe control programs that can not only

monitor and prohibit actions, but can also trigger

corrective actions such as the application of

penalties” (Hilty et al., 2005).

4 DESIGN AND REALISATION

In this section we adapt the UCON model to the

problem at hand. First in Subsection 4.1 we frame

the data integration scenario as UCON obligations.

Subsequently in Subsection 4.2 we describe the

policy decision-making component of the reference

monitor that determines whether datasets A and B

are joinable at a given time. Finally in Subsection

4.3 we formalize the policy decision-enforcement

component of the reference monitor for the inner

join operation.

4.1 Specifics of Obligations

Obligations are an active area of research currently.

Particularly, the enforcement of those obligations

that are concerned with fulfilling some tasks and

actions during or after the usage of the object (i.e.,

data) are open research issues (Lazouski et al.,

2010).

Obligations mandate those actions that someone

should execute before, during or after an object’s

usage interval (Lazouski et al., 2010). For example,

the credit card owner must be informed in 30 days

after a credit card being used, a license agreement

must be signed before data usage, an ad must be

watched for 20 seconds, and the document must be

downloaded just one time. When the actions are

executed appropriately, the subject could access or

could continue to use the object. Note that the entity

that fulfills the obligation, i.e., carries out the

action(s), might be the subject or someone else,

depending on the usage scenario. Similarly, the

entity on which an obligation activity is carried out

might be the object or something else.

In (Colombo and Ferrari, 2014) the authors

consider the enforcement of obligations, which are

derived from privacy policies, on relational database

management operations. They regard obligations as

“the constraints that refer to the (expected) state of

the data [object] stored in the database” at the time

in which the object is accessed or used (i.e.,

invoking a SQL code). For example, the bank

account balance must be positive after withdrawing.

In summary, one can regard obligations as the

constraints (a) on the state of the object (i.e., the

data in the database) or (b) on specific actions being

executed by someone. Fulfillment of both constraint

types can be required before, during or after an

object’s usage interval.

Our usage control on the inner join operation in

this contribution can be categorized as “obligation”

because the authorization of the right (i.e., the inner

join of datasets A and B) is constrained with the

state of the objects (e.g., the datasets A and B for

projects A and B, respectively). Our first

On Usage Control in Relational Database Management Systems - Obligations and Their Enforcement in Joining Datasets

195

contribution hereto is that we distinguish a new type

of obligations where the state of an object (i.e.,

dataset A or dataset B) is determined with respect to

another object. This type of dependency, to the best

of our knowledge, has not been identified so far in

the UCON literature.

In distributed usage control, where information is

disseminated in open networks, post usage

obligations are widely applicable (Lazouski et al.,

2010). We observe that this is also the case in our

centralized usage control, when an operation on a

data object (like dataset A in our scenario) is

dependent of other (upcoming) data object (like

dataset B in our scenario). More specifically, from

requirements 













,









,









 for project

A and 













,









,









 for project B,

where 







, one can define the data

integration obligations for the data analyst as the

UCON’s subject. These obligations can be of type:

‐ Pre-obligation for project B when





‐ Post obligation for project A (because  









) and for project B when 



So when 



the constraint on datasets A and B

can be of type pre-obligation (for project B) and

post-obligation (for project A) simultaneously. This

duality is another new insight, to the best of our

knowledge, provided in this contribution.

4.2 Decision Making

The reference monitor should decide on whether two

datasets A and B can be joined or not based on

requirements 













,









,









 and















,









,









 that, in turn, depend on

the momentary policies of project A and B (i.e.,











and 









) as well as on the datasets of

projects A and B (i.e., 









and 









). Joining

two datasets may extend the attribute sets 



and





to set 

∪





∪









∪

,⋯,



⋃

∪

.

For deciding on the join of two datasets, one

could check whether the resulting combination of

attributes is allowed or not. A domain expert can

control this based on existing laws, regulations, and

policies. Alternatively, similarly to (Byun and Li,

2008) and based on the purposes for which datasets











and 









are collected, the reference monitor

can check whether the privacy policies of project A

and project B allow their data objects to be part of

the resulting table or not. This can be done through

controlling the possibility of any inconsistency in

data collection and usage purposes. For example, if

dataset 









and 









are collected for

commercial and system administration purposes,

respectively, then the join should not go on

assuming that commercial and administrative

purposes are disjoint/inconsistent.

Another way to decide on allowing the join

operation is to control whether there would be

undesired information leakage due to the join

operation or not. To explain this aspect, let assume

that every attribute in 



and 



and thus in



∪

can be represented by a random variable.

For example, random variable 





corresponds to

attribute 





∈



. (NB: In the rest of this

section we misuse the notations of sets 



, 



and 

∪

and assume they represent sets of

attributes as well as the corresponding attribute

random variables.)

Further, let random variable set ⊂

∪

the set of those random variable attributes after the

join operation that are (privacy) sensitive. In our

setting,  includes at least one member, i.e., we have







∋ (being the same as 





∋, as we

assumed). Let random variable set 

∪

∖

be the set of those (privacy) non-sensitive random

variable attributes after the join operation. Thus, sets

 and  encompass those attributes that cannot be

and can be, respectively, revealed to the data analyst

in our scenario according to requirements 



and





In order to determine the information leakage in

the dataset resulted from the inner join operation 



as defined by

〈











,









,









,









,,

〉

, one

may use the mutual information function (Sankar et

al., 2013)(Wang et al., 2014). The amount of

information leaked about random variables in  due

to random variables in can be determined by

mutual Information ;, which should ideally be

zero. If this value reaches an unacceptably high level

due to the join operation, then the join should be

disallowed. One can also aim at the information

leakage for any ′ ⊂  and examine whether





; reaches an unacceptably high level or not

due to the join operation. The thresholds of the

unacceptably high level can be determined from the

policies of 









and 









at or up-to runtime .

4.3 Decision Enforcement

We defined the inner join action 



〈











,









,









,









,,

〉

. Datasets A and B

are represented by their tuples as 



















:1,⋯,





and 













:1,⋯,



. The

inner join action 



can further be specified as those

ICISSP 2017 - 3rd International Conference on Information Systems Security and Privacy

196

members of the Cartesian product of sets 









and











, i.e.,







:1,⋯,









:1,⋯,



, for

which

a) The predicate  holds for those tuples that



,



,

, and

b) Requirements 



and 



hold.

In other words,





≜







:1,⋯,



;:1,⋯,



;

,

,

where 

,

is a Boolean function defined as the

conjunction of the following operands:

‐ O

: 



,



  ∧

,



,

,

‐ O

: 













,









,









,

‐ O

: 













,









,









 .

In practice, due to for example privacy requirements,

the (personal) identifiers 

,

and 

,

in datasets A

and B (i.e., for all andj) are anonymized. In this

section we assume they are pseudonymized by Hash

functions 



∙ and 



∙, respectively. In this case,

operand O

can be written as:

: 



,



  ∧





,

≡





,

,

where 





,

≡





,

holdsif

,



,

. Note

that

a) The operator ≡ represents the fact that there is a

one-to-one mapping possible between 





,



and 





,

, and it does not imply equality

necessarily; and

b) The probability that 

,



,

if 





,

≡







,

 is (extremely) negligible for collision-

resistant hash functions practically.

In Subsection 5.1 we shall present a technique to

realize operator ≡ in DBMSs.

5 EVALUATION

In this section we present an example realization of

the proposed mechanism and subsequently discuss

its characteristics and limitations.

5.1 Example Realization

To illustrate how to realize the proposed mechanism

we explain an example with 3 tables A, B, and C,

each containing two attributes: an identification

number ID and a data attribute. The data analyst has

access to all three tables. We assume that the analyst

is authorized to join tables A and B, but (s)he is not

authorized to join table A and C (for the generic

scenario see Section 2.2). To illustrate the idea we

use these simple tables without loss of generality

(i.e., instead of a single table one can use a set of

tables just like the case of our data warehouse).

Table 1: Table A.

ID ATTR

1 Source value A1

2 Source value A2

3 Source value A3

Table 2: Table B.

ID ATTR

1Source value B1

3Source value B3

Table 3: table C.

ID ATTR

2Source value C2

3Source value C3

Using the unique identifiers, i.e., attribute ID, as

the primary key the data analyst can now easily

combine the data from tables A and C:

Select * from a join c on c.id = a.id

As the first step of our implementation, we replace

the original unique identifiers, which are unique per

criminal case, by a new set of global identifiers,

which are globally, i.e., among these tables, unique.

Alternatively, one could use hash functions, as

suggested in Note (b) in Subsection 4.3.

Consequently those tuples from different tables,

which correspond to the same case/entity, will no

longer have the same identifiers in the new dataset.

The mapping between the new identifiers of an

entity via the old identifier of the entity, see Table 7,

is made one-to-one and it is stored in a separate

repository, called identifier repository, safely. The

identifier repository realizes the operation ≡ in

practice (see Note (a) in Subsection 4.3).

The joins in the proposed approach will be made

through the identifier repository, where we will use a

usage rights table to check whether the tables can be

combined. The tables with the new identifiers, the

identifier repository, and the usage rights table are

shown in Table 4, Table 5, Table 6, Table 7 and

Table 8 below.

Table 4: Table DWH_A.

ID ATTR

1Source value A1

2Source value A2

3Source value A3

On Usage Control in Relational Database Management Systems - Obligations and Their Enforcement in Joining Datasets

197

Table 5: Table DWH_B.

ID ATTR

4 Source value B1

5 Source value B3

Table 6: table DWH_C.

ID ATTR

6 Source value C2

7 Source value C3

Table 7: Table ID_REP (identifier repository).

ID_SRC ID_DWH SRC_DATASET

1 N1 A

2 N2 A

3 N3 A

1 N4 B

3 N5 B

2 N6 C

3 N7 C

Table 8: Table USAGE_RIGHT (the authorization

policy).

DATASET1 DATASET2

A B

B A

Now direct joining like in the previous SQL

query will return an empty result set, since the

identifiers no longer match. In the next query the

join is performed through the identifier repository,

and includes a check on the usage rights:

> select dwh_a.*, dwh_b.*

> from dwh_a

> join id_rep rep1 on rep1.id_dwh

= dwh_a.id

> join id_rep rep2 on rep2.id_src

= rep1.id_src

> join dwh_b on dwh_b.id =

rep2.id_dwh

> join usage_right on

(usage_right.dataset1 =

rep1.src_dataset and

usage_right.dataset2 =

rep2.src_dataset)

For tables DWH_A and DWH_B this will result in a

dataset with the combined data. When a similar

query is run for table DWH_A and DWH_C there

will be an empty result set because the join is not

allowed according to the USAGE_RIGHT table.

The final step in our implementation is to set

access control to the tables in the example. The data

analyst is not allowed to see the identifier repository

or to change the usage rights. Joins are carried out

through a stored procedure, which has access to the

identifier repository. It generates a table with the

join of the two given tables if allowed by the

USAGE_RIGHT table. The result does not contain

the old or new identifiers from the identifier

repository, but a new independent set of identifiers is

generated. In this way the identifiers in the new

dataset cannot easily and readily be tracked back to

the original datasets, nor will it be possible to easily

link the identifiers of the joined dataset to other

(possible future) datasets.

5.2 Discussion and Limitations

As described before the database consists of

different datasets. Within a dataset the tables can be

joined as usual so the performance will be the same

as for standard joins. When tables of different

datasets are joined there is a performance penalty

since the identifiers have to be looked up in the

identifier repository and usage control rights have to

be checked. As we have seen this adds three extra

joins. Obviously this is less efficient than a single

join. However modern database engines are very

efficient in doing joins and can be optimized by the

use of techniques like indexing.

Furthermore we propose a scenario where the

combination of two datasets is generated once with a

new identifier set. Within the new combined dataset

the joins will have no performance loss. The

generation of the new dataset will have a cost, but in

a data warehouse setting this is usually not a

problem.

The USAGE_RIGHT table is a good location to

hook in extra usage control decision factors like

authorizations, conditions or obligations. For

example only tuples for adults may be combined, the

datasets may only be joined during a fixed period, or

the requester has to sign a privacy statement. For

example:

Table 9: Extended table USAGE_RIGHT.

DATAS

ET1

DATAS

ET2

AUTH COND OBLIG

A D D.age>

System.dat

e<15 July

Agree on

We created a new identifier (i.e., pseudo ID) for

those tuples in the resulting dataset of the join

operation. In practice, however, it is possible for a

data analyst with ill intentions to infer those

identifiers in table A (see Table 4) and in table B

ICISSP 2017 - 3rd International Conference on Information Systems Security and Privacy

198

(see Table 5) that their records appear also in the

resulting table of the join operation. The data analyst

can infer the mappings in the identity repository for

these records by using/matching the values of the

other attributes in Table A, Table B and the resulting

table from their join, using techniques described in

(Narayanan and Shmatikov, 2008); (Choenni et al.,

2010). This can be seen as an attack on the

pseudonimization part of the proposed approach. As

mentioned in Subsection 5.1, our proposed approach

attempts to impede those attackers (i.e., data

analysts, the employee with questionable ethics as

mentioned in (Agrawal et al., 2002)) who want to

directly track the identifiers in the new dataset back

to the original datasets.

6 RELATED WORK

This section provides a review of some related

works on obligations and on controlling the join

operation in relational databases.

6.1 Obligations

Obligations are considered an important means of

realizing privacy and security aware systems,

nevertheless, as mentioned before, there is still no

consensus on the precise meaning of the term

obligation (Colombo and Ferrari, 2014).

One misalignment in the literature relates to the

concepts of pre-obligation, on-going obligation and

post-obligation. In the UCON-ABC model of (Park

and Sandhu, 2004) pre-obligations and on-going

obligations are recognized. The concept of post-

obligation is added to the UCON model in (Katt et

al., 2008). Note that, not within the context of the

UCON model, others (e.g. (Hilty et al., 2005);

(Gama et al., 2006) and (Bettini et al., 2003)) had

already considered obligations as requirements that

must be fulfilled after data access has been done. In

(Ni et al., 2008) the authors introduce pre and post

obligations to the Role Based Access Control (P-

RBAC). In (Bettini et al., 2003) pre-obligations are

characterized as provisions.

In (Hilty et al., 2005) obligations are further

classified in two dimensions of being

(un)observational and being temporally-

(un)bounded. The observational aspect characterizes

whether the reference monitor can observe the

fulfillment of the obligation or not. The temporal

bound-ability characterizes whether obligations

should be fulfilled in a certain time period or not

(i.e., should be checked for ever). These criteria

define four obligation types:

‐ Bounded future and observable (e.g., pay a fee

within a fixed number of days, data item may not

be accessed for x days, the reference monitor

must notify the data owner about the access

within x days).

‐ Bounded future and non-observable (e.g., data

item must be deleted within x days, data item

must not be redistributed in the next x days),

‐ Unbounded future and observable (e.g., re-access

the data at least every x days to maintain

freshness of data as demanded by some data

protection regulations), and

‐ Unbounded future and non-observable (e.g., data

should be used only for statistical analysis, data

should not be distributed further, each usage of

the data must be reported immediately, or must

be protected with protection level L until it is

declassified by the owner).

The obligation for the join operation is temporally

unbounded, i.e., it holds for as long as there is a

possibility of joining any pair of vertically separated

datasets (e.g., in our case 









and 









). The

obligation for the join operation is also unobservable

(i.e., in project A one cannot foresee that project B is

going to link dataset 









with its dataset 









and vice versa). By introducing the reference

monitor we ensure the join operation to be

observable to the central reference monitor and,

eventually, those non-observable data protection

requirements to be adhered to. This strategy is also

mentioned in (Hilty et al.. 2005), whereby an

unobservable obligation is enforced by transforming

a non-observable obligation into a set of provisions

and observable obligations that prevent unwanted

executions. One can think of not only this “strict

sense of enforcement”, i.e., “the prevention of

unwanted executions of a system through system

monitoring and denying actions that would violate

the policy”, but also additional corrective or

“compensating actions (e.g. penalties) in case the

execution violates the policy” (Hilty et al., 2005).

Unlike in our case, obligations in (Hilty et al., 2005)

are those conditions that must be imposed in the

future (i.e., the time after an access is authorized)

and (Hilty et al., 2005) uses provisions instead of

obligations to refer to those conditions that must be

imposed by/at the time of an access being

authorized. In our case, furthermore, we showed that

it is possible for obligations to be of types pre-

obligation/on-going-obligation and post-obligation

at the same time.

On Usage Control in Relational Database Management Systems - Obligations and Their Enforcement in Joining Datasets

199

6.2 Relational Databases

As the work presented in this contribution relates to

usage control for relational databases and privacy

protection for the join operation in relational DBMS,

we review some related works on these topics in the

following.

In (Colombo and Ferrari, 2014) the authors

consider enforcing obligations, which are derived

from privacy policies, on relational database

management operations. While Colombo and Ferrari

(2014) consider SQL operations in general, we focus

on the inner join operation particularly, and zoom in

its peculiarities from the viewpoints of the parties

(i.e., projects) involved in the operation. Similarly to

our work, (Colombo and Ferrari, 2014) considers

obligations as constraints on “the [expected] state of

the data [(i.e., the object)] stored in the database at

the time in which the execution of an action (i.e.,

SQL code) is invoked (like the account balance after

withdrawing must be positive)”. We go one step

further and take into account also the state of each of

the two datasets of the join operation with respect to

the other dataset.

Secure Multi Party Computing (SMPC) methods

aim at computing a function F on vertically or

horizontally distributed datasets for data mining or

data processing purposes, without requiring the raw

datasets to be shared with a central entity (a Trusted

Third Party, TTP) or with the peers. In this way

every party learns only the result of function F and

its own dataset. SMPC methods are applied in

combination with the SQL join operation in multi-

party settings in (Laur et al., 2013) for horizontally

distributed datasets. As mentioned above, the

objective of SMPC is to compute a specific function

F on the joined dataset in a privacy preserving way

(i.e., without sharing the datasets with a TTP or the

peers). For example, the function F in (Laur et al.,

2013) delivers the number of rows in the join table

(for which the join predicate holds). In our setting,

however, the aim is to authorize the inner join

operation or not, regardless of which function the

data analyst intends to apply to the resulting datasets

in the future. As such, our approach acts as a sort of

on-fly access control (thus a usage control)

mechanism rather than a privacy preserving data

mining or data processing mechanism.

7 CONCLUSION

To deal with the issue of authorized-access and

unauthorized-use of datasets, there is a need for a

flexible and adaptive framework to decide on and

enforce the data integration policy at runtime. We

motivated this need for the inner join operation in

vertically separated relational datasets where one

cannot predetermine which datasets would be

encountered and integrated with a given dataset.

We characterized the usage control model of the

inner join operation by the obligations of the UCON

model. Here the authorization of the right (i.e., the

inner join of datasets A and B) is constrained with

the state of the object. In this study we distinguished

a new type of obligations where the state of the

object (i.e., dataset A or dataset B) is determined

with respect to another dataset. These obligations

can be of both pre-obligation and post-obligation

types simultaneously, depending on the timing of the

join operation with respect to the moments of

datasets A and B availability. This duality is another

new insight provided in this contribution.

We proposed a few methods for making decision

whether two datasets A and B can be joined or not.

The decision can be based on whether the resulting

combination of attributes is allowed or not using the

domain knowledge, comparing the data collection

and data usage purposes of datasets A and B, or

information leakage about the sensitive attributes

due to the join operation. Finally we proposed a

mechanism to enforce the obligations and realized it

in an example realization. The reference monitor of

the proposed usage control is realized as a stored

procedure that maps the pseudo identifiers from the

identifier repository to the original identifiers,

checks the usage rights to determine if a join is

allowed, and joins the data if that is the case.

Our scheme uses different pseudo identifiers for

the input and output datasets of the join operation

and relies on a secure lookup table to map among

these pseudo identifiers during the realized join

functionality. This solution creates a first barrier

against the threat of inferring pseudo identifiers.

Searching for a more robust and secure solution, the

future research can be directed towards, for example,

adopting and adapting the method of Polymorphic

Encryption and Pseudonymisation (Verheul et al.,

2016).

REFERENCES

Agrawal, R. et al., 2002. Hippocratic databases.

Proceedings of the 28th international conference on

Very Large Data Bases, 4(1890), pp.143–154.

Bargh, M.S. & Choenni, S., 2013. On preserving privacy

whilst integrating data in connected information

ICISSP 2017 - 3rd International Conference on Information Systems Security and Privacy

200

systems. In Proceedings of International Conference

on Cloud Security Management (ICCSM’13).

Guimarães, Portugal.

Bettini, C. et al., 2003. Provisions and Obligations in

Policy Rule Management. Journal of Network and

Systems Management, 11(3), pp.351–372.

Byun, J. & Li, N., 2008. Purpose based access control for

privacy protection in relational database systems. The

VLDB Journal, pp.603–619.

Choenni, S. et al., 2016. Privacy and security in smart data

collection by citizens. In J. R. Gil-Garcia, T. A. Pardo,

& T. Nam, eds. Smarter as the New Urban Agenda.

Springer, pp. 349–366.

Choenni, S., Dijk, J. van & Leeuw, F., 2010. Preserving

privacy whilst integrating data: Applied to criminal

justice. Information Polity, 15(1–2), pp.125–138.

Colombo, P. & Ferrari, E., 2014. Enforcing obligations

within relational database management systems. IEEE

Transactions on Dependable and Secure Computing,

pp.1–14.

Dawes, S.S., 2010a. Information Policy Meta-Principles:

Stewardship and Usefulness R. H. Sprague Jr., ed.

Proceedings of the 43rd Hawaii International

Conference on System Sciences (HICSS-43), pp. 1–10.

Dawes, S.S., 2010b. Stewardship and usefulness: Policy

principles for information-based transparency.

Government Information Quarterly, 27(4), pp.377–

383.

Fung, B.C.M. et al., 2010. Privacy-preserving data

publishing. ACM Computing Surveys, 42(4), pp.1–53.

Gama, P., Ribeiro, C. & Ferreira, P., 2006. Heimdhal: A

History-based Policy Engine for Grids. In Sixth IEEE

International Symposium on In Cluster Computing

and the Grid (CCGRID).

Hilty, M., Basin, D. & Pretschner, A., 2005. On

obligations. Computer Security–ESORICS 2005,

pp.98–117.

Karr, A.F. et al., 2007. Secure, privacy-preserving analysis

of distributed databases. Technometrics, 49(3),

pp.335–345.

Katt, B. et al., 2008. A general obligation model and

continuity: enhanced policy enforcement engine for

usage control. Proceedings of the 13th ACM

symposium on Access control models and technologies

(SACMAT ’08), pp.123–132.

Kosinski, M., Stillwell, D. & Graepel, T., 2013. Private

traits and attributes are predictable from digital records

of human behavior. Proceedings of the National

Academy of Sciences of the United States of America,

110(15), pp.5802–5.

Laur, S., Talviste, R. & Willemson, J., 2013. From

oblivious AES to efficient and secure database join in

the multiparty setting. Lecture Notes in Computer

Science (including subseries Lecture Notes in

Artificial Intelligence and Lecture Notes in

Bioinformatics), 7954 LNCS, pp.84–101.

Lazouski, A., Martinelli, F. & Mori, P., 2010. Usage

control in computer security: A survey. Computer

Science Review, 4(2), pp.81–99.

Lopez, J., Oppliger, R. & Pernul, G., 2004. Authentication

and authorization infrastructures (AAIs): a

comparative survey. Computers & Security, 23(7),

pp.578–590.

de Montjoye, Y.-A. et al., 2013. Unique in the Crowd: The

privacy bounds of human mobility. Scientific reports,

3, p.1376.

Narayanan, A. & Shmatikov, V., 2008. Robust de-

anonymization of large sparse datasets open datasets.

In IEEE Symposium on Security and Privacy (SP’08).

pp. 111–125.

Ni, Q., Bertino, E. & Lobo, J., 2008. An obligation model

bridging access control policies and privacy policies.

Proceedings of the 13th ACM symposium on Access

control models and technologies - SACMAT’08, p.133.

Park, J. & Sandhu, R., 2004. The UCON ABC usage

control model. ACM Transactions on Information and

System …, 7(1), pp.128–174.

Sandhu, R. & Park, J., 2003. Usage Control : A Vision for

Next Generation Access Control. , pp.17–31.

Sankar, L., Rajagopalan, S. & Poor, H., 2013. Utility-

Privacy Tradeoff in Databases: An Information-

theoretic Approach. IEEE Transactions on

Information Forensics and Security, pp.1–1.

Verheul, E. et al., 2016. Polymorphic Encryption and

Pseudonymisation for Personalised Healthcare,

Available at:

https://www.semanticscholar.org/paper/Polymorphic-

Encryption-and-Pseudonymisation-for-Verheul-

Jacobs/7dfce578644bc101ae4ffcd0184d2227c6d07809

Wang, W., Ying, L. & Zhang, J., 2014. On the relation

between identifiability, differential privacy and

mutual-information privacy. In In 52nd IEEE Annual

Allerton Conference on Communication, Control, and

Computing (Allerton). pp. 1086–1092. Available at:

http://arxiv.org/abs/1402.3757.

Zhang, X. et al., 2005. Formal model and policy

specification of usage control. ACM Transactions on

Information and System Security, 8(4), pp.351–387.

On Usage Control in Relational Database Management Systems - Obligations and Their Enforcement in Joining Datasets

201