HOUDAL: A Data Lake Implemented for Public Housing

Etienne Scholly

1,2

, C

ecile Favre

, Eric Ferey

and Sabine Loudcher

Universit

e de Lyon, Lyon 2, UR ERIC, France

BIAL-X, France

Keywords:

Data Lake, Metadata System, Implementation, Enterprise Use Case.

Abstract:

Like all areas of economic activity, public housing is impacted by the rise of big data. While Business In-

telligence and Data Science analyses are more or less mastered by social landlords, combining them inside a

shared environment is still a challenge. Moreover, processing big data, such as geographical open data that

sometimes exceed the capacity of traditional tools, raises a second issue. To face these problems, we propose

to use a data lake, a system in which data of any type can be stored and from which various analyses can be

performed. In this paper, we present a real use case on public housing that fueled our motivation to introduce

a data lake. We also propose a data lake framework that is versatile enough to meet the challenges induced by

the use case. Finally, we present HOUDAL, an implementation of a data lake based on our framework, which

is operational and used by a social landlord.

1 INTRODUCTION

In the public housing sector, the use of data al-

lows social landlords to improve the management of

dwellings, as well as to better understand tenants.

Many use cases have already been dealt with in this

context (Scholly, 2019). Through Business Intelli-

gence (BI), data are organized within a data ware-

house, and then provide decision-makers (company

managers, management controllers, etc.) with dash-

boards enabling them to manage their activity. BI

analyses consist of “looking back” to manage one’s

business (Watson and Wixom, 2007). Data Science

methods applied to public housing can be used for de-

scriptive analyses, such as identifying groups of ten-

ants or dwellings, but also for predictive analyses,

for example, to predict non-payment or vacancy of

a dwelling. These analyses allow a better understand-

ing of one’s activity by “looking forward” (Mortenson

et al., 2015).

However, during the past ten years, the big data

revolution have changed the way economic sectors

extract knowledge from data (Gandomi and Haider,

2015; Assunc¸

ao et al., 2015), and public housing is no

exception. New, often unstructured, data are emerg-

ing and can be exploited. This is particularly the case

with telephone calls (tenant complaints), or social net-

works. Open data are also increasingly in fashion,

especially geographical data, from which it is pos-

sible to obtain information on schools, services, or

the employment rate by geographic sector, for exam-

ple (Gandomi and Haider, 2015; Chen et al., 2014).

All this raises new challenges to be addressed.

The use of Data Science methods on a company’s

data is called Business Analytics (BA) (Chen et al.,

2012). Although Data Science methods used have

been around for a long time, the democratization of

BA is quite recent, so its deﬁnition remains relatively

unclear. Some authors believe that BA is an evolu-

tion of BI and tends to replace it (Mortenson et al.,

2015), while others consider that BI and BA comple-

ment each other and form a whole called Business In-

telligence & Analytics (BI&A) (Chen et al., 2012).

It is the latter position that we prefer. But instead

of using the term Business Intelligence & Analytics,

we prefer to introduce the notion of Data Intelligence,

which we deﬁne as “carrying out analyses, both sim-

ple and advanced, on all types of data, whether big

data or not”.

Authors have proposed architectures to combine

BI and BA in the general context of big data. (Baars

and Ereth, 2016) propose the concept of analytical

atoms, which they see as small, ready-to-use, au-

tonomous data warehouses. The interest of this notion

lies in the combination of analytical atoms, to form

what they call “virtual data warehouses”. (Gr

oger,

2018) offers a fairly general data analysis platform,

based on the Lambda architecture, in which both

Scholly, E., Favre, C., Ferey, E. and Loudcher, S.

HOUDAL: A Data Lake Implemented for Public Housing.

DOI: 10.5220/0010418200390050

In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 1, pages 39-50

ISBN: 978-989-758-509-8

batch and stream data can be processed in their re-

spective layers. We believe these proposals make

sense from the perspective of combining BI and BA.

The authors propose architectures to capture data in

various formats, with the possibility of processing

them through different analyses.

These proposals have similarities with the notion

of data lake, where a data lake is a system in which it

is possible to store data of any type, without a prede-

ﬁned schema, and from which various analyses can be

performed (Sawadogo et al., 2019b). In the absence

of a schema, the use of metadata is essential in order

to exploit the data lake. Without an efﬁcient metadata

system, a data lake becomes unusable, and is called a

data swamp (Miloslavskaya and Tolstoy, 2016). The

notion of data lake is still relatively new and much

work is in progress, both on the theoretical deﬁnition

of metadata and on the concrete implementation of a

data lake. In addition, the metadata system greatly

inﬂuences the data lake’s framework, which is also a

matter of debate.

In our case, we rely on a real business use case that

highlights the problems inherent to achieving Data In-

telligence without the right tools. A project aiming to

marry BI and BA within the same scope was carried

out, but by using traditional tools. A number of lim-

itations have been reached, especially regarding the

storage of various data and the analyses of these data

by different methods. This allows us to show that the

use of a data lake could efﬁciently address these is-

sues.

In this paper, we propose to use our metadata sys-

tem named MEDAL (Sawadogo et al., 2019b), but re-

worked, in order to best meet the requirements of our

use case on public housing. With this metadata sys-

tem, we also propose a framework for data lakes that

we consider to be versatile and well suited for Data

Intelligence. By using this framework, users can inte-

grate all types of data and run various kinds of anal-

yses without giving the data lake a predeﬁned shape.

Moreover, we introduce HOUDAL (public HOUsing

DAta Lake), a data lake implementation based on our

framework and adapted to our use case. This imple-

mentation allows users to create and manage metadata

in the data lake and to interact with data stored in the

lake.

This paper is organized as follows. Section 2

presents the state of the art on data lakes. Section 3

explains our use case about public housing. Section 4

presents our data lake framework. Section 5 details

HOUDAL, the implementation of our proposal. Fi-

nally, Section 6 concludes the document by present-

ing the perspectives of evolution and future work.

2 STATE OF THE ART

The term data lake was initially proposed by (Dixon,

2010). It deﬁned a vast repository of raw and hetero-

geneous data structures fed by external data sources,

and from which various analytical operations can be

performed. After Dixon’s proposal, the term data lake

was quickly associated with specialized technologies

for massive data processing such as Apache Hadoop,

with low storage cost and better ability to handle large

volumes of data. However, this view is less and less in

vogue in the literature as most research teams have fo-

cused on Dixon’s deﬁnition (Miloslavskaya and Tol-

stoy, 2016; Gandomi and Haider, 2015).

Several other proposals have been made to deﬁne

a data lake (Miloslavskaya and Tolstoy, 2016). Two

main characteristics stand out: the variety of data and

the absence of a schema. This indicates that a data

lake can integrate all types of data and that it works

via a schema-on-read approach (i.e. data are stored

without a predeﬁned schema). This second property

contrast with the schema-on-write approach of data

warehouses, where data are transformed before inser-

tion to match the schema deﬁned upstream.

The most consensual and complete deﬁnition is

proposed by (Madera and Laurent, 2016). However,

we amended this deﬁnition by discussing some ele-

ments (Sawadogo et al., 2019b). From our perspec-

tive, a data lake is “an evolutionary system (in terms

of scaling) of storage and analysis of data of all types,

in their native format, used mainly by data specialists

(statisticians, data scientists, data analysts) for knowl-

edge extraction. Characteristics of a data lake include:

1) a metadata catalog that ensures data quality; 2) a

data governance policy and tools; 3) openness to all

types of users; 4) integration of all types of data; 5)

logical and physical organization; 6) scalability”.

Besides the deﬁnition of the data lake concept, a

lot of work has been carried out on the framework of

a data lake (Maccioni and Torlone, 2017; Hai et al.,

2016). Since the deﬁnition of the concept is not yet

totally consensual, and proposals of frameworks have

emerged in parallel to the concept, there are several

visions (Hellerstein et al., 2017; Halevy et al., 2016).

It should be noted that the framework of a data lake is

intimately conditioned by its metadata system.

Among framework proposals, some consider the

data lake only as an economical storage space for

massive data (Miloslavskaya and Tolstoy, 2016). In

this vision, the architectures are essentially based on

technologies such as Apache Hadoop, Azure Blob

Storage, or Amazon S3. We prefer to set aside this

view as is does not match our deﬁnition of a data lake.

A ﬁrst framework, introduced in particular by (In-

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

mon, 2016), proposes an organization in data ponds.

The idea is to partition data in 3 ponds according to

their format: structured, semi-structured and unstruc-

tured. There exists variations, with for example a raw

data pond, an archived data pond, or a division of

the unstructured data pond by sub-formats (text data

pond, audio data pond, etc.) (Sawadogo et al., 2019a).

Other frameworks prefer to divide the data lake

into zones (Ravat and Zhao, 2019). The zones re-

fer to the “reﬁnement level” of data: raw data (i.e.,

as ingested in the lake, data remain in the raw data

zone, while reworked data are placed in the processed

data zone, and ready-to-use data are in the access data

zone). Other authors refer to these zones as bronze,

silver, and gold, respectively (Heintz and Lee, 2019).

Finally, authors and we propose a framework

based on data semantics (Diamantini et al., 2018;

Sawadogo et al., 2019b). Here, the idea is to group

data according to their meaning. The authors and we

introduce the term object, designating a homogeneous

set of information. An object can group several ver-

sions and representations of the same set of data, to

track whether data have been updated or transformed

by an user. Moreover, by crossing data from differ-

ent objects, it generates a new object, with different

semantics.

While these frameworks are relevant and address

the issues identiﬁed by their authors, they impose a

predeﬁned shape to the data lake (e.g., by organizing

data according to their structure, reﬁnement or seman-

tics). We believe this is contrary to the schema-on-

read approach of data lakes.

Let’s consider the following example. A com-

pany having a lot of textual documents is interested in

extracting structured descriptors (e.g., bag of words

or term-frequency vector) for computing documents

similarity. To this end, both textual documents and

descriptors are stored in a data lake. Let us distinguish

between the three cases listed above. If the company

organizes its data lake through:

• Data structure, and thus creates data ponds within

the lake: textual documents are in the unstructured

data pond, while descriptors are in the structured

data pond;

• Data reﬁnement, and thus creates zones within the

lake: textual documents are in the raw data zone,

while descriptors are in another zone (for exam-

ple, process data or access data);

• Data semantics, and thus creates data objects

within the lake: the descriptors remain in the same

object, meaning that textual documents and their

descriptors are in the same object.

We observe from the example, that depending on the

framework chosen for the data lake, data are put in

different areas inside the data lake. As a result, the

lake’s usage will be different according to the frame-

work chosen. We believe the proposals are too spe-

ciﬁc and thus we propose a more general framework

that avoids giving a predeﬁned shape to the data lake.

This will be detailed in Section 4.

3 PUBLIC HOUSING USE CASE

The use case, anchored in the business issue of pub-

lic housing, was initially carried out in a Data Intelli-

gence consulting company without using a data lake.

Named Projet Etat Des Lieux (EDL) (i.e., “inventory

of ﬁxtures project” in English), the goal is to study a

social landlord’s data in order to provide two predic-

tions regarding the departure of tenants.

The ﬁrst study seeks to predict whether the equip-

ment of a dwelling might be degraded when a ten-

ant leaves, and thus planning an exit visit or not. In-

deed, it is frequent that an agent of the landlord trav-

els to carry out the exit visit, but if the dwelling is

in good condition, no compensation will be requested

from the outgoing tenant. If this is the case, the visit

is “useless” and the landlord loses money, since an

agent’s travel is not cost-free. The objective of this

prediction is thus to identify in advance the dwellings

that present a high probability of having degraded

equipment for which the landlord could ask the outgo-

ing tenant for an indemnity, thus making the agent’s

visit useful and proﬁtable.

A second similar study was conducted, to predict

whether any repairs to the dwelling will be required

at the tenant’s departure before it can be rented again.

The objective of this prediction is to reduce dwelling’s

vacancy period and be able to anticipate the works by

ordering them before the tenant’s departure. Indeed,

if the observation is made on departure of the tenant

that works need to be done, the housing cannot be

re-rented immediately. It is necessary thus to count

the duration of the order of the works and the time of

these works. If the works are ordered in advance, they

can begin as soon as the tenant leaves, which reduces

the period of vacancy of the dwelling and thus reduce

the losses of the landlord.

In parallel to these predictive analyses, data are

inserted into a reporting tool, in which a small data

warehouse is modeled through a star model, in or-

der to conduct BI analyses. Once the forecasting is

done, the results are also inserted into the dashboard,

to combine them with the star model and return it to

the landlord. With this dashboard, the landlord can

easily use the results of the two predictions and thus

HOUDAL: A Data Lake Implemented for Public Housing

better visualize the dwellings presenting risks in terms

of damaged equipment or works to be planned when

the tenant leaves.

Since this project did not use a data lake from the

beginning, many problems appeared when trying to

do Data Intelligence. We distinguish 4 main problems

that cannot be easily handled by traditional systems

such as data warehouses or notebooks.

Multiple Data Formats. Data used to complete

this project are essentially structured data in CSV

format. The social landlord sent these ﬁles, which

are extracted from its information system through

queries. Nevertheless, the studies carried out through

this project have generated new data in a wide variety

of formats (e.g., ﬁles in .RData and .pkl format for

R and Python analyses ; .png images generated with

predictive models ; PowerBI reports in .pbix ; and

project tracking documents in .ppt and .docx, among

others). Storing data in a variety of formats is always

a challenge. This results in data being scattered in

different environments, which can quickly complicate

things for data specialists. With a data lake, this prob-

lem no longer exists because all data are in one place.

Multiple Types of Analysis. This project involved

several data specialists: data scientists for building

predictive models and BI consultants for designing

the reporting dashboard. These two types of proﬁles

do not work with the same tools and methods. Having

to integrate the forecasting result into the star model

in the BI reporting tool was a complicated task. Be-

sides, for BA analyses alone, the project was carried

out in several environments and technologies (e.g.,

the data cleansing was done in R, the ﬁrst predictions

in a Jupyter notebook and the second predictions in

Python). Combined with the fact that data is scattered,

this complicates things even more since analyses are

saved in different places and performed in different

programming languages. By using a data lake, it is

much simpler to track which processes use and gen-

erate which data.

Data Lineage. The social landlord sent data from

queries performed on their internal data warehouse,

so the provided data were sometimes redundant,

and/or not complete. As a result, there were several

versions of a ﬁle, with some versions being used by

some scripts and not others. With both data and anal-

ysis scattered around, it is not easy to know what data

are being used, by whom, what, and how. Since we

also have very little information about potential re-

dundancies between data and treatments, the informa-

tion is limited to ﬁle names or comments in the code.

This would not be a problem when using a data lake,

since data lineage is recorded within metadata.

Skills Transfer. Finally, the three problems men-

tioned above make it difﬁcult, without the help of the

data specialists who actively took part in the project,

to understand the progress of the project, to know

what data have been exploited, by what, for what pur-

pose, and so on. In our case, it took several explana-

tory meetings with the different actors of the project

in order to have a clear and precise vision of what was

done in the project, whether in a detailed way (ﬁle by

ﬁle) or in a more general way (goals).

If the project were conducted within a data lake,

merely exploring the metadata would have avoided

these time-consuming and ultimately not useful meet-

ings. The use of a data lake makes it possible to han-

dle these issues, which are difﬁcult to manage with

conventional tools. Because of this, we consider rele-

vant the use of a data lake for this type of Data Intel-

ligence projects.

4 FRAMEWORK FOR DATA

LAKES

In this Section, we now detail our approach to the es-

tablishment of the data lake, by presenting the meta-

data system and the framework of the data lake.

4.1 Metadata System

The metadata system is the keystone of a data lake,

and guarantees its proper functioning. In the ab-

sence of an effective metadata system, the data lake

becomes unusable and is then referred to as a data

swamp. Several proposals have been made for the

metadata system of a data lake. As discussed in Sec-

tion 2, the metadata system greatly inﬂuences the

framework of the data lake. Thus, the choice of the

metadata system is crucial, and the use case we are

dealing with will also inﬂuence this decision.

As mentioned in Section 3, our use case mostly

concerns structured data, which are then reworked

through various data wrangling operations. How-

ever, reworked data often end up being stored in not

very common formats, such as .RData or .pkl ﬁles.

These unconventional formats are considered as un-

structured data by most tools, because the ﬁles can

only be exploited by the right tools (in this case, R

for .RData ﬁles and Python for .pkl ﬁles). We there-

fore consider that an approach based on data structure,

with the implementation of data ponds, is not relevant

in our case, since the variety of data formats may cre-

ate ambiguities.

As far as the zones are concerned, source data

in their original format (.csv) would ﬁll the raw data

zone, and the ﬁnal dashboard would be in the access

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

data zone. These two zones would be sparingly ﬁlled

compared to the process data zone, which would con-

tain most of the data, with multiple formats. Thus,

we believe that distinguishing data according only to

their degree of reﬁnement is not the best idea in our

use case. Ultimately, this leaves us with the option of

using a framework based on data semantics.

We conducted a study on metadata systems, iden-

tifying 6 key functionalities that the metadata sys-

tem of a data lake must be able to manage in or-

der to overcome the problems associated with big

data (Sawadogo et al., 2019b). This study showed that

apart from our metadata system, namely MEDAL, no

other system offers the 6 key functionalities, the most

complete systems implementing only 5 of the func-

tionalities. Moreover, MEDAL is a metadata system

based on data semantics, which best suits our con-

straints. Thus, we consider that MEDAL is the most

relevant metadata system to use for our data lake.

MEDAL is based on the concept of object, which

designates a set of homogeneous information, as well

as a metadata typology in three categories: intra-

object, inter-object and global metadata. Intra-object

metadata refer to metadata associated with a given ob-

ject, while inter-object metadata are about relation-

ships between objects, and global metadata add more

context to all the data lake’s metadata (Sawadogo

et al., 2019b).

For our data lake, however, we have reworked

MEDAL. The main change lies in the introduction

of three main concepts, that generalizes some of

MEDAL’s. The ﬁrst concept is data entity, that gen-

eralizes versions and representations of an object in

the sense of MEDAL. A data entity can represent a

spreadsheet ﬁle, an image, a database table, and so

on.

The second concept is grouping, which is a set of

groups, and data entities can be gathered into groups.

For example, we can reproduce the object notion of

MEDAL, by creating a grouping “semantic object”,

each group of this grouping being an object in the

sense of MEDAL. Note that it is possible to have

several groupings, thus enabling, among other things,

to gather data entities according to their structure or

zone if such groupings are needed. Thanks to this, it

becomes possible to reproduce classic frameworks as

presented in Section 2.

Lastly, the third concept is process, which general-

izes the MEDAL’s notions of transformation, update

and parenthood relationship. It refers to any trans-

formation applied to a set of data entities and that

generates a new set of data entities. With processes,

data lineage can be tracked, allowing the data process-

ing chain to be monitored. A process can represent a

script or a manual operation made by an user.

Let’s illustrate with an example. A CSV ﬁle is

stored in the data lake, thus creating a data entity in

the metadata system. This ﬁle is then processed by a

data cleansing script: a second ﬁle with cleaned data

is generated, and also stored in the data lake. Thus,

a second data entity is also created, as well as a pro-

cess representing the script. The two data entities are

connected by this process. Besides, if a grouping on

zones is created, then the ﬁrst data entity is in the “raw

data zone” group, while the second one belongs to the

“processed data zone” group.

For further reading, a complete presentation of

this evolution of MEDAL, named goldMEDAL, was

the subject of additional work (Scholly et al., 2021).

4.2 Framework

A data lake, in its simplest form, is composed of two

main parts: the data storage area, which can store any

type of data, and the metadata system, that describes

data stored in the lake and help users navigate inside

the data lake. However, to ease the user experience,

we believe that it is relevant add two other layers, con-

cerning respectively data ingestion and data analysis.

As discussed before, a data lake can store any type

of data, in its raw format. But to do so, an efﬁcient

metadata system is mandatory. However, if users have

to enter all metadata manually, even the most efﬁcient

metadata system can become very tedious to use. This

is where the data ingestion layer comes into play: its

goal is to facilitate metadata creation by automatically

generating as much metadata as possible, allowing

users to quickly validate or invalidate metadata and

move on.

Data stored in the lake are not only meant to be

stored, but also analyzed. All types of data analysis

can be considered, but since our ﬁnal goal is to com-

bine BI and BA within Data Intelligence, we focus

on the presence of BI and BA analysis. Among these

analyses, we are particularly interested in the creation

of “advanced indicators” through BA methods (for

example with machine learning), which then enrich

BI analyses, whether at the data warehouse level or

directly in dashboards. To illustrate the notion of ad-

vanced indicator, our use case is a perfect example:

once predictions are made, they are recorded in the

BI dashboard, and cross-referenced with data in the

star model. This gives us an example of BA analyses

that enrich BI analyses. It is important to note that

since the data lake can contain reworked data, it can

also contain a data warehouse and/or dashboards.

Our proposal of a data lake’s framework combines

data ingestion, storage, analysis, and the metadata

HOUDAL: A Data Lake Implemented for Public Housing

system. Firstly, data are extracted from sources, and

go through the ingestion layer. There, metadata are

created and stored in the metadata system, while data

are saved in the storage area. Then, users can browse

metadata to ﬁnd useful data, and run either BI or BA

analyses on them thanks to the data analysis layer. In

the end, if an analysis generates new data, it is possi-

ble to store it back into the data lake: in this case, data

go through the ingestion layer, in order to create meta-

data, and so on. This is referred to as “back-feeding”

the data lake.

The schema-on-read property of a data lake im-

plies that the data schema is speciﬁed at the query.

That is why data lakes operate in Extract - Load -

Transform (ELT) mode, which is opposed to the Ex-

tract - Transform - Load (ETL) mode typical of data

warehouses, where data are transformed before being

loaded. In the data lake, data are not transformed be-

fore it is stored, but only at query time.

In our case, to best describe our framework, we

prefer to say that the data lake works in EDLT(L):

• Extract: data are extracted from sources;

• Describe: to-be-inserted data are described by

metadata;

• Load: data are stored in the data lake, without any

modiﬁcation;

• Transform: when queried by the user, data are

modiﬁed as required;

• (Re)Load: if the user wishes to save the result of

his analysis, he can “back-feed” the lake and save

the transformed data, along with the input of ade-

quate metadata.

By adding the Describe step, we emphasize the im-

portance of metadata, and that it must be generated

before storing the data in the data lake. Saving data in

the data lake without capturing the associated meta-

data is a signiﬁcant risk of losing track of them and

quickly turning the lake into an unusable swamp. Fur-

thermore, with the (Re)Load step, we insist on the

fact that the data lake must be the central piece of

the whole data analysis chain. Of course, data are

ﬁrst extracted from sources outside the lake; however,

when the user reworks data from the lake, whether

through BI or BA analyses, the results may have to be

stored in the lake. Thus, for these reworked data, the

framework “restarts” at the Describe step. Figure 1

summarizes this approach in a schematic way.

To illustrate the framework, we refer to the use

case. First, data sent by the landlord are ingested into

the data lake. Data scientists then process data, and

back-feed the data lake with these reworked data, as

well as the transformation scripts. In the end, the two

predictions are made, and then exploited in a BI dash-

board. With all this, we have a perfect example of

the creation and exploitation of advanced indicators,

as presented before.

Placing the data lake as the central piece of the

knowledge extraction from data chain gives us several

advantages. Firstly, it is one of the fundamental prop-

erties of the lake, i.e. the ability to store of all types of

data in their native format, that beneﬁts us. Indeed, we

store in the lake both raw data sent by the landlord, re-

worked data, predictive models, dashboards, but also

project tracking documents. This centralized storage

of all data also means data are not scattered and envi-

ronments are not multiplied. This was a major issue

for our use case without the data lake, and it is quite a

common problem, whether in BI or BA projects: the

data as well as the processes can be stored locally on a

computer, or on the company server, or on a cloud, or

to be retrieved directly on a client’s virtual computer,

and so on.

An additional advantage is that metadata must be

entered directly when data are inserted into the data

lake. Indeed, whether for BI or BA projects, there

are several ways to generate post-project documenta-

tion. For example, in BI, DataGalaxy

provides an

ergonomic data catalog to manage data governance.

On the other hand, having to maintain this documen-

tation after the project has been set up is an expen-

sive, error-prone task, in addition to being tedious,

and therefore rarely done properly. With the data lake,

data governance is done upstream : we can deﬁne this

as “on-write data governance”, and there is little need

to maintain this information afterwards.

5 HOUDAL DATA LAKE

After explaining in detail the framework of our data

lake, we now present how we implemented HOUDAL

to address the issues of our use case, and to validate

the efﬁciency of our framework.

5.1 Selection of Technologies

Although the concept of a data lake is still relatively

recent, and there is no consensus yet on the architec-

ture that a data lake should have, there are already

several tools and technologies available to implement

this type of solution. We therefore conducted a study

to identify which tools could be used to meet the re-

quirements of our use case and remain the most faith-

ful to our vision of the data lake and its framework.

https://datagalaxy.com

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

Figure 1: Proposed data lake’s framework.

We have identiﬁed four possible ways to implement

the data lake.

All-in-One Tool. The ﬁrst is to use all-in-one tools,

such as Apache Kylo

. This is deﬁned as “a self-

service data lake management software platform for

data acquisition and preparation with integrated meta-

data management”. Both data storage and analysis are

handled with Kylo, as well as the metadata system.

However, this option was abandoned for three main

reasons. First of all, the metadata system proposed

by Kylo is not exhaustive enough to let us implement

our goldMEDAL-based metadata system. Also, Kylo

is mainly oriented to manage structured data. Al-

though we mainly have source data in .csv format in

our use case, this becomes problematic when dealing

with ﬁles of various formats. The last reason con-

cerns data analysis: the data preparation part of the

tool is certainly interesting, especially in our case, but

as far as more advanced analyses (BI and BA) are con-

cerned, we would have had to manage them outside

Kylo. This would have had several unattractive con-

sequences, such as managing metadata outside Kylo.

Proprietary Clouds. The second option consid-

ered is to use proprietary clouds, such as Microsoft’s

Azure

or Amazon’s AWS

. Both have the main ad-

vantage of offering a wide range of services, allow-

ing one to ingest all types of data, manage metadata,

and conduct all types of data analysis. Users can

https://kylo.io/

https://azure.microsoft.com/

https://aws.amazon.com/

stay in the same ecosystem for all their needs, and

no longer have to worry about deploying the proper

server. However, having everything in the cloud, and

services for everything, also turns out to be a disad-

vantage. Firstly, if developments have already been

made, then they have to be migrated to the appro-

priate services within the cloud, which can be very

time-consuming. Besides, we noticed that these pro-

prietary clouds are not ﬂexible: in particular, the ser-

vices for metadata management lacked completeness

on certain aspects, and we would have liked to be

able to make some modiﬁcations to them to match

our expectations better, but this turned out to be im-

possible. Finally, proprietary clouds are pay-per-use,

which slowed us down for a ﬁrst proof of concept.

However, for future work on larger use cases or with

different needs, they might be reconsidered.

Hadoop Ecosystem. A third option explored is about

using an ecosystem from the world of big data open

technologies, such as Apache Hadoop

. Like pro-

prietary clouds, Hadoop comes with its own ecosys-

tem, and a lot of services are available to ingest data,

store data (especially with the Hadoop Distributed

File System, HDFS), manage the metadata system and

run different types of data analysis. In particular, the

metadata system Apache Atlas is very interesting, and

seems to be complete enough for us to implement

our metadata system. Nevertheless, following pre-

liminary tests, this solution was ﬁnally not retained.

On the one hand, HDFS is suitable for large ﬁles, but

https://hadoop.apache.org/

HOUDAL: A Data Lake Implemented for Public Housing

not for small ﬁles, which we have in large quantities

in our use case. On the other hand, installing an en-

tire Apache Hadoop ecosystem is a very long, com-

plex process, and requires speciﬁc skills, in addition

to the need for hardware resources to run the cluster

properly. Furthermore, these open source technolo-

gies evolve very quickly, and it is necessary to keep a

watchful eye on new developments, but also to keep

the cluster up to date, which takes time and can some-

times be a complicated task. Finally, in our vision

of the data lake, we consider that an Apache ecosys-

tem is welcome when there is a need to manage large

amounts of data, but this was not an issue in our use

case. Integrating an HDFS storage area into the data

lake, as well as some associated services, may how-

ever be a very interesting topic for future work.

From Scratch Development. Finally, the last way

studied is to develop our data lake from scratch, in

other words to take tools and technologies apparently

with no direct link and assemble them to form our

data lake. Although it is a risky gamble, this is the so-

lution we have chosen, in particular to have as much

ﬂexibility as possible in the implementation of the

metadata system, but also to remain as faithful as pos-

sible to our vision of the data lake. Using already ex-

isting technologies and/or being part of an ecosystem

meant that we ran the risk of seeing these tools evolve

over time, possibly in a direction that might one day

no longer suit our vision of the data lake. In addi-

tion, developing a system from scratch gives us the

possibility to have a more complete control of it and

to be able to make it evolve as we wish, according to

the future use cases we might have to deal with. Fi-

nally, considering the limited volume of data in our

use case, using relatively simple technologies was the

best choice to make.

5.2 Technical Architecture

HOUDAL (public HOUsing DAta Lake), our imple-

mentation of the data for public housing, is based on

a web application. It has two major parts: the front-

end, or client part, which we can refer to as the user

interface, and the back-end, or server part, which is

broken down into various services, namely the API,

the metadata system, the data storage, and the user

management service.

User Interface. HOUDAL is operated through a user

interface. Users can connect to the interface and then

interact with the data lake. They can consult and ex-

plore existing metadata, with a search bar among oth-

ers. Files stored in the lake can be downloaded. In

addition, it is also possible for users to create new

metadata, but also to add new ﬁles to the data lake.

For this, a form is available, with which the user up-

loads the ﬁle. During this process, metadata are en-

tered semi-automatically : some information, such as

ﬁle name, size or format are extracted automatically,

while others have to be entered manually by the user,

such as a description. Once the ﬁle upload has been

validated, data are saved in the storage area, and meta-

data are created. From a technical standpoint, the user

interface has been developed in ReactJS.

API. Through requests methods, the user interface

communicates with an API. For any action made by

the user on the interface, it generates a request that

is sent to the API. The API then queries the applica-

tion’s various services, such as the metadata system,

and returns the results of this query to the interface in

response to the user’s request. This API has been de-

veloped in NodeJS. Moreover, HOUDAL works with

an identiﬁcation system: the user must log in to ac-

cess the data lake. Besides, some pages of the inter-

face are reserved for users with an administrator role.

The information about the users is saved in a Mon-

goDB collection. Note that a MongoDB database is

not a necessity, and this may change in the future.

Metadata System. Since our metadata system is

based on goldMEDAL, which has a graph-based rep-

resentation, we store the data lake’s metadata in a

Neo4J

graph database. The database is accessed

only by the API, whether to consult existing metadata

or to create new ones. Data entities are nodes, and

are linked to each other through processes, or con-

nected to groups. Having all the metadata in a sin-

gle database is an advantage because it simpliﬁes API

calls.

Storage Area. In our case of use, we only deal

with ﬁles so there is no database to manage (although

database data can also be considered as ﬁles). This

implies that for the storage area of HOUDAL, we did

not need anything other than a directory of a conven-

tional ﬁle system to store our data. Note however, that

this practice, although sufﬁcient in our speciﬁc case,

is limited and could be problematic in other use cases.

This is discussed in more detail in the following sub-

section.

A screenshot of HOUDAL’s management inter-

face is presented in Figure 2. The interface allows

users to create different types of metadata. To this

end, the different buttons at the top of the screen redi-

rect users to pages containing forms to create meta-

data. Moreover, the interface provides a search bar,

where the user can perform a keyword search in data

entities. On the screenshot, the user has searched for

the keyword “solicitation”. The interface then pro-

vides the user with all data entities containing this

https://neo4j.com

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

Figure 2: HOUDAL’s interface.

keyword in their ﬁle name or description.

Note that the screenshot of Figure 2 is only par-

tial and does not show all of the functionalities of the

interface, such as downloading data or viewing the

groups related to a data entity. In the following para-

graph, we discuss how HOUDAL helps us to tackle

the issues encountered in our use case.

5.3 HOUDAL Applied to the Use Case

In Section 3, we presented how the use case generates

issues when using traditional BI and BA tools. We

also showed that, from a theoretical standpoint, using

a data lake would be a good solution to tackle these

problems. In order to validate our proposal, we have

tried to replicate in HOUDAL the project carried out

without the use of a data lake. Here we discuss what

HOUDAL’s metadata system looks like when applied

to this use case, and how it assists data specialists.

As presented before, HOUDAL’s metadata system

is based on three main concepts. The different data

ﬁles that populate the data lake are data entities. They

can be either raw data ﬁles sent by landlords (often in

comma separated value ﬁles) or reworked data, some-

times stored in various formats such as .pkl or .RData,

for Python and R analyses, respectively. In Neo4J,

each data entity has its node labeled :ENTITY and

the entity’s properties, such as ﬁle name or descrip-

tion, are stored in the node’s attributes.

Groupings are use for categorizing data entities.

With HOUDAL, users can create as many groupings

as necessary, and several groups for each grouping.

Data entities can be linked to zero, one or several

groups for each grouping. In the physical model,

groupings are modeled by nodes carrying a :GROUP-

ING label. Groups are also nodes, carrying both a

:GROUP label and grouping’s name as a second label,

in order to facilitate querying. A data entity node can

be linked to several group nodes. A data entity node

(resp. group node) is linked to a group node (resp.

grouping node) with an edge labeled with the group-

ing’s name (resp. :GROUPING). With groups and

groupings, users can, for example, determine whether

it is internal or external data, or the data reﬁnement

level (zones), and so on.

Processes reﬂects the modiﬁcations that data may

undergo, and are used for tracking data lineage. It can

be, for example, a script for transforming or cleaning

a data ﬁle, i.e., a data entity. In Neo4J, a process is

also modeled by a node, labeled :PROCESS. Data en-

tities can be input of the process (for example, raw

data sent by the landlord), or output (cleansed data).

If a data entity is the input of a process, there is an

edge labeled :PROCESS

IN from the entity node to

the process node. Inversely, an edge labeled :PRO-

CESS OUT from the process node to the entity node

HOUDAL: A Data Lake Implemented for Public Housing

is created if a new data entity is generated by the pro-

cess.

Figure 3 presents a sample of metadata stored in

Neo4J. Data entity nodes are colored in red. On both

sides of the Figure, a data entity node is highlighted:

some of its attributes are depicted at the bottom in

grey.

The left-hand side of the Figure gives an example

of groups. There are three groupings: a zone group-

ing, a format grouping and a granularity grouping.

Each grouping has its group nodes, colored in green,

purple and blue, respectively. Data entity nodes are

connected to group nodes with an edge. For example,

we can see that the highlighted data entity node (on

the left) is a raw .csv ﬁle, and the granularity level is

“Tenant”, meaning that each line corresponds to a ten-

ant. Note that in Neo4J, groupings are also modeled

as nodes, but are not represented in this Figure.

An example of process is depicted on the right side

of Figure 3. The process node is colored in yellow.

We can see that three data entity nodes are the pro-

cess’ input, and three data entity nodes are the pro-

cess’ output, meaning that they are generated by the

process.

5.4 Limitations and Future Work

HOUDAL is functional and is currently in advanced

testing phase. Several social landlords have shown

interest in our work and, although much work re-

mains to be done, discussions are underway to deploy

this application to other social landlords over the long

term. However, while being functional, it is still a

work in progress, and thus has limitations.

The ﬁrst limitation of HOUDAL is its data stor-

age system. Indeed, considering the use case we have

treated, in which there are only ﬁles, having a storage

system other than a directory was not necessary. On

the other hand, in the long term, we would like to be

able to offer users the possibility of carrying out BI

analyses, which often implies the presence of a data

warehouse, and therefore a relational database.

To overcome this problem, we wish to add, at least

initially, a relational database in the storage space of

the data lake. In order to make it easy for the user to

use it, we also wish to propose through the interface a

page allowing the user to enter SQL queries, in order

to create and ﬁll his databases and tables.

The second limitation arises at the level of ﬁlling

the data lake with source data. Although this process

is currently simple, it remains tedious in the case of

a large number of ﬁles, and if the data source is not

a native ﬁle, it forces the user to generate a ﬁle from

the source. If we take the example of a source being

a relational database, the user must extract data from

the tables as CSV ﬁles and then insert them in the

data lake. This adds one more step in the process of

storing data into the lake, and we could argue that it

“denatures” the source in a certain way.

This is why we wish to open the scope of possibil-

ities concerning the data sources that can feed the data

lake. In particular, we want to be able to automate the

ﬁlling of the lake, and not have to go through the in-

terface, which requires human intervention each time.

For this, we want to develop an ETL component (for

example, Pentaho Data Integration), as well as a li-

brary for a scripting language (for example, Python).

With these tools, the user would not use the “extract

from ﬁle/table” and “insert into ﬁle/table” functions

anymore, but an “extract from data lake” and “insert

into data lake” function.

The data lake can contain data of all types and

from all sources. In addition, many users with dif-

ferent proﬁles can connect to the lake to exploit it. On

the other hand, some data may be of a sensitive nature

and should only be accessible by a speciﬁc category

of users. For example, in the case of data concern-

ing Social Housing, if a dataset contains details about

the income of tenants or their professional situation, it

must be considered sensitive and therefore only acces-

sible by users who have been validated beforehand. In

the case of the use we dealt with in the data lake, there

was no notion of sensitive data.

For these reasons, we wish to work on this aspect.

To this end, we want to introduce the notion of roles

and restrictions in the data lake management applica-

tion. We want the roles to be modular, i.e. to be able

to deﬁne the roles for a given project, and to associate

restrictions to these roles. It is also possible to restrict

access to metadata according to a proﬁle : to be able

to only consult, to be able to create new metadata, to

be able to modify it, and so on.

6 CONCLUSION

For achieving Data Intelligence in the context of big

data, some challenges must be addressed ﬁrst. One

of or proposals is the use of a data lake, in which all

types of data are stored, and from which various kinds

of analyses can be performed. This is a trendy re-

search topic with several possibilities being explored.

Given the constraints of our real-life use case, we

drew inspiration from our metadata system based on

data semantics, namely MEDAL, to which we made a

few modiﬁcations to create goldMEDAL. It is based

on three main concepts: data entity, grouping and pro-

cess.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

Figure 3: HOUDAL sample Neo4J metadata.

This metadata system allows us to deﬁne a frame-

work that doesn’t give a predeﬁned shape to the data

lake. Conventional data lakes operate in ELT; we

prefer to add two steps to get an EDLT(L) operating

mode. Step D, for Describe, indicates that metadata

are extracted before storing the data in the lake. The

optional step L, for (Re)Load, is for analyses con-

ducted on the lake’s data, which may generate new

reworked data; when this is the case, these data are

back-fed into the lake, and go through the Describe

step again.

In the context of public housing, we have im-

plemented our framework and we have created

HOUDAL, a data lake for public housing. After a

review of the different possibilities to carry out this

development, we chose to develop a web application

from scratch, in order to best ﬁt the requirements of

our use case. HOUDAL is composed of a user inter-

face and of an API that queries the various services

necessary for the proper functioning of the lake, in

particular the data lake storage area and the metadata

system.

HOUDAL is showing satisfactory results, but we

have many areas for improvement to make it more

complete. In particular, we would like to improve

the feeding of the data lake by being able to query

various sources, but also to simplify the data extrac-

tion from the lake, whether through ETL or scripting

languages. Finally, we also want to make HOUDAL

more robust against GDPR-related problems, which

we did not have to deal with in our case of use, but

which are nevertheless very common.

One of HOUDAL’s long-term objectives is to al-

low the study of a social landlord’s patrimony, and

more precisely the attractiveness of its dwellings.

This is indeed a key issue for social landlords. In

the past, qualitative studies have been carried out with

the help of business experts to obtain strategic infor-

mation on their patrimony and the actions to be taken

for the future. The next step would be to conduct a

data-driven, quantitative study to compare it with the

qualitative opinions drawn up by business experts.

But this objective raises many challenges. The at-

tractiveness of a dwelling is deﬁned on the one hand

according to its characteristics, such as surface area,

number of rooms, or heating type; social landlords

have this information, in what we call “internal data”.

On the other hand, attractiveness is also deﬁned by

the urban environment in which the dwelling is lo-

cated, with for example the presence of public trans-

portation and schools nearby, the employment rate in

the district or the general standard of living in the city.

Here, we need to retrieve this information from the in-

ternet, whether through open data, or social networks:

this is “external data”.

To meet these constraints, we believe that using

HOUDAL as presented in this article can be success-

ful. Indeed, it is a good solution to store both inter-

nal data from social landlords and external data gath-

ered from the Internet, while describing them prop-

erly with the appropriate metadata. Moreover, the

study of housing attractiveness is a good example of a

Data Intelligence project and the creation of advanced

indicators. The external data will be reworked and

then cross-referenced with internal data, with the ﬁ-

nal objective of providing an attractiveness score for

each dwelling.

HOUDAL: A Data Lake Implemented for Public Housing

REFERENCES

Assunc¸

ao, M. D., Calheiros, R. N., Bianchi, S., Netto,

M. A., and Buyya, R. (2015). Big data computing

and clouds: Trends and future directions. Journal of

Parallel and Distributed Computing, 79:3–15.

Baars, H. and Ereth, J. (2016). From data warehouses to

analytical atoms-the internet of things as a centrifugal

force in business intelligence and analytics. In 24th

European Conference on Information Systems (ECIS),

Istanbul, Turkey, page ResearchPaper3.

Chen, H., Chiang, R. H., and Storey, V. C. (2012). Business

intelligence and analytics: from big data to big impact.

MIS quarterly, pages 1165–1188.

Chen, M., Mao, S., and Liu, Y. (2014). Big data: A survey.

Mobile networks and applications, 19(2):171–209.

Diamantini, C., Giudice, P. L., Musarella, L., Potena, D.,

Storti, E., and Ursino, D. (2018). A New Meta-

data Model to Uniformly Handle Heterogeneous Data

Lake Sources. In European Conference on Advances

in Databases and Information Systems (ADBIS 2018),

Budapest, Hungary, pages 165–177.

Dixon, J. (2010). Pentaho, Hadoop, and Data Lakes.

https://jamesdixon.wordpress.com/2010/10/14/pentaho-

hadoop-and-data-lakes/.

Gandomi, A. and Haider, M. (2015). Beyond the hype: Big

data concepts, methods, and analytics. International

Journal of Information Management, 35(2):137–144.

oger, C. (2018). Building an industry 4.0 analytics plat-

form. Datenbank-Spektrum, 18(1):5–14.

Hai, R., Geisler, S., and Quix, C. (2016). Constance: An

Intelligent Data Lake System. In International Con-

ference on Management of Data (SIGMOD 2016),

San Francisco, CA, USA, ACM Digital Library, pages

2097–2100.

Halevy, A. Y., Korn, F., Noy, N. F., Olston, C., Polyzotis, N.,

Roy, S., and Whang, S. E. (2016). Goods: Organizing

Google’s Datasets. In Proceedings of the 2016 Inter-

national Conference on Management of Data (SIG-

MOD 2016), San Francisco, CA, USA, pages 795–

806.

Heintz, B. and Lee, D. (2019). Production-

izing Machine Learning with Delta Lake.

https://databricks.com/fr/blog/2019/08/14/production

izing-machine-learning-with-delta-lake.html.

Hellerstein, J. M., Sreekanti, V., Gonzalez, J. E., Dalton,

J., Dey, A., Nag, S., Ramachandran, K., Arora, S.,

Bhattacharyya, A., Das, S., Donsky, M., Fierro, G.,

She, C., Steinbach, C., Subramanian, V., and Sun, E.

(2017). Ground: A Data Context Service. In Bien-

nial Conference on Innovative Data Systems Research

(CIDR 2017), Chaminade, CA, USA.

Inmon, B. (2016). Data Lake Architecture: Designing the

Data Lake and avoiding the garbage dump. Technics

Publications.

Maccioni, A. and Torlone, R. (2017). Crossing the ﬁnish

line faster when paddling the data lake with KAYAK.

VLDB Endowment, 10(12):1853–1856.

Madera, C. and Laurent, A. (2016). The next informa-

tion architecture evolution: the data lake wave. In

International Conference on Management of Digital

EcoSystems (MEDES 2016), Biarritz, France, pages

174–180.

Miloslavskaya, N. and Tolstoy, A. (2016). Big Data, Fast

Data and Data Lake Concepts. In International Con-

ference on Biologically Inspired Cognitive Architec-

tures (BICA 2016), NY, USA, volume 88 of Procedia

Computer Science, pages 1–6.

Mortenson, M. J., Doherty, N. F., and Robinson, S. (2015).

Operational research from taylorism to terabytes: A

research agenda for the analytics age. European Jour-

nal of Operational Research, 241(3):583–595.

Ravat, F. and Zhao, Y. (2019). Metadata management for

data lakes. In European Conference on Advances in

Databases and Information Systems (ADBIS 2019),

Bled, Slovenia, pages 37–44. Springer.

Sawadogo, P., Kibata, T., and Darmont, J. (2019a). Meta-

data management for textual documents in data lakes.

arXiv preprint arXiv:1905.04037.

Sawadogo, P. N., Scholly, E., Favre, C., Ferey, E., Loud-

cher, S., and Darmont, J. (2019b). Metadata sys-

tems for data lakes: models and features. In Inter-

national Workshop on BI and Big Data Applications

(BBIGAP@ADBIS 2019), Bled, Slovenia, pages 440–

451. Springer.

Scholly, E. (2019). Business intelligence & analytics ap-

plied to public housing. In ADBIS Doctoral Consor-

tium (DC@ADBIS 2019), Bled, Slovenia, pages 552–

557. Springer.

Scholly, E., Sawadogo, P., Liu, P., Espinosa-Oviedo, J. A.,

Favre, C., Loudcher, S., Darmont, J., and No

us, C.

(2021). Coining goldmedal: A new contribution to

data lake generic metadata modeling. In 23rd Interna-

tional Workshop on Design, Optimization, Languages

and Analytical Processing of Big Data (DOLAP@

EDBT 2021).

Watson, H. J. and Wixom, B. H. (2007). The current state

of business intelligence. Computer, 40(9):96–99.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems