De-identiﬁcation of Clinical Text for Secondary Use:

Research Issues

∗

Hanna Berg, Aron Henriksson, Uno Fors and Hercules Dalianis

Department of Computer and Systems Sciences, Stockholm University, Sweden

Keywords:

De-identiﬁcation, Privacy, Electronic Health Records, Clinical Text, Natural Language Processing.

Abstract:

Privacy is challenged by both advances in AI-related technologies and recently introduced legal regulations.

The problem of privacy has been extensively studied within the privacy community, but has largely focused on

methods for protecting and assessing the privacy of structured data. Research aiming to protect the integrity

of patients based on clinical text has primarily referred to US law and relied on automatically recognising

predetermined, both direct and indirect, identiﬁers. This article discusses the various challenges concerning

the re-use of unstructured clinical data, in particular in the form of clinical text, and focuses on ambiguous and

vague terminology, how different legislation affects the requirements for de-identiﬁcation, differences between

methods for unstructured and structured data, the impact of approaches based on named entity recognition and

replacing sensitive data with surrogates, as well as the lack of measures for usability and re-identiﬁcation risk.

1 INTRODUCTION

Electronic health records (EHRs) are a valuable re-

source for research aimed at developing and evalu-

ating health care, but they also contain information

which may jeopardise personal integrity. Due to the

sensitive nature of EHRs, access is restricted and pro-

tected by data protection laws, patient data laws or

similar, which may for example require consent by

patients and/or care organisations in order for such

data to be used for research. However, if the risk of

identiﬁcation is deemed sufﬁciently low, patient con-

sent may not be required. Automatic de-identiﬁcation

techniques aim to reduce the risk of identiﬁcation and

may therefore enable access to clinical data for re-

search while protecting the privacy of patients.

In order to allow for secondary use of EHRs, in

cases where informed consent is difﬁcult to obtain,

it is not sufﬁcient to de-identify only structured data;

one must also consider unstructured data. One type

of unstructured data that needs to be considered is

clinical text. Clinical text may describe, e.g., pa-

tient history, social background, relatives’ contact in-

formation and the patient’s living situation (Dalianis,

2018). Clinical text may, in fact, include more sen-

sitive information than structured EHR data, but is

unfortunately more challenging to de-identify. Re-

∗

This paper is an extension of an abstract submitted to Heal-

TAC 2020.

search may also require access to both structured and

unstructured EHR data, for example to study and de-

velop new algorithms for decision support. There has

been an increasing interest in AI-based solutions with

advanced and data-hungry algorithms. The need for

ever-increasing amounts of data has led to a need for

well-functioning methods to protect patients’ right to

privacy, while maintaining high-quality data.

De-identiﬁcation is the process of mitigating the

risk of identifying individuals in datasets by alter-

ing the data. The most appropriate de-identiﬁcation

method and de-identiﬁcation level may depend on the

type of data, as well as the context in which the data

will be used. There are methods for structured data to

ensure that possibly identifying values are common

enough, both separately and in combination, not to be

identifying. With free text data, it is not as clear which

information is stored in which section. Therefore, the

primary method for de-identiﬁcation is based on ﬁnd-

ing information belonging to certain pre-determined

classes that are deemed to be possibly identifying.

To that end, natural language processing and, specif-

ically, named entity recognition (NER) is used, after

which the identiﬁed information is obscured.

In this paper, a number of research issues concern-

ing de-identiﬁcation of unstructured clinical data – in

the form of clinical text – are highlighted and dis-

cussed. The issues concern privacy regulations, ter-

minology, de-identiﬁcation methods and evaluation.

592

Berg, H., Henriksson, A., Fors, U. and Dalianis, H.

De-identiﬁcation of Clinical Text for Secondary Use: Research Issues.

DOI: 10.5220/0010318705920599

In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021) - Volume 5: HEALTHINF, pages 592-599

ISBN: 978-989-758-490-9

2 PRIVACY REGULATIONS

Privacy regulations and, speciﬁcally, data protection

laws exist to protect individuals rights to protect per-

sonal data and privacy. As regulations vary across

countries, so do requirements for de-identiﬁcation.

In this article, two privacy regulations are de-

scribed, HIPAA and GDPR; HIPAA is used in the

United States and GDPR is used in the European

Union. HIPAA is commonly used as the basis for

de-identiﬁcation (Stubbs et al., 2017, Meystre et al.,

2010), and GDPR as it is relevant to the context in

which we work. The HIPAA Privacy Rule was in-

troduced in 2003, while GDPR was introduced in

2018. The American Health and Portability Act,

HIPAA, regulates the privacy protection of Protected

Health Information, PHI (HIPAA, 1996). PHI is es-

sentially deﬁned as all health information with in-

dividual identiﬁers. There are 18 PHI identiﬁers;

if these are removed, the data is not considered to

be sensitive. These identiﬁers include names, dates

more speciﬁc than year, geographic data, contact in-

formation and any unique identifying number or other

code. The method of removing PHI identiﬁers, named

Safe Harbour, is one of two HIPAA-compliant de-

identiﬁcation methods. The other method, Expert De-

termination, instead requires that an expert applies

mitigating methods based on statistics and mathemat-

ics until the risk of identiﬁcation is very small.

The General Data Protection Regulation, GDPR,

covers personal data, deﬁned as data relating to an

identiﬁed or identiﬁable natural person, which is a

person who could be identiﬁed directly or indirectly

(European Comission, 2016). Examples of identiﬁers

are speciﬁc physiological or social attributes, which,

when combined, point to one individual.

While HIPAA does not cover datasets where

data of certain classes has been removed or the re-

identiﬁcation risk is very small, GDPR considers all

data which could potentially be attributed to a natural

person through supplementary information or a tool

that could be used for re-identiﬁcation of personal

data. This includes other supplementary informa-

tion only accessible by the data controller. A dataset

which is not identiﬁable without supplementary in-

formation is pseudonymised. GDPR encourages the

use of pseudonymisation as a privacy-protecting mea-

sure, but pseudonymised data is still considered sen-

sitive. If there are no re-identiﬁcation risks, the data

is not considered as personal data and not covered by

GDPR. The requirement of zero risk has been crit-

icised as unattainable, with the claim that the only

way to achieve this is by not disclosing any data at

all (El Emam, 2013).

The different levels of risk required may amount

to de-identiﬁcation systems developed for HIPAA not

being suitable for usage under GDPR. HIPAA and,

speciﬁcally, Safe Harbour are often the basis for clini-

cal text de-identiﬁcation (Stubbs et al., 2017, Kushida

et al., 2012, Meystre et al., 2010, Marimon et al.,

2019). It is unknown whether these methods are ap-

propriate under GDPR, or if there is a need for other

methods with guarantees for privacy.

3 TERMINOLOGY

Another challenge is the lack of a consistent use of

terminology within the research of anonymisation and

de-identiﬁcation. According to a literature review, the

deﬁnition of both terms varies within the biomedi-

cal literature, and deﬁnitions are often vague or non-

existent (Chevrier et al., 2019). Only around half of

the articles provided a deﬁnition of the terms. Ar-

ticles using both terms often made a distinction be-

tween the two terms where anonymisation is then

most commonly refers to probabilistic and statisti-

cal techniques. According to the same review, the

term de-identiﬁcation more commonly refers to rule-

based techniques where information belonging to pre-

deﬁned categories are removed. According to an-

other literature review (Meystre et al., 2010), de-

identiﬁcation and anonymisation are often used in-

terchangeably; however, de-identiﬁcation means that

explicit identiﬁers are obscured and that anonymisa-

tion implies that implicit identiﬁers cannot be used

to identify individuals by linkage. HIPAA, on the

other hand, uses the term de-identiﬁcation regardless

of whether the approach is rule-based, where spe-

ciﬁc predetermined information types are removed,

or probabilistic, where statistical methods are ap-

plied to ensure that the risk of re-identiﬁcation is

sufﬁciently low. GDPR, on the other hand, uses

the word anonymisation to refer to data with no re-

identiﬁcation risks, and pseudonymisation for data

which is not identiﬁable in the absence of supple-

menting data. Here, the term de-identiﬁcation will

be used to refer to any type of method where personal

information is hidden or obscured with the intention

to protect the privacy of data subjects.

A dataset is sometimes referred to as de-identiﬁed

when risk mitigation techniques have been applied

to ensure a small enough re-identiﬁcation risk. This

term usage has been criticised as misleading since it

implies that the dataset has a level of re-identiﬁcation

risk that is, in reality, not met (El Emam, 2013).

The term pseudonymised has more than one mean-

ing. Pseudonymised may refer to the process of mask-

ing sensitive information with surrogates (Dalianis,

De-identiﬁcation of Clinical Text for Secondary Use: Research Issues

593

2019). Pseudonymisation may also refer to the us-

age of an alias that, with a key, may be linked to the

real original data (European Comission, 2016). These

are similar concepts, where the former could be seen

as a version of the latter, but without keys. The dis-

tinction made in GDPR between pseudonymised data

and anonymised data is that the pseudonymised data

can be re-identiﬁed by linkage, while anonymised

data cannot. This is similar to the distinction made

by, for example, Meystre et al. (2010) concerning

anonymization and de-identiﬁcation.

The different uses of each term increase the risk

for misunderstandings. A dataset can be described

as de-identiﬁed, anonymised, pseudonymised and as

personal data simultaneously by different people, de-

pending on the deﬁnition for each term and the con-

text. There needs to be an increased awareness sur-

rounding the use of terminology in de-identiﬁcation

research, with explicit deﬁnitions.

4 DE-IDENTIFICATION

METHODS

While the task of de-identiﬁcation is difﬁcult no mat-

ter the type of data, the process is especially difﬁcult

for unstructured data. The methods for risk mitiga-

tion and assessment used on well-structured data are

speciﬁcally designed for structured data and therefore

not readily applicable to noisy, unstructured data like

free text. Figure 1 provides an architectural overview

of the different approaches for preserving privacy of

clinical data and speciﬁcally clinical text.

4.1 De-identiﬁcation of Structured Data

There are, as previously mentioned, methods to statis-

tically ensure that structured data is protected against

identiﬁcation. Some examples are: K-anonymity, l-

diversity and differential privacy. K-anonymity en-

sures that multiple individuals share the same com-

bination of identifying values for structured data

(El Emam and Dankar, 2008). l-diversity instead en-

sures that there is enough diversity in sensitive val-

ues within a dataset (Machanavajjhala et al., 2007).

Another method is differential privacy (Dwork et al.,

2014), where aggregation and noise introduction is

combined to create de-identiﬁed views of the data;

each time a new data request is made, the level of

noise is adjusted based on previous information given.

Wagner and Eckhoff (2018) conducted a systematic

review of different different privacy metrics for struc-

tured data, describing and discussing over 80 metrics.

While all of these methods may individually have

their disadvantages and there are risks associated with

not doing them properly, they do offer the possibil-

ity of assessing the level of risk of re-identiﬁcation,

and then mitigating the risk. Differential privacy may

also provide statistical guarantees against what could

be inferred from the information provided (Dwork

et al., 2014), while for example k-anonymity does not

(Machanavajjhala et al., 2007).

There are a few examples of structured data and

methods being used to de-identify unstructured data

in the form of text, for example applying k-anonymity

on quasi-identiﬁers identiﬁed with NER (Gardner and

Xiong, 2009). Another example is recursive parti-

tioning to cluster medical text records based on in-

formation similarity and value-enumeration to de-

identify potentially identifying information (Li and

Qin, 2017). These are promising alternatives that may

provide additional ﬂexibility and ways to deal with

potentially identifying quasi-identiﬁers.

If a machine learning model is built using sensitive

data, the process can be re-engineered and the individ-

uals revealed. Papernot et al. (2017) have proposed a

solution to this by injecting noise in the trained model

by using differential privacy. In Figure 1, one can ob-

serve that adding external databases may enable the

re-identiﬁcation of individuals, so called data linkage.

Encryption of data and the use of synthetic data are

other methods for protecting privacy.

4.2 De-identiﬁcation of Clinical Text

Systems intending to de-identify text generally rely

on NER (Meystre et al., 2010), which is the task of lo-

cating and classifying named entities in unstructured

text (Nadeau and Sekine, 2007). Using NER has the

potential of being useful within the scope of HIPAA

to enable data sharing. The strategy is to, through

hand-crafted rules and/or machine learning, ﬁnd enti-

ties belonging to any of the HIPAA classes, possibly

with the addition of other classes which are deemed

identifying (Stubbs et al., 2017). If any of these are

found, they are either marked as belonging to a certain

class, or replaced with similar data of the same class.

A common method is to combine machine learning

for the irregular or less structured entities, and rules

for the more structured or regular entities within the

free text (Meystre et al., 2010).

4.2.1 Identifying Sensitive Information

The early de-identiﬁcation systems, for example the

Scrub system (Sweeney, 1996), are rule-based. Rule-

based approaches rely on rules, patterns and gazetteer

lists. Since they rely on hand-crafted rules, little or

HEALTHINF 2021 - 14th International Conference on Health Informatics

594

Figure 1: An architectural overview of the different approaches to preserve privacy in clinical data. The protected data

is shown in the red rectangle. For structured data, methods like aggregation, generalization, pseudonymisation and noise

perturbation may be used to create a safer database. For unstructured text, sections known to have a high density of sensitive

information and low relevance for the task may be removed before named entity recognition is performed to ﬁnd sensitive

information. The identiﬁed sensitive information is then replaced with, e.g., pseudonyms to make the text safer. Beyond this,

there are methods for creating synthetic databases and privacy preserving learning techniques for machine learning models.

no annotated data is needed other than for evalua-

tion purposes (Meystre et al., 2010). Crafting rules is,

however, a complex task where the developers need

to be aware of all possible PHI patterns that can oc-

cur. They also typically require customisation to a

particular dataset and are therefore less generalisable.

Supervised machine learning methods do not require

hand-crafted rules, but require annotated data for the

algorithm to train on. Feature-based supervised learn-

ing approaches rely on feature engineering. Common

features are lexical features (e.g. word casing, word

shape, punctuation, numerical characters), syntactic

features (e.g. part-of-speech tags) and semantic fea-

tures (e.g. terms from dictionaries, semantic types).

Section headers may also be used. Unlike rule-based

methods, supervised machine learning methods can

automatically learn to recognise complex patterns.

For telephone numbers or other data that tends to

be regular and where the patterns are not complex,

rule-based methods are still often used within hybrid

systems in which rules and machine learning meth-

ods are combined. The ﬁrst neural network for de-

identiﬁcation was introduced in 2016 (Dernoncourt

et al., 2017). Neural networks can effectively learn

features through composition over token embeddings

and therefore do not require handcrafted features or

feature engineering to the same extent as feature-

based machine learning methods. These embeddings

can be initialised randomly or pre-trained on large un-

labeled data sets.

A number of shared task challenges have been or-

ganised to drive the development of de-identiﬁcation

systems forward. The challenges have focused on

the identiﬁcation of personally identifying informa-

tion. As a part of the i2b2 project, three challenges

have been organised: the ﬁrst in 2006 (Uzuner et al.,

2007), the second in 2014 (Stubbs et al., 2015) and

the most recent in 2016 (Stubbs et al., 2017). Further-

more, a de-identiﬁcation challenge on Spanish syn-

thetic health records, MEDDOCAN, was organized

during IberLEF 2019 (Marimon et al., 2019). Dur-

ing the ﬁrst challenge in 2006, the submitted sys-

tems were either supervised feature-based machine

learning systems, rule-based systems or a combina-

tion of the two (Uzuner et al., 2007). Machine learn-

ing method such as SVMs, CRFs, hierarchial Hidden

Markov Models and decision trees were used. The

best performing systems were CRFs or decision trees

with rule template features. The pure rule-based sys-

tems performed the worst. The system with the best

performance scored an entity-based binary precision

score of 0.99 and a recall score of 0.98 on classes

based on HIPAA’s Safe Harbour (Stubbs et al., 2015).

In the 2014 i2b2 challenge, a majority of the sub-

mitted systems were hybrid systems using CRFs and

rules, and these also performed the best. The system

with the best performance scored an entity-based bi-

nary precision score of 0.99 and a recall score of 0.96

De-identiﬁcation of Clinical Text for Secondary Use: Research Issues

595

on classes based on HIPAA’s Safe Harbour. In the

2016 i2b2 de-identiﬁcation task on psychiatric intake

records, CRFs were still the most popular approach

to the de-identiﬁcation task, but among the top ﬁve,

two teams also used LSTMs (Stubbs et al., 2017).

While the most common approach was supervised

machine learning, three of the top four systems were

hybrid systems in which both multiple machine learn-

ing techniques were used but also hand-crafted rules.

Similarly, during the MEDDOCAN anonymisation

challenge with synthetic Spanish electronic health

records narrative text, deep learning systems outper-

formed other systems (Marimon et al., 2019). The

best performing system was a bidirectional LSTM

with FLAIR embeddings, as well as both domain-

independent and domain-dependent fastText embed-

dings (Lange et al., 2019). This system achieved an

entity binary precision score of 0.98 and a recall score

0.97 (Marimon et al., 2019).

NER is used to ﬁnd direct and indirect identi-

ﬁers in order to remove them. The removal of, e.g.,

names, contact information and serial codes may pro-

tect against re-identiﬁcation. Other types of informa-

tion are, however, difﬁcult to handle using only NER

methods. Diagnosis codes, and the diagnoses them-

selves, have been shown to be identifying. Accord-

ing to Loukides et al. (2010), 96% of all individuals

in 2,600 patient records could be identiﬁed through

their diagnosis codes. While unique combinations of

diseases along with years may appear in written form

within a patient record, it is not considered to be an

identiﬁer according to HIPAA, and is not something

that traditional de-identiﬁcation of text can handle. It

is important to be aware of these limitations, and to

seek ways to overcome them.

De-identiﬁcation of text is so heavily connected to

the NER step that the main evaluation metrics of text

de-identiﬁcation are the ones used for NER: recall,

precision and F

-score (Meystre et al., 2014a). These

metrics measure how many entities of each predeﬁned

class a system manages to ﬁnd and classify correctly,

but does not consider how identifying the kept in for-

mation is. In reality, a ﬁrst name could be either very

common, as for example John Smith and not identiﬁ-

able, but also unique, as for example Severus Snape.

Similarly, a disclosed phone number to a close relative

increases the risk of identifying a patient to a greater

degree than the number to a hospital. Similarly, the

incorrect labelling of an entity as a PHI may affect

downstream tasks to different degrees. The labelling

of Parkinsons as a surname could be assumed to po-

tentially cause trouble, and the incorrect identiﬁcation

of the health care unit as a speciﬁc health care unit, is

likely less harmful.

4.2.2 Obscuring Identiﬁers

If a token is classiﬁed as belonging to a possibly iden-

tifying class, it is either removed or replaced. The

removal may either be in the form of masking, or

keeping information about which type of information

is identiﬁed. The method of instead replacing data

with similar data has both its advantages and disad-

vantages.

Replacing a name with a surrogate may both lead

to an increase or decrease of re-identiﬁcation risk. A

system relying on NER will unlikely be able to ﬁnd

all possible sensitive data. The use of realistic sur-

rogates, as opposed to masking, conceals information

about which data is real and which data is not (Car-

rell et al., 2012). This is called Hidden In Plain Sight,

HIPS.

The use of realistic surrogates, or generalising the

information, may at the same time allow for some in-

formation to be kept. The total removal of tempo-

ral information may make the dataset insufﬁcient for

some research, whereas the use of surrogates would

maintain the usability of the dataset. There is, at

the same time, a risk of altering clinical informa-

tion which should be kept as it is, with a possible in-

crease in the risk of false conclusions (Meystre et al.,

2014a).

5 EVALUATION

There are two aspects of de-identiﬁcation methods

that are evaluated: the risk of re-identiﬁcation and the

impact de-identiﬁcation has on downstream tasks.

5.1 Re-identiﬁcation

Re-identiﬁcation is the identiﬁcation of individuals in

a dataset which is claimed to be de-identiﬁed. This is

most commonly achieved by linkage of data between

data sources. The desire to use interconnected medi-

cal data for various sources may lead to a higher risk

of re-identiﬁcation through linkability.

Most, if not all, examples of automatic re-

identiﬁcation include structured data. This may have

various causes: ﬁrstly, it is a more common data

source; secondly, there are strategies for assessing the

risk of re-identiﬁcation; ﬁnally, linking information

from various data tables is not a complex task. There

are examples where information in newspaper text

has been extracted manually and then linked to de-

identiﬁed structured data to successfully re-identify

individuals (Yoo et al., 2018); however, there are few

examples of re-identiﬁcation from unstructured text.

HEALTHINF 2021 - 14th International Conference on Health Informatics

596

Re-identiﬁcation studies on de-identiﬁed text have

focused on the risk of re-identiﬁcation by another in-

dividual. Carrell et al. (2012) showed that by re-

placing identiﬁed sensitive data with surrogates, the

risk of humans ﬁnding residual identiﬁers not found

by the NER system signiﬁcantly decreased. Meystre

et al. (2014b) set up an experiment where physicians

were asked if they could recognise their patients in a

dataset. In 4.65% of the 86 pseudonymised discharge

summaries, the physician thought they recognised a

patient, but in no cases a patient was correctly iden-

tiﬁed. A study by Grouin et al. (2015) showed that,

for de-identiﬁed text, it was possible to recover spe-

ciﬁc values. The disclosed information was, however,

not sufﬁcient to re-identify any patient unless the ad-

versary had access to the hospital health information

system and several documents from the same patient.

There are no clear estimates on how big the risk of

re-identiﬁcation is with the Safe Harbour method on

unstructured data. In general, there are few methods

for assessing the risk of re-identiﬁcation for unstruc-

tured data. The most common metric relating to the

safety of a de-identiﬁcation system for text is based

on calculating how many sensitive entities are found

in a test set. While this recall measurement is likely

to correlate with the risk of re-identiﬁcation, it does

not directly measure this.

Since there are no examples known to us of auto-

matic re-identiﬁcation of text in the same way as for

structured data, it is difﬁcult to determine how such

an attack would be performed and therefore also dif-

ﬁcult to determine the level of effort required. In the

end, this makes it difﬁcult to determine the needs for

de-identiﬁcation in practice. From one perspective, a

step toward improving de-identiﬁcation would be to

investigate possible re-identiﬁcation techniques.

5.2 Impact of De-identiﬁcation

There is a concern that de-identiﬁcation would af-

fect data quality. It has been hypothesised that the

de-identiﬁcation process may be harmful on down-

stream tasks as clinical information erroneously clas-

siﬁed as PHI may lead to a reduction in information

content and the introduction of misleading informa-

tion (Meystre et al., 2014a). It has also, however,

been hypothesised that de-identiﬁcation may poten-

tially improve machine learning performance by re-

ducing dimensionality and noise (Obeid et al., 2019).

Studies have so far shown no signiﬁcant differ-

ences between using original text data or de-identiﬁed

text data as training data for text classiﬁcation (Obeid

et al., 2019), and small but possibly statistically sig-

niﬁcant beneﬁts of training on de-identiﬁed data for

medication name extraction (Deleger et al., 2013).

Meystre et al. (2014a) noted that between 1.2-3%

fewer SNOMED-CT concepts were found in the de-

identiﬁed dataset than in the original version, but the

difference was largely explained by PHIs being erro-

neously recognised as SNOMED-CT concepts in the

original version rather than a decrease in information

content. So far, no signiﬁcant deterioration has been

documented for de-identiﬁcation systems on down-

stream tasks. It could, however, be assumed that a de-

identiﬁcation system with too low precision reduces

the amount of available information in a way that neg-

atively affects downstream tasks (Berg et al., 2020).

Uncertainty about the impact de-identiﬁcation has

on the dataset results in uncertainty about which mea-

sures to use in order to compare systems. In prac-

tice, F

-score, a weighted score between precision

and recall, is often used to compare systems (Stubbs

et al., 2017, Meystre et al., 2010). Studies point to

the fact that the precision of de-identiﬁcation systems

does not necessarily have a signiﬁcant impact on the

content of other research (Obeid et al., 2019, Meystre

et al., 2014a), perhaps because there are so few sensi-

tive tokens in total. There is, however, a need to deter-

mine what impact precision has, in order to be able to

determine what weight should be placed on this when

comparing and evaluating systems – but also to decide

how they should be built.

A case where de-identiﬁcation, however, seems

to have an impact on the end product is when using

de-identiﬁed data for training a NER de-identiﬁcation

system (Berg et al., 2019, Yeniterzi et al., 2010). Ac-

cording to these studies, models trained on datasets

with surrogates will perform worse on real data than

a model trained on real data would.

6 OTHER APPROACHES

There are other privacy-protecting approaches, in-

cluding synthesised datasets, encryption of data and

black box tools.

Synthesised data is data that is not real but has real

properties. Synthetic data can be created manually,

such as in Rama et al. (2018), or be multiply-imputed

synthetic microdata with exactly the same statistical

information as the real microdata and should lead to

the same statistical inferences (Nowok et al., 2016).

Methods, such as Synthpop, for generating synthetic

data with the same statistical information rely on and

generate only structured data.

In one approach by Dalianis and Bostr

om (2012),

all the lexical items themselves were removed, while

word features, such as part-of-speech, were kept and

De-identiﬁcation of Clinical Text for Secondary Use: Research Issues

597

used for the machine learning, resulting in only a

slightly decreased performance. For structured data,

complete encryption has been carried out before ma-

chine learning algorithms have been applied (Bos

et al., 2014, Arellano et al., 2018).

Finally, an alternative approach for research rely-

ing on machine learning is that the software is devel-

oped and possibly trained externally, training on non-

sensitive data, in relation to where the sensitive clin-

ical text is stored, and then the models are applied or

executed by a gate-keeper internally and reported to

the researcher (Almgren et al., 2016). This is a viable

alternative for research related to clinical text mining.

7 CONCLUSIONS

We have discussed various research issues regarding

the preservation of patients’ integrity in the reuse of

clinical text. The main method is based on HIPAA

and Safe Harbour. This involves ﬁrst determining

what kind of information is potentially identifying,

ﬁnding entities that belong to these classes, and then

deleting or obscuring them. There are no methods

to statistically calculate which information is identi-

fying, but the method relies instead on rules. While

this may be appropriate in a US context, it is un-

clear whether this type of de-identiﬁcation allows for

the sharing of data to researchers in a European con-

text since the risk of re-identiﬁcation remains unclear.

There is a need for other methods to determine which

information is sensitive and managing this, possibly

by taking advantage of available structured data. At

present, de-identiﬁcation through named entity recog-

nition seems, however, to be the best we have.

REFERENCES

Almgren, S., Pavlov, S., and Mogren, O. (2016). Named

Entity Recognition in Swedish Health Records with

Character-Based Deep Bidirectional LSTMs. Pro-

ceedings of the Fifth Workshop on Building and

Evaluating Resources for Biomedical Text Mining

(BioTxtM 2016), held in conjunction with Coling

2016, pages 30–29.

Arellano, A. M., Dai, W., Wang, S., Jiang, X., and Ohno-

Machado, L. (2018). Privacy policy and technology in

biomedical data science. Annual review of biomedical

data science, 1:115–129.

Berg, H., Chomutare, T., and Dalianis, H. (2019). Building

a de-identiﬁcation system for real swedish clinical text

using pseudonymised clinical text. In Proceedings of

the Tenth International Workshop on Health Text Min-

ing and Information Analysis (LOUHI 2019), pages

118–125.

Berg, H., Henriksson, A., and Dalianis, H. (2020). The Im-

pact of De-identiﬁcation on Downstream Named En-

tity Recognition in Clinical Text. In To appear in the

Proceedings of the 11th International Workshop on

Health Text Mining and Information Analysis (LOUHI

2020).

Bos, J. W., Lauter, K., and Naehrig, M. (2014). Private

predictive analysis on encrypted medical data. Journal

of biomedical informatics, 50:234–243.

Carrell, D., Malin, B., Aberdeen, J., Bayer, S., Clark, C.,

Wellner, B., and Hirschman, L. (2012). Hiding in

plain sight: use of realistic surrogates to reduce ex-

posure of protected health information in clinical text.

Journal of the American Medical Informatics Associ-

ation, 20(2):342–348.

Chevrier, R., Fouﬁ, V., Gaudet-Blavignac, C., Robert, A.,

and Lovis, C. (2019). Use and understanding of

anonymization and de-identiﬁcation in the biomedical

literature: Scoping review. Journal of medical Inter-

net research, 21(5):e13484.

Dalianis, H. (2018). Clinical text mining: Secondary use of

electronic patient records. Springer.

Dalianis, H. (2019). Pseudonymisation of Swedish

electronic patient records using a rule-based ap-

proach. In Proceedings of the Workshop on NLP

and Pseudonymisation, pages 16–23, Turku, Finland.

Link

oping Electronic Press.

Dalianis, H. and Bostr

om, H. (2012). Releasing a Swedish

Clinical Corpus after Removing all Words – De-

identiﬁcation Experiments with Conditional Random

Fields and Random Forests. In Proceedings of the

Third Workshop on Building and Evaluating Re-

sources for Biomedical Text Mining (BioTxtM 2012)

held in conjunction with LREC, pages 45–48.

Deleger, L., Molnar, K., Savova, G., Xia, F., Lingren, T., Li,

Q., Marsolo, K., Jegga, A., Kaiser, M., Stoutenbor-

ough, L., and Solti, I. (2013). Large-scale evaluation

of automated clinical note de-identiﬁcation and its im-

pact on information extraction. Journal of the Ameri-

can Medical Informatics Association, 20(1):84–94.

Dernoncourt, F., Lee, J. Y., Uzuner, O., and Szolovits, P.

(2017). De-identiﬁcation of patient notes with recur-

rent neural networks. Journal of the American Medi-

cal Informatics Association, 24(3):596–606.

Dwork, C., Roth, A., et al. (2014). The algorithmic

foundations of differential privacy. Foundations

and Trends

 in Theoretical Computer Science, 9(3–

4):211–407.

El Emam, K. (2013). Guide to the de-identiﬁcation of per-

sonal health information. Auerbach Publications.

El Emam, K. and Dankar, F. K. (2008). Protecting Privacy

Using k-Anonymity. Journal of the American Medical

Informatics Association, 15(5):627–637.

European Comission (2016). Regulation on the protection

of natural persons with regard to the processing of per-

sonal data and on the free movement of such data.

Gardner, J. and Xiong, L. (2009). An integrated framework

for de-identifying unstructured medical data. Data &

Knowledge Engineering, 68(12):1441–1451.

HEALTHINF 2021 - 14th International Conference on Health Informatics

598

Grouin, C., Griffon, N., and N

eol, A. (2015). Is it pos-

sible to recover personal health information from an

automatically de-identiﬁed corpus of french ehrs? In

Proceedings of the Sixth International Workshop on

Health Text Mining and Information Analysis, pages

31–39.

HIPAA (1996). Guidance regarding methods for

de-identiﬁcation of protected health informa-

tion in accordance with the Health Insurance

Portability and Accountability Act (HIPAA)

Privacy Rule, https://www.hhs.gov/hipaa/for-

professionals/privacy/special-topics/de-

identiﬁcation/index.html. Accessed: 2020-01-17.

Kushida, C. A., Nichols, D. A., Jadrnicek, R., Miller, R.,

Walsh, J. K., and Grifﬁn, K. (2012). Strategies for de-

identiﬁcation and anonymization of electronic health

record data for use in multicenter research studies.

Medical care, 50(Suppl):S82.

Lange, L., Adel, H., and Str

otgen, J. (2019). NLNDE:

The Neither-Language-Nor-Domain-Experts’ Way of

Spanish Medical Document De-Identiﬁcation. arXiv

preprint arXiv:2007.01030.

Li, X.-B. and Qin, J. (2017). Anonymizing and sharing

medical text records. Information Systems Research,

28(2):332–352.

Loukides, G., Denny, J. C., and Malin, B. (2010). The dis-

closure of diagnosis codes can breach research par-

ticipants’ privacy. Journal of the American Medical

Informatics Association, 17(3):322–327.

Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkita-

subramaniam, M. (2007). l-diversity: Privacy beyond

k-anonymity. ACM Transactions on Knowledge Dis-

covery from Data (TKDD), 1(1):3–es.

Marimon, M., Gonzalez-Agirre, A., Intxaurrondo, A., Ro-

drguez, H., Lopez Martin, J., Villegas, M., and

Krallinger, M. (2019). Automatic De-Identiﬁcation of

Medical Texts in Spanish: the MEDDOCAN Track,

Corpus, Guidelines, Methods and Evaluation of Re-

sults. In Proceedings of the Iberian Languages Eval-

uation Forum (IberLEF 2019). vol. TBA, p. TBA.

CEUR Workshop Proceedings (CEUR-WS. org), Bil-

bao, Spain (Sep 2019), TBA.

Meystre, S. M., Ferr

andez,

O., Friedlin, F. J., South, B. R.,

Shen, S., and Samore, M. H. (2014a). Text de-

identiﬁcation for privacy protection: A study of its

impact on clinical text information content. Journal

of biomedical informatics, 50:142–150.

Meystre, S. M., Friedlin, F. J., South, B. R., Shen, S., and

Samore, M. H. (2010). Automatic de-identiﬁcation

of textual documents in the electronic health record:

a review of recent research. BMC medical research

methodology, 10(1):70.

Meystre, S. M., Shen, S., Hofmann, D., and Gundlapalli,

A. V. (2014b). Can physicians recognize their own

patients in de-identiﬁed notes? In MIE, pages 778–

782.

Nadeau, D. and Sekine, S. (2007). A survey of named entity

recognition and classiﬁcation. Lingvisticae Investiga-

tiones, 30(1):3–26.

Nowok, B., Raab, G. M., and Dibben, C. (2016). synthpop:

Bespoke Creation of Synthetic Data in R. Journal of

Statistical Software, 74(11):1–26.

Obeid, J. S., Heider, P. M., Weeda, E. R., Matuskowitz,

A. J., Carr, C. M., Gagnon, K., Crawford, T., and

Meystre, S. M. (2019). Impact of de-identiﬁcation on

clinical text classiﬁcation using traditional and deep

learning classiﬁers. Studies in health technology and

informatics, 264:283.

Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., and

Talwar, K. (2017). Semi-supervised knowledge trans-

fer for deep learning from private training data. Proc-

cedings of 5th International Conference on Learning

Representations, ICLR 2017, Toulon, France, April

24-26, 2017.

Rama, T., Brekke, P., Nytrø, Ø., and Øvrelid, L. (2018).

Iterative development of family history annotation

guidelines using a synthetic corpus of clinical text. In

Proceedings of the Ninth International Workshop on

Health Text Mining and Information Analysis, pages

111–121.

Stubbs, A., Filannino, M., and Uzuner,

O. (2017). De-

identiﬁcation of psychiatric intake records: Overview

of 2016 cegs n-grid shared tasks track 1. Journal of

biomedical informatics, 75:S4–S18.

Stubbs, A., Kotﬁla, C., and Uzuner,

O. (2015). Auto-

mated systems for the de-identiﬁcation of longitudinal

clinical narratives: Overview of 2014 i2b2/uthealth

shared task track 1. Journal of biomedical informatics,

58:S11–S19.

Sweeney, L. (1996). Replacing personally-identifying in-

formation in medical records, the scrub system. In

Proceedings of the AMIA annual fall symposium, page

333. American Medical Informatics Association.

Uzuner,

O., Luo, Y., and Szolovits, P. (2007). Evaluating the

state-of-the-art in automatic de-identiﬁcation. Jour-

nal of the American Medical Informatics Association,

14(5):550–563.

Wagner, I. and Eckhoff, D. (2018). Technical privacy met-

rics: a systematic survey. ACM Computing Surveys

(CSUR), 51(3):1–38.

Yeniterzi, R., Aberdeen, J., Bayer, S., Wellner, B.,

Hirschman, L., and Malin, B. (2010). Effects of

personal identiﬁer resynthesis on clinical text de-

identiﬁcation. Journal of the American Medical In-

formatics Association, 17(2):159–168.

Yoo, J. S., Thaler, A., Sweeney, L., and Zang, J. (2018).

Risks to patient privacy: A re-identiﬁcation of patients

in maine and vermont statewide hospital data. Tech-

nology Science. Oct.

De-identiﬁcation of Clinical Text for Secondary Use: Research Issues

599