Identifying Personal Data Processing for Code Review

Feiyang Tang

1 a

, Bjarte M. Østvold

1 b

and Magiel Bruntink

2 c

Norwegian Computing Center, Oslo, Norway

Software Improvement Group, Amsterdam, The Netherlands

Keywords:

Data Privacy Protection, Code Review, Static Analysis.

Abstract:

Code review is a critical step in the software development life cycle, which assesses and boosts the code’s

effectiveness and correctness, pinpoints security issues, and raises its quality by adhering to best practices. Due

to the increased need for personal data protection motivated by legislation, code reviewers need to understand

where personal data is located in software systems and how it is handled. Although most recent work on

code review focuses on security vulnerabilities, privacy-related techniques are not easy for code reviewers to

implement, making their inclusion in the code review process challenging. In this paper, we present ongoing

work on a new approach to identifying personal data processing, enabling developers and code reviewers

in drafting privacy analyses and complying with regulations such as the General Data Protection Regulation

(GDPR).

1 INTRODUCTION

The General Data Protection Regulation (GDPR) lays

the legal foundation for data protection in the EU and

increases individual data protection rights through-

out Europe. It also carries signiﬁcant ﬁnes of up to

4% of yearly worldwide revenue for businesses that

do not comply with the legislation. Many IT system

providers, especially software-producing ﬁrms, may

need to alter their systems in order to comply with

the GDPR. This is predicted to require signiﬁcant ef-

fort (Blume, 2016). As a result, providing software

engineers in the industry with effective and system-

atic ways to build data protection into software is an

essential and beneﬁcial study topic (Lenhard et al.,

2017). Organizations are pushing security to the soft-

ware development life cycle, such as code review,

to prevent software security vulnerabilities (Braz and

Bacchelli, 2022). Similarly, to comply with privacy-

by-design and perform privacy analysis tasks, code

reviewers would beneﬁt from similar tools to those

used for security to identify privacy-related patterns

in software.

Developers address privacy concerns using data

security terminology, and this vocabulary conﬁnes

their notions of privacy to threats outside of the orga-

https://orcid.org/0000-0002-8720-6743

https://orcid.org/0000-0001-6922-4027

https://orcid.org/0000-0002-6117-6347

nization (Hadar et al., 2018). However, even though

data security is the main prerequisite of data privacy,

privacy protection in software is still very much dif-

ferent from traditional security-related vulnerabilities.

And according to Bambauer: “security and privacy

can and should be treated as distinct concerns” (Bam-

bauer, 2013). Developers struggle to convert legal,

ethical, and social privacy concerns into concrete

technology and solutions (Notario et al., 2015).

Assessing privacy involves not only ﬁnding per-

sonal data in the software but also evaluating compli-

ance with the related processing. GDPR deﬁnes as

processing: “any operation or set of operations which

is performed on personal data or on sets of personal

data, whether or not by automated means.” The deﬁ-

nition encompasses a vast range of actions performed

on personal data, such as collecting, recording, orga-

nization, structuring, storage, adaption or modiﬁca-

tion, retrieval, transit, etc. Privacy assessment tasks

beg the question: How can we assist code review-

ers and software developers in assessing personal data

processing? By identifying personal data and the rele-

vant processing in the system, code reviewers can un-

cover interesting patterns and utilize them to redesign

the system to be more privacy-friendly or perform pri-

vacy analysis.

In this paper, we present ongoing work on a

novel approach designed to assist developers and code

reviewers in identifying personal data processing,

568

Tang, F., Østvold, B. and Br untink, M.

Identifying Personal Data Processing for Code Review.

DOI: 10.5220/0011725700003405

In Proceedings of the 9th International Conference on Information Systems Security and Privacy (ICISSP 2023), pages 568-575

ISBN: 978-989-758-624-8; ISSN: 2184-4356

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

which can subsequently be used for privacy analy-

sis. This enables developers and code reviewers to as-

sist organizations with a variety of important privacy-

related tasks, such as completing a data protection im-

pact assessment (DPIA) and creating a privacy policy.

2 RELATED WORK

An essential step in the software development pro-

cess, code reviewing incorporates both manual and/or

automated reviews. The main goal of code reviews is

to assess and boost the code’s effectiveness and cor-

rectness, pinpoint security issues, and raise its qual-

ity by adhering to best practices (McIntosh et al.,

2014). To automatically evaluate code, a variety of

vulnerability detection tools have been built. They are

also known as source code analyzers or static analysis

tools, as they can analyze a program’s code without

having to execute it (McGraw, 2008).

CodeQL

, and Semgrep (r2c, 2022)

are two pop-

ular code review tools that utilize static analysis. Cod-

eQL treats code as if it were data, and issues are mod-

eled as queries. Following the extraction of these

queries from the code, they are executed against a

database. The database is a directory containing data,

a source reference for displaying query results, query

results, and log ﬁles. Semgrep matches grammatical

patterns on parsed programs (represented as an Ab-

stract Syntax Tree (AST)) instead of matching string

or regular expression (regex) patterns on the program

as a string. Semgrep makes it considerably simpler

to construct customized rules than CodeQL, which

needs rules to be deﬁned in QL, a declarative object-

oriented query language.

There is relatively little published work that fo-

cuses on code reviews to identify privacy-related vul-

nerabilities, and it is problematic to translate current

security knowledge to privacy, which we will explain

in Section 3. There are studies on the identiﬁcation

of personal data that are valuable to our research.

Fugkeaw et al. (Fugkeaw et al., 2021) proposed AP2I

to enable organizations to identify and manage per-

sonal data in the local ﬁle system automatically. By

monitoring network trafﬁc, ReCon (Ren et al., 2016)

utilized machine learning to identify probable per-

sonal data breaches. van der Plas et al. (van der Plas,

2022) used CodeBERT, a RoBERT-like transformer

model, to identify personal data in Git commits.

https://codeql.github.com/

https://semgrep.dev/

3 BACKGROUND AND

CHALLENGES

Data privacy analysis is becoming as crucial as secu-

rity vulnerability discovery and has brought a new di-

mension to the data security dilemma (Bertino, 2016).

It is advantageous for code reviewers to be able to

conduct a similar privacy analysis that they did for se-

curity.

The current state of the art is mostly focused on

security analysis. Although data security is a primary

requirement for data privacy, the analysis domain and

identiﬁcation process are rather different (Jain et al.,

2016). Simply adopting security mechanisms and

mindsets to analyze privacy can be misguided, and

even harmful (Bambauer, 2013).

Integration of recent studies on assessing software

privacy during code review is challenging. On the

subject of program analysis, three well-known pri-

vacy analysis methods are available. First, static anal-

ysis based on bytecode requires project compilation,

whereas dynamic taint analysis requires project ex-

ecution. This is not practical nor efﬁcient for code

reviewers to implement. A machine learning-based

technique is similarly difﬁcult to implement, as it re-

quires a large and diverse training data set. Obtaining

and generating such data sets requires additional ef-

fort and could be outside the scope of code reviewers’

capabilities. Lastly, text analysis based on UI widgets

is constrained for privacy by domain-speciﬁc UI at-

tributes. A ﬁnancial web application that employs a

model trained on an Android health mobile applica-

tion is unlikely to beneﬁt. Code reviewers require an

approach that is simple to deploy, efﬁcient, and adapt-

able (Buse and Zimmermann, 2012).

Due to the complex nature of privacy and the ﬂu-

idity of the deﬁnition of personal data, identifying the

processing of personal data in the codebase presents

challenges.

In the following paragraphs, we highlight the two

most signiﬁcant challenges related to the task in the

context of code review.

3.1 The Ambiguous Deﬁnition of

Personal Data

Article 4(1) in GDPR deﬁnes personal data as:

any information relating to an identiﬁed or

identiﬁable natural person (‘data subject’);

an identiﬁable natural person is one who can

be identiﬁed, directly or indirectly, in partic-

ular by reference to an identiﬁer such as a

name, an identiﬁcation number, location data,

Identifying Personal Data Processing for Code Review

569

an online identiﬁer or to one or more factors

speciﬁc to the physical, physiological, genetic,

mental, economic, cultural or social identity

of that natural person.

The deﬁnition of personal data in the GDPR is so

broad that almost any information may qualify as per-

sonal data if it refers to a speciﬁc individual, such as

the fact that a person is wearing a red shirt (Ber

and George, 2009). The deﬁnition is also semanti-

cally ambiguous.

In contrast to the fact that certain data may be

anonymous from the start (such as weather sensor

data without any connection to real people), other data

may initially be personal data but later be success-

fully altered to no longer have any connection to an

identiﬁed or identiﬁable natural person. This empha-

sizes how ﬂexible the categorization of personal data

is (Finck and Pallas, 2020).

The same data point may be personal or non-

personal depending on the context and may thus be

covered by the regulation or not. This implies that

the categories of personal data in the software vary

depending on the software and the processing under-

lying it. For instance, health data such as blood pres-

sure and medical records, for example, are sensitive

for a health application, but location data is sensitive

for navigation software.

Even if we accept that content-wise every item of

information can be considered personal data if it can

be related to an individual, the GDPR’s deﬁnition is

still rather vague structurally since it is not always

clear what kind of structure every ‘record’ of an indi-

vidual must have to be considered personal data (Voss

and Houser, 2019).

Due to the ambiguous nature of the deﬁnition of

personal data in the relevant legislation, it is practi-

cally difﬁcult for us to have a clear and ﬁxed identiﬁer

to precisely locate personal data in code.

3.2 What Counts as Sensitive

Processing?

Data subjects may agree to data processing for par-

ticular reasons. This is the usual legal basis but only

counts as one factor. Processing may also be “nec-

essary for the performance of a contract to which the

data subject is a party or in order to take steps at the

request of the data subject prior to entering into a con-

tract.” (Voss and Houser, 2019)

Unfortunately, concerns that arise in principle

about the relationship between contract and consent

tend to be avoided in reality by disregarding consent

requirements (Pormeister, 2017).

We cannot rely on existing privacy policies and

written consent to uncover personal data processing in

the codebase. This requires us to consider all potential

personal data processing in the codebase. Later we

will explain how we deﬁne and identify the relevant

processing in software in Section 4.1.2.

4 APPROACH

We present an approach to identify instances of per-

sonal data processing in the codebase and present

them in a way that facilitates the code review.

The approach has three primary phases: pattern

matching, labeling, and grouping of results. As input,

we take the codebase, which consists of source code

ﬁles. Then, a static analyzer will evaluate these source

code ﬁles using our rules and patterns. The code snip-

pets discovered by the static analyzer are then labeled

according to the various features they include. Fi-

nally, we allow users to group the results by single

or several labels, allowing a personalized exploration

of the ﬁndings.

An illustration of our approach is shown in Fig-

ure 1.

Codebase

Code snippets

Labeled snippets

Results

pattern

matching

labeling

grouping

Figure 1: Approach.

4.1 Design Choices

In the following paragraphs, we discuss our design

choices for implementing the approach.

4.1.1 Types of Findings

We want to have a basic default list of personal data

that we want to locate, this is mostly personal identi-

ﬁcation and characteristics data, such as full name,

email address, gender, sexual orientation, and age.

We call them ﬁxed personal data.

According to different types of software, we cus-

tomize default lists for them. For example, for a bank-

ing/ﬁnance application, the list may contain bank ac-

count numbers, credit scores, and salary information.

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

570

This type of personal data is subject to context - the

types of processing in speciﬁc software, which we

named contextual personal data.

Depending on how we locate the mentioned per-

sonal data in the software, we can divide their occur-

rences in code into simply two types.

The ﬁrst is in clear text. This includes all kinds

of locations where personal data appear in clear text.

It is verbatim or direct personal data. For example,

a credit card number appears in an SQL query, or an

email address falls into a log function.

The other type is more common and subtle, where

personal data is stored in a variable or an object. De-

pending on the different types of programming lan-

guages, the object types might vary from a local vari-

able, a class instance, or a prototype. This means we

aim to ﬁnd the code that processes this type of data.

4.1.2 Types of Processing

Simply locating every instance of personal data pro-

duces a large number of results. Many of these do not

directly help the code reviewer’s work, which is to

ﬁnd meaningful processing. We want to use a hybrid

approach to cover as many as processing as possible.

Processing personal data represents a speciﬁc

behavior. This motivates our ﬁrst approach: to

use an action name tag to ﬁnd relevant processing.

We adopted most of the verbs from Section 3 of

DPV (Pandit et al., 2019)

. These vocabularies help

us to ﬁnd relevant processes in the software.

The second approach is the identiﬁcation of exter-

nal libraries. We know that modern applications rely

on various APIs to achieve different goals. Therefore,

obtaining a list of relevant APIs and detecting the ex-

istence of personal data that ﬂows into them helps us

ﬁnd meaningful patterns.

4.2 Pattern Matching

The ﬁrst step is to feed our codebase (consisting of

source code ﬁles) to the static analyzer for pattern

matching. We chose Semgrep as our analyzer because

of its user-friendly rules and rapid processing perfor-

mance. Depending on the different syntactic char-

acteristics of personal data, as we discussed in Sec-

tion 4.1.1, we adopt a hybrid approach that combines

two different types of analysis.

• Match personal data in clear text using regular ex-

pression matching.

• Taint analysis to ﬁnd ﬂows in each ﬁle between

a source (where personal data enters the analysis

https://w3c.github.io/dpv/dpv/

scope) and a sink (where personal data gets pro-

cessed) that match our criteria.

Our personal data processing rules currently support

Java, JavaScript, and TypeScript as our primary anal-

ysis domains. However, our rules for identifying

clear-text personal data apply to the vast majority of

Semgrep-supported languages.

4.2.1 Source and Sink

Our prototype classiﬁes the sources into nine separate

categories. As stated in Section 4.1.1, we divide ﬁxed

personal data into four different categories: account,

contact, national ID, and personal ID. Included are

ﬁve more contextual personal data categories, such as

location, health, and ﬁnancial data. In addition, we

provide a template for identifying the processing of

personal data and enable code reviewers and develop-

ers to submit additional personal data simply by enter-

ing the relevant keywords. Then, corresponding rules

will be automatically produced for future use.

Sinks are categorized into ﬁve main types. Three

types of action: data manipulation (M), data trans-

portation (T), and data creation/deletion (C/D). An-

other two represent two special types: database (DB)

and encryption (E).

A sink’s name may contain a speciﬁc type of

source. For example, setLatitude(100,100)

does not take any source into the method, but

includes a source identiﬁer Latitude and a sink

identiﬁer set, showing that it processes values

directly as a source into a sink. We call this spe-

cial type of sink a source-speciﬁc sink. When a

source-speciﬁc sink invokes anything, we mark this

source-speciﬁc sink as the new source but the caller

of the source-speciﬁc sink as the new sink. For exam-

ple, in gpsTracker.setLatitude(100,100),

setLatitude becomes the new source and

gpsTracker is the new sink.

Inspired by how Privado

uses regular expres-

sions to identify GDPR-related data in Java applica-

tions, a sample Semgrep rule that matches the pattern

of account data source goes into a transportation (T)

sink is shown below in Figure 2, followed by a sample

code snippet detected in Figure 5.

4.3 Labeling

The identiﬁed ﬁndings from Semgrep are in the form

of various lengths of code snippets (consisting of

statements and expressions). Each ﬁnding contains

at least one detected sink and one source (or an object

that received value from a source). We abstract the

https://www.privado.ai

Identifying Personal Data Processing for Code Review

571

Figure 2: Semgrep rule: ﬁnd personal data ﬂows from ac-

count data source to transportation sink.

Figure 3: Sample code snippet (from ToolJet) detected by

Semgrep showing a ﬂow from account personal data to a

transportation sink.

structure of possible sources and sinks in each code

snippet using the symbols below.

• O ranges over sources

• I ranges over sinks

• I

ranges over source-speciﬁc sinks

We write

O as shorthand for a possibly empty se-

quence O

, ··· , O

. Here the underscore represents

a placeholder for an expression that is insigniﬁcant in

terms of privacy - it is neither a source nor sink nor

contains a value from a source.

Below is a list of the common ﬂow abstracts be-

tween sources and sinks that we observed in each

code snippet. Each abstract represents a typical ﬂow,

for example, 1 to 3 show that there are values pass-

ing through a sink to a source, from a non-privacy

sensitive value ( 1 ) or from another source ( 2 ) or

from innovating a sink inside another source object

( 3 ).

1 O = .I( )

2 O

= .I(O

, )

3 O

= .O

.I( )

4 = .O.I( )

5 = .I(

O, )

6 .O.I( )

7 .O.I( ,

8 .I

( )

9 .I

( ,

10 .I(

O, )

For each identiﬁed code snippet, we label them

with 22 labels (9 types of source, 5 types of sink, 5

types of source-speciﬁc sink, and 3 types of change

in the sensitivity level), which are listed in Table 1.

Besides the deﬁnition of source and sinks, we also in-

troduce an important label: sensitivity. The sensitivity

level can increase, decrease, and stay the same in one

identiﬁed code snippet.

Table 1: Labels to be assigned to each code snippet.

O Nine types of source: {O

, O

, . . . , O

}

I Five types of sink: {I

, I

, . . . , I

}

Five types of source-speciﬁc sink :

, I

, . . . , I

}

S Sensitivity level change: {up, down, equal}

Sensitivity Level. Not every result shares the same

level of sensitivity regarding personal data process-

ing. After processing, the data from the source might

remain at a similar sensitivity level, become more sen-

sitive, or become less sensitive.

• S = up: 1 , 4 , 5

• S = equal: 2 , 3 , 6 , 7 , 8 , 9

• S = down: 10

4.4 Result Presentation

Johnson et al. (Johnson et al., 2013) pointed out

that “because the results are dumped onto a code re-

viewer’s screen with no distinct structure causing him

to spend a lot of time trying to ﬁgure out what needs

to be done”. This indicates that developers and code

reviewers may not beneﬁt from ungrouped code snip-

pets from static analysis tools if they are not presented

in a sensible manner.

To tackle this issue, we present a two-phase tech-

nique to process the ﬁndings from Semgrep and

present them to code reviewers in a smart way.

After each code snippet is labeled, we start to

group them for presentation using their labels and

other criteria. Criteria for grouping include not only

the labels but also other properties:

• neighboring results will be combined (same ﬁle

and within a line number threshold);

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

572

• same or similar source/sink name;

• same API usage (e.g. every code snippet that is

related to the same API MongoDB).

Figure 4 following provides a straightforward illus-

tration of how we present our results. The results are

presented in two separate sections: plain text results

and ﬂow results. Users have the ﬂexibility to select

any label or label combination to ﬁlter the results.

Plain Text Personal Data

email: 'user@enterprise.org'

queryExecutor('SELECT John Doe FROM userDB')

Flows Group by: source

Account data

Government ID data

Transportation

Manipulation

Sensitivity level: up

Creation/deletion

Manipulation

createRepository{userId: userId;

pasId: pasId;

ssn: ssn}

Figure 4: Example presentation of the result. Personal data

occurrences is at the top and personal data processing code

is at the bottom.

5 DEMONSTRATION

We created rules in Semgrep trying to capture as

many useful ﬁndings for our analysis. The soft-

ware we analyzed here is ToolJet

, an open-source

low-code framework for building React-based web

applications. ToolJet’s implementation is mostly in

JavaScript and TypeScript. Users can build internal

tools using ToolJet’s prebuilt UI widgets to connect to

data sources like databases, API endpoints, and exter-

nal services. This means ToolJet has many parts that

process personal data, which makes it a good starting

example.

Our Semgrep rules produce a total of 1,589 results

from ToolJet’s source code. We manually reviewed

each of the results and calculated the precision for

each category. If a single result can clearly demon-

strate the processing of personal data, we consider it

relevant and it could be beneﬁcial for privacy code

review. Surprisingly, most false positives come from

the personal data occurrence detector (with a preci-

sion of only 46.6%), while most personal data pro-

cessing results are relevant (with an average of 90.9%

precision for categories that have more than 50 code

snippets identiﬁed).

Detailed statistics are listed in Tables 2 and Ta-

ble 3.

https://github.com/ToolJet/ToolJet

Table 2: The code snippet count for each identiﬁed source

and sink identiﬁed, ‘-’ marks labels for which our approach

detected no code snippet. Sink types are: data manipulation

(M), data transportation (T), data creation/deletion (C/D),

database (DB), encryption (E) and log (L).

M T C/D DB E L

Account 66 171 84 24 - 21

Contact 89 175 36 3 - 3

Personal ID 56 133 41 7 1 4

Online ID 6 26 1 - - 1

Location 1 2 - - - -

Table 3: The precision of code snippet relevance (in %) for

each identiﬁed type of source and sink, ‘-’ marks the labels

for which our approach did not detect any code snippet, ‘*’

marks the labels for which our approach detected less than

10 results. Sink types are: data manipulation (M), data

transportation (T), data creation/deletion (C/D), database

(DB), encryption (E) and log (L).

M T C/D DB E L

Account 90.9 90.6 95.2 91.67 - 95.2

Contact 89.9 94.9 80.6 * - *

Personal ID 92.9 81.9 85.4 * * *

Online ID * 84.6 * - - *

Location * * - - - -

Figure 5 shows a simple interesting example

of a grouped result showing how personal data

userId is retrieved from a local repository in

app users.service.ts and then utilized to generate

many data structures, such as the app object in

app service.ts.

app.service.ts

app_users.service.ts

Figure 5: Grouped example results showing how

organizationUserId ﬂows between functions.

Identifying Personal Data Processing for Code Review

573

5.1 Future Work

Since our objective is to identify all relevant process-

ing of personal data in source code, reducing false

negatives is our next primary priority. However, in

our case, false positives are not a major concern. Due

to the subtlety of personal data processing, determin-

ing relevance without human assistance is particularly

challenging. Specifying the analysis to certain spe-

ciﬁc patterns would ease manual analysis. This ne-

cessitates the implementation of a privacy taxonomy.

Using Ethyca’s taxonomy (Ethyca, 2022) as an exam-

ple, we may modify our labels to match the technique

with the taxonomy.

As an extension of this article, we propose an au-

tomated mapping of personal data in an unpublished

(under review) manuscript (Tang et al., 2023) to assist

developers and code reviewers in identifying privacy-

related code. The mapping based on static analysis

automatically detects personal data and the code that

processes it, and we offer semantics of personal data

ﬂows.

6 CONCLUSIONS

This short paper presented ongoing work on a novel,

customizable approach to identify personal data pro-

cessing for code review. This three-phase technique

ﬁrst uses Semgrep to match patterns in the code based

on rules for sources and sinks, then associates code

snippets generated from pattern matching with a set of

behavioral labels, and ﬁnally groups results to reduce

code reviewer workload. Our demonstration shows

the utility and feasibility of this method for gathering

and presenting code snippets related to personal data

processing from a codebase.

Along with the continued development of the ap-

proach architecture (reﬁned rules for source and sink,

more meaningful labels, and additional criteria for

grouping), future work will focus on expanding the

case study to include a larger set of open-source soft-

ware from various domains and conducting a thor-

ough user evaluation of the resulting platform.

ACKNOWLEDGEMENTS

This work is part of the Privacy Matters (PriMa)

project. The PriMa project has received funding from

European Union’s Horizon 2020 research and inno-

vation program under the Marie Skłodowska-Curie

grant agreement No. 860315.

REFERENCES

Bambauer, D. E. (2013). Privacy versus security. J. Crim.

L. & Criminology, 103:667.

Ber

c, B. and George, C. (2009). Identifying personal data

using relational database design principles. Interna-

tional Journal of Law and Information Technology,

17(3):233–251.

Bertino, E. (2016). Data security and privacy: Concepts, ap-

proaches, and research directions. In 2016 IEEE 40th

Annual Computer Software and Applications Confer-

ence (COMPSAC), volume 1, pages 400–407. IEEE.

Blume, P. (2016). Impact of the EU General Data Protec-

tion Regulation on the public sector. Journal of Data

Protection & Privacy, 1(1):53–63.

Braz, L. and Bacchelli, A. (2022). Software security dur-

ing modern code review: The developer’s perspective.

arXiv preprint arXiv:2208.04261.

Buse, R. P. and Zimmermann, T. (2012). Information

needs for software development analytics. In 2012

34th International Conference on Software Engineer-

ing (ICSE), pages 987–996. IEEE.

Ethyca (2022). Fides language. https://ethyca.github.io/

ﬁdeslang/. (Accessed on 11/15/2022).

Finck, M. and Pallas, F. (2020). They who must

not be identiﬁed—distinguishing personal from non-

personal data under the GDPR. International Data

Privacy Law, 10(1):11–36.

Fugkeaw, S., Chaturasrivilai, A., Tasungnoen, P., and

Techaudomthaworn, W. (2021). AP2I: Adaptive PII

scanning and consent discovery system. In 2021 13th

International Conference on Knowledge and Smart

Technology (KST), pages 231–236. IEEE.

Hadar, I., Hasson, T., Ayalon, O., Toch, E., Birnhack, M.,

Sherman, S., and Balissa, A. (2018). Privacy by de-

signers: software developers’ privacy mindset. Em-

pirical Software Engineering, 23(1):259–289.

Jain, P., Gyanchandani, M., and Khare, N. (2016). Big data

privacy: a technological perspective and review. Jour-

nal of Big Data, 3(1):1–25.

Johnson, B., Song, Y., Murphy-Hill, E., and Bowdidge, R.

(2013). Why don’t software developers use static anal-

ysis tools to ﬁnd bugs? In 2013 35th International

Conference on Software Engineering (ICSE), pages

672–681. IEEE.

Lenhard, J., Fritsch, L., and Herold, S. (2017). A literature

study on privacy patterns research. In 2017 43rd Eu-

romicro Conference on Software Engineering and Ad-

vanced Applications (SEAA), pages 194–201. IEEE.

McGraw, G. (2008). Automated code review tools for secu-

rity. Computer, 41(12):108–111.

McIntosh, S., Kamei, Y., Adams, B., and Hassan, A. E.

(2014). The impact of code review coverage and code

review participation on software quality: A case study

of the qt, vtk, and itk projects. In Proceedings of the

11th working conference on mining software reposito-

ries, pages 192–201.

Notario, N., Crespo, A., Mart

ın, Y.-S., Del Alamo, J. M.,

Le M

etayer, D., Antignac, T., Kung, A., Kroener, I.,

and Wright, D. (2015). PRIPARE: integrating privacy

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

574

best practices into a privacy engineering methodology.

In 2015 IEEE Security and Privacy Workshops, pages

151–158. IEEE.

Pandit, H. J., Polleres, A., Bos, B., Brennan, R., Bruegger,

B., Ekaputra, F. J., Fern

andez, J. D., Hamed, R. G.,

Kiesling, E., Lizar, M., et al. (2019). Creating a vo-

cabulary for data privacy. In OTM Confederated In-

ternational Conferences” On the Move to Meaningful

Internet Systems”, pages 714–730. Springer.

Pormeister, K. (2017). Informed consent to sensitive per-

sonal data processing for the performance of digital

consumer contracts on the example of “23andMe”.

Journal of European Consumer and Market Law, 6(1).

r2c (2022). Semgrep. https://semgrep.dev/. (Accessed on

11/15/2022).

Ren, J., Rao, A., Lindorfer, M., Legout, A., and Choffnes,

D. (2016). Recon: Revealing and controlling pii leaks

in mobile network trafﬁc. In Proceedings of the 14th

Annual International Conference on Mobile Systems,

Applications, and Services, pages 361–374.

Tang, F., Østvold, B. M., and Bruntink, M. (2023). Mapping

personal data in source code for GDPR compliance.

van der Plas, N. (2022). Detecting PII in Git commits. TU

Delft Master’s thesis.

Voss, W. G. and Houser, K. A. (2019). Personal data

and the GDPR: providing a competitive advantage

for US companies. American Business Law Journal,

56(2):287–344.

Identifying Personal Data Processing for Code Review

575