Collaborative Reproducible Reporting

Git Submodules as a Data Security Solution

Peter E. DeWitt

and Tellen D. Bennett

Biostatistics and Bioinformatics, Colorado School of Public Health,

University of Colorado Denver, Anschutz Medical Campus, Aurora, CO, U.S.A.

Pediatric Critical Care, University of Colorado School of Medicine, Children’s Hospital Colorado, Aurora, CO, U.S.A.

Keywords:

Data Security, Collaborative Authoring, Reproducible Reports, Workﬂow, Software.

Abstract:

Sensitive data and collaborative projects pose challenges for reproducible computational research. We present

a workﬂow based on literate programming and distributed version control to produce well-documented and

dynamic documents collaboratively authored by a team composed of members with varying data access priv-

ileges. Data are stored on secure institutional network drives and incorporated into projects using a feature

of the Git version control system: submodules. Code to analyze data and write text is managed on public

collaborative development environments. This workﬂow supports collaborative authorship while simultane-

ously protecting sensitive data. The workﬂow is designed to be inexpensive and is implemented primarily

with a variety of free and open-source software. Work products can be abstracts, manuscripts, posters, slide

decks, grant applications, or other documents. This approach is adaptable to teams of varying size in other

collaborative situations.

1 INTRODUCTION

Reproducible reporting, deﬁned here as processing

data and generating an abstract, manuscript, slide

deck, or poster via a fully documented and auto-

mated process, is considerably more difﬁcult when

working with multiple authors and sensitive data,

such as protected health information (PHI). Work-

ﬂows for reproducible computational research using

tools such as the Jupyter Notebook

, the Galaxy

project

, or RStudio (Gandrud, 2015) are not con-

sistently used in biomedical research (Peng et al.,

2006; National Academies of Sciences, Engineering,

and Medicine, Division on Engineering and Physical

Sciences, Board on Mathematical Sciences and Their

Applications, Committee on Applied and Theoretical

Statistics, 2016). This may be due to concerns about

slower production, the need for investigators to learn

new tools, or barriers to collaboration between inves-

tigators with varying computational skills and devel-

opment environments. Collaborative research involv-

ing sensitive data poses additional challenges.

One solution would be for a team to work in a

single development environment hosted on a compu-

http://jupyter.org/

http://galaxyproject.org/

tational server with the necessary physical and elec-

tronic security standards for the level of sensitivity of

the data. However, the ﬁnancial investment required

to build a full development environment behind an in-

stitutional ﬁrewall might be prohibitive for some re-

search teams. Fortunately, a reproducible collabora-

tive workﬂow that protects sensitive data is possible

at much lower cost.

We minimize team hardware and software ex-

penses in two ways. First, only those team mem-

bers who require data access are provided with

institutionally-owned laptops with licenses for whole-

disk encryption and other proprietary software. Sec-

ond, by using free and open-source software for ver-

sion control, analysis, and manuscript authoring, we

incur minimal ﬁnancial expenses when new team

members join or when we collaborate with external

investigators.

Our solution to data protection and collabora-

tion is to compartmentalize and distribute our project

such that data resources, analysis scripts, and text

are all linked together, version-controlled, and access-

controlled via implicit and explicit read/write permis-

sions. Raw data is stored on institutionally owned

network drives and cloned on laptop computers which

have been approved for storage of our data. Only team

230

Dewitt P. and Bennett T.

Collaborative Reproducible Reporting - Git Submodules as a Data Security Solution.

DOI: 10.5220/0006109302300235

In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2017), pages 230-235

ISBN: 978-989-758-213-4

members with institutional review board approval can

access the data. Data analysis scripts and manuscript

text ﬁles are available to all team members on pub-

lic code hosting services. The linkage between the

data and the code is made possible by a feature of

the Git version control software: submodules. A 40-

character hexadecimal sequence (SHA-1 hash) allows

us to share the version of the data source publicly

without compromising the data itself.

The objective of this manuscript is to present a

workﬂow that we developed to 1) protect sensitive

data from unauthorized access, 2) allow multiple au-

thors, included those with and without data access

rights, to contribute to a single set of ﬁles, and 3) min-

imize the ﬁnancial commitment to hardware and soft-

ware.

Our primary focus is on the use of the Git version

control system and speciﬁcally, Git submodules. We

will note other software tools and programs used in

our workﬂow, but they can often be substituted for

other similar software.

2 WORKFLOW OVERVIEW

Dynamic document authoring is a key component of

the overall reproducible research paradigm. Varia-

tions on literate programming (Knuth, 1984) are ideal

for this purpose. The R package knitr (Xie, 2015),

an evolution of R’s sweave (Leisch, 2002) pack-

age, provides a structured process for authoring a

manuscript using a literate programming paradigm.

knitr was highlighted several times at a recent

workshop supported by the National Academies

of Sciences, Engineering, and Medicine (National

Academies of Sciences, Engineering, and Medicine,

Division on Engineering and Physical Sciences,

Board on Mathematical Sciences and Their Applica-

tions, Committee on Applied and Theoretical Statis-

tics, 2016).

We typically perform data analysis with the sta-

tistical language R

and rely on either markdown

or L

X for markup. The desired format of our

deliverables dictates the markup language selection.

Weaving R code with a markup language is well-

described (Gandrud, 2015).

Our team manages collaborative projects using a

distributed version control system, Git

. Git is free

to use and is supported on all major operating sys-

tems. Distributed version control systems are becom-

ing more common than centralized systems, although

https://www.r-project.org/

https://Git-scm.com/

some distributed version control projects, including

many of ours, have a centralized design (De Alwis

and Sillito, 2009).

In the simplest centralized design, a Git server

hosts the repository and each team member would

push to, and pull from, that server. It is possible to

have the individual team members’ repositories di-

rectly linked, but we did not use this option because

of network security concerns. Another option is to

have a bare repository on a network drive act as the

central code repository. We use that design for a mi-

nority of projects with unusually sensitive data. For

most projects, our team takes advantage of the inte-

grated issue tracker, web editing interface, and addi-

tional read/write permissions provided by a Git server.

Several public Git repository sites exist. We chose

to use Atlassian’s Bitbucket

to host our reposito-

ries. At the time this choice was made, Bitbucket

allowed academic account holders unlimited private

repositories and unlimited collaborators. Recently,

Github.com has offered similar packages.

Code repositories solved the problems of dynamic

document authoring and collaboration, but we also

needed to track data set versions and limit data ac-

cess to approved team members without preventing

collaboration.

The solution was to use Git submodules. “Sub-

modules allow you to keep a Git repository as a subdi-

rectory of another Git repository. This lets you clone

another repository into your project and keep your

commits separate.” (Chacon and Straub, 2014). Also,

while the data ﬁles within the submodule exist in a

subdirectory and are visible in the working directory,

only the SHA-1 of the commit of the submodule is

stored in the primary project repository. Thus, when

the manuscript repository is pushed to bitbucket.org,

the only reference to the data is a 40-digit hexadec-

imal number. The data never leaves the team mem-

bers’ machines, but the status of the data is shared

and documented between team members.

3 INFRASTRUCTURE

Below we describe how we have used existing in-

frastructure, open source software, and free hosting

services to generate reproducible reports while pro-

tecting sensitive data. We designed the workﬂow

so that sensitive data is stored on a secure network

hard drive or whole-disk encrypted personal machine.

Data transfer between the network drive and a team

member’s machine only occurs on the institution’s

https://bitbucket.org

Collaborative Reproducible Reporting - Git Submodules as a Data Security Solution

231

network. The following subsections describe the nec-

essary hardware, repository design, and workﬂow for

collaboration.

3.1 Hardware

Our institution maintains a Microsoft Windows net-

work. We chose to work on Windows machines be-

cause they are available to all of our team members

and because they support whole-disk encryption soft-

ware that meets our institution’s requirements for data

security. Each team member with access to the data

has a whole-disk encrypted laptop or desktop. This

software costs approximately 100 US Dollars per ma-

chine, but allows each team member to have local

copies of the data repositories relevant to their work.

Like investigators at many academic institutions, we

have access to secure network drives behind the uni-

versity’s ﬁrewall. We rely on the network drives for

data repository hosting and backup.

3.2 Repository Design

Although Git is a distributed version control platform,

we conceptually have central data and code reposito-

ries on a network drive or Git server, see Figure 1.

Each collaborator has a local clone of the necessary

data and code repositories on their machine that serve

as distributed backups of the central data and code.

3.2.1 Data Repositories

Data are housed in .csv format within our local data

repositories. For collaboration, team members with

data access privileges push to and pull form bare Git

repositories on our institution’s secure network drives.

Bare repositories do not contain a working directory:

individual ﬁles are not visible when inspecting the

contents of the directory and subdirectories. As such,

inadvertently editing or over-writing the data ﬁles is

very unlikely. We rely on the read/write access limits

enforced by the institution’s network to limit access to

these bare repositories and entrust those with access

to not manually edit ﬁles. The repositories theoret-

ically could become corrupted. If that occurred, we

would compare the distributed copies of the repos-

itories between team members and re-initialize the

repository from the most current local copy. This is an

advantage of the distributed version control paradigm:

every local copy is a backup for all others.

3.2.2 Code Repositories

In the simplest form of this workﬂow, a work product

such as a manuscript has its own code repository.

A basic repository design shown in Figure 2. An

example code repository can be found at https://

bitbucket.org/pedstbi/example collaborative report

which has a data submodule available at https://

bitbucket.org/pedstbi/example collaborative data

source. The analysis and manuscript authoring

code is free of sensitive data. Therefore, the remote

code repository can be maintained on a publicly

available code development system such as GitHub

or Bitbucket. We use private code repositories to

maintain academic conﬁdentiality prior to manuscript

publication. Repositories on either GitHub or Bit-

bucket can be made public at any time, such as when

a manuscript is submitted for publication.

A team member working on a project (manuscript,

in this case) would have a local clone of the code

repository on their machine. Their daily workﬂow

would be to fetch and merge (pull-ing is shorthand for

the fetch then merge process) any changes on the re-

mote repository made by other team members, make

changes to ﬁles using the text editor of their choice,

stage and commit the changes using their local Git in-

stance, then push those changes to the remote reposi-

tory. Team members (clinical authors or collaborators

at other institutions, for example) whose contributions

are focused on writing the manuscript or who do not

have a whole-disk encrypted machine might have a

local copy of the code repository but not the data

repository. Those team members can have the bene-

ﬁts of a version-controlled project without cloning the

data submodule. One challenge introduced by this ap-

proach is that collaborators without local data reposi-

tories cannot compile manuscripts. Because the quan-

titative results in the manuscript are generated by em-

bedded analytic code within the manuscript ﬁle, those

results cannot be updated without a local data copy.

Periodically, team members with both data and code

access must compile the manuscript (which runs the

embedded code) and commit the ﬁnished product to

the central code repository for reference by collabo-

rators who primarily write and edit manuscript text.

3.2.3 Limitations

The size of data submodules is the most important

limitation of this repository design. Thus far, the

largest data submodule in our system is approxi-

mately 10GB. Segmentation of the data repositories

into, for example, a large raw data repository and a

smaller analysis data repository for one project can

improve efﬁciency.

Additional features of Git such as branch-ing

strategies, forking, pull requests, rebase-ing, and oth-

ers, provide additional levels of structure within the

collaboration. However, such tools can be over-

HEALTHINF 2017 - 10th International Conference on Health Informatics

232

Shared

Network

Drive

Developer

Hosting

Service

DS1

DS2

DS1

DS2

Collaborator 1

Collaborator 2

M3 M2

Repository (DS: Data Set; M: Manuscript)

Direct remote access

Access only through submodules

Figure 1: Collaboration Structure. Data is version-controlled in bare repositories on our institutional shared network drives.

The datasets are tracked within projects as submodules. Each developer has access to the data on his or her whole-disk

encrypted laptop or desktop. Non-sensitive code, manuscript text, references, etc. are hosted on bitbucket.org. Other authors

are able to contribute by having access to the bitbucket.org code repositories. Note that the manuscript repositories, M1,

M2, and M3, only have access to the data sets via git submodules. The copies of M1, M2, and M3 on the hosting service

and on each collaborator machine have no access to the data sets. The hosting service and collaborator only see a 40-digit

hexadecimal SHA1 to reference the version of the data repository.

whelming for a novice Git user. Increased train-

ing time, or limited participation, must be weighed

against the beneﬁt of Git feature use.

This workﬂow does include copies of sensitive

datasets on the whole-disk encrypted local machines

of selected team members. Our experience has been

that data owners and institutional review boards are

supportive of this approach. If a particular dataset was

not permitted to be housed on a local machine with

whole-disk encryption, then a computational server

within the institution’s ﬁrewall would likely be nec-

essary.

Clinical members of our research team without

computational backgrounds have been able to adopt

most or all of this workﬂow with a modest time in-

vestment. However, like all complex tools, regular

use is needed to maintain comfort. A more integrated

environment that was friendly to the na

ıve user would

increase the accessibility of a reproducible reporting

workﬂow.

3.2.4 Extensions/Other Options

Our team initially hosted code repositories on GitHub

and moved to Bitbucket as the team grew and the

number of projects increased. GitLab.com is another

option that offers unlimited private repositories, un-

limited collaborators, and up to 10GB disk space per

repository (compared to Bitbucket’s 1GB soft and

2GB hard limits). Placing a dedicated Git server be-

hind our institutional ﬁre wall would provide a solu-

tion for data management and access control and use-

ful collaboration tools. Hardware and administrative

support costs would need to be considered.

4 COSTS

This reproducible reporting workﬂow is powerful and

also cost-effective. For many investigators, a Win-

dows operating system and Windows Ofﬁce software

are supported by the institution. Whole-disk encryp-

tion software is inexpensive (100 US Dollars per team

member). Other software needed to implement this

workﬂow is free to use under the GNU General Pub-

lic License (GPL)

or similar license. There are no

hardware costs if investigators currently have indi-

vidual computers capable of performing the planned

analyses and access to a secure network drive. Many

academic investigators already have this hardware in

place.

The time and effort needed to learn the necessary

tools to adopt this workﬂow are likely higher than

the software and hardware costs. However, the re-

turn on investment can be high. Our experience in an

academic research environment suggests that a team

http://www.gnu.org/licenses/gpl-3.0.en.html

Collaborative Reproducible Reporting - Git Submodules as a Data Security Solution

233

. <user-path>/project1/

|-- .git/ # the Git repository

|-- analysis-scripts/ # data analysis scripts

| |-- data-import.R

| |-- primary-analysis.R

| |-- secondary-analysis.R

| ‘- figures.R

|-- data/ # A Git submodule

|-- products_donotedit/ # generated files

| |-- cache/

| | |-- documentation-data-import-cache/

| | |-- documentation-analysis-cache/

| | ‘-- manuscript-cache/

| |-- figures/

| |-- tables/

| |-- coverletter.docx

| |-- coverletter.md

| |-- documentation-data-import.html

| |-- documentation-analysis.html

| |-- manuscript.docx

| |-- manuscript.md

| ‘-- poster.pdf

|-- coverletter.Rmd ## Files for authoring

|-- documentation-data-import.Rmd ## coverletters,

|-- documentation-analysis.Rmd ## documentation,

|-- manuscript.Rmd ## manuscripts, posters,

|-- poster.Rnw ## etc.

‘-- README.md # project README

Figure 2: A generic repository layout for a manuscript writ-

ten in Rmarkdown. Not shown in the graphic, but part of

our overall design, are build scripts. A build script is a R

script, .cmd or .sh ﬁle, or makeﬁle. The format and loca-

tion of the build script is project-speciﬁc. We decide which

format to use based on the complexity of the build required,

the development platforms (Windows, Mac, or Linux), the

integrated development environments (RStudio or vim are

used by our team), and ease of use.

adopting this workﬂow might see research production

slow for up to six months, recover to initial levels

within a year, and show potential increases after one

year. Improvements in quality and reproducibility are

difﬁcult to quantify but are valuable.

5 ALTERNATIVE APPROACHES

Another solution to the simultaneous problems of

multiple collaborators and sensitive data might be

to run a local instance of Galaxy.

However, most

Galaxy tools use Python.

Few clinical researchers

have the training and experience to collaboratively

develop analysis code in Python. Many more have

been trained to use R. A capable computational server

would solve the problems of multiple collaborators

and data security. However, the purchase (5,000 US

https://galaxyproject.org/

https://www.python.org/

Dollars and up) and maintenance (varying, but poten-

tially exceeding 1, 000 US Dollars per year) costs for

such a server are beyond the reach of most small re-

search teams. Because many biomedical manuscripts

are generated by small teams, we think it likely that

the workﬂow we present here will be generalizable.

Existing cloud-based solutions such as RunMy-

Code.org

and the Open Science Framework

are re-

producible and support multiple collaborators, but are

not designed to protect sensitive data. Cloud-based

computational server services, some of which now

have robust data security features, are another op-

tion. Their utility will grow once institutional review

boards and data owners (health care organizations, in-

surance companies, etc.) gain enough conﬁdence in

the data security measures used by those services that

researchers are consistently permitted to analyze sen-

sitive datasets in those environments.

We did not extensively test our Git-based solution

against other possible solutions. This was primarily

for two reasons. First, most available alternative ap-

proaches did not provide sufﬁcient data security. Sec-

ond, alternative approaches with sufﬁcient data secu-

rity required additional ﬁnancial commitment beyond

standard operating expenditures. We developed this

workﬂow as part of an active academic research team

and needed to maintain productivity in our content

areas. The lack of formal method comparison is a

limitation of this manuscript at this time. However,

our team’s ability to rapidly adopt this workﬂow and

maintain productivity highlights the value and ease of

use of this approach.

6 DISCUSSION

Collaborative and reproducible biomedical reporting

can be inexpensive and have low barriers to entry

even when working with sensitive data and a team

with variable technical skills. Our goal is to intro-

duce an overall workﬂow and one set of viable tools.

Many data processing/analysis languages, markup

languages, text editors, ﬁle formats, and ﬁle sharing

systems can be used.

Peng (Peng et al., 2006; Peng, 2011) has sug-

gested criteria for the reproducibility of epidemio-

logic and clinical computational research. The work-

ﬂow we present here would meet the criteria for Meth-

ods (free, open-source software, public code reposi-

tories), Documentation (well-commented code in the

repository), and Distribution (code repositories on

http://www.runmycode.org/

https://osf.io/

HEALTHINF 2017 - 10th International Conference on Health Informatics

234

public Git servers). However, due to the limitations

regarding disclosure of data, our workﬂow would not

meet Peng’s Data Availability criterion. Summary

statistics (Peng et al., 2006) could in some situations

be posted publicly, but overall the balance between

reproducibility and data privacy will need additional

public discussion (National Academies of Sciences,

Engineering, and Medicine, Division on Engineering

and Physical Sciences, Board on Mathematical Sci-

ences and Their Applications, Committee on Applied

and Theoretical Statistics, 2016).

Rossini and Leisch described how “information

and knowledge was divided asymmetrically between

[collaborators]. . . ” (Rossini and Leisch, 2003) and

Donoho reported that one of the beneﬁts of a repro-

ducible computational workﬂow was improved team-

work (Donoho, 2010). Our experience would support

both of those ideas, as team members with variable

clinical, statistical, and technical backgrounds have

all contributed to the development of this workﬂow

and to the quality of the workﬂow’s research prod-

ucts.

In conclusion, reproducible reporting is a key

component of the reproducible research paradigm.

This manuscript presents an inexpensive, practical,

and easily adopted workﬂow for collaborative repro-

ducible biomedical reporting when working with sen-

sitive data.

ACKNOWLEDGEMENTS

We thank Dr. Michael Kahn for his comments and

suggestions on our manuscript. Dr. Bennett is sup-

ported by Eunice Kennedy Shriver National Insti-

tute for Child Health and Human Development Grant

K23HD074620.

REFERENCES

Chacon, S. and Straub, B. (2014). Pro git. Apress. Online

at https://git-scm.com/book/en/v2.

De Alwis, B. and Sillito, J. (2009). Why are software

projects moving from centralized to decentralized ver-

sion control systems? In Cooperative and Human

Aspects on Software Engineering, 2009. CHASE’09.

ICSE Workshop on, pages 36–39. IEEE.

Donoho, D. L. (2010). An invitation to reproducible com-

putational research. Biostatistics, 11(3):385–388.

Gandrud, C. (2015). Reproducible Research with R and

RStudio. Chapman & Hall/CRC Press, second edition.

Knuth, D. E. (1984). Literate programming. The Computer

Journal, 27(2):97–111.

Leisch, F. (2002). Sweave: Dynamic generation of statisti-

cal reports using literate data analysis. In H

ardle, W.

and R

onz, B., editors, Compstat 2002 — Proceedings

in Computational Statistics, pages 575–580. Physica

Verlag, Heidelberg. ISBN 3-7908-1517-9.

National Academies of Sciences, Engineering, and

Medicine, Division on Engineering and Physical Sci-

ences, Board on Mathematical Sciences and Their

Applications, Committee on Applied and Theoretical

Statistics (2016). Statistical Challenges in Assessing

and Fostering the Reproducibility of Scientiﬁc Results:

Summary of a Workshop. National Academies Press.

Peng, R. D. (2011). Reproducible research in computational

science. Science, 334(6060):1226–1227.

Peng, R. D., Dominici, F., and Zeger, S. L. (2006). Re-

producible epidemiologic research. Am J Epidemiol,

163(9):783–9.

Rossini, A. and Leisch, F. (2003). Literate statistical prac-

tice. Biostatistics Working Paper Series. Working Pa-

per 194. accessed May 17th, 2016.

Xie, Y. (2015). Dynamic Documents with R and knitr, Sec-

ond Edition. Chapman & Hall/CRC The R Series.

CRC Press.

Collaborative Reproducible Reporting - Git Submodules as a Data Security Solution

235