Archiving Programs for the Future
Alexander Binun, Shlomi Dolev and Yin Li
Department of Computer Science, Ben-Gurion University of the Negev, Beer Sheva, Israel
Keywords: Data Archiving, Long-term Bit Preservation, OISC, Subleq.
Abstract: This paper presents a novel approach for long-term software archiving which is based on preserving
programs as bit blocks. A simple machine that is able to execute a single command is used to interpret these
bit blocks. We suggest to compile the existing programs into the bit representation of the One-Instruction-
Set computer (OISC) command “SUBtract and Branch if Less than or EQual to zero”, shortly Subleq. This
is in order to keep the resulting bit stream using error correcting code in a reliable storage unit. At any
moment, this bit stream can be executed by a simple interpreter that possesses the functionality of a basic
Random Access Machine. Furthermore, a compiler prototype based on an existing compiler and interpreter
is also proposed to convert a program written by a procedural language (e.g., C) into the Subleq assembly
language, and then translates it into a binary executable format. Error correcting is achieved by
supplementing bit streams with Hamming codes. Our scheme nullifies the need to preserve legacy hardware
in order to support/operate preserved software systems thus serving as a program “time capsule” for the
future.
1 INTRODUCTION
Information preservation is one of the most crucial
challenges facing scholarly communities today. We
have large amounts of digital data, such as e-
journals, e-books, emails and blogs that will be
preserved for generations to come. Nowadays, there
are certain programs (National Digital Information
Infrastructure, 2015) and organizations (Research
Information Info, 2015) that are dealing with digital
asset preservation, before technological
obsolescence, or data loss creep in.
Attempts to preserve information for future
generations have been known since the ancient
times, for example ancient Egyptian or Cretan
hieroglyphs and Babylonian cuneiform scripts.
Organized efforts to conserve information for future
generations are observed during the recent centuries.
These efforts manifest in placing and tracking time
capsules (Time Capsule, 2015), i.e. containers of
such information. However, languages do not
become dead and time capsules do not disintegrate
during a short period. Modern works on
computerized long-term data preservation such as
(Campisi et al., 2009) focus mainly on
supplementing data with auxiliary information
needed for proper interpretation (metadata) and
synchronizing metadata.
In this paper, our work is devoted to finding a
uniform solution for the preservation of code, in an
architectural and compiler independent form, in the
extreme, to ensure preservation of codes for a long
time – for decades, centuries, or even for thousands
of years. To put it another way, we suggest to store
program code in a time capsule that should be
opened in the future for technology preservation and
can still be executed in the future. For example,
imagine that we need to run an important program
written in PL/I with DOS on PDP-11, with
procedure written in Pascal for MAC. Obviously,
backward compatibility issues become impassable
barriers for executing legacy programs as time goes
by. Thus, a new paradigm for software repository is
extremely important.
Our approach focuses on different aspects:
We suggest using a minimalistic and simple
command, namely, OISC machine commands
that can be easily interpreted by any future
machine, or even manually.
In the longer range future (thousands of years),
people may cease to understand the rationale
behind code, its effects and even the language
used to express the code ideas. We emphasize
the need to develop a universal writing system in
Binun, A., Dolev, S. and Li, Y..
Archiving Programs for the Future.
In Proceedings of the 6th International Workshop on Software Knowledge (SKY 2015), pages 53-57
ISBN: 978-989-758-162-5
Copyright
c
2015 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
53
order to keep our code today understandable for
future generations.
We assume that the design concepts behind the code
will become less relevant in a certain amount of
years. Our assumption is based on the fast pace of
software evolution. In the 1970s efficiency was
amongst the major requirements for C programs
which were often enriched by assembly fragments.
Nowadays, thanks to modern and powerful
computers, (that can endure huge libraries)
readability and maintainability have become
“affordable” and actual issues. The design rationale
behind modern OO libraries (such as Java Beans)
may become less relevant in the upcoming 30-40
years but sometimes we might be willing to run
programs based on these libraries. We need to be
able to understand the execution effects rather than
the design details of old programs.
We therefore focus on bit preservation, i.e., the
ability to restore the bits of a program stored for a
long period of time and obtain their effect, namely,
run the bytecode of the restored program. The bit-
level interpreter should thus embody some
fundamental computing ideas based on bit
manipulation (e.g., Random Access Machine,
Goldstine et al., 1947)). This concept is powerful,
simple to simulate, efficient (with relation to Turing
machine) and has not be changed (and will probably
not change) since the beginning of the computer
science era.
However, there are several challenges for long-
term code preservation as in (Factor et al., 2009).
These are storage media degradation, hardware
obsolescence and hardware failures. These
drawbacks are aggravated by targeted attacks against
media systems staged by malicious parties. As a
result, corrupted data may be fetched from the
storage. To address this challenge, we use the
Subleq machine as in (Mazonka et al., 2011) as an
appropriate format for keeping and running code.
Redundancy is also employed when storing the data,
augmenting data with error correcting information.
The use of error correcting codes enables integrity
checks and correction during data access. One
possible error correcting code is Hamming code that
can be utilized for the preservation of every bit of a
command.
RAM based Subleq is proven to be Turing-
equivalent and therefore capable of expressing code
for any feasible computation. In our scheme any
code is converted into the Subleq assembly format
and then into a bit stream. Such bit streams enriched
with the Hamming error-correcting codes (Arazi,
1988) are placed into media storage units (e.g., CD-
ROMS, or even into a plate with Braille style marks
on a surface).
The second challenge is to ensure that our
descendants in future will want to learn more about
our lives and will be able to do so. Even if we focus
solely on execution effects, what is the input/output
format? Textual input/output is usually expressed in
some modern language (today mostly in English).
Images and sound might be based on modern
cultural references (like modern computers, cars).
With that being said, modern languages and other
cultural references may be forgotten in the future,
sharing the destiny of undeciphered ancient scripts.
A possible solution would be to propose a set of
visual or tactile signs allowing any language group
to express their own utterances; that is, using a
universal writing system (Universal Writing System,
2015). Possible inputs/outputs can be translated into
this system and deciphered in the future. There is a
clear need to bring together computer scientists,
linguists, psychologists etc. into the joint
interdisciplinary project aimed at developing a
universal writing system.
Paper Organization. Section 2 outlines the
principles behind preserving code excerpts as bit
streams. It includes compiling any procedural
language into the Subleq assembly language and
further into bytecode as well as integrating error-
correcting capabilities. Section 3 outlines the main
idea behind the usage of a universal writing system
in our context and provides an example of a simple
C program converted into the bit stream format.
Finally, some conclusions are drawn in Section 4.
2 PRESERVATION OF CODE
EXCERPTS
2.1 Keeping Code as Bit Stream
A Subleq program represents an infinite array of
memory cells, where each cell holds an integer
number. This number can be an address of another
memory cell. The Subleq interpreter considers this
array as a sequence of instructions; each instruction
has 3 operands: A,B,C. An operand may be either
an integer number or a symbolic label describing a
memory address. Execution of a Subleq instruction
subtracts the value in the memory cell at the address
stored in A from the content of a memory cell at the
address stored in B and then writes the result back
into the cell with the address in B. If the value after
subtraction in B is less than or equal to zero, the
SKY 2015 - 6th International Workshop on Software Knowledge
54
execution jumps to the address specified in C.
Otherwise, the execution continues to the next
instruction, which is the address of the memory cell
next to the current cell.
We use the compiler that accepts a simplified
version of C language (mentioned as Higher Subleq
in (Mazonka et al., 2011)) that is referenced
throughout the article as the compiler. Higher
Subleq does not have a preprocessor, does not
support structures, bit fields, abstract declarations
and bit operations. It has only one underlying type
int. Depending on the configuration the compiler
can produce:
A Subleq assembly language module.
An executable Subleq module. It is assumed that
the address of the first instruction in a Subleq
assembly program is zero so every Subleq
instruction is converted into a triple of integer
numbers.
We configure the compiler to transform a program
written in a high-level language into an executable
Subleq module. Then we convert the integer
numbers of this module into the binary format. Thus,
eventually a program written in a high-level
language is converted and stored as an array of bits.
There are a number of high-level languages that
possess similar syntax and semantics for basic
operations. Based on the similarities and the
modularity of the compiler, it can seamlessly adapt
to many procedural languages.
2.2 Using Error-correcting Codes
Long-term preservation of bit streams may be
challenged by storage media degradation, hardware
obsolescence and hardware failures (Factor et al.,
2009). Media systems may be also attacked by
malicious users. As a result, the stored bits that need
to be interpreted may be corrupted.
To preserve the integrity of bit streams we stored
those streams using Hamming error-correcting codes
as in (Arazi, 1988). We omit the details of Hamming
codes as it is a classical basic error-correcting code;
in fact we can use any other code. The storage
system recognizes code excerpts by their IDs.
Periodic storage scans of stored bit streams can
exploit its redundancy, preserving the data integrity.
When a user wants to retrieve a code excerpt, the
corresponding excerpt ID is sent to the storage
system. The required excerpt with error correction
codes is fetched. When a bit stream arrives at the
interpreter site, Hamming codes are stripped off,
necessary correction is performed and the “clean” bit
stream is executed.
3 A UNIVERSAL WRITING
SYSTEM
All natural languages carry a lot of irregularities in
grammar which make them more difficult to learn.
They are also associated with the national and
cultural dominance of the nation that speaks it as its
mother tongue. Therefore creating an artificial or
constructing language lies at the core of an approach
of universal language where people from different
nations can communicate, just like numbers and
calculations became universal.
The notion of “phoneme“, that is a single “unit”
of sound that has meaning in any language is at the
heart of the ongoing efforts to devise a universal
writing system. This is because pronouncing of
individual letters may differ in various languages
and depend on their context; English is known for
many irregularities and exceptional cases.
It should be noted that writers of ancient
manuscripts did not invest decent efforts in making
their writing understandable for future generations.
As a result, some dead languages (e.g., Cretan
hieroglyphs) still remain undeciphered. The ancient
Egyptian hieroglyphs were not deciphered until
several stones with bilingual writings in two
languages (Egyptian and ancient Greek) were found
(the so-called Rosetta Stones). Ancient Greek was
known to archaeologists and served as a universal
writing system when deciphering Egyptian.
As a first attempt to instruct our descendants how
to use our storage system, we suggest to use the
pictogram approach, to draw several pictures
describing an operator executing a program. We thus
will be following ancient Egyptians whose
hieroglyphs carried intuitive sense (e.g., the
hieroglyph “king” resembles a king, one can even
recognize the crown, see Figure 1).
Figure 1: The picture of crown represents ‘the King”.
3.1 A Simple Program Written as a Bit
Stream
In this subsection, we give a very simple example to
show the whole procedure of our scheme. Consider
the following simple program written in Higher
Subleq:
Archiving Programs for the Future
55
int main(void){
int x=0;
x=x-2;
}
The Subleq assembly code for this fragment is.
main:
c1 x ?
x=x-2;
. x:0 c1:2
The semantics behind the Subleq code is: The
row starting from “.” denotes the data section
where variables are initialized. Thus x gets the
value 0, the constant 2 is placed into the variable
c1”.
The instruction “c1 x ?“ performs subtraction
of the value stored at the address denoted by
c1” from the value stored at the address
denoted by “x”. Afterwards the execution
proceeds to the next instruction (whose address
is denoted by “?”).
The bytecode (and the bit stream resulting from it)
are obtained by setting the address of the first
instruction to 0. We assume that 2-byte integer
numbers hold actual addresses.
The actual address of “x” is resolved to 402,
the actual address of “c1” is 404, the instruction
following “c1 x ?” has the address 350. The
instruction bytecode has the form “404
402 350”. The corresponding bit stream is:
0000000110010100
0000000110010010
0000000101011110
If we use the Hamming (7,4)-code principle to add
redundant bit for error correcting, the above bit
stream is extended as follows:
0000000 1101001 0011001
0101010
0000000 1101001 0011001
1001100
0000000 1101001 0100101 0010110
.
In above codes, every 4 bits are extended into 7 bits,
the bits underlined are additional parity bits that we
added.
3.2 Visual Puzzle Styled Manual Guide
As mentioned previously, we use a pictogram to
explain our operations and ideas. More explicitly, a
“user manual” is needed to explain the process and
should be stored as a part of the record.
Here, we can follow the rule of visual puzzles to
write such a manual. For example, the following
figure gives a manual for Subleq
Figure 2: The visual manual guide for Subleq.
In the above figure, we use different types of
arrows to show one Subleq command step by step,
and describe the way it is manipulated. For the sake
of simplicity, Figure 2 does not include a visual
puzzle for explaining the subtractions operation and
comparison. One can easily add related con- tents to
make these operations comprehensible. The manual
for Hamming error correcting code could be made in
a similar fashion.
In the future, when people want to recover the
program, they first check the integrity of the bit
stream using these parity bits. Then, they delete
these parity bits and recover the original bits
stream. Intelligent people should be able to follow
the visual puzzle manual and recover the whole
program, with the input and output.
4 CONCLUSIONS
This paper proposed a new prototype for program
archiving for the future. The OISC command Subleq
is used to simplify the classic complex program
codes and Hamming error-correcting code to protect
the program itself. Meanwhile, visual puzzles are
utilized to design instruction manual. All these tricks
form an alternative universal writing system. This
scheme allows us to archive various programs
without preserving different hardware and software
platforms, allowing for future execution by future
generations of computers. At last our scheme can be
used as an Arecibo message as in (Arecibo Message,
2015).
SKY 2015 - 6th International Workshop on Software Knowledge
56
ACKNOWLEDGEMENTS
We thank Oleg Mazonka for providing us with the
Subleq compiler prototype and for giving thorough
explanations regarding its usage. We thank the
support of the Rita Altura Trust Chair in Computer
Science. We also indebted to the Lynne and
William Frankel center for its generous support.
REFERENCES
B. Arazi, 1998. A Commonsense Approach to the Theory
of Error- Correcting Codes. Computer System Series,
The MIT Press, ISBN-10:0262010984.
P. Campisi, E. Maiorana, E. Ducci Teri, A. Neri, 2009.
Challenges to long-term data preservation: a glimpse of
the Italian experience. In DSP’09: Proceedings of the
16th international conference on Digital Signal
Processing. Pages 120-127, IEEE Press, Piscataway,
NJ.
M. Factor, E. Henis, D. Naor, S. Rabinovici-Cohen, P.
Reshef, S. Ronen, G. Michetti and M. Guercio, 2009.
Authenticity and Provenance in Long Term Digital
Preservation: Modeling and Implementation in
Preservation Aware Storage. In TAPP09: First
Workshop on on Theory and Practice of Provenance,
Pages 6:1-6:10, San Francisco, USA.
H. H. Goldstine and J. von Neumann,1947. Planning and
Coding of the Problems for an Electronic Computing
Instrument. Institute for Advanced Study, Princeton.
McGraw-Hill Book Company, New York.
O. Mazonka and A. Kolodin,2011. A Simple Multi-
Processor Computer Based on Subleq. In CoRR,
Volume 1106.2593.
Time Capsule, 2015. http://en.wikipedia.org/
wiki/Time_capsule
A Universal Writing System, 2015.
http://www.omniglot.com/pdfs/phonbook.pdf
National Digital Information Infrastructure, 2015.
http://en.wikipedia.org/wiki/National_Digital_Informa
tion_Infrastructure
Research Information Info, 2015. http://
www.researchinformation.info/features/feature.php?
feature_id=506
Arecibo Message, 2015. http://en.wikipedia.org/
wiki/Arecibo_message
Archiving Programs for the Future
57