TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE
Alvise Span`o, Michele Bugliesi and Agostino Cortesi
Dipartimento di Scienze Ambientali, Informatica e Statistica, Universit`a Ca’ Foscari Venezia, Venice, Italy
Keywords:
Static analysis, Analyzer, Type system, Type flow, Flow Type, Type rule, Storage, Picture, Record, Cobol,
Label, Variable, Branch, Termination, Status, Convergence, Abstract interpretation, Coercion, Coerce, Envi-
ronment, Judgement, Substitution, Grammar, Island grammar, Parser, Island parsing, Lexer, Parsing, LALR,
Yacc, Lex, F#, .NET, IBM, z/OS, COBOL, COBOL85.
Abstract:
Many business applications today still rely on COBOL programs written decades ago that are difficult to
maintain and upgrade due to technological limitations and lack of experts in the language. Several companies
have been trying to migrate their software base to modern platforms, but code translation is problematic
because most business processes implemented are often no longer documented or even known. Applying
existing Program Understanding techniques to COBOL could be a way for aiding IT specialists in charge of
a porting - but useful raw information must be extracted from the source code in order to get these techniques
yield to meaningful results. We believe that the types of variables used in programs are an important part
of such raw information and we present an approach based on static analysis of types rather than data. Our
system is capable of reconstructing the type-flow of a COBOL program throughout branches, jumps and loops
in finite time and to track type information on reused variables occurring in the code. It also detects a number
of error-prone situations, type mismatches or misuses and notifies that by means of messages annotated in the
code along with types inferred for each variable occurrence.
1 INTRODUCTION
Analyzing COBOL code using type inference tech-
niques has been proposed many times in the last
decade and before. From the system first described
in (van Deursen and Moonen, 1998) to its later re-
finement in (Moonen, 2003), giving informative types
to COBOL variables seems to be a good way for
automatically generating a basic tier of documenta-
tion of legacy software (van Deursen and Moonen,
2006) and is also a reliable starting point for fur-
ther Program Understanding approaches (Kuipers and
Moonen, 2000). These systems are quite sophisti-
cate and rely on a number of complex side models
and tools aimed to extract properties and information
from COBOL programs at a high level of abstraction,
thus inevitably omitting several details at a lower lan-
guage and type level - e.g. how to deal exactly with
the many picture formats supported by COBOL and
with control constructs that alter the program flow.
In this paper, we propose a light-weight system for
typing COBOL with rich yet simple types that pursue
a number goals:
1. model the COBOL picture system without ap-
proximating storage format information such as
computational fields or the amount of digits in
a numeric, in order to reconstruct the exact in-
memory representation of datatypes and perform
precise comparisons among the many formats
COBOL supports;
2. deal with what in (van Deursen and Moonen,
2001) is called pollution in such a way that no
complex relational property system among types
is needed, by tracking type alterations that vari-
ables are subject to in the following scenarios:
(a) when data is reused for different purposes in a
program: many COBOL programmers are used
to this practice in order to save memory and the
result is often poor maintainability and error-
proneness;
(b) when the language performs an implicit
datatype cast, reformatting values to fit target
variables, either at compile-time or at run-time.
3. deal with branches in the program flow that are
not statically decidable (i.e. conditional state-
ments) by embedding into the type itself multiple
types a variable may possibly assume during the
execution.
64
Spanò A., Bugliesi M. and Cortesi A..
TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE.
DOI: 10.5220/0003506700640075
In Proceedings of the 6th International Conference on Software and Database Technologies (ICSOFT-2011), pages 64-75
ISBN: 978-989-8425-77-5
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
We introduce a kind of storage type for variables
declared as pictures in COBOL and a special flow-
type for collecting storage types resulting from condi-
tional branches in the program.
On top of that, having to do with a language
where
GOTO
and other low-level commands altering
the control-flow are frequently used in programs, our
system cannot behave like an ordinary type-checker
or type-inference algorithm: it is a type analyzer able
to follow jumps and branches in the code, detect cy-
cles and avoid loops by checking for a convergence in
the status of the typing function - in pretty much the
same way as many basic techniques of Abstract In-
terpretation for Static Analysis do (F. Nielson, 1999).
The status here consists in a special topological envi-
ronment mapping variable occurrences to their flow-
type at that point in the program. Overall, this ap-
proach resembles a sort of data-flow analysis where
data are actually types rather than values.
A prototype of the system described in this article
has been developed in F# for the .NET 4.0 platform
and implements a Lex & Yacc tweak for reproduc-
ing the behavior of Island Grammars (Moonen, 2001)
with the benefit of efficient LALR parsing.
It is able to parse large COBOL source programs
(up to many thousands of lines) and to type them
generating as output the flow-types annotated at ev-
ery variable occurrence (i.e. the topological environ-
ment mentioned above). Additionally, it produces
useful information about type usage in form of er-
ror messages, warnings and hints. Again as opposed
to a compiler, here errors do not imply a failure:
the system adopts a keep-going approach and is tol-
erant to most recoverable error scenarios. All type
mismatches or misuses are notified and other hints
over possible error-prone situations are signaled; an
undefined variable, though, would make the system
fail. Thus, we assume to process production code that
compiled successfully and does work.
1.1 Overview
Our system do not manipulate COBOL code directly:
as other remarkable systems do (van Deursen and
Moonen, 1999), we translate COBOL into a more
comfortable intermediate language (from now on re-
ferred to as IL) resembling modern imperative lan-
guages without altering COBOL semantics and prin-
ciples. Notably, what in COBOL speak is referred to
as a program (i.e. a compilation unit), here is trans-
lated into a procedure, with its own static variable
declarations. A COBOL application made up of many
units becomes a single large IL program, where the
main code shows up as the bottom unnamed block.
Before performing the type analysis, the system
must also label all variables occurring in the program
with an unique identifier - simply a fresh integer tag.
The type analyzer eventually explores the code, state-
ment by statement and recursively descending into
expressions, basically performing two operations that
affect either the topological or type environment:
1. keeping track of the current type(s) of variables
by updating flow-types in the type environment;
2. annotating variable occurrences with their flow-
type at that point in the program, i.e. creating new
bindings in the topological environment.
Assignments and call-by-reference argument ap-
plications are two scenarios where variables could be
subject to an implicit cast, hence the flow-type of a
variable appearing for example at the left-hand side
of an assignment must be updated. Conditional con-
structs, instead, lead to branches in the code explo-
ration, thus the analyzer would produce two parallel
results for the two sub-blocks of an
if-then-else
statement: the resulting environments must therefore
be merged somehow to reflect that the same variables
may possibly have different types after the
if
block
and these multiple choices are collected in the flow-
type itself.
Look at the following example code directly writ-
ten in IL:
{
x := x + 1;
if x > 0 then
{
x := "foo";
}
x := x + 23;
}
where x : num[2] := 11
What we want to achieve is reconstructing the
types of the program and producing annotations for
each occurrence of variable
x
with its type in that
point of the code, as well as outputting error and
warning messages. For doing that the system has to
follow all branches in the control flow and keep up-
dated the type of
x
: by the end of the conditional
block we want to to show somehow that
x
might have
become a string. And where there is an ambiguous
operation, we want the system to recover to a default
decision and add a comment about it.
{
(x : num[2]) := (x : num[2]) + 1;
// [WARNING] possible truncation
// detected in assignment:
// num[3] :> num[2]
if (x : num[2]) > 0 then
{
(x : alpha[2]) := "foo";
// [ERROR] truncation detected in
// assignment:
// alpha[3] :> num[2]
}
(x : num[2]) := (x : num[3]|alpha[3]) + 23;
TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE
65
// [HINT] type of ’x’ is ambiguous in
// expression at right-hand of
// assignment: assuming
// initialization type num[2]
// [WARNING] possible truncation
// detected in assignment:
// num[3] :> num[2]
}
where x : num[2] := 11
In the first statement, where
x
is incremented by 1,
the type of the variable is annotated both in its usage
as an expression term and as the target on the left side
of an assignment. In the right-hand case its type is the
initialization type
num[2]
that appears in the global
declaration, which happens to be its current type at
the beginning of the program; in the left-hand case
x
should be given a wider numeric type, because the re-
sult of the sum of a
num[2]
and a literal whose type is
num[1]
would lead to
num[3]
1
, but it gets truncated
in order to fit the initialization type as COBOL run-
time would do and therefore, being the resulting stor-
age class still
num
, its final type happens to be equiv-
alent to its initialization type.
The system tracks the type that variable are sup-
posed to have from a type-flow point of view, i.e. as if
data movements were tracked across expressions and
statements and the type of what variables are sup-
posed to contain is recorded.
Encountering the
if
statement makes the analyzer
descend into its
then
block: a truncation is detected
therein, being
alpha[3]
surely wider than the tar-
get type
num[2]
, and the truncated type
alpha[2]
is given to
x
, which fits the initialization type. Such
information must be then merged to that previously
collected before branching: hence the reason why the
type of
x
in the expression at the right hand of the as-
signment after the
if
block is not a simple type. The
flow-type has grown here due to the merge and it now
consists of all possible types
x
might have at the mo-
ment. That leads to an ambiguous choice when typing
the sum operation and so the system needs to recover
to the initial type declaration - which might seem odd,
but is in fact a viable solution, as in COBOL every
variable strictly adheres to its picture declaration, thus
falling back to it is not an unsafe decision in case a
better information cannot be reconstructed.
1.2 Comparisons and Motivation
As already mentioned, the legacy software analysis
system thoroughly presented in (Moonen, 2003)
2
rely
1
In general, a number made of 2 digits plus a number made
of 1 digit could possibly lead to a number made of 3 digits, as in
99+ 9 = 109. See type rules for expressions in table 6 for details
on how arithmetic operations formally affect numeric type formats.
2
That is a Ph.D. thesis collecting previous works on the same
subject and anticipating some that yet had to come. In general, that
on mechanisms for producing information over types
that mainly serve Program Understanding techniques,
Concept Analysis (Kuipers and Moonen, 2000) and
other high-level elaborations. In general, its scope
is wider than ours and not entirely overlapping.
Nonetheless there is something in common, that is
giving somehow interesting types to COBOL vari-
ables, that can be taken into consideration for mak-
ing a comparison with what we believe is the most
advanced system for COBOL analysis based on types
available to date.
We translate COBOL into a simpler interme-
diate language as (van Deursen and Moonen,
1998) does, though without leaving out impor-
tant language constructs whose behavior is rele-
vant to typing real-world programs, such as
goto
,
perform
and
perform-thru
jump statements,
call-by-reference procedure calls and
if
state-
ments.
Our type syntax is more complete, clearer and
open to more orthodox type manipulation, as
it doesn’t provide just a plain AST-ization of
COBOL picture declarations
3
.
The type inference
4
rules given in (van Deursen
and Moonen, 2001) are sometimes trivial. We de-
fine a type-system that reconstruct more detailed
type information, e.g. our type rules for arith-
metic operators in table 6 recalculate the resulting
type format length in order to include within the
type itself as much information as possible about
changes in value ranges.
We don’t infer a type equivalence when two or
more types are expected to be the same (as would
happen in ML in a homogeneous binary applica-
tion, for example). Our system rather falls back to
a variable initialization type in case a type mis-
match or ambiguity is detected. This trade off
makes type derivations simpler, does not neces-
sarily imply a loss of information and reflects
COBOL run-time semantics better.
system has been proposed several times in more articles with some
additions - we might therefore refer to either (van Deursen and
Moonen, 1998), (van Deursen and Moonen, 2001), (van Deursen
and Moonen, 2000), (Kuipers and Moonen, 2000) or (Moonen,
2003).
3
Syntax of types in (van Deursen and Moonen, 1998) oddly
carries along the variable identifiers and picture format strings as
is, leaving unclear how the type environment and type comparisons
formally related to them.
4
That system uses the word inference, with a clear reference
to the world of ML and functional languages, though we’d prefer
reconstruction, as there is actually no use of type variables and uni-
fication for resolving a set of constraints over type equations.
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
66
The system in (Kuipers and Moonen, 2000) repre-
sents the inferred set of type relations via a Rela-
tional Algebra and resolves them applying an al-
gorithm written in Grok (Holt, 2008): the resolu-
tion is actually a simplification process perform-
ing iterative unification. This approach is rather
inefficient and does not take into account type dy-
namics due to control-flow jumps. Our system
performs a code analysis at typing-timeby follow-
ing jumps
5
, thus detects a wider range of possible
type anomalies and variable reuses.
According to (van Deursen and Moonen, 2001),
pollution occurs whenever a type-equivalence in-
volves types that are not equivalent or subtypes:
we do not handle this as a special case, but it
automatically comes from non-singleton choices
within flow-types, which are natively supported
by our type-system and do not require any further
processing.
Our subtype relation deals with the in-memory
representation of a wider range of type formats
and qualifiers that are very common in COBOL
programs, such as all
COMP
fields (translated into
native integer, floating point and binary-coded-
decimal types), signed/unsigned numeric formats
and mixed alphabetic/alphanumeric strings.
In (van Deursen and Moonen, 2001) there is no
mention on how COBOL references
6
are handled,
nor on how COBOL run-time data conversions af-
fect type rules of commands that manipulate dif-
ferent picture formats and computational fields
(e.g. the
COMPUTE
instruction). A major feature
of our system is reproducing such behaviors at
typing-time by giving temporary types to R-value
expressions
7
and eventually promoting them to
storage types when a type coercion in invoked (see
definitions 3.1 and 3.6).
Let’s now apply our system to a COBOL code
fragment mentioned in (van Deursen and Moonen,
1998) and other papers of the series:
DATA DIVISION.
WORKING-STORAGE SECTION.
01 N000.
05 N100-N PIC S9(03) COMP-3.
5
Until a convergence in the topological environment is detected
(see section 2.4).
6
According to COBOL syntax specification in (IBM, 2009),
accessible memory cells are called references. We renamed them as
Left Values in our intermediate language for the sake of symmetry
with imperative languages such as C that define them as a sub-class
of expressions that can appear at the left side on an assignment and
refer to an in-memory value (Kernighan and Ritchie, 1988).
7
Symmetrically, right values are expressions that can stand on
the right side of an assignments, hence evaluate to a temporary
value (Stroustrup, 2000).
01 TAB000.
05 TAB100-NAME-PART.
10 TAB100-POS PIC X(01) OCCURS 40.
05 TAB100-MAX PIC S9(03) COMP-3 VALUE 40.
05 TAB100-FILLED PIC S9(03) COMP-3 VALUE ZERO.
01 RAR001-RECORD.
03 RAR001-VAST.
05 RAR001-INITIALS PIC X(05).
PROCEDURE DIVISION.
R210-INITIALS.
MOVE RAR001-INITIALS TO TAB100-NAME-PART
PERFORM R300-COMPOSE-NAME
EXIT.
R300-COMPOSE-NAME.
MOVE TAB100-MAX TO N100.
MOVE ZERO TO TAB100-FILLED.
PERFORM UNTIL N100 EQUAL ZERO
IF TAB100-POS (N100) EQUAL SPACE
SUBTRACT 1 FROM N100
ELSE
MOVE N100 TO TAB100-FILLED
MOVE ZERO TO N100
END-IF
END-PERFORM.
The whole code above is translated into one sin-
gle annotated IL program showing operations among
types and warning messages
8
:
{
R210-INITIALS: // main code
{
(TAB100 : { NAME-PART : alphanum[40]; .. })
.NAME-PART :=
RAR100-RECORD.VAST.INITIALS;
// [WARNING] reverse subsumption
// detected in assignment: right-hand
// type is smaller that left-hand type
perform R300-COMPOSE-NAME;
return;
}
R300-COMPOSE-NAME:
{
N00.N := TAB100.MAX;
TAB000.FILLED := 000;
__loop0:
{
if N000.N = 000 then goto __loop0_exit;
else
{
if (TAB100 : { NAME-PART :
{ POS : alphanum[1] array[40] };..})
.NAME-PART.POS[N00.N] = ’ ’
// [WARNING] possible access to
// corrupted data: accessing TAB100
// with its initialization type
// but its type had changed to
// { NAME-PART : alphanum[40]; .. }
// [WARNING] possible error in array
// subscript: type ’num.bcd[S3]’ has
// signed format
then {
(N000 : { N : num.bcd[S3] }).N :=
N000.N - 1;
// [WARNING] possible truncation
// detected in assignment:
// num.bcd[S4] :> num.bcd[S3]
}
else
{
TAB000.FILLED := N000.N;
(N000 : { N : num.bcd[S3] }).N :=
000;
}
goto __loop0;
}
}
__loop0_exit: {}
}
}
where N000 : { N : num.bcd[S3] };
8
We omit type annotations where flow-types do not differ from
the previous variable occurrence or from its initialization type.
TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE
67
TAB000 : {
NAME-PART : {
POS : alphanum[1] array[40] };
MAX : num.bcd[S3] := 40;
FILLED : num.bcd[S3] := 0 };
RAR001-RECORD : {
VAST : { INITIALS : alphanum[5] } }
(Moonen, 2003) unfortunately does not contain
practical code samples of pollution or other anoma-
lies, thus we can’t compare how the two systems be-
have in that regard. As a matter of facts, though,
our system seems ultimately more involved in accu-
rate typing and detecting error-proneness rather than
program reasoning and collecting statistics.
2 TYPE SYSTEM
2.1 Storage Types and Flow-types
COBOL picture declarations in the Working Storage
section of the Data Division define data instances
along with their own storage format: they’re not type
declarations for instantiating data elsewhere as most
modern languages do. Our system must of course
reproduce this design, but mapping COBOL picture
format strings into types. For example, consider the
following picture declaration:
DATA DIVISION.
WORKING-STORAGE SECTION.
01 A PIC 9(3) COMP-3 OCCURS 10.
01 N PIC COMP 9(8).
01 R1.
02 R1-S PIC A(2).
02 R1-B PIC X(3)9(2)A(3).
01 R2 OCCURS 7.
02 X PIC S99V9 COMP-2.
We translate it into more orthodox type bindings
that are quite self-explanatory:
A : num
bcd
[3] array[10]
N : num
int
32
[8]
R1 : {S : alpha[2]; B : alphanum[8]}
R2 : {X : num
float
64
[S2.1]} array[7]
Picture format strings are mapped into either nu-
meric, pure alphabetic or alphanumeric types accord-
ing to their structure; arrays and records are also
first-class citizens of the type language in our sys-
tem and can therefore be nested at will, yielding to
types that resemble those of modern functional lan-
guages. Moreover, numeric types carry along de-
tailed information on their in-memory representa-
tion at machine level, sign and length of both in-
tegral and fractional parts; while arrays and alpha-
betic/alphanumeric strings simply carry their length.
The full syntax of the type-system follows:
τ := storage types
num
q
[ρ] numeric
| alpha[n] alphabetic string
| alphanum[n] alphanumeric string
| τ array[n] array
| {x
1
: τ
1
.. x
n
: τ
n
} record
σ := temporary types
τ
| bool boolean
| num[ρ] abstract numeric
q := numeric storage qualifier
ascii display or ASCII
| bcd binary-packed decimal
| int
16|32|64
native integer
| float
32|64
native float
ρ := [S]n.d numeric format
ϕ := {τ
1
.. τ
k
} flow-item or choice
Φ := hϕ;τi flow-type
where
k 1
,
n N
,
d N
There are two distinct classes of types:
τ is the type of storage variables and L-values in
general, i.e. the type of data that stands in memory
and has some representation
9
;
σ, where σ τ, is the type given to expression
terms only and is never produced by picture trans-
lation, serving just as a temporary light-weight
type whose in-memory representation is yet to be
known in that context.
As typing rules will show, such temporary types
are eventually promoted to ordinary τ types as soon as
the storage type of an actual variable becomes known,
for example when an expression that’s given a tem-
porary is then assigned to a L-value or passed as a
call-by-ref argument in an procedure call.
Finally, a flow-type is a simply a pair of possi-
bly multiple storage types (those a variable may con-
currently have following statically undecidable condi-
tional branches in the program flow, as stated in sec-
tion 1) and an additional single storage type, which is
the type initially declared for the variable in the global
environment. We’ll be often referring to the first com-
ponent of a flow-type as flow-item or choice.
9
ASCII is the default qualifier for numeric types: whenever
unspecified this one holds, as in
num[3]
for example.
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
68
2.2 Environments
Type judgements operate over a number of environ-
ments.
Type Environment Γ maps variable identifiers to
flow-types: this environment is initially populated
with global type declarations and its bindings are
then updated when the current flow-type changes
during typing. It contains bindings of form x : Φ.
Topological Environment Θ collects all annotations
producedby the type analyzer by mapping labeled
variableoccurrences x
κ
to its flow-type at that pro-
gram point. It represents also the status of the typ-
ing function in the detection of loop termination.
It contains bindings of form x
κ
: Φ.
Procedure Environment Π maps procedure names
to signatures (see definition 3.8). It contains bind-
ings of form p 7→ hy
1
: τ
p
1
.. y
n
: τ
p
n
;Γ
p
i.
Block Environment Σ maps label identifiers to
blocks of statements. It contains bindings of form
l 7→ {st
1
.. st
n
}.
Type environments also support a binary function
merge, used by rules IF and IF-ELSE, which recom-
pacts the bindings collected in separate environments
by the typing of program branches, as informally in-
troduced by section 1. Such merge function is alone
responsible for the growth of the flow-item compo-
nent within a flow-type.
2.3 Coercion of L-Values
Take the following example:
{
a[0].l := "boo";
}
where a : { l : num[2]; m : alpha[10] } array[5]
And its annotated form resulting from the type
analysis:
{
(a : { { l : alpha[2]; m : alpha[10] } array[5])[0].l
:= "boo";
}
where a : { l : num[2]; m : alpha[10] } array[5]
The literal
"boo"
having type
alpha[3]
is as-
signed to field
l
of a record within a cell of an array.
The flow-type of variable
a
needs to be updated here
somehow with the type of the right-hand of the as-
signment - and of course it’s not to
a
that such type
must naively be given, but to the record field
l
nested
within. Nonetheless the environment binds variable
identifiers to flow-types, thus there is no way to up-
date the type of a record label (as
l
in our case) or
of an array cell alone. Therefore the whole type of a
variable must be updated keeping the original struc-
ture layout and replacing the appropriate bit nested
within it. Hence, the whole type of
a
in the exam-
ple becomes {
l : alpha[2]; m : alpha[10]
}
array[5]
.
This shows also that the expected type
alpha[3]
of the literal
"boo"
has been adapted to fit into the
initialization type
num[2]
: coercion in assignments
needs therefore both to replace a piece of a type and
to resize it accordingly, keeping the original storage
class (
num
in our example) and recalculating the for-
mat in such a way that the overall size of the new re-
sulting type fits the initialization one.
For this reasons, judgements for L-value terms are
slightly different: Π;Σ;Γ;Θ
0
lv
lv : τ\θ
x
κ
Θ
1
means that the L-value lv has a storage type τ co-
ercible by the substitution θ
x
κ
, where x is the root vari-
able of lv (formally x = (lv) as of definition 3.7) and
x
κ
is its labeled occurrence. θ is a function from stor-
age types to storage types that can be passed by typing
rules that need to update the type of the root variable
of an L-value to the coerce function C (see definition
3.6), which performs the proper fit operation among
other things.
2.4 Loops and Convergence
As informally stated in section 1, the type analyzer
follows
goto
and
perform
statements unless already
visited and a convergence in the status of the typing
function is detected. In subsection 2.2 we said that
this status actually consists of the topological envi-
ronment Θ. The typing function at step i of the anal-
ysis can be defined as a function taking the statement
fetched at that step and the topological environment:
T
i
(st
B,p
,Θ
i
) = Θ
i+1
where st
B,p
is the statement located within block
B at position p.
Each time the typing function encounters a jump
statement, it performs a number of operations. Say
a jump statement st
A,q
goto
l is encountered by T
at step i while typing block A = {st
A,1
.. st
A,n
} (with
q [1,n]):
1. it saves the topological environmentΘ
i
built up so
far, binding it to the current program location;
2. it looks up the destination block of state-
ments from the block environment, hence B =
{st
B,1
.. st
B,m
} = Σ(l);
3. it continues the analysis from there, i.e. from
statement st
B,1
.
Let’s consider that later at step j (obviously j > i)
T reaches the jump statement st
A,q
again: then the
new current topological environment Θ
j
is compared
TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE
69
against Θ
i
, which had formerly been saved at that pro-
gram location. If Θ
j
Θ
i
(see definition 3.11) then
it means that no further type information has been
collected during the second pass and we can there-
fore assume that the analysis can safely skip the jump
statement st
A,q
and continue from st
A,q+1
. Else, the
new topological environment Θ
j
is saved (replacing
the old Θ
i
previously stored) and the analysis contin-
ues from the jump statement destination st
B,1
again.
We observed that even in complex spaghettish
scenarios with several
goto
statements within nested
conditional blocks the system detects a convergence
pretty soon: averagely in 1 and anyway in up to 3 re-
iterations of the same piece of code. The reason is
twofold:
the topological environment cannot by defini-
tion be subject to binding removal, hence x
κ
Θ
i
. x
κ
Θ
i+1
at any given step i;
flow-types bound to variable occurrences in the
topological environment can only grow - they can
neverdiminish in width. Given we’re dealing with
types and not values, the stability is certain: stor-
age types of variables do not change from pass to
pass for obvious reasons and the only thing that
could change and modify the status Θ of the typ-
ing function T is the flow-item ϕ part of flow-
types bound to variable occurrences. ϕ is defined
as a set of storage types τ in table 2.1 and it is sub-
ject to a single operation: the merge function as
of definition 3.9, which basically consists in a set-
union between flow-items. Duplicate types can
therefore never occur and no element could be re-
moved.
2.5 Ambiguity
Having non-singleton flow-items within flow-types is
indeed a central feature of this system, signaling that
the programmer reused a variable in different ways
along the program. Nonetheless, that makes judg-
ments for L-values problematic: howare we supposed
to type an L-value appearing in an expression, for in-
stance, if its current flow-type says that it could have
many storage types at the same time? In fact, we can’t
- that’s exactly what flow-types stand for: detecting
anomalous scenarios that may lead to unwanted re-
sults at run-time.
In our code example in section 1, imagine the
system had output another hint message for the am-
biguous statement claiming that among the possible
choices
num[3]
would have been suitable. And the
typing then proceeded selecting
num[3]
as candidate,
leading to a different type for
x
- not the one shown in
the original example.
{
(x : num[3]) := (x : num[2]) + 1;
// [WARNING] possible truncation detected
// in assignment:
// num[3] :> num[2]
if (x : num[3]) > 0 then
{
(x : alpha[3]) := "foo";
// [ERROR] truncation detected in
// assignment:
// alpha[3] :> num[2]
}
(x : num[6]) := (x : num[3]|alpha[3]) + 23;
// [HINT] type of ’x’ is ambiguous in
// expression at right-hand of
// assignment: choice num[3]
// would fit
// [WARNING] possible truncation detected
// in assignment:
// num[6] :> num[2]
}
where x : num[2] := 11
What if more than one type was suitable, though?
The flow-type would literally explode for tracking
several implications among possible typing paths and
in the end it would hardly be useful.
Our proposal in such situations is to do the sim-
plest thing: falling back to the initial type of the vari-
able; and of course notifying the choice with a hint
message. However, this leads to a duplication of the
type rule for variables, as table 4 shows.
3 FORMAL SPECIFICATION
In this section we give the full specification of the
type-system described in section 2. A number of def-
initions is given below that will be used by type rules.
Definition 3.1 (Promote). The promotion JσK
τ
of a
temporary type σ to a storage type τ produces a stor-
age type that transform σ into a storable type inher-
iting the characteristics of τ. The promotion function
is defined as follows (top-down closest-match rule on
the left hand holds):
Jnum[ρ
2
]K
num
q
[ρ
1
]
= num
q
[ρ
2
]
Jnum[ρ]K
τ
= num
ascii
[ρ]
JboolK
τ
= Jnum[1.0]K
τ
Jτ
2
K
τ
1
= τ
2
Definition 3.2 (Representation). We define a func-
tion rep : τ N for calculating the in-memory byte
size of a storage type:
rep(num
ascii
[n.d]) = n+ d
rep(num
bcd
[n.d]) =
n+d+1
2
rep(num
int
b
[ρ]) = b/8
rep(num
float
b
[ρ]) = b/8
rep(alpha[n]) = n
rep(alphanum[n]) = n
rep(τ array[n]) = rep(τ) n
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
70
rep({x
1
: τ
1
.. x
n
: τ
n
}) =
n
i=1
rep(τ
i
)
Definition 3.3 (Subtype). We define a total-order be-
tween storage types such that the relation τ
1
τ
2
holds when rep(τ
1
) rep(τ
2
).
Definition 3.4 (Var-Bound Substitution). A substi-
tution θ
x
κ
is a function from storage types to stor-
age types that carries along a labeled identifier x
κ
which stands for the variable occurrence whose type
the substitution has been built from and is supposed
to replace
10
.
Definition 3.5 (Fit). The fit τ
1
τ
2
of a storage type
τ
1
to a storage type τ
2
produces a storage type whose
storage class is equivalent to that of τ
1
and whose
size fits into that of τ
2
. The fit function is defined as
follows:
num
q
[ρ]
τ
= num
q
[ρ
]
for some ρ
such that
rep(num
q
[ρ
]) = rep(τ)
alpha[n]
τ
= alpha[n
]
for some n
such that
rep(alpha[n
]) = rep(τ)
alphanum[n]
τ
= alphanum[n
]
for some n
such that
rep(alphanum[n
]) = rep(τ)
τ
a
array[n]
τ
= τ
a
array[n
]
for some τ
a
and n
such that
rep(τ
a
array[n
]) = rep(τ)
⌊{l
1
: τ
1
..l
n
: τ
n
}⌋
τ
= {l
1
: τ
1
..l
n
: τ
n
}
for some τ
1
..τ
n
such that
rep({l
1
: τ
1
..l
n
: τ
n
}) = rep(τ)
Definition 3.6 (Coerce). The coerce function C up-
dates the given type and topological environments by
applying a given substitution function θ
x
κ
to the types
a given flow-item ϕ consists of; it produces a new pair
of form hΓ;Θi consisting of the type and topological
environments endowed with updated bindings for the
variable x and the occurrence label κ annotated on
the substitution function θ
x
κ
itself:
C (ϕ,θ
x
κ
,Γ,Θ) = hΓ,x : Φ
;Θ,κ : Φ
i
where
10
Substitution functions are recursively defined by type rules for
L-Values as shown in table 4. They’re meant for generically replac-
ing a term nested within a storage type of arbitrary complexity by
reproducing its original structure ofrecursive type terms andchang-
ing the innermost part only.
hϕ;τ
x
i = Γ(x)
Φ
= h{τ
i
| τ
i
ϕ.τ
i
= θ
x
κ
(τ
i
)
τ
x
};τ
x
i
Definition 3.7 (Root Variable). Given an L-value lv,
its root variable is the identifier x evaluated by the
recursive function defined as:
(x) = x
(lv[e]) = (lv)
(lv.l) = (lv)
Definition 3.8 (Signature). A signature is a pair
hY
p
;Γ
p
i where p is a procedure name, Y
p
are its for-
mal parameters y
1
: τ
p
1
.. y
n
: τ
p
n
and Γ
p
is the output
type environment returned by typing the body of p.
Definition 3.9 (Type Environment Merge). The bi-
nary function merges two given type environments
into one as follows:
Γ
1
Γ
2
= Γ
(Γ
1
\Γ
2
) (Γ
2
\Γ
1
)
where
Γ
= {x : hϕ
1
ϕ
2
;τ
1
i | Γ
1
(x) = hϕ
1
;τ
1
i
Γ
2
(x) = hϕ
2
;τ
2
i
τ
1
= τ
2
}
Definition 3.10 (Partial Ordering of Flow-Types).
We define a partial order between flow-types such that
Φ
1
Φ
2
holds when, let Φ
1
= hϕ
1
;τ
1
i and Φ
2
=
hϕ
2
;τ
2
i, then ϕ
1
ϕ
2
τ
1
= τ
2
.
Definition 3.11 (Partial Ordering of Topological
Environments). We define a partial order between
topological environments such that Θ
1
Θ
2
holds
when x : Φ
1
Θ
1
. x dom(Θ
2
) Φ
1
Φ
2
, where
Φ
2
= Θ
2
(x).
3.1 Type Rules
Syntax-directed type rules are divided by category.
Rules for Programs are shown in table 1, for State-
ments in table 2, for Expressions in table 6, for Argu-
ments in table 3 and for Literals in table 5.
Most judgements give a type to a term of the lan-
guage in a context consisting of a tuple of environ-
ments and output the updated Γ and Θ, except judge-
ments for Statements and Programs that give no type
and simply update the environments. As a general
rule, the topological environment Θ is always for-
warded to and returned by all judgements (except lit-
erals), because flow-types must be annotated recur-
sively on each variable occurring in any subterm of
the program. While the type environment Γ is output
only by rules that actually update it: consider it as re-
turned back untouched when there’s no mention of it
among outputs.
Judgements are of a number of forms, each
TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE
71
Table 1: Type Rules for Programs and Body.
MAIN
Π;
/
0;Θ
0
B
B Γ;Θ
1
Π;Θ
0
P
B Θ
1
PROC
Γ
p
=
/
0,y
1
: h{τ
p
i
};τ
p
i
i .. y
n
: h{τ
p
n
};τ
p
n
i
Π;Γ
p
;Θ
0
B
B Γ
p
;Θ
1
Π, p 7→ hy
1
: τ
p
1
.. y
n
: τ
p
n
;Γ
p
i;Θ
1
P
P Θ
2
Π;Θ
0
P
proc p(y
1
: τ
p
1
.. y
n
: τ
p
n
) B in P Θ
2
BODY
i [1,n]. Π;Σ;Γ
0
;Θ
0
lit
lit
i
: σ
i
Jσ
i
K
τ
i
τ
i
Π;
/
0;Γ
0
,x
1
: h{τ
1
};τ
1
i .. x
n
: h{τ
n
};τ
n
i;Θ
0
st
st Γ
1
;Θ
1
Π;Γ
0
;Θ
0
B
st where x
1
: τ
1
:= lit
1
.. x
n
: τ
n
:= lit
n
Γ
1
;Θ
1
Table 2: Type Rules for Statements.
ASSIGN
Π;Σ;Γ
0
;Θ
0
e
e : σ
e
Θ
1
Π;Σ;Γ
0
;Θ
1
lv
lv : τ
lv
\θ
x
κ
Θ
2
x
κ
= (lv)
hΓ
1
;Θ
2
i = C (Jσ
e
K
τ
lv
,θ
x,κ
,Γ
0
,Θ
2
)
Π;Σ;Γ
0
;Θ
0
st
lv := e Γ
1
;Θ
2
IF
Π;Σ;Γ
0
;Θ
0
e
e : bool Θ
1
Π;Σ;Γ
0
;Θ
1
st
st Γ
1
;Θ
2
Γ
2
= Γ
0
Γ
1
Π;Σ;Γ
0
;Θ
0
st
if e then st
1
Γ
2
;Θ
2
IF-ELSE
Π;Σ;Γ
0
;Θ
0
e
e : bool Θ
1
Π;Σ;Γ
0
;Θ
1
st
st
1
Γ
1
;Θ
2
Π;Σ;Γ
0
;Θ
2
st
st
2
Γ
2
;Θ
3
Γ
3
= Γ
1
Γ
2
Π;Σ;Γ
0
;Θ
0
st
if e then st
1
else st
2
Γ
3
;Θ
3
PERFORM
Π;Σ;Γ
0
;Θ
0
st
Σ(l) Γ
1
;Θ
1
Π;Σ;Γ
0
;Θ
0
st
perform l Γ
1
;Θ
1
PERFORM-THRU
i [a, b). Π;Σ;Γ
ia
;Θ
ia
st
Σ(l
i
) Γ
ia+1
;Θ
ia+1
Π;Σ;Γ
0
;Θ
0
st
perform l
a
l
b
Γ
ba1
;Θ
ba1
GOTO
l
n
dom(Σ) | l
m
. m > n
i [k,n]. Π;Σ;Γ
ik
;Θ
ik
st
Σ(l
i
) Γ
ik+1
;Θ
ik+1
Π;Σ;Γ
0
;Θ
0
st
goto l
k
Γ
nk
;Θ
nk
CALL
hy
1
: τ
p
1
.. y
n
: τ
p
n
;Γ
p
i = Π(p)
i [1, n]. Π;Σ;Γ
i1
;Θ
i1
a
a
i
: τ
p
i
Γ
i
;Θ
i
Π;Σ;Γ
0
;Θ
0
st
p(a
1
.. a
n
) Γ
n
;Θ
n
BLOCK
Σ
= Σ, l
j
7→ {st
j,1
.. st
j,n
j
}.. ( j | st
0, j
l
j
:{st
j,1
.. st
j,n
j
})
i [1, n]. Π;Σ
;Γ
i1
;Θ
i1
st
st
i
Γ
i
;Θ
i
Π;Σ;Γ
0
;Θ
0
st
[l
0
:] { st
0,1
.. st
0,n
0
} Γ
n
;Θ
n
syntactic category having its own, though most
of them are quite self-explanatory. For example,
Π;Σ;Γ;Θ
0
e
e : σ Θ
1
denotes that, in the given
environments, expression e is given a temporary type
σ and the topological environment Θ
1
is output.
Judgements for Arguments probably need some
extra words. Call-by-ref calls need to update the type
environment of the the caller because the flow-type
of argument might be modified by the invoked proce-
dure. The procedure environment Π stores the type
environment Γ
p
for each procedure p of the program,
thus the flow-type of a variable passed by reference
to p can be updated according to the flow-type of the
corresponding formal parameter bound in Γ
p
. Such
update is carried on by the coerce function C , as
shown by rule BYREF in table 3. The mechanism
resembles that in rule ASSIGN in table 2: call-by-
reference argument application indeed behaves like
an assignment (call-by-value doesn’t).
Rules for Arguments have form Π;Σ;Γ
0
;Θ
0
a
a : τ
p
i
Γ
1
;Θ
1
, meaning that, in the given envi-
ronments, the actual argument a has type τ
p
i
, which is
the type of the i-th formal parameter of procedure p.
As a final notice, for the sake of simplicity we as-
sume that all labels in the program are named in order
of occurrence: if l
n
and l
m
are two labels and m > n,
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
72
Table 3: Type Rules for Arguments.
BYVAL
Π;Σ;Γ
0
;Θ
0
e
e : σ Θ
1
JσK
τ
τ
p
i
Π;Σ;Γ
0
;Θ
0
a
val e : τ
p
i
Γ
0
;Θ
1
BYREF
Π;Σ;Γ
0
;Θ
0
lv
lv : τ
\θ
x
κ
Θ
1
x = (lv) τ
τ
p
i
hy
1
: τ
p
1
.. y
n
: τ
p
n
;Γ
p
i = Π(p)
hϕ
p
i
;τ
p
i
i = Γ
p
(y
i
)
hΓ
1
;Θ
2
i = C (ϕ
p
i
,θ
x,κ
,Γ
0
,Θ
1
)
Π;Σ;Γ
0
;Θ
0
a
ref lv : τ
p
i
Γ
1
;Θ
2
Table 4: Type Rules for L-Values.
VAR-INIT
Γ(x) = Φ = h{τ
1
τ
2
.. τ
n
};τ
0
i
Θ
1
= Θ
0
,x
κ
: Φ θ
x
κ
(τ) = τ
Π;Σ;Γ;Θ
0
lv
x
κ
: τ
0
\θ
x
κ
Θ
1
VAR-CURR
Γ(x) = Φ = h{τ
1
};τ
0
i
Θ
1
= Θ
0
,x
κ
: Φ θ
x
κ
(τ) = τ
Π;Σ;Γ;Θ
0
lv
x
κ
: τ
1
\θ
x
κ
Θ
1
SUBSCRIPT
Π;Σ;Γ;Θ
0
e
e : num[ρ] Θ
1
Π;Σ;Γ;Θ
1
lv
lv : τ array[n]\θ
x
κ
0
Θ
2
x = (lv)
θ
x
κ
(τ) = θ
x
κ
0
(τ array[n])
Π;Σ;Γ;Θ
0
lv
lv[e] : τ\θ
x
κ
Θ
2
SELECT
Π;Σ;Γ;Θ
0
lv
lv : {z
1
: τ
1
.. z : τ .. z
n
: τ
n
}\θ
x
κ
0
Θ
1
x = (lv)
θ
x
κ
(τ) = θ
x
κ
0
({z
1
: τ
1
.. z : τ .. z
n
: τ
n
})
Π;Σ;Γ;Θ
0
lv
lv.z : τ\θ
x
κ
Θ
2
then l
m
appear below l
n
in the program. That makes
type rules for jump statements simpler.
4 RESULTS AND CONCLUSIONS
Our implementation of the system, as already said,
can detect a number of type misuses and mismatches
besides producing flow-type annotations for each
variable occurrence. At the time of writing several
tests have been run over real-world legacy business
code, mainly written in COBOL85 for z/OS during
the 1990s and owned by a big local company within
the mechanical vehicle industry. The following con-
siderations and evidences have emerged:
variable reuse involves up to 30% of overall vari-
able usage in COBOL programs
nearly 90% of these, though, accumulate less
than 5 storage types simultaneously within their
flow-type; averagely 3
remaining 10% however unlikely grow wider
than 8
75% of non-singleton flow-types indicates
reuse of numeric types
80% of these come from in-place arithmetic
operations possibly exceeding target variable
space, such as the typical scenario
(x :
num[3]) := (x : num[2]) + 1
probably few of such operations are poten-
tially risky at run-time, because programmers
typically declare pictures wider than actually
needed for their numerics
remaining 20% are re-assignments or data
movements, i.e. assignments where variables
on the right-hand do not appear in left-hand
25% of non-singleton flow-types indicates reuse
of non-numeric types
70% of these are alphanumeric-strings-to-array
type switches and viceversa
10% involve complex data types, such as nested
records overlapping arrays
only 2% occurs between incompatible types,
thus probably leading to data corruption and
bugs
remaining 18% involve data movementsimply-
ing no truncation, thus might be bad code but
does not lead to run-time unwanted behaviors
80% of jump statements require up to 3 visits (in-
cluding the first one, hence 2 re-visits) to reach
TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE
73
Table 5: Type Rules for Literals.
NUM-U
n = len(n
1
) d = len(n
2
)
Π;Σ;Γ;Θ
lit
n
1
[.n
2
] : num[n.d]
NUM
n = len(n
1
) d = len(n
2
)
Π;Σ;Γ;Θ
lit
n
1
[.n
2
] : num[Sn.d]
STRING-ALPHANUM
{0 .. 9}
"str.."
6=
/
0
n = len(str)
Π;Σ;Γ;Θ
lit
"str.."
: alphanum[n]
STRING-ALPHA
n = len(
"str.."
)
Π;Σ;Γ;Θ
lit
"str.."
: alpha[n]
TRUE
Π;Σ;Γ;Θ
lit
true : bool
FALSE
Π;Σ;Γ;Θ
lit
false : bool
Table 6: Type Rules for Expressions.
DEMOTE-NUM
Π;Σ;Γ;Θ
0
e
e : num
q
[ρ] Θ
0
Π;Σ;Γ;Θ
0
e
e : num[ρ] Θ
0
LV
Π;Σ;Γ;Θ
0
lv
lv : τ\θ
x
κ
Θ
1
Π;Σ;Γ;Θ
0
e
lv : τ Θ
1
LIT
Π;Σ;Γ;Θ
lit
lit : σ
Π;Σ;Γ;Θ
0
e
lit : σ Θ
0
NEG-S
Π;Σ;Γ;Θ
0
e
e : num[Sn.d] Θ
1
Π;Σ;Γ;Θ
0
e
e : num[Sn.d] Θ
1
NEG-U
Π;Σ;Γ;Θ
0
e
e : num[n.d] Θ
1
Π;Σ;Γ;Θ
0
e
e : num[Sn.d] Θ
1
NOT
Π;Σ;Γ;Θ
0
e
e : bool Θ
1
Π;Σ;Γ;Θ
0
e
not e : bool Θ
1
PLUS-U
Π;Σ;Γ;Θ
0
e
e
1
: num[n
1
.d
1
] Θ
1
Π;Σ;Γ;Θ
1
e
e
2
: num[n
2
.d
2
] Θ
2
n = max(n
1
,n
2
) d = max(d
1
,d
2
)
Π;Σ;Γ;Θ
0
e
e
1
+ e
2
: num[Sn.d] Θ
2
PLUS-MINUS-S
Π;Σ;Γ;Θ
0
e
e
1
: num[S
1
n
1
.d
1
] Θ
1
Π;Σ;Γ;Θ
1
e
e
2
: num[S
2
n
2
.d
2
] Θ
2
S = S
1
S
2
n = max(n
1
,n
2
) + 1
d = max(d
1
,d
2
)
Π;Σ;Γ;Θ
0
e
e
1
(+ | ) e
2
: num[Sn.d] Θ
2
MULT
Π;Σ;Γ;Θ
0
e
e
1
: num[S
1
n
1
.d
1
] Θ
1
Π;Σ;Γ;Θ
1
e
e
2
: num[S
2
n
2
.d
2
] Θ
2
S = S
1
S
2
n = n
1
+ n
2
d = d
1
+ d
2
Π;Σ;Γ;Θ
0
e
e
1
e
2
: num[Sn.d] Θ
2
DIV
Π;Σ;Γ;Θ
0
e
e
1
: num[S
1
n
1
.d
1
] Θ
1
Π;Σ;Γ;Θ
1
e
e
2
: num[S
2
n
2
.d
2
] Θ
2
S = S
1
S
2
n = n
1
+ d
2
d = d
1
+ n
2
Π;Σ;Γ;Θ
0
e
e
1
/ e
2
: num[Sn.d] Θ
2
BIN-REL-NUM
Π;Σ;Γ;Θ
0
e
e
1
: num[S
1
n
1
.d
1
] Θ
1
Π;Σ;Γ;Θ
1
e
e
2
: num[S
2
n
2
.d
2
] Θ
2
Π;Σ;Γ;Θ
0
e
e
1
op
r
e
2
: bool Θ
2
BIN-REL-ALPHANUM
Π;Σ;Γ;Θ
0
e
e
1
: alphanum[n1] Θ
1
Π;Σ;Γ;Θ
1
e
e
2
: alphanum[n2] Θ
2
Π;Σ;Γ;Θ
0
e
e
1
op
r
e
2
: bool Θ
2
BIN-LOGIC
Π;Σ;Γ;Θ
0
e
e
1
: bool Θ
1
Π;Σ;Γ;Θ
1
e
e
2
: bool Θ
2
Π;Σ;Γ;Θ
0
e
e
1
op
l
e
2
: bool Θ
2
a convergence in the typing function status; aver-
agely 2, hence 1 re-visit
98% of those are actually pretty ordinary loops
coming from COBOL iterative constructs; just
2% are weird custom cycles created by the pro-
grammer
remaining 20% of jump statements need any-
way up to 5 visits before a convergence occurs
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
74
70% the latter are actually just nested condi-
tional loops that COBOL iterative constructs
cannot express and are explicitly written by
programmers via
IF
and
GOTO
statements.
All this suggests that type-flow analysis is actually
able to detect a number of possible errors in COBOL
programs coming from bad reuse of variables or in-
compatible data movements. Either ways lead to data
truncation or corruption, which are the major sources
of run-time bugs. And, by the way, the statistics above
do not differ a lot from those collected and shown by
(Moonen, 2003).
In the following example we show how a data
move from a smaller type to a larger one might lead to
unwanted scenarios where previous data has not been
replaced by new one:
{
a := r;
// [WARNING] reverse subsumption detected in
assignment: right-hand type
is smaller that left-hand type
n := a[3];
// [WARNING] possible access to corrupted
data: accessing ’a’ with its
initialization type
’alphanum[2] array[4]’ but its
content and type have changed
}
where a : alphanum[2] array[4];
r : { x : num[3];
y : alphanum[2];
z : num[2] };
n : alphanum[2]
Record
r
is 7-bytes long and array
a
is 8 bytes,
therefore, once
r
is copied into
a
, accesses to the latter
as its initialization array type would lead to unwanted
data in case the last byte is accessed. Although in the
example we used a literal in the subscript, in general
the analyzer cannot know what is accessed and there-
fore the warning is output.
For this matter, static evaluation of constant ex-
pressions has been implemented in our prototype,
even though we haven’t considered it in this article
- that would avoid the warning in case the assign-
ment was
n := a[1]
and we generally noted that it
does slightly reduce the number of messages logged
by the analyzer, overall. Also, a GUI front-end is un-
der development for letting users browse annotated
source programs and understand complex flow-type
more easily.
Finally, we’re considering to extend the system
with the following features:
dealing with unknown statements in some inter-
esting way, type-wise, such as adding weak types
to the type-system indicating that type assump-
tions might get broken whenevera variable is used
by a COBOL command whose semantics are un-
known
support for COBOL language extensions such as
SQL, introducing the notion of cursor and table
types within the system for detecting possible in-
consistencies between declared records and actual
row layout in the database
adding some form of data-flow analysis over value
domains and ranges
designing some custom Program Understanding
approaches, such as pattern recognition over iden-
tifier names or code snippets for making the
system aware of typical COBOL programming
trends, styles, practices and design patterns
REFERENCES
F. Nielson, H.R. Nielson, C. H. (1999). Principles of Static
Analysis. Springer Verlag.
Holt, R. C. (2008). WCRE 1998 most influential paper:
Grokking software architecture. In WCRE (Work-
ing Conference on Reverse Engeneering), pages 5–14.
IEEE.
IBM (2009). Cobol z/OS language reference. Website.
http://publib.boulder.ibm.com/ infocenter/ pdthelp/
v1r1/ index.jsp?topic=/ com.ibm.debugtool.doc 7.1/
eqa7rm0293.htm.
Kernighan, B. W. and Ritchie, D. (1988). The C Program-
ming Language, Second Edition. Prentice-Hall.
Kuipers, T. and Moonen, L. (2000). Types and concept anal-
ysis for legacy systems. In IWPC, pages 221–230.
IEEE Computer Society.
Moonen, L. (2001). Generating robust parsers using island
grammars. In WCRE (Working Conference on Reverse
Engeneering).
Moonen, L. (2003). Exploring software systems. In ICSM,
pages 276–280. IEEE Computer Society.
Stroustrup, B. (2000). The C++ Programming Lan-
guage. Addison-Wesley Longman Publishing Co.,
Inc., Boston, MA, USA, 3rd edition.
van Deursen, A. and Moonen, L. (1998). Type inference
for cobol systems. In WCRE (Working Conference on
Reverse Engeneering), pages 220–230.
van Deursen, A. and Moonen, L. (1999). Understanding
cobol systems using inferred types. In IWPC. IEEE
Computer Society.
van Deursen, A. and Moonen, L. (2000). Exploring legacy
systems using types. In WCRE (Working Conference
on Reverse Engeneering), pages 32–41.
van Deursen, A. and Moonen, L. (2001). An empirical
study into cobol type inferencing. Sci. Comput. Pro-
gram., 40(2-3):189–211.
van Deursen, A. and Moonen, L. (2006). Documenting
software systems using types. Sci. Comput. Program.,
60(2):205–220.
TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE
75