mirror of https://github.com/fdiskyou/Zines.git
1235 lines
63 KiB
Plaintext
1235 lines
63 KiB
Plaintext
Generalizing Data Flow Information
|
||
Aug, 2007
|
||
skape
|
||
mmiller@hick.org
|
||
|
||
Abstract: Generalizing information is a common method of reducing the quantity
|
||
of data that must be considered during analysis. This fact has been plainly
|
||
illustrated in relation to static data flow analysis where previous research
|
||
has described algorithms that can be used to generalize data flow information.
|
||
These generalizations have helped support more optimal data flow analysis in
|
||
certain situations. In the same vein, this paper describes a process that can
|
||
be employed to generalize and persist data flow information along multiple
|
||
generalization tiers. Each generalization tier is meant to describe the data
|
||
flow behaviors of a conceptual software element such as an instruction, a
|
||
basic block, a procedure, a data type, and so on. This process makes use of
|
||
algorithms described in previous literature to support the generalization of
|
||
data flow information. To illustrate the usefulness of the generalization
|
||
process, this paper also presents an algorithm that can be used to determine
|
||
reachability at each generalization tier. The algorithm determines
|
||
reachability starting from the least specific generalization tier and uses the
|
||
set of reachable paths found to progressively qualify data flow information
|
||
for each successive generalization tier. This helps to constrain the amount
|
||
of data flow information that must be considered to a minimal subset.
|
||
|
||
1) Introduction
|
||
|
||
Data flow analysis uses data flow information to solve a particular data flow
|
||
problem such as determining reachability, dependence, and so on. The
|
||
algorithms used to obtain data flow information may vary in terms of accuracy
|
||
and precision. To help quantify effectiveness, data flow algorithms may
|
||
generally be categorized based on specific sensitivities. The first category,
|
||
referred to ask flow sensitivity is used to convey whether or not an algorithm
|
||
takes into account the implied order of instructions. Path sensitivity is
|
||
used to convey whether or not an algorithm considers predicates. Finally,
|
||
algorithms may also be context-sensitive if they take into account a calling
|
||
context to restrict analysis to realizable paths when considering
|
||
interprocedural data flow information.
|
||
|
||
Data flow information is typically collected by statically analyzing the data
|
||
dependence of instructions or statements. For example, conventional def-use
|
||
chains describe the variables that exist within def(), use(), in(), out(), and
|
||
kill() set for each instruction or statement. Understanding data flow
|
||
information with this level of detail makes it possible to statically solve a
|
||
particular data flow problem. However, the resources needed to represent the
|
||
def-use data flow information can be prohibitive when working with large
|
||
applications. Depending on the data flow problem, the amount of data flow
|
||
information required to come to a solution may be in excess of the physical
|
||
resources present on a computer performing the analysis. This physical
|
||
resource problem can be solved using at least two general approaches.
|
||
|
||
The most basic approach might involve simply partitioning, or fragmenting,
|
||
analysis information such that smaller subsets are considered individually
|
||
rather than attempting to represent the complete set of data flow information
|
||
at once[15]. While this would effectively constrain the amount of physical
|
||
resources required, it would also directly impact the accuracy and precision
|
||
of the underlying algorithm used to perform data flow analysis. For instance,
|
||
identifying the ``interesting portion'' of a program may require more state
|
||
than can be feasibly obtained in single program fragment. A second and
|
||
potentially more optimal approach might involve generalizing data flow
|
||
information. By generalizing data flow information, an algorithm can operate
|
||
within the bounds of physical resources by making use of a more abstract view
|
||
of the complete set of data flow information. The distinction between the
|
||
generalizing approach and the partitioning approach is that the generalized
|
||
data flow information should not affect the accuracy of the algorithm since it
|
||
should still be able to represent the complete set of generalized data flow
|
||
information at once.
|
||
|
||
There has been significant prior work that has illustrated the effectiveness
|
||
of generalizing data flow information when performing data flow analysis. The
|
||
def-use information obtained between instructions or statements has been
|
||
generalized to describe sets for basic blocks. Horwitz, Reps, and Binkley
|
||
describe how a system dependence graph (SDG) can be derived from
|
||
intraprocedural data flow information to produce a summary graph which convey
|
||
context-sensitive data flow information at the procedure level[7]. Their paper
|
||
went on to describe an interprocedural slicing algorithm that made use of
|
||
SDGs. Reps, Horwitz, and Sagiv later described a general framework (IFDS) in
|
||
which many data flow analysis problems can be solved as graph reachability
|
||
problems[13, 14]. The algorithms proposed in their paper focus on restricting
|
||
analysis to interprocedurally realizable paths to improve precision.
|
||
Identifying interprocedurally realizable paths has since been compared to the
|
||
concept of context-free-language (CFL) reachability (CFL-reachability)[8]. These
|
||
algorithms have helped to form the basis for techniques used in this paper to
|
||
both generalize and analyze data flow information.
|
||
|
||
This paper approaches the generalization of data flow information by defining
|
||
generalization tiers at which data flow information can be conveyed. A
|
||
generalization tier is intended to show the data flow relationships between a
|
||
set of conceptual software elements. Examples of software elements include an
|
||
instruction, a basic block, a procedure, a data type, and so on. To define
|
||
these relationships, data flow information is collected at the most specific
|
||
generalization tier, such as the instruction tier, and then generalized to
|
||
increasingly less-specific generalization tiers, such as the basic block,
|
||
procedure, and data type tiers.
|
||
|
||
To illustrate the usefulness of generalizing data flow information, this paper
|
||
also presents a progressive algorithm that can be used to determine
|
||
reachability between nodes on a data flow graph at each generalization tier.
|
||
The algorithm starts by generating a data flow graph using data flow
|
||
information from the least-specific generalization tier. The graph is then
|
||
analyzed using a previously describe algorithm to determine reachability
|
||
between an arbitrary set of nodes. The set of reachable paths found is then
|
||
used to qualify the set of more-specific potentially reachable paths found at
|
||
the next generalization tier. The more-specific paths are used to construct a
|
||
new data flow graph. These steps then repeat using each more-specific
|
||
generalization tier until it is not possible to obtained more detailed
|
||
information. The benefit of this approach is that a minimal set of data flow
|
||
information is considered as a result of progressively qualifying data flow
|
||
paths at each generalization tier. It should be noted that different
|
||
reachability problems may require state that is prohibitively large. As such,
|
||
it is helpful to consider refining a reachability problem to operate more
|
||
efficiently by making use of generalized information.
|
||
|
||
This paper is organized into two sections. Section 2 discusses the algorithms
|
||
used to generalize data flow information at each generalization tier. Section
|
||
3 describes the algorithm used to determine reachable data flow paths by
|
||
progressively analyzing data flow information at each generalization tier. It
|
||
should be noted in advance that the author does not claim to be an expert in
|
||
this field; rather, this paper is simply an explanation of the author's
|
||
current thoughts. These thoughts attempt to take into account previous work
|
||
whenever possible to the extent known by the author. Given that this is the
|
||
case, the author is more than willing to receive criticism relating to the the
|
||
ideas put forth in this paper.
|
||
|
||
2) Generalization
|
||
|
||
Generalizing data flow information can make it possible to analyze large data
|
||
sets without losing accuracy. This section describes the process of
|
||
generalizing information at each generalization tier. As a matter of course,
|
||
each generalization tier uses data flow information obtained from its
|
||
preceding more specific generalization tier. In this way, the basic block
|
||
tier generalizes information obtained at the instruction tier, the procedure
|
||
tier generalizes information obtained at the basic block tier, and so on. The
|
||
algorithms used to generalize information at each generalization tier can have
|
||
a direct impact on the accuracy of the information that can be obtained when
|
||
used during data flow analysis. The subject of accuracy will be addressed for
|
||
each specific tier.
|
||
|
||
To obtain generalized data flow information, a set of target executable image
|
||
files, or modules, must be defined. The target modules serve to define the
|
||
context from which data flow information will be obtained and generalized.
|
||
The general process used to accomplish this involves visiting each procedure
|
||
within each module. For each procedure, data flow information is collected at
|
||
the instruction tier and is then generalized to each less-specific tier. To
|
||
facilitate the reachability algorithm, it is assumed that as the data flow
|
||
information is collected, it is persisted in a form such that can be accessed
|
||
on demand. The process described in this paper assumes a normalized database
|
||
is used to contain the data flow information found at each generalization
|
||
tier. In this manner, the upper limit associated with the number of target
|
||
modules is tied to the amount of available persistent storage with respect to
|
||
the amount required by a given data flow problem.
|
||
|
||
Before proceeding, it is important to point out that while this paper
|
||
describes explicit algorithms for generalizing at each tier, it is entirely
|
||
possible to substitute alternative algorithms. This serves to illustrate that
|
||
the concept of generalizing information along generalization tiers is
|
||
sufficiently abstract enough to support representing alternate forms of data
|
||
flow and control flow information. By using different algorithms, it is
|
||
possible to convey different forms of data flow relationships which vary in
|
||
terms of precision and accuracy.
|
||
|
||
2.1) Instruction Tier
|
||
|
||
Generalizing data flow information presupposes that there is data flow
|
||
information to generalize. As such, a base set of data flow information must
|
||
be collected first. For the purposes of this paper, the most specific data
|
||
flow information is collected at the instruction tier using the Static Single
|
||
Assignment (SSA) implementation provided by Microsoft's Phoenix framework,
|
||
though other algorithms could just as well be used[11]. SSA is an elegant
|
||
solution to the problem of representing data flow information in a
|
||
flow-sensitive manner. Each definition and use of a given variable are
|
||
defined in terms of a unique variable version which makes it possible to show
|
||
clear, unambiguous data flow relationships. In cases where data flow
|
||
information may merge along control flow paths, SSA makes use of a phi
|
||
function which acts as a pseudo-instruction to represent the merge point.
|
||
Obtaining distinct data flow paths at the instruction tier can be accomplished
|
||
by traversing an SSA graph for a given procedure starting from each root
|
||
variable, which have no prior definitions, and proceeding to each reachable
|
||
leaf variable, which have no subsequent uses, are encountered along each data
|
||
flow path. The end result of this traversal is the complete set of data flow
|
||
paths found within the context of a given procedure.
|
||
|
||
One of SSA's limitations is that it is only designed to work intraprocedurally
|
||
and therefore makes no effort to describe the behavior of passing data between
|
||
procedures, such as through input and output parameters. In order to provide
|
||
an accurate, distinct path data flow representation, one must take into
|
||
account interprocedural data flow. One method of accomplishing this is to
|
||
generalize the concept of SSA's phi function and use it represent formal
|
||
parameters. In this way, the phi function can be used to represent data flow
|
||
merges that happen as a result of data passing as input or output parameters
|
||
when a procedure is called. A phi function can be created to represent each
|
||
formal input and output parameter for a procedure, thus linking definitions of
|
||
parameters at a call site to actual parameter uses in a callee. Reps,
|
||
Horwitz, and Sagiv describe a concept similar to this[13].
|
||
|
||
In addition to using phi functions to link the definitions and uses of formal
|
||
parameters, it is also necessary to fracture data flow paths at call sites
|
||
that are found within a procedure . This is necessary because data flow paths
|
||
collected using SSA information will convey a relationship between the input
|
||
parameters passed to a procedure and the output parameters returned by a
|
||
procedure. This is the case because a call instruction at a call site appears
|
||
to use input parameters and define output parameters, thus creating an
|
||
implicit link between input and output parameters. Since SSA information is
|
||
obtained intraprocedurally, it is not possible to know in advance whether or
|
||
not an input parameter will influence an output parameter.
|
||
|
||
To fracture a data flow path, the instructions that define input parameters
|
||
passed at a given call site are instead linked directly to the associated
|
||
formal input parameter phi functions that are found in the context of the
|
||
target procedure. Likewise, instructions that use output parameters
|
||
previously defined by the call instruction are instead linked directly to the
|
||
associated formal output parameter phi functions found in the context of the
|
||
target procedure. This has the effect of breaking the original data flow path
|
||
into two disconnected data flow paths at the call site location. The linking
|
||
of actual parameters and call site parameters with formal parameters has been
|
||
illustrated in previous literature. Horwitz, Reps, and Binkley used this
|
||
concept during the construction of a system dependence graph (SDG)[7]. The
|
||
concept of creating symbolic variables that are later used to link information
|
||
together is not new[14]. Figure 2 provides an example of what a conventional and
|
||
fractured data flow path might look like.
|
||
|
||
Conventional Fractured
|
||
|
||
.---------. .---------.
|
||
| ldarg.0 | | ldarg.0 |
|
||
`---------` `---------`
|
||
| |
|
||
V V
|
||
.---------. .---------.
|
||
| call g | | fin(x) |
|
||
`---------` `---------`
|
||
|
|
||
| ------------------
|
||
V
|
||
.---------. .---------.
|
||
| stloc.0 | | fout(g) |
|
||
`---------` `---------`
|
||
|
|
||
V
|
||
.---------.
|
||
| stloc.0 |
|
||
`---------`
|
||
|
||
Figure 2
|
||
Fracturing a data flow path at a call
|
||
site. Call instructions no longer act
|
||
as the producers or receivers of data
|
||
that is passed between procedures.
|
||
|
||
Using the fracturing concept, the instruction tier's path-sensitive data flow
|
||
information for a given procedure becomes disconnected. This helps to improve
|
||
the overall accuracy of the data flow paths that are conveyed. Fracturing
|
||
also has the added advantage of making it possible to use formal parameter phi
|
||
functions to dynamically link a caller and a callee at runtime. This makes it
|
||
possible to identify context-sensitive interprocedural data flow paths at the
|
||
granularity of an instruction. This ability will be described in more detail
|
||
when the reachability algorithm is described in section 3.
|
||
|
||
With an understanding of the benefits of fracturing, it is now possible to
|
||
define the general form that data flow paths may take at the instruction tier.
|
||
This general form is meant to describe the structure of data flow paths at the
|
||
instruction tier in terms of the potential set of origins, transient, and
|
||
terminal points with respect to the general instruction types. Based on the
|
||
description given above, it is possible to categorize instructions into a few
|
||
general types. Using these general instruction types, the general form of
|
||
instruction data flow paths can be captured as illustrated by the diagram in
|
||
figure 3.
|
||
|
||
value: Defines or uses a data value
|
||
compare: Compares a data value
|
||
fin: Pseudo instruction representing a formal input parameter
|
||
fout: Pseudo instruction representing a formal output parameter
|
||
|
||
|
||
.---------. .---------. .---------.
|
||
| value | | fin | | fout |
|
||
`---------` `---------` `---------`
|
||
\ | /
|
||
`---------|-----------`
|
||
V
|
||
.--------------.
|
||
| value |
|
||
| compare |
|
||
| ... |
|
||
`--------------`
|
||
|
|
||
.------------|---------.-------------.
|
||
V V V V
|
||
.---------. .---------. .---------. .---------.
|
||
| value | | fin | | fout | | compare |
|
||
`---------` `---------` `---------` `---------`
|
||
|
||
Figure 3
|
||
General forms of data flow paths at the instruction tier
|
||
|
||
Based on this general description of instruction data flow paths, it is
|
||
helpful to consider a concrete example. Consider the example source code
|
||
described below which shows the implementation of the f function.
|
||
|
||
static public int f(int x)
|
||
{
|
||
return (x >= 0) ? g(x) : x + 1;
|
||
}
|
||
|
||
This function is intentionally very simple so as to limit the number of data
|
||
flow paths that must be represented visually. Using the concepts described
|
||
above, the instruction data flow paths that would be created as a result of
|
||
analyzing this procedure are shown in figure 4. Note that the call site for
|
||
the g function results in two disconnected data flow paths. The end result is
|
||
that there are four unique data flow paths within this procedure.
|
||
|
||
|
||
.----------.
|
||
| fin(f,x) |
|
||
`----------`
|
||
/ / |
|
||
/ / |
|
||
V | V
|
||
.----------. | .----------. .----------.
|
||
| ldarg.0 | | | ldarg.0 | | ldc |
|
||
`----------` | `----------` `----------`
|
||
| | | /
|
||
| | | /
|
||
| | V V
|
||
.------. | | .----------. .----------.
|
||
| ldc | | | | add | | fout(g) |
|
||
`------` | | `----------` `----------`
|
||
\ | | \ \ /
|
||
\ | | \ \ /
|
||
V V V V V V
|
||
.-------. .----------. .--------.
|
||
| brcmp | | ldarg.0 | | ret |
|
||
`-------` `----------` `--------`
|
||
| |
|
||
| |
|
||
V V
|
||
.----------. .---------.
|
||
| fin(g,x) | | fout(f) |
|
||
`----------` `---------`
|
||
|
||
Figure 4
|
||
Instruction tier data flow paths for the example code.
|
||
The context of these data flow paths is the f function.
|
||
|
||
2.2) Basic Block Tier
|
||
|
||
Once the complete set of data flow paths are identified at the instruction
|
||
tier for a given procedure, the next step is to generalize data flow
|
||
information to the basic block tier. At the basic block tier, instruction
|
||
data flow paths should be generalized to show path-sensitive data flow
|
||
interactions between basic blocks rather than instructions. This level of
|
||
generalization reduces the amount of information needed to represent data flow
|
||
paths. For example, there are many cases where data will be passed between
|
||
multiple instructions within the same basic block. Using basic block tier
|
||
generalization, those individual operations can be generalized and represented
|
||
as a single basic block. The generalized basic block data flow paths can then
|
||
be persisted for subsequent use when determining reachability in much the same
|
||
fashion that was used at the instruction tier.
|
||
|
||
Since the instruction tier's data flow information has been fractured and
|
||
parameters passed at call sites have been tied to phi functions, an approach
|
||
must be defined to preserve this information at the basic block tier during
|
||
generalization. An easy way of preserving this information is to define the
|
||
formal parameters which represent input and output parameters as being
|
||
contained within distinct pseudo blocks. For example, the phi functions
|
||
representing formal input parameters can exist within a formal entry pseudo
|
||
block. Likewise, the phi functions representing formal output parameters can
|
||
exist within a formal exit pseudo block. Both pseudo blocks can then be tied
|
||
to the procedure associated with the formal parameters. Defining the
|
||
underlying instruction tier phi functions in this way makes it trivial to
|
||
retain information that will be needed to define context-sensitive
|
||
interprocedural data flow at less-specific generalization tiers. Like the
|
||
instruction tier, it is possible to dynamically link data passed to a pseudo
|
||
block in a caller's context to subsequent uses in a callee's context. Figure
|
||
5 shows the general form that basic block data flow paths may take.
|
||
|
||
|
||
.---------. .---------. .---------.
|
||
| fin | | fout | | block |
|
||
`---------` `---------` `---------`
|
||
\ | /
|
||
.`-------+--------'.
|
||
V V V
|
||
.---------. .---------. .---------.
|
||
| fin | | fout | | block |
|
||
`---------` `---------` `---------`
|
||
|
||
Figure 3
|
||
General forms of data flow paths at the basic block tier
|
||
|
||
The act of generalizing instruction data flow paths means that two or more
|
||
distinct instruction data flow paths may produce the same basic block data
|
||
flow path. When this occurs, only one basic block data flow path should be
|
||
defined since it will effectively capture the information conveyed by the set
|
||
of distinct instruction data flow paths. Each corresponding instruction data
|
||
flow path should still be associated with a single basic block data flow path.
|
||
This association makes it possible to show the set of instruction data flow
|
||
paths that have been generalized by a specific basic block data flow path.
|
||
The association can be persisted in a normalized database by creating a
|
||
one-to-many link table between basic block and instruction data flow paths.
|
||
Figure provides an example of what would happen when generalizing the
|
||
instruction data flow paths described in figure 6.
|
||
|
||
|
||
.----------.
|
||
| fin(f,x) |
|
||
`----------`
|
||
|
|
||
.----------+----------.
|
||
V V V
|
||
.----------. .----------. .----------.
|
||
| block 1 | | block 2 | | block 3 |
|
||
`----------` `----------` `----------`
|
||
| | |
|
||
V | |
|
||
.----------. | |
|
||
| fin(g,x) | | |
|
||
`----------` | |
|
||
V V
|
||
.----------. .----------.
|
||
| fout(g) |-->| block 4 |
|
||
`----------` `----------`
|
||
|
|
||
V
|
||
.----------.
|
||
| fout(f) |
|
||
`----------`
|
||
|
||
Figure 6
|
||
Basic block tier data flow paths obtained by generalizing
|
||
the instruction data flow paths described in figure 4. These
|
||
context for these data flow paths is the f function.
|
||
|
||
2.3) Procedure Tier
|
||
|
||
Generalizing data flow paths from the basic block tier to the procedure tier
|
||
further reduces the amount of information needed to show data flow behavior.
|
||
Procedure tier data flow paths are meant to show how data is passed between
|
||
procedures through formal parameters. This covers scenarios such as passing a
|
||
procedure's formal input parameter to a child procedure's formal input
|
||
parameter, using the formal output parameter of a child procedure as the
|
||
formal input parameter to another called procedure, and so on. These
|
||
behaviors are all represented within the context of a particular procedure.
|
||
|
||
Based on these constraints, only two classes ofbasic block data flow paths
|
||
need to be considered. The first class involves data traveling from any block
|
||
to a formal input or output parameter, thus showing interprocedural flows.
|
||
The second class involves data traveling from a formal input or formal output
|
||
parameter to a terminal point in a procedure. This effectively eliminates any
|
||
intraprocedural data flows that are not carried over to another procedure in
|
||
some form. Since data flow information about which formal parameters are used
|
||
or defined is conveyed by basic block data flow paths, it is possible to
|
||
simply generalize this data flow information to show data flowing to formal
|
||
parameters within the context of a given procedure. While it may be tempting
|
||
to think that one must only show data flow paths between two formal
|
||
parameters, it is also necessary to show data flow paths that originate from
|
||
data that is locally defined within a procedure, such as through a local
|
||
variable which is not populated by a formal parameter. As such, the general
|
||
form that data flow paths may take at the procedure tier is illustrated by
|
||
figure 7. Figure 8 provides an example of what would happen when generalizing
|
||
the basic block data flow paths described in figure 6.
|
||
|
||
|
||
.---------. .---------. .---------.
|
||
| fin | | fout | | origin |
|
||
`---------` `---------` `---------`
|
||
\ | /
|
||
.`-------+--------'.
|
||
V V V
|
||
.---------. .---------. .---------.
|
||
| fin | | fout | | origin |
|
||
`---------` `---------` `---------`
|
||
|
||
Figure 7
|
||
General forms of data flow paths at the procedure tier
|
||
|
||
|
||
.----------. .-----------. .----------.
|
||
| fin(f,x) | | origin(f) | | fout(g) |
|
||
`----------` `-----------` `----------`
|
||
| \ | /
|
||
| `--------+----------'
|
||
V V
|
||
.----------. .-----------.
|
||
| fin(g,x) | | fout(f) |
|
||
`----------` `-----------`
|
||
|
||
Figure 8
|
||
Procedure tier data flow paths obtained by generalizing the
|
||
basic block data flow paths described in figure 4. The context
|
||
for these data flow paths is the f function.
|
||
|
||
Procedure data flow paths may generalize multiple basic block data flow paths
|
||
and thus can make use of a one-to-many link table to illustrate this
|
||
association. While generalizing data flow paths to the procedure tier is
|
||
trivial, the challenging aspect comes when determining reachability. This
|
||
will be discussed in more detail in section 3.
|
||
|
||
2.4) Data Type Tier
|
||
|
||
Using data flow information obtained from the procedure tier, it is sometimes
|
||
possible, depending on language features, to generalize data flow information
|
||
to the data type tier. Generalizing to the data type tier is meant to show
|
||
how formal parameters are passed between data types within the context of a
|
||
given data type. This relies on the underlying language having the ability to
|
||
associate procedures with data types. For example, object-oriented languages
|
||
are all capable of associating procedures with data types, such as through
|
||
classes defined in C++, C, and other languages. In the case of languages
|
||
where data types do not have procedures, it may instead be possible to
|
||
associate procedures with the name of the source file that contains them. In
|
||
both cases, it is possible to show formal parameters passing between elements
|
||
that act as containers for procedures, regardless of whether the underlying
|
||
elements are true data types.
|
||
|
||
The benefit of generalizing data flow information at the data type tier is
|
||
that it helps to further reduce the amount of data flow information that must
|
||
be represented. Since the small example source code that has been used to
|
||
illustrate generalizations at each tier only involves passing formal
|
||
parameters within the same data type, it is useful to consider an alternative
|
||
example which involves passing data between multiple data types.
|
||
|
||
class Company {
|
||
void AddEmployee(int num) {
|
||
Person employee = new Person(num);
|
||
employees.Add(employee);
|
||
Console.WriteLine("New employee {0}", employee);
|
||
}
|
||
int EmployeeCount() {
|
||
return employees.Count;
|
||
}
|
||
private ArrayList employees;
|
||
}
|
||
|
||
Figure 9 shows the data type data flow paths for the example code shown above.
|
||
It is important to note that unlike previous tiers, the specific formal
|
||
parameters that are being passed between types is not preserved. Instead,
|
||
only the fact that formal parameters are passed between data types is
|
||
retained. In this manner, fin indicates a data type's formal input parameter
|
||
and fout indicates a data type's formal output parameter.
|
||
|
||
|
||
.---------------------. .--------------.
|
||
| fin(System.Console) |<-----| fout(Person) |
|
||
`---------------------` `--------------`
|
||
|
|
||
.-----------------------. /
|
||
| fin(System.ArrayList) |<---------`
|
||
`-----------------------`
|
||
|
||
.--------------.
|
||
| fin(Company) |
|
||
`--------------`
|
||
|
|
||
V
|
||
.-------------.
|
||
| fin(Person) |
|
||
`-------------`
|
||
|
||
.------------------------. .---------------.
|
||
| fout(System.ArrayList) |---->| fout(Company) |
|
||
`------------------------` `---------------`
|
||
|
||
Figure 9
|
||
Data type tier data flow paths obtained by generalizing the
|
||
procedure tier data flow paths. The context for these data
|
||
flow paths is the Company data type.
|
||
|
||
In a fashion much the same as previous generalization tiers, a single data
|
||
type data flow path can represent multiple underlying procedure data flow
|
||
paths. Each generalized procedure data flow path can be associated with its
|
||
corresponding data type data flow path through a one-to-many link table in a
|
||
normalized database.
|
||
|
||
2.5) Module Tier
|
||
|
||
Generalizing data flow information to the module tier is meant to show how
|
||
data flows between distinct modules. As with each step in the generalization
|
||
process, the module tier data flow paths lose much of the information that is
|
||
conveyed at more specific tiers. Figure shows module tier data flow paths
|
||
that would be defined when generalizing the data type data flow paths
|
||
illustrated in figure 10.
|
||
|
||
|
||
.------------------. .-----------------.
|
||
| fin(Company.dll) | | fin(System.dll) |
|
||
`------------------` `-----------------`
|
||
| ^
|
||
V |
|
||
.-----------------. .------------------.
|
||
| fin(Person.dll) | | fout(Person.dll) |
|
||
`-----------------` `------------------`
|
||
|
||
.------------------. .-------------------.
|
||
| fout(System.dll) |----->| fout(Company.dll) |
|
||
`------------------` `-------------------`
|
||
|
||
Figure 10
|
||
Module tier data flow paths obtained by generalizing the
|
||
data type tier data flow paths. The context for these data
|
||
flow paths is the Company.dll module.
|
||
|
||
2.6) Abstract Tiers
|
||
|
||
Once data flow paths have been generalized from the instruction tier through
|
||
the module tier, it is no longer possible to create additional concrete
|
||
generalizations for most runtime environments An exception to this is managed
|
||
code which has an additional concrete assembly tier. Even though it may not
|
||
be possible to establish concrete generalizations, it is possible to define
|
||
abstract generalizations. An abstract generalization attempts to show data
|
||
flow relationships between abstract elements. A good example of an abstract
|
||
element would be a logical component which is defined in the architecture of a
|
||
given application. For example, a VPN client application might be composed of
|
||
a user interface component and a networking component, each of which may
|
||
consist of multiple concrete modules. By defining logical components and
|
||
associating concrete modules with each component, it is possible to further
|
||
generalize information beyond the module tier.
|
||
|
||
Given the example described above, it may be prudent to define two abstract
|
||
generalization tiers. The first abstract tier is the component tier. In this
|
||
context, a component is defined as a logical software component that contains
|
||
one or more concrete modules. The component tier makes it possible to
|
||
illustrate data flow between conceptual components within an application as
|
||
derived from how data flows between concrete modules. The second abstract
|
||
tier is the application tier. The application tier can be used to illustrate
|
||
how data is passed between conceptual applications. For example, a web
|
||
browser application passes data in some form to a web server application, both
|
||
of which consist of conceptual components which, in turn, consist of concrete
|
||
modules.
|
||
|
||
The caveat with abstract generalization tiers is that it must be possible to
|
||
illustrate data flow between what may otherwise be disjoint concrete elements.
|
||
The reason for this is that, often times, the paths that data will take
|
||
between two modules which belong to different logical components will be
|
||
entirely indirect with respect to one another. For this reason, it is
|
||
necessary to devise a mechanism to bridge data flow paths between concrete
|
||
software elements that belong to each logical component or application. A
|
||
particularly useful example of an approach that can be taken to bridge two
|
||
distinct components can be found in web services.
|
||
|
||
In a web services application, it is often common to have a client component
|
||
and a server component. The two components pass data to one another through
|
||
an indirect channel, such as through a web request. For this reason, it is
|
||
not immediately possible to show direct data flow paths from a web client
|
||
component to a web service component. To solve this problem, one can define a
|
||
mechanism that bridges the formal parameters associated with the web service
|
||
method that is being invoked. In this manner, the the formal input parameters
|
||
for a web service method found on the client side can be implicitly linked and
|
||
shown to define the formal input parameters received on the web service
|
||
side. By illustrating data flow at a concrete tier, it is possible to
|
||
generalize data flow behaviors all the way up through the abstract tiers.
|
||
|
||
The benefit of describing data flow behavior at abstract tiers is that it
|
||
makes it possible to derive data flow behaviors between abstract software
|
||
elements rather than strictly focusing on concrete software elements. This is
|
||
useful when attempting to view an application's behavior at a glance rather
|
||
than worrying about the specific details relating to how data is passed. For
|
||
example, this could be used to help validate threat models which describe how
|
||
data is expected to be passed between abstract components within an
|
||
application.
|
||
|
||
When generalizing information at abstract tiers, the only information that can
|
||
be conveyed, at least based on the approach described thus far, is whether or
|
||
not a component or application are passing data through a formal input or
|
||
formal output parameter. The specifics of which formal parameters are passed
|
||
is no longer available for use in generalization. Using the example shown at
|
||
the data type tier, one might assume the following component associations:
|
||
Company.dll and Person.dll, which contain the Company data type and Person
|
||
data type, are part of the user interface component of a human resources
|
||
application. The classes used from system libraries can be generically
|
||
grouped as belonging to an external library component. Using these,
|
||
groupings, the component data flow paths may be represented as shown in figure
|
||
11.
|
||
|
||
.------------------------. .-----------------------.
|
||
| fout(External Library) | | fin(External Library) |
|
||
`------------------------` `-----------------------`
|
||
| ^
|
||
V |
|
||
.----------------------. .---------------------.
|
||
| fout(User Interface) | | fin(User Interface) |
|
||
`----------------------` `---------------------`
|
||
|
||
Figure 11
|
||
Component tier data flow paths obtained by generalizing the
|
||
module tier data flow paths. The context for these data flow
|
||
paths is the user interface component.
|
||
|
||
As with all previously described generalization tiers, a single component data
|
||
flow path may represent multiple module data flow paths. The single component
|
||
data flow path should be associated with each corresponding module data flow
|
||
path through a one-to-many link table in a normalized database.
|
||
|
||
3) Reachability
|
||
|
||
The real benefit of the generalizations described in can be realized when
|
||
attempting to solve a graph reachability problem. By generalizing data flow
|
||
behaviors to both abstract and concrete generalization tiers, it is possible
|
||
to reduce the amount of information that must be represented when attempting
|
||
to determine graph reachability. This is further improved by the fact that
|
||
data flow paths found at less-specific generalization tiers can be used to
|
||
progressively qualify potential data flow paths at more-specific
|
||
generalization tiers. This qualification is possible due to the fact that
|
||
less-specific data flow paths are associated with more-specific data flow
|
||
paths at each generalization tier through a one-to-many link table, thus
|
||
permitting trivial expansion. The benefit of qualifying data flow paths in
|
||
this fashion is that only the minimal set of information needed to determine
|
||
reachability must be considered at once at each generalization tier. This can
|
||
drastically reduce the physical resources required to solve a graph
|
||
reachability problem by effectively limiting the size of a graph. This
|
||
general approach is captured by the Progressive Qualified Elaboration (PQE)
|
||
algorithm described by . This concept is very similar to the ideas outlined
|
||
by Schultes' highway hierarchy which is used to optimize fast path discovery
|
||
when identifying travel routes in road networks[16].
|
||
|
||
|
||
PQE(Elements, SourceDescriptor, SinkDescriptor)
|
||
Paths := 0
|
||
Graph := BuildGraph(Elements)
|
||
|
||
while Graph != 0
|
||
SourceVertices := Vertices(Graph, SourceDescriptor)
|
||
SinkVertices := Vertices(Graph, SinkDescriptor)
|
||
Paths := Reachability(Graph, SourceVertices, SinkVertices)
|
||
ElaboratedPaths := Elaborate(Paths)
|
||
Graph := BuildGraph(ElaboratedPaths)
|
||
end
|
||
|
||
return Paths
|
||
end
|
||
|
||
For the purposes of this paper, graph reachability is restricted to
|
||
determining realizable paths between two flow descriptors: a source and a
|
||
sink. A flow descriptor provides information that is needed to identify
|
||
corresponding vertices within a graph at each generalization tier. The tables
|
||
in figure and figure show the information needed to identify source and sink
|
||
vertices at each generalization tier for the example that will be described in
|
||
this section.
|
||
|
||
The PQE algorithm itself requires three parameters. The first parameter,
|
||
Elements, contains the set of generalized elements to be analyzed. For
|
||
example, it may contain the set of target modules that should be analyzed.
|
||
The second and third parameters, SourceDescriptor and SinkDescriptor,
|
||
represent the source and sink flow descriptors, respectively.
|
||
|
||
The first step taken by the algorithm is to define Paths as an empty set.
|
||
Paths will be used to contain the set of reachable paths between an actual set
|
||
of sources and sinks at a given generalization tier. After Paths has been
|
||
initialized, Graph is initialized to a flow graph that conveys data flow
|
||
relationships between the set of elements provided in . The approach taken to
|
||
construct the flow graph involves retrieving persisted data flow information
|
||
for the appropriate generalization tier. Once Paths and Graph have been
|
||
initialized, the qualified elaboration process can begin.
|
||
|
||
For each loop iteration, a check is made to see if Graph is an empty graph
|
||
(contains no vertices). If Graph is empty, the loop terminates. If Graph is not an
|
||
empty graph, reachable paths between the actual set of sources and sinks are
|
||
determined. This is accomplished by first identifying the vertices in that
|
||
are associated with the flow descriptors SourceDescriptor and SinkDescriptor
|
||
at the current generalization tier. The actual set of sources and sinks found
|
||
to be associated with these descriptors are stored in SourceVertices and
|
||
SinkVertices, respectively. With the set of actual source and sink vertices
|
||
identified, a reachability algorithm, Reachability(), can be used to determine
|
||
the set of reachable paths in flow graph between the two sets of vertices.
|
||
The result of this determination is stored in Paths. The final step in the
|
||
iteration involves using qualified elaboration to construct a new flow graph
|
||
containing more-specific data flow paths which are qualified by the set of
|
||
data flow paths encountered in the reachable paths found in Paths. This set
|
||
is then elaborated to a subset that contains the associated data flow paths
|
||
from the next, more specific tier, such as by elaborating to a subset of basic
|
||
blocks data flow paths from a more general set of procedure data flow paths.
|
||
The result of the elaboration is stored in ElaboratedPaths. Finally, a new
|
||
flow graph is constructed and stored in Graph using the elaborated set of flow
|
||
paths contained within ElaboratedPaths.
|
||
|
||
When it is not possible to obtain a more-detailed flow set, such as when the
|
||
instruction tier is reached, an empty graph is created and the algorithm
|
||
completes by returning . In the final iteration, Paths contains the most detailed
|
||
set of reachable data flow paths found between the source and sink flow
|
||
descriptors. The benefit of approaching graph reachability problems in this
|
||
fashion is that only a subset of the elements at any generalization tier need
|
||
to be considered at once. These subsets are dictated by the set of reachable
|
||
data flow paths found at each preceding generalization tier. In this manner,
|
||
the subset of procedure data flow paths that need to be considered would be
|
||
effectively qualified by the set of data types and modules found to be
|
||
involved in data flow paths between the source and sink flow descriptors at
|
||
less-specific tiers.
|
||
|
||
For the purposes of this paper, the algorithm is designed to consider
|
||
realizable paths at each generalization tier in manner that is similar to the
|
||
concept described by Reps et al. This involves traversing the graph in
|
||
context-sensitive fashion. To accomplish this, the algorithm keeps a scope
|
||
stack at each generalization tier. The scope may be an assembly, a type, or a
|
||
procedure. When data is passed through to a formal input parameter, the scope
|
||
for the formal input parameter is pushed onto the stack. When data is
|
||
returned through a formal output parameter to another location, the
|
||
algorithm ensures that the scope that is being returned to is the parent
|
||
scope. In this way, only realizable paths are considered at each
|
||
generalization tier which limits the number of paths that must be
|
||
considered and also has the benefit of producing more accurate results.
|
||
|
||
The specific algorithm used for the function involves using the set of data
|
||
flow paths found at a less-specific tier to identify the set of more-specific
|
||
data flow paths that have been generalized. This is accomplished by simply
|
||
using the one-to-many link tables that were populated during generalization to
|
||
determine the subset of data flow paths that must be considered at the next
|
||
generalization tier. For example, elaborating from a set of procedure data
|
||
flow paths would involve determining the complete set of basic block data flow
|
||
paths that have been generalized by the affected set of procedure data flow
|
||
paths.
|
||
|
||
Based on this general description of the algorithm, it is useful to consider a
|
||
concrete example This section provides a concrete illustration by determining
|
||
reachability between a source and a sink using an example web application that
|
||
consists of a web client and a web service component. This is illustrated by
|
||
progressively drilling down through each generalization tier starting from the
|
||
least-specific tier, the abstract tier, and working toward the most-specific
|
||
tier, the instruction tier. At each tier, a description of the number of data
|
||
flow paths that must be represented and the number of reachable data flow
|
||
paths found is given. This particular example will attempt to determine
|
||
concrete reachable data flow paths between the return value of
|
||
HttpRequest.getQueryString and the first formal input parameter of
|
||
Process.Start. The implications of a reachable path between these two points
|
||
could be indicative of a command injection vulnerability within the
|
||
application. The tables in figure 12 and figure 13 show the flow descriptors for
|
||
the source and sink, respectively. These flow descriptors are used to
|
||
identify associated vertices at each generalization tier.
|
||
|
||
+-------------+------------------------------+
|
||
| Tier | Information |
|
||
+-------------+------------------------------+
|
||
| Component | fout(Undefined) |
|
||
| Module | fout(System.Web.dll) |
|
||
| Data Type | fout(System.Web.HttpRequest) |
|
||
| Procedure | fout(get_QueryString, 0) |
|
||
| Basic Block | fout(get_QueryString, 0) |
|
||
| Instruction | fout(get_QueryString, 0) |
|
||
+-------------+------------------------------+
|
||
|
||
Figure 12
|
||
Source flow descriptor for the return value of
|
||
HttpRequest.get_QueryString
|
||
|
||
+-------------+---------------------------------+
|
||
| Tier | Information |
|
||
+-------------+---------------------------------+
|
||
| Component | fin(Undefined) |
|
||
| Module | fin(System.dll) |
|
||
| Data Type | fin(System.Diagnostics.Process) |
|
||
| Procedure | fin(Start, 0) |
|
||
| Basic Block | fin(Start, 0) |
|
||
| Instruction | fin(Start, 0) |
|
||
+-------------+---------------------------------+
|
||
|
||
Figure 13
|
||
Sink flow descriptor for the first formal parameter
|
||
of Process.Start
|
||
|
||
For this illustration, there is in fact a data flow path that exists from the
|
||
source descriptor to the sink descriptor. However, unlike conventional data
|
||
flow paths, this data flow path happens to cross an abstract boundary between
|
||
the two components. In this case, data is passed from the web client
|
||
component through an HTTP request to a method hosted by the web service
|
||
component. This path can be seen by first looking at a portion of the web
|
||
client code:
|
||
|
||
class Program {
|
||
static void Main(string[] args) {
|
||
HttpRequest request = new HttpRequest(
|
||
"a","b","c");
|
||
WebClient client = new WebClient();
|
||
|
||
client.ExecuteCommand(
|
||
request.QueryString["abc"]);
|
||
}
|
||
}
|
||
[WebServiceBinding]
|
||
public class WebClient : SoapHttpClientProtocol {
|
||
[SoapDocumentMethod]
|
||
public void ExecuteCommand(string command) {
|
||
Invoke("ExecuteCommand",
|
||
new object[] { command });
|
||
}
|
||
}
|
||
|
||
In this contrived example, data is shown as being passed from a query string
|
||
obtained from what is presumably a real HTTP request to the client portion of
|
||
the web service method ExecuteCommand. The web service application, in turn,
|
||
contains the following code:
|
||
|
||
[WebService]
|
||
public class WebService {
|
||
[WebMethod]
|
||
public void ExecuteCommand(string command) {
|
||
System.Diagnostics.Process.Start(command);
|
||
}
|
||
}
|
||
|
||
In conventional tools, it would not be possible to directly model this data
|
||
flow path because the data flow path is indirect. However, using a simple
|
||
methodology to bridge the client-side formal input parameters with the
|
||
server-side formal input parameters at the instruction tier, it is possible to
|
||
connect the two and represent data flow between the two conceptual software
|
||
elements at each generalization tier. The following sections will provide
|
||
visual examples of how the PQE algorithm narrows down and eliminates
|
||
unnecessary data flow paths at each generalization tier by progressively
|
||
qualifying data flow information. One thing to note about the graphs at each
|
||
tier is that implicit edges have been created between formal input and output
|
||
parameters that reside in external (un-analyzed) libraries. This is done
|
||
under the assumption that a formal input parameter may affect a formal output
|
||
parameter in some way in the context of the code that is not analyzed. If all
|
||
target code paths have been analyzed, then this is not necessary. The graphs
|
||
shown at each tier were automatically generated but have been modified to
|
||
allow them to fit within the margins of this document and in some cases
|
||
highlight important features.
|
||
|
||
3.1) Abstract Tiers
|
||
|
||
Abstract tiers represent the most general view of the data flow behaviors of
|
||
an application. The data flow behavior is modeled with respect to abstract
|
||
software elements, such as a component, rather than concrete software
|
||
elements, such as a module or a type. For this example, it is assumed that
|
||
the PQE algorithm begins by modeling data flow behaviors between conceptual
|
||
components in a web application. The web application is composed of two
|
||
manually defined abstract components, a Web Client and a Web Service. These
|
||
two components both rely on external libraries, as represented by the
|
||
Undefined component, which are outside of the scope of the application itself.
|
||
When starting at the abstract tier, all abstract data flow paths must be
|
||
considered as potential data flow paths. The component tier data flow graph
|
||
for this application is shown in figure 14.
|
||
|
||
|
||
.--------------------.
|
||
.------| origin(Web Client) |
|
||
| `--------------------`
|
||
V |
|
||
.----------------. |
|
||
| fin(Undefined) |<---. |
|
||
`----------------` \ |
|
||
.-^ | ^ | |
|
||
/ V | | |
|
||
| .----------------. | |
|
||
| | fin(Undefined) | | |
|
||
| `----------------` | |
|
||
| \ | |
|
||
| `--. | |
|
||
| V | V
|
||
.------------------. .-----------------.
|
||
| fin(Web Service) |<---| fin(Web Client) |
|
||
`------------------` `-----------------`
|
||
| | ^
|
||
V V |
|
||
.------------------. .------------------.
|
||
| fin(Web Service) | | fout(Web Client) |
|
||
`------------------` `------------------`
|
||
|
||
Figure 14
|
||
Complete compenet tier data flow graph for the
|
||
web application.
|
||
|
||
Using the data flow graph shown in figure 14, PQE uses the Reachability()
|
||
algorithm to determine data flow paths between a formal output parameter in
|
||
the Undefined component and a formal input parameter in the Undefined
|
||
component. At this generalization tier, there are many different paths that
|
||
can be taken between these two components. This effectively results in the
|
||
qualification of nearly all of the assembly tier data flow paths. These data
|
||
flow paths are used to represent the data flow graph at the assembly tier.
|
||
|
||
In this example, PQE offers no improvements at abstract tiers because it is a
|
||
requirement that all abstract data flow information be represented. Since the
|
||
amount of information required to represent abstract data flow is minimal,
|
||
this is not seen as a deficiency. Furthermore, for this particular example,
|
||
nearly all component data flow paths are found to be involved in reachable
|
||
paths. At worst, this is indicative that for small applications, it may not
|
||
be necessary to start the algorithm by looking at abstract data flow
|
||
information. Instead, one might immediately progress to the module or data
|
||
type tiers.
|
||
|
||
3.2) Module Tier
|
||
|
||
The module tier uses the set of data flow paths found at the abstract
|
||
component tier to construct a data flow graph that shows the data flow
|
||
relationships between formal input and formal output parameters passed between
|
||
modules. The graph is generated using the one-to-many table that was
|
||
populated during generalization which conveys the module data flow paths that
|
||
were generalized by the set of qualified component data flow paths. For this
|
||
particular example, nearly all of the module data flow paths were qualified as
|
||
potentially being involved in a reachable path between the source and sink
|
||
flow descriptor. The graph that is generated as a result is shown in figure
|
||
15.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Using this graph, the Reachability() algorithm is again employed to find paths
|
||
between the source and sink flow descriptor at the module tier. In this case,
|
||
only the edges between the nodes highlighted in dark orange are found to be
|
||
involved in reachable paths between fout(System.Web) and fin(System). The
|
||
important thing to note is that even at the module tier, a data flow path is
|
||
illustrated between fin(WebClient) and fin(WebService). This will be a trend
|
||
that will continue to each more specific generalization tier.
|
||
|
||
3.3) Data Type Tier
|
||
|
||
The data type tier uses the set of data flow paths found at the module tier to
|
||
construct a data flow graph that shows the data flow relationships between
|
||
formal input and formal output parameters passed between data types. The
|
||
graph is generated using the one-to-many table that was populated during
|
||
generalization which conveys the data type data flow paths that were
|
||
generalized by the set of qualified module data flow paths. The graph that is
|
||
generated as a result is shown in figure 16.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Using the graph, the Reachability() algorithm is again employed to find paths
|
||
between the source and sink flow descriptor at the data type tier. Due to the
|
||
simplicity of the example application, only a few data flow paths were
|
||
rendered. The complete data flow path from fout(System.Web.HttpRequest) to
|
||
fin(System.Diagnostics.Process.Start) can be clearly seen.
|
||
|
||
3.4) Procedure Tier
|
||
|
||
The procedure tier uses the set of data flow paths found at the data type tier
|
||
to construct a data flow graph that shows the data flow relationships between
|
||
formal input and formal output parameters passed between procedures. Unlike
|
||
previous tiers, procedure tier data flow paths explicitly identify the formal
|
||
parameter index that data is being passed to. This helps to further isolate
|
||
data flow paths from one another and improves the overall accuracy of paths
|
||
that are selected. The graph is generated using the one-to-many table that
|
||
was populated during generalization which conveys the procedure data flow
|
||
paths that were generalized by the set of qualified data type data flow paths.
|
||
The graph that is generated as a result is shown in figure 17.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Using the graph, the Reachability() algorithm is again employed to find paths
|
||
between the source and sink flow descriptor at the procedure tier. Due to the
|
||
simplicity of the example application, only a few data flow paths were
|
||
rendered. The complete data flow path from fout(getQueryString, 0) to
|
||
fin(Start, 0) can be clearly seen.
|
||
|
||
3.5) Basic Block Tier
|
||
|
||
The basic block tier uses the set of data flow paths found at the procedure
|
||
tier to construct a data flow graph that shows the data flow relationships
|
||
between formal input and formal output parameters passed between basic blocks.
|
||
Like the procedure tier, basic block tier data flow paths also explicitly
|
||
identify the formal parameter index that data is being passed to. The graph
|
||
is generated using the one-to-many table that was populated during
|
||
generalization which conveys the basic block data flow paths that were
|
||
generalized by the set of qualified procedure data flow paths. The graph that
|
||
is generated as a result is shown in figure 18. Due to the way that Phoenix
|
||
currently represents basic blocks, the basic block tier data flow paths offer
|
||
very little generalization beyond the instruction tier.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Using the graph, the Reachability() algorithm is again employed to find paths
|
||
between the source and sink flow descriptor at the basic block tier. Due to
|
||
the simplicity of the example application, only a few data flow paths were
|
||
rendered. The complete data flow path from fout(getQueryString, 0) to
|
||
fin(Start, 0) can be clearly seen.
|
||
|
||
3.6) Instruction Tier
|
||
|
||
The instruction tier uses the set of data flow paths found at the basic block
|
||
tier to construct a data flow graph that shows the data flow relationships
|
||
between formal input and formal output parameters passed between instructions.
|
||
Like the basic block tier, instruction tier data flow paths also explicitly
|
||
identify the formal parameter index that data is being passed to. The graph
|
||
is generated using the one-to-many table that was populated during
|
||
generalization which conveys the instruction data flow paths that were
|
||
generalized by the set of qualified basic block data flow paths. The graph
|
||
that is generated as a result is shown in figure 19. The instruction tier
|
||
data flow paths represent the final step taken by the algorithm as they
|
||
contain the most specific description of data flow paths.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Using the graph, the Reachability() algorithm is again employed to find paths
|
||
between the source and sink flow descriptor at the instruction tier. Due to
|
||
the simplicity of the example application, only a few data flow paths were
|
||
rendered. The complete data flow path from fout(getQueryString, 0) to
|
||
fin(Start, 0) can be clearly seen along with source lines that are encountered
|
||
along the way.
|
||
|
||
4) Acknowledgements
|
||
|
||
The author would like to thank Rolf Rolles, Richard Johnson, Halvar Flake,
|
||
Jordan Hind, and many others for thoughtful discussions and feedback.
|
||
|
||
5) Conclusion
|
||
|
||
This document has attempted to convey the potential benefits of generalizing
|
||
data flow information along generalization tiers. Each generalization tier is
|
||
used to represent the data flow behaviors of an abstract or concrete software
|
||
element such as an instruction, basic block, procedure, and so on. Using this
|
||
concept, data flow information can be collected at the most specific tier, the
|
||
instruction tier, and then generalized to increasingly less-specific tiers.
|
||
The generalization process has the effect of reducing the amount of data that
|
||
must be considered at once while still conveying a general description of the
|
||
manner in which data flows within an application.
|
||
|
||
Generalized data flow information can be immediately used in conjunction with
|
||
existing graph reachability problems. For instance, a common task that
|
||
involves determining reachable data flow paths between a conceptual source and
|
||
sink location within an application can potentially benefit from operating on
|
||
generalized data flow information. This paper has illustrated these potential
|
||
benefits by defining the Progressive Qualified Elaboration (PQE) algorithm
|
||
which can be used to progressively determine reachability at each
|
||
generalization tier. By starting at the least specific generalization tier
|
||
and progressing toward the most specific, it is possible to restrict the
|
||
amount of data flow information that must be considered at once to a minimal
|
||
set. This is accomplished by using reachable paths found at each
|
||
generalization tier to qualify the set of data flow paths that must be
|
||
considered at more specific generalization tiers.
|
||
|
||
While these benefits are thought to be present, the author has yet to
|
||
conclusively prove this to be the case. The results presented in this paper
|
||
do not prove the presumed usefulness of generalizing data flow information
|
||
beyond the procedure tier. The author believes that analysis of large
|
||
applications involving hundreds of modules could benefit from generalizing
|
||
data flow information to the data type, module, and more abstract tiers.
|
||
However, at the time of this writing, conclusive data has not been
|
||
collected to prove this usefulness. The author hopes to collect
|
||
information that either confirms or refutes this point during future
|
||
research.
|
||
|
||
At present, the underlying implementation used to generate the results
|
||
described in this paper has a number of known limitations. The first
|
||
limitation is that it does not currently take into account formal parameters
|
||
that are not passed at a call site, such as fields, global variables, and so
|
||
on. This significantly restricts the accuracy of the data flow model that it
|
||
is currently capable of generating. This limitation represents a more general
|
||
problem of needing to better refine the underlying completeness of the data
|
||
flow information that is captured.
|
||
|
||
While the algorithms presented in this paper were portrayed in the context of
|
||
data flow analysis, it is entirely possible to apply them to other fields as
|
||
well, such as control flow analysis. The PQE algorithm itself is conceptually
|
||
generic in that it simply describes a process that can be employed to qualify
|
||
the next set of analysis information that must be considered from a more
|
||
generic set of analysis information. This may facilitate future research
|
||
directions.
|
||
|
||
References
|
||
|
||
[1] Atkinson, Griswold. Implementation Techniques for Efficient Data-flow Analysis of Large Programs.
|
||
Proceedings of the IEEE International Conference on Software Maintenance (ICSM'01). 2001.
|
||
http://www.cse.scu.edu/ atkinson/papers/icsm-01.ps
|
||
|
||
[2] Das, M. Static Analysis of Large Programs: Some Experiences
|
||
2000. http://research.microsoft.com/manuvir/Talks/pepm00.ppt
|
||
|
||
[3] Das, M., Lerner, S., Seigle, M. ESP: Path-Sensitive Program Verification in Polynomial Time.
|
||
Proceedings of the SIGPLAN 2002 Conference on programming language design. 2002.
|
||
http://www.cs.cornell.edu/courses/cs711/2005fa/papers/dls-pldi02.pdf
|
||
|
||
[4] Das, M., Fahndrich, M., Rehof, J. From Polymorphic Subtyping to CFL Reachability: Context-Sensitive Flow Analysis Using Instantiation Constraints. 2000.
|
||
http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-99-84
|
||
|
||
[5] Dinakar Dhurjati1, Manuvir Das, and Yue Yang.
|
||
Path-Sensitive Dataflow Analysis with Iterative Refinement.
|
||
SAS'06: The 13th International Static Analysis Symposium, Seoul, August 2006.
|
||
|
||
[6] Erikson, Manocha. Simplification Culling of Static and Dynamic Scene Graphs.
|
||
TR9809-009 by University of North Carolina at Chapel Hill. 1998. citeseer.ist.psu.edu/erikson98simplification.html
|
||
|
||
[7] Horwitz, S., Reps, T., and Binkley, D., Interprocedural slicing using dependence graphs.
|
||
In Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation,
|
||
(Atlanta, GA, June 22-24, 1988), ACM SIGPLAN Notices 23, 7 (July 1988), pp. 35-46.
|
||
|
||
[8] Horwitz, S., Reps, T., and Binkley, D., Retrospective: Interprocedural slicing using dependence graphs.
|
||
20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation (1979 - 1999):
|
||
A Selection, K.S. McKinley, ed., ACM SIGPLAN Notices 39, 4 (April 2004), 229-231.
|
||
|
||
[9] Gregor, D., Schupp, S. Retaining Path-Sensitive Relations across Control Flow Merges.
|
||
Technical report 03-15, Rensselaer Polytechnic Institute, November 2003.
|
||
http://www.cs.rpi.edu/research/ps/03-15.ps
|
||
|
||
[10] Kiss, Ja<4A>sz, Lehotai, Gyimo<6D>thy. Interprocedural Static Slicing of Binary Executables.
|
||
http://www.inf.u-szeged.hu/ akiss/pub/kiss_interprocedural.pdf
|
||
|
||
[11] Microsoft Corporation. Phoenix Framework.
|
||
http://research.microsoft.com/phoenix/
|
||
|
||
[12] Naumovich, G., Avrunin, G. S., and Clarke, L. A. 1999.
|
||
Data flow analysis for checking properties of concurrent Java programs.
|
||
In Proceedings of the 21st international Conference on Software Engineering
|
||
(Los Angeles, California, United States, May 16 - 22, 1999).
|
||
International Conference on Software Engineering.
|
||
IEEE Computer Society Press, Los Alamitos, CA, 399-410.
|
||
|
||
[13] Reps, T., Horwitz, S., and Sagiv, M., Precise interprocedural dataflow analysis via graph reachability.
|
||
In Conference Record of the 22nd ACM Symposium on Principles of Programming Languages,
|
||
(San Francisco, CA, Jan. 23-25, 1995), pp. 49-61.
|
||
|
||
[14] Reps, T., Sagiv, M., and Horwitz S., Interprocedural dataflow analysis via graph reachability.
|
||
TR 94-14, Datalogisk Institut, University of Copenhagen, Copenhagen, Denmark, April 1994.
|
||
|
||
[15] A. Rountev, B. G. Ryder, and W. Landi. Dataflow analysis of program fragments.
|
||
In Proc. Symp. Foundations of Software Engineering, LNCS 1687, pages 235--252, 1999.
|
||
http://citeseer.ist.psu.edu/rountev99dataflow.html
|
||
|
||
[16] Schultes, Dominik. Fast and Exact Shortest Path Queries Using Highway Hierachies. 2005.
|
||
http://algo2.iti.uka.de/schultes/hwy/hwyHierarchies.pdf
|