Zines/uninformed/8.6.txt

Generalizing Data Flow Information
Aug, 2007
skape
mmiller@hick.org

Abstract: Generalizing information is a common method of reducing the quantity
of data that must be considered during analysis.  This fact has been plainly
illustrated in relation to static data flow analysis where previous research
has described algorithms that can be used to generalize data flow information.
These generalizations have helped support more optimal data flow analysis in
certain situations.  In the same vein, this paper describes a process that can
be employed to generalize and persist data flow information along multiple
generalization tiers.  Each generalization tier is meant to describe the data
flow behaviors of a conceptual software element such as an instruction, a
basic block, a procedure, a data type, and so on.  This process makes use of
algorithms described in previous literature to support the generalization of
data flow information.  To illustrate the usefulness of the generalization
process, this paper also presents an algorithm that can be used to determine
reachability at each generalization tier.  The algorithm determines
reachability starting from the least specific generalization tier and uses the
set of reachable paths found to progressively qualify data flow information
for each successive generalization tier.  This helps to constrain the amount
of data flow information that must be considered to a minimal subset.

1) Introduction

Data flow analysis uses data flow information to solve a particular data flow
problem such as determining reachability, dependence, and so on.  The
algorithms used to obtain data flow information may vary in terms of accuracy
and precision.  To help quantify effectiveness, data flow algorithms may
generally be categorized based on specific sensitivities.  The first category,
referred to ask flow sensitivity is used to convey whether or not an algorithm
takes into account the implied order of instructions.  Path sensitivity is
used to convey whether or not an algorithm considers predicates.  Finally,
algorithms may also be context-sensitive if they take into account a calling
context to restrict analysis to realizable paths when considering
interprocedural data flow information.

Data flow information is typically collected by statically analyzing the data
dependence of instructions or statements.  For example, conventional def-use
chains describe the variables that exist within def(), use(), in(), out(), and
kill() set for each instruction or statement.  Understanding data flow
information with this level of detail makes it possible to statically solve a
particular data flow problem.  However, the resources needed to represent the
def-use data flow information can be prohibitive when working with large
applications.  Depending on the data flow problem, the amount of data flow
information required to come to a solution may be in excess of the physical
resources present on a computer performing the analysis.  This physical
resource problem can be solved using at least two general approaches.

The most basic approach might involve simply partitioning, or fragmenting,
analysis information such that smaller subsets are considered individually
rather than attempting to represent the complete set of data flow information
at once[15].  While this would effectively constrain the amount of physical
resources required, it would also directly impact the accuracy and precision
of the underlying algorithm used to perform data flow analysis.  For instance,
identifying the ``interesting portion'' of a program may require more state
than can be feasibly obtained in single program fragment.  A second and
potentially more optimal approach might involve generalizing data flow
information.  By generalizing data flow information, an algorithm can operate
within the bounds of physical resources by making use of a more abstract view
of the complete set of data flow information.  The distinction between the
generalizing approach and the partitioning approach is that the generalized
data flow information should not affect the accuracy of the algorithm since it
should still be able to represent the complete set of generalized data flow
information at once.

There has been significant prior work that has illustrated the effectiveness
of generalizing data flow information when performing data flow analysis.  The
def-use information obtained between instructions or statements has been
generalized to describe sets for basic blocks.  Horwitz, Reps, and Binkley
describe how a system dependence graph (SDG) can be derived from
intraprocedural data flow information to produce a summary graph which convey
context-sensitive data flow information at the procedure level[7].  Their paper
went on to describe an interprocedural slicing algorithm that made use of
SDGs.  Reps, Horwitz, and Sagiv later described a general framework (IFDS) in
which many data flow analysis problems can be solved as graph reachability
problems[13, 14].  The algorithms proposed in their paper focus on restricting
analysis to interprocedurally realizable paths to improve precision.
Identifying interprocedurally realizable paths has since been compared to the
concept of context-free-language (CFL) reachability (CFL-reachability)[8].  These
algorithms have helped to form the basis for techniques used in this paper to
both generalize and analyze data flow information.

This paper approaches the generalization of data flow information by defining
generalization tiers at which data flow information can be conveyed.  A
generalization tier is intended to show the data flow relationships between a
set of conceptual software elements.  Examples of software elements include an
instruction, a basic block, a procedure, a data type, and so on.  To define
these relationships, data flow information is collected at the most specific
generalization tier, such as the instruction tier, and then generalized to
increasingly less-specific generalization tiers, such as the basic block,
procedure, and data type tiers.

To illustrate the usefulness of generalizing data flow information, this paper
also presents a progressive algorithm that can be used to determine
reachability between nodes on a data flow graph at each generalization tier.
The algorithm starts by generating a data flow graph using data flow
information from the least-specific generalization tier.  The graph is then
analyzed using a previously describe algorithm to determine reachability
between an arbitrary set of nodes.  The set of reachable paths found is then
used to qualify the set of more-specific potentially reachable paths found at
the next generalization tier.  The more-specific paths are used to construct a
new data flow graph.  These steps then repeat using each more-specific
generalization tier until it is not possible to obtained more detailed
information.  The benefit of this approach is that a minimal set of data flow
information is considered as a result of progressively qualifying data flow
paths at each generalization tier.  It should be noted that different
reachability problems may require state that is prohibitively large.  As such,
it is helpful to consider refining a reachability problem to operate more
efficiently by making use of generalized information.

This paper is organized into two sections. Section 2  discusses the algorithms
used to generalize data flow information at each generalization tier.  Section
3 describes the algorithm used to determine reachable data flow paths by
progressively analyzing data flow information at each generalization tier.  It
should be noted in advance that the author does not claim to be an expert in
this field; rather, this paper is simply an explanation of the author's
current thoughts.  These thoughts attempt to take into account previous work
whenever possible to the extent known by the author.  Given that this is the
case, the author is more than willing to receive criticism relating to the the
ideas put forth in this paper.

2) Generalization

Generalizing data flow information can make it possible to analyze large data
sets without losing accuracy.  This section describes the process of
generalizing information at each generalization tier.  As a matter of course,
each generalization tier uses data flow information obtained from its
preceding more specific generalization tier.  In this way, the basic block
tier generalizes information obtained at the instruction tier, the procedure
tier generalizes information obtained at the basic block tier, and so on.  The
algorithms used to generalize information at each generalization tier can have
a direct impact on the accuracy of the information that can be obtained when
used during data flow analysis.  The subject of accuracy will be addressed for
each specific tier.

To obtain generalized data flow information, a set of target executable image
files, or modules, must be defined.  The target modules serve to define the
context from which data flow information will be obtained and generalized.
The general process used to accomplish this involves visiting each procedure
within each module.  For each procedure, data flow information is collected at
the instruction tier and is then generalized to each less-specific tier.  To
facilitate the reachability algorithm, it is assumed that as the data flow
information is collected, it is persisted in a form such that can be accessed
on demand.  The process described in this paper assumes a normalized database
is used to contain the data flow information found at each generalization
tier.  In this manner, the upper limit associated with the number of target
modules is tied to the amount of available persistent storage with respect to
the amount required by a given data flow problem.

Before proceeding, it is important to point out that while this paper
describes explicit algorithms for generalizing at each tier, it is entirely
possible to substitute alternative algorithms.  This serves to illustrate that
the concept of generalizing information along generalization tiers is
sufficiently abstract enough to support representing alternate forms of data
flow and control flow information.  By using different algorithms, it is
possible to convey different forms of data flow relationships which vary in
terms of precision and accuracy.

2.1) Instruction Tier

Generalizing data flow information presupposes that there is data flow
information to generalize.  As such, a base set of data flow information must
be collected first.  For the purposes of this paper, the most specific data
flow information is collected at the instruction tier using the Static Single
Assignment (SSA) implementation provided by Microsoft's Phoenix framework,
though other algorithms could just as well be used[11].  SSA is an elegant
solution to the problem of representing data flow information in a
flow-sensitive manner.  Each definition and use of a given variable are
defined in terms of a unique variable version which makes it possible to show
clear, unambiguous data flow relationships.  In cases where data flow
information may merge along control flow paths, SSA makes use of a phi
function which acts as a pseudo-instruction to represent the merge point.
Obtaining distinct data flow paths at the instruction tier can be accomplished
by traversing an SSA graph for a given procedure starting from each root
variable, which have no prior definitions, and proceeding to each reachable
leaf variable, which have no subsequent uses, are encountered along each data
flow path.  The end result of this traversal is the complete set of data flow
paths found within the context of a given procedure.

One of SSA's limitations is that it is only designed to work intraprocedurally
and therefore makes no effort to describe the behavior of passing data between
procedures, such as through input and output parameters.  In order to provide
an accurate, distinct path data flow representation, one must take into
account interprocedural data flow.  One method of accomplishing this is to
generalize the concept of SSA's phi function and use it represent formal
parameters.  In this way, the phi function can be used to represent data flow
merges that happen as a result of data passing as input or output parameters
when a procedure is called.  A phi function can be created to represent each
formal input and output parameter for a procedure, thus linking definitions of
parameters at a call site to actual parameter uses in a callee.  Reps,
Horwitz, and Sagiv describe a concept similar to this[13].

In addition to using phi functions to link the definitions and uses of formal
parameters, it is also necessary to fracture data flow paths at call sites
that are found within a procedure .  This is necessary because data flow paths
collected using SSA information will convey a relationship between the input
parameters passed to a procedure and the output parameters returned by a
procedure.  This is the case because a call instruction at a call site appears
to use input parameters and define output parameters, thus creating an
implicit link between input and output parameters.  Since SSA information is
obtained intraprocedurally, it is not possible to know in advance whether or
not an input parameter will influence an output parameter.

To fracture a data flow path, the instructions that define input parameters
passed at a given call site are instead linked directly to the associated
formal input parameter phi functions that are found in the context of the
target procedure.  Likewise, instructions that use output parameters
previously defined by the call instruction are instead linked directly to the
associated formal output parameter phi functions found in the context of the
target procedure.  This has the effect of breaking the original data flow path
into two disconnected data flow paths at the call site location. The linking
of actual parameters and call site parameters with formal parameters has been
illustrated in previous literature.  Horwitz, Reps, and Binkley used this
concept during the construction of a system dependence graph (SDG)[7].  The
concept of creating symbolic variables that are later used to link information
together is not new[14].  Figure 2 provides an example of what a conventional and
fractured data flow path might look like.

   Conventional        Fractured

   .---------.        .---------.
   | ldarg.0 |        | ldarg.0 |
   `---------`        `---------`
        |                  |
        V                  V
   .---------.        .---------.
   | call g  |        | fin(x)  |
   `---------`        `---------`
        |
        |          ------------------
        V
   .---------.        .---------.
   | stloc.0 |        | fout(g) |
   `---------`        `---------`
                           |
                           V
                      .---------.
                      | stloc.0 |
                        `---------`

             Figure 2
   Fracturing a data flow path at a call
   site.  Call instructions no longer act
   as the producers or receivers of data
   that is passed between procedures.

Using the fracturing concept, the instruction tier's path-sensitive data flow
information for a given procedure becomes disconnected.  This helps to improve
the overall accuracy of the data flow paths that are conveyed.  Fracturing
also has the added advantage of making it possible to use formal parameter phi
functions to dynamically link a caller and a callee at runtime.  This makes it
possible to identify context-sensitive interprocedural data flow paths at the
granularity of an instruction.  This ability will be described in more detail
when the reachability algorithm is described in section 3.

With an understanding of the benefits of fracturing, it is now possible to
define the general form that data flow paths may take at the instruction tier.
This general form is meant to describe the structure of data flow paths at the
instruction tier in terms of the potential set of origins, transient, and
terminal points with respect to the general instruction types.  Based on the
description given above, it is possible to categorize instructions into a few
general types.  Using these general instruction types, the general form of
instruction data flow paths can be captured as illustrated by the diagram in
figure 3.

value: Defines or uses a data value
compare: Compares a data value
fin: Pseudo instruction representing a formal input parameter
fout: Pseudo instruction representing a formal output parameter


   .---------.  .---------.  .---------.
   |  value  |  |   fin   |  |   fout  |
   `---------`  `---------`  `---------`
          \          |            /
           `---------|-----------`
                     V
              .--------------.
              |     value    |
              |    compare   |
              |      ...     |
              `--------------`
                     |
        .------------|---------.-------------.
        V            V         V             V
 .---------.  .---------.  .---------.  .---------.
 |  value  |  |   fin   |  |   fout  |  | compare |
 `---------`  `---------`  `---------`  `---------`

                    Figure 3
  General forms of data flow paths at the instruction tier

Based on this general description of instruction data flow paths, it is
helpful to consider a concrete example.  Consider the example source code
described below which shows the implementation of the f function.

static public int f(int x)
{
   return (x >= 0) ? g(x) : x + 1;
}

This function is intentionally very simple so as to limit the number of data
flow paths that must be represented visually.  Using the concepts described
above, the instruction data flow paths that would be created as a result of
analyzing this procedure are shown in figure 4.  Note that the call site for
the g function results in two disconnected data flow paths.  The end result is
that there are four unique data flow paths within this procedure.


               .----------.
               | fin(f,x) |
               `----------`
              / /     |
             / /      |
            V  |      V
 .----------.  |  .----------.  .----------.
 | ldarg.0  |  |  | ldarg.0  |  |   ldc    |
 `----------`  |  `----------`  `----------`
          |    |       |       /
          |    |       |      /
          |    |       V     V
.------.  |    |  .----------.  .----------.
| ldc  |  |    |  |   add    |  |  fout(g) |
`------`  |    |  `----------`  `----------`
     \    |    |          \ \        /
      \   |    |           \ \      /
       V  V    V            V V    V
   .-------.  .----------.  .--------.
   | brcmp |  | ldarg.0  |  |  ret   |
   `-------`  `----------`  `--------`
                   |             |
                   |             |
                   V             V
              .----------.  .---------.
              | fin(g,x) |  | fout(f) |
              `----------`  `---------`

                Figure 4
   Instruction tier data flow paths for the example code.
   The context of these data flow paths is the f function.

2.2) Basic Block Tier

Once the complete set of data flow paths are identified at the instruction
tier for a given procedure, the next step is to generalize data flow
information to the basic block tier.  At the basic block tier, instruction
data flow paths should be generalized to show path-sensitive data flow
interactions between basic blocks rather than instructions.  This level of
generalization reduces the amount of information needed to represent data flow
paths.  For example, there are many cases where data will be passed between
multiple instructions within the same basic block.  Using basic block tier
generalization, those individual operations can be generalized and represented
as a single basic block.  The generalized basic block data flow paths can then
be persisted for subsequent use when determining reachability in much the same
fashion that was used at the instruction tier.

Since the instruction tier's data flow information has been fractured and
parameters passed at call sites have been tied to phi functions, an approach
must be defined to preserve this information at the basic block tier during
generalization.  An easy way of preserving this information is to define the
formal parameters which represent input and output parameters as being
contained within distinct pseudo blocks.  For example, the phi functions
representing formal input parameters can exist within a formal entry pseudo
block.   Likewise, the phi functions representing formal output parameters can
exist within a formal exit pseudo block.  Both pseudo blocks can then be tied
to the procedure associated with the formal parameters.  Defining the
underlying instruction tier phi functions in this way makes it trivial to
retain information that will be needed to define context-sensitive
interprocedural data flow at less-specific generalization tiers.  Like the
instruction tier, it is possible to dynamically link data passed to a pseudo
block in a caller's context to subsequent uses in a callee's context.  Figure
5 shows the general form that basic block data flow paths may take.


   .---------.  .---------.  .---------.
   |  fin    |  |   fout  |  |  block  |
   `---------`  `---------`  `---------`
            \        |         /
            .`-------+--------'.
            V        V         V
   .---------.  .---------.  .---------.
   |  fin    |  |   fout  |  |  block  |
   `---------`  `---------`  `---------`

                    Figure 3
  General forms of data flow paths at the basic block tier

The act of generalizing instruction data flow paths means that two or more
distinct instruction data flow paths may produce the same basic block data
flow path.  When this occurs, only one basic block data flow path should be
defined since it will effectively capture the information conveyed by the set
of distinct instruction data flow paths.  Each corresponding instruction data
flow path should still be associated with a single basic block data flow path.
This association makes it possible to show the set of instruction data flow
paths that have been generalized by a specific basic block data flow path.
The association can be persisted in a normalized database by creating a
one-to-many link table between basic block and instruction data flow paths.
Figure  provides an example of what would happen when generalizing the
instruction data flow paths described in figure 6.


               .----------.
               | fin(f,x) |
               `----------`
                    |
         .----------+----------.
         V          V          V
 .----------.  .----------.  .----------.
 |  block 1 |  | block 2  |  | block 3  |
 `----------`  `----------`  `----------`
                    |            |  |
                    V            |  |
               .----------.      |  |
               | fin(g,x) |      |  |
               `----------`      |  |
                                 V  V
              .----------.   .----------.
              | fout(g)  |-->| block 4  |
              `----------`   `----------`
                                  |
                                  V
                             .----------.
                             | fout(f)  |
                             `----------`

                Figure 6
   Basic block tier data flow paths obtained by generalizing
   the instruction data flow paths described in figure 4.  These
   context for these data flow paths is the f function.

2.3) Procedure Tier

Generalizing data flow paths from the basic block tier to the procedure tier
further reduces the amount of information needed to show data flow behavior.
Procedure tier data flow paths are meant to show how data is passed between
procedures through formal parameters.  This covers scenarios such as passing a
procedure's formal input parameter to a child procedure's formal input
parameter, using the formal output parameter of a child procedure as the
formal input parameter to another called procedure, and so on.  These
behaviors are all represented within the context of a particular procedure.

Based on these constraints, only two classes ofbasic block data flow paths
need to be considered.  The first class involves data traveling from any block
to a formal input or output parameter, thus showing interprocedural flows.
The second class involves data traveling from a formal input or formal output
parameter to a terminal point in a procedure.  This effectively eliminates any
intraprocedural data flows that are not carried over to another procedure in
some form.  Since data flow information about which formal parameters are used
or defined is conveyed by basic block data flow paths, it is possible to
simply generalize this data flow information to show data flowing to formal
parameters within the context of a given procedure.  While it may be tempting
to think that one must only show data flow paths between two formal
parameters, it is also necessary to show data flow paths that originate from
data that is locally defined within a procedure, such as through a local
variable which is not populated by a formal parameter.  As such, the general
form that data flow paths may take at the procedure tier is illustrated by
figure 7.  Figure 8 provides an example of what would happen when generalizing
the basic block data flow paths described in figure 6.


   .---------.  .---------.  .---------.
   |  fin    |  |   fout  |  |  origin |
   `---------`  `---------`  `---------`
            \        |         /
            .`-------+--------'.
            V        V         V
   .---------.  .---------.  .---------.
   |  fin    |  |   fout  |  |  origin |
   `---------`  `---------`  `---------`

                    Figure 7
  General forms of data flow paths at the procedure tier


 .----------.  .-----------.  .----------.
 | fin(f,x) |  | origin(f) |  | fout(g)  |
 `----------`  `-----------`  `----------`
    |     \         |           /
    |      `--------+----------'
    V               V
 .----------.  .-----------.
 | fin(g,x) |  |  fout(f)  |
 `----------`  `-----------`

                    Figure 8
  Procedure tier data flow paths obtained by generalizing the
  basic block data flow paths described in figure 4.  The context
  for these data flow paths is the f function.

Procedure data flow paths may generalize multiple basic block data flow paths
and thus can make use of a one-to-many link table to illustrate this
association.  While generalizing data flow paths to the procedure tier is
trivial, the challenging aspect comes when determining reachability.  This
will be discussed in more detail in section 3.

2.4) Data Type Tier

Using data flow information obtained from the procedure tier, it is sometimes
possible, depending on language features, to generalize data flow information
to the data type tier.  Generalizing to the data type tier is meant to show
how formal parameters are passed between data types within the context of a
given data type.  This relies on the underlying language having the ability to
associate procedures with data types.  For example, object-oriented languages
are all capable of associating procedures with data types, such as through
classes defined in C++, C, and other languages.  In the case of languages
where data types do not have procedures, it may instead be possible to
associate procedures with the name of the source file that contains them.  In
both cases, it is possible to show formal parameters passing between elements
that act as containers for procedures, regardless of whether the underlying
elements are true data types.

The benefit of generalizing data flow information at the data type tier is
that it helps to further reduce the amount of data flow information that must
be represented.  Since the small example source code that has been used to
illustrate generalizations at each tier only involves passing formal
parameters within the same data type, it is useful to consider an alternative
example which involves passing data between multiple data types.

class Company {
   void AddEmployee(int num) {
      Person employee = new Person(num);
      employees.Add(employee);
      Console.WriteLine("New employee {0}", employee);
   }
   int EmployeeCount() {
      return employees.Count;
   }
   private ArrayList employees;
}

Figure 9 shows the data type data flow paths for the example code shown above.
It is important to note that unlike previous tiers, the specific formal
parameters that are being passed between types is not preserved.  Instead,
only the fact that formal parameters are passed between data types is
retained.  In this manner, fin indicates a data type's formal input parameter
and fout indicates a data type's formal output parameter.


 .---------------------.      .--------------.
 | fin(System.Console) |<-----| fout(Person) |
 `---------------------`      `--------------`
                                      |
 .-----------------------.           /
 | fin(System.ArrayList) |<---------`
 `-----------------------`

 .--------------.
 | fin(Company) |
 `--------------`
       |
       V
 .-------------.
 | fin(Person) |
 `-------------`

 .------------------------.     .---------------.
 | fout(System.ArrayList) |---->| fout(Company) |
 `------------------------`     `---------------`

                    Figure 9
  Data type tier data flow paths obtained by generalizing the
  procedure tier data flow paths.  The context for these data
  flow paths is the Company data type.

In a fashion much the same as previous generalization tiers, a single data
type data flow path can represent multiple underlying procedure data flow
paths.  Each generalized procedure data flow path can be associated with its
corresponding data type data flow path through a one-to-many link table in a
normalized database.

2.5) Module Tier

Generalizing data flow information to the module tier is meant to show how
data flows between distinct modules.  As with each step in the generalization
process, the module tier data flow paths lose much of the information that is
conveyed at more specific tiers.  Figure  shows module tier data flow paths
that would be defined when generalizing the data type data flow paths
illustrated in figure 10.


 .------------------.      .-----------------.
 | fin(Company.dll) |      | fin(System.dll) |
 `------------------`      `-----------------`
         |                          ^
         V                          |
 .-----------------.      .------------------.
 | fin(Person.dll) |      | fout(Person.dll) |
 `-----------------`      `------------------`

 .------------------.      .-------------------.
 | fout(System.dll) |----->| fout(Company.dll) |
 `------------------`      `-------------------`

                    Figure 10
  Module tier data flow paths obtained by generalizing the
  data type tier data flow paths.  The context for these data
  flow paths is the Company.dll module.

2.6) Abstract Tiers

Once data flow paths have been generalized from the instruction tier through
the module tier, it is no longer possible to create additional concrete
generalizations for most runtime environments An exception to this is managed
code which has an additional concrete assembly tier.  Even though it may not
be possible to establish concrete generalizations, it is possible to define
abstract generalizations.  An abstract generalization attempts to show data
flow relationships between abstract elements.  A good example of an abstract
element would be a logical component which is defined in the architecture of a
given application.  For example, a VPN client application might be composed of
a user interface component and a networking component, each of which may
consist of multiple concrete modules.  By defining logical components and
associating concrete modules with each component, it is possible to further
generalize information beyond the module tier.

Given the example described above, it may be prudent to define two abstract
generalization tiers.  The first abstract tier is the component tier.  In this
context, a component is defined as a logical software component that contains
one or more concrete modules.  The component tier makes it possible to
illustrate data flow between conceptual components within an application as
derived from how data flows between concrete modules.  The second abstract
tier is the application tier.  The application tier can be used to illustrate
how data is passed between conceptual applications.  For example, a web
browser application passes data in some form to a web server application, both
of which consist of conceptual components which, in turn, consist of concrete
modules.

The caveat with abstract generalization tiers is that it must be possible to
illustrate data flow between what may otherwise be disjoint concrete elements.
The reason for this is that, often times, the paths that data will take
between two modules which belong to different logical components will be
entirely indirect with respect to one another.  For this reason, it is
necessary to devise a mechanism to bridge data flow paths between concrete
software elements that belong to each logical component or application.  A
particularly useful example of an approach that can be taken to bridge two
distinct components can be found in web services.

In a web services application, it is often common to have a client component
and a server component.  The two components pass data to one another through
an indirect channel, such as through a web request.  For this reason, it is
not immediately possible to show direct data flow paths from a web client
component to a web service component.  To solve this problem, one can define a
mechanism that bridges the formal parameters associated with the web service
method that is being invoked.  In this manner, the the formal input parameters
for a web service method found on the client side can be implicitly linked and
shown to define the formal input parameters received on the web service
side.  By illustrating data flow at a concrete tier, it is possible to
generalize data flow behaviors all the way up through the abstract tiers.

The benefit of describing data flow behavior at abstract tiers is that it
makes it possible to derive data flow behaviors between abstract software
elements rather than strictly focusing on concrete software elements.  This is
useful when attempting to view an application's behavior at a glance rather
than worrying about the specific details relating to how data is passed.  For
example, this could be used to help validate threat models which describe how
data is expected to be passed between abstract components within an
application.

When generalizing information at abstract tiers, the only information that can
be conveyed, at least based on the approach described thus far, is whether or
not a component or application are passing data through a formal input or
formal output parameter.  The specifics of which formal parameters are passed
is no longer available for use in generalization.  Using the example shown at
the data type tier, one might assume the following component associations:
Company.dll and Person.dll, which contain the Company data type and Person
data type, are part of the user interface component of a human resources
application.  The classes used from system libraries can be generically
grouped as belonging to an external library component.  Using these,
groupings, the component data flow paths may be represented as shown in figure
11.

 .------------------------.      .-----------------------.
 | fout(External Library) |      | fin(External Library) |
 `------------------------`      `-----------------------`
            |                               ^
            V                               |
 .----------------------.        .---------------------.
 | fout(User Interface) |        | fin(User Interface) |
 `----------------------`        `---------------------`

                    Figure 11
  Component tier data flow paths obtained by generalizing the
  module tier data flow paths.  The context for these data flow
  paths is the user interface component.

As with all previously described generalization tiers, a single component data
flow path may represent multiple module data flow paths.  The single component
data flow path should be associated with each corresponding module data flow
path through a one-to-many link table in a normalized database.

3) Reachability

The real benefit of the generalizations described in  can be realized when
attempting to solve a graph reachability problem.  By generalizing data flow
behaviors to both abstract and concrete generalization tiers, it is possible
to reduce the amount of information that must be represented when attempting
to determine graph reachability.  This is further improved by the fact that
data flow paths found at less-specific generalization tiers can be used to
progressively qualify potential data flow paths at more-specific
generalization tiers.  This qualification is possible due to the fact that
less-specific data flow paths are associated with more-specific data flow
paths at each generalization tier through a one-to-many link table, thus
permitting trivial expansion.  The benefit of qualifying data flow paths in
this fashion is that only the minimal set of information needed to determine
reachability must be considered at once at each generalization tier.  This can
drastically reduce the physical resources required to solve a graph
reachability problem by effectively limiting the size of a graph.  This
general approach is captured by the Progressive Qualified Elaboration (PQE)
algorithm described by .  This concept is very similar to the ideas outlined
by Schultes' highway hierarchy which is used to optimize fast path discovery
when identifying travel routes in road networks[16].


PQE(Elements, SourceDescriptor, SinkDescriptor)
   Paths := 0
   Graph := BuildGraph(Elements)

   while Graph != 0
      SourceVertices  := Vertices(Graph, SourceDescriptor)
      SinkVertices    := Vertices(Graph, SinkDescriptor)
      Paths           := Reachability(Graph, SourceVertices, SinkVertices)
      ElaboratedPaths := Elaborate(Paths)
      Graph           := BuildGraph(ElaboratedPaths)
   end

   return Paths
end

For the purposes of this paper, graph reachability is restricted to
determining realizable paths between two flow descriptors: a source and a
sink.  A flow descriptor provides information that is needed to identify
corresponding vertices within a graph at each generalization tier.  The tables
in figure  and figure  show the information needed to identify source and sink
vertices at each generalization tier for the example that will be described in
this section.

The PQE algorithm itself requires three parameters.  The first parameter,
Elements, contains the set of generalized elements to be analyzed.  For
example, it may contain the set of target modules that should be analyzed.
The second and third parameters, SourceDescriptor and SinkDescriptor,
represent the source and sink flow descriptors, respectively.

The first step taken by the algorithm is to define Paths as an empty set.
Paths will be used to contain the set of reachable paths between an actual set
of sources and sinks at a given generalization tier.  After Paths  has been
initialized, Graph  is initialized to a flow graph that conveys data flow
relationships between the set of elements provided in .  The approach taken to
construct the flow graph involves retrieving persisted data flow information
for the appropriate generalization tier.  Once Paths and Graph have been
initialized, the qualified elaboration process can begin.

For each loop iteration, a check is made to see if Graph is an empty graph
(contains no vertices).  If Graph is empty, the loop terminates.  If Graph is not an
empty graph, reachable paths between the actual set of sources and sinks are
determined.  This is accomplished by first identifying the vertices in  that
are associated with the flow descriptors SourceDescriptor and SinkDescriptor
at the current generalization tier.  The actual set of sources and sinks found
to be associated with these descriptors are stored in SourceVertices and
SinkVertices, respectively.  With the set of actual source and sink vertices
identified, a reachability algorithm, Reachability(), can be used to determine
the set of reachable paths in flow graph between the two sets of vertices.
The result of this determination is stored in Paths.  The final step in the
iteration involves using qualified elaboration to construct a new flow graph
containing more-specific data flow paths which are qualified by the set of
data flow paths encountered in the reachable paths found in Paths.  This set
is then elaborated to a subset that contains the associated data flow paths
from the next, more specific tier, such as by elaborating to a subset of basic
blocks data flow paths from a more general set of procedure data flow paths.
The result of the elaboration is stored in ElaboratedPaths.  Finally, a new
flow graph is constructed and stored in Graph using the elaborated set of flow
paths contained within ElaboratedPaths.

When it is not possible to obtain a more-detailed flow set, such as when the
instruction tier is reached, an empty graph is created and the algorithm
completes by returning .  In the final iteration, Paths contains the most detailed
set of reachable data flow paths found between the source and sink flow
descriptors.  The benefit of approaching graph reachability problems in this
fashion is that only a subset of the elements at any generalization tier need
to be considered at once.  These subsets are dictated by the set of reachable
data flow paths found at each preceding generalization tier.  In this manner,
the subset of procedure data flow paths that need to be considered would be
effectively qualified by the set of data types and modules found to be
involved in data flow paths between the source and sink flow descriptors at
less-specific tiers.

For the purposes of this paper, the  algorithm is designed to consider
realizable paths at each generalization tier in manner that is similar to the
concept described by Reps et al.  This involves traversing the graph in
context-sensitive fashion.  To accomplish this, the algorithm keeps a scope
stack at each generalization tier.  The scope may be an assembly, a type, or a
procedure.  When data is passed through to a formal input parameter, the scope
for the formal input parameter is pushed onto the stack.  When data is
returned through a formal output parameter to another location, the
algorithm ensures that the scope that is being returned to is the parent
scope.  In this way, only realizable paths are considered at each
generalization tier which limits the number of paths that must be
considered and also has the benefit of producing more accurate results.

The specific algorithm used for the  function involves using the set of data
flow paths found at a less-specific tier to identify the set of more-specific
data flow paths that have been generalized.  This is accomplished by simply
using the one-to-many link tables that were populated during generalization to
determine the subset of data flow paths that must be considered at the next
generalization tier.  For example, elaborating from a set of procedure data
flow paths would involve determining the complete set of basic block data flow
paths that have been generalized by the affected set of procedure data flow
paths.

Based on this general description of the algorithm, it is useful to consider a
concrete example  This section provides a concrete illustration by determining
reachability between a source and a sink using an example web application that
consists of a web client and a web service component.  This is illustrated by
progressively drilling down through each generalization tier starting from the
least-specific tier, the abstract tier, and working toward the most-specific
tier, the instruction tier.  At each tier, a description of the number of data
flow paths that must be represented and the number of reachable data flow
paths found is given.  This particular example will attempt to determine
concrete reachable data flow paths between the return value of
HttpRequest.getQueryString and the first formal input parameter of
Process.Start.  The implications of a reachable path between these two points
could be indicative of a command injection vulnerability within the
application.  The tables in figure 12 and figure 13 show the flow descriptors for
the source and sink, respectively.  These flow descriptors are used to
identify associated vertices at each generalization tier.

  +-------------+------------------------------+
  | Tier        | Information                  |
  +-------------+------------------------------+
  | Component   | fout(Undefined)              |
  | Module      | fout(System.Web.dll)         |
  | Data Type   | fout(System.Web.HttpRequest) |
  | Procedure   | fout(get_QueryString, 0)     |
  | Basic Block | fout(get_QueryString, 0)     |
  | Instruction | fout(get_QueryString, 0)     |
  +-------------+------------------------------+

                Figure 12
   Source flow descriptor for the return value of
        HttpRequest.get_QueryString

  +-------------+---------------------------------+
  | Tier        | Information                     |
  +-------------+---------------------------------+
  | Component   | fin(Undefined)                  |
  | Module      | fin(System.dll)                 |
  | Data Type   | fin(System.Diagnostics.Process) |
  | Procedure   | fin(Start, 0)                   |
  | Basic Block | fin(Start, 0)                   |
  | Instruction | fin(Start, 0)                   |
  +-------------+---------------------------------+

                Figure 13
   Sink flow descriptor for the first formal parameter
   of Process.Start

For this illustration, there is in fact a data flow path that exists from the
source descriptor to the sink descriptor.  However, unlike conventional data
flow paths, this data flow path happens to cross an abstract boundary between
the two components.  In this case, data is passed from the web client
component through an HTTP request to a method hosted by the web service
component.  This path can be seen by first looking at a portion of the web
client code:

class Program {
   static void Main(string[] args) {
      HttpRequest request = new HttpRequest(
         "a","b","c");
      WebClient client = new WebClient();

      client.ExecuteCommand(
         request.QueryString["abc"]);
   }
}
[WebServiceBinding]
public class WebClient : SoapHttpClientProtocol {
   [SoapDocumentMethod]
   public void ExecuteCommand(string command) {
      Invoke("ExecuteCommand",
         new object[] { command });
   }
}

In this contrived example, data is shown as being passed from a query string
obtained from what is presumably a real HTTP request to the client portion of
the web service method ExecuteCommand.  The web service application, in turn,
contains the following code:

[WebService]
public class WebService {
   [WebMethod]
   public void ExecuteCommand(string command) {
      System.Diagnostics.Process.Start(command);
   }
}

In conventional tools, it would not be possible to directly model this data
flow path because the data flow path is indirect.  However, using a simple
methodology to bridge the client-side formal input parameters with the
server-side formal input parameters at the instruction tier, it is possible to
connect the two and represent data flow between the two conceptual software
elements at each generalization tier.  The following sections will provide
visual examples of how the PQE algorithm narrows down and eliminates
unnecessary data flow paths at each generalization tier by progressively
qualifying data flow information.  One thing to note about the graphs at each
tier is that implicit edges have been created between formal input and output
parameters that reside in external (un-analyzed) libraries.  This is done
under the assumption that a formal input parameter may affect a formal output
parameter in some way in the context of the code that is not analyzed.  If all
target code paths have been analyzed, then this is not necessary.  The graphs
shown at each tier were automatically generated but have been modified to
allow them to fit within the margins of this document and in some cases
highlight important features.

3.1) Abstract Tiers

Abstract tiers represent the most general view of the data flow behaviors of
an application.  The data flow behavior is modeled with respect to abstract
software elements, such as a component, rather than concrete software
elements, such as a module or a type.  For this example, it is assumed that
the PQE algorithm begins by modeling data flow behaviors between conceptual
components in a web application.  The web application is composed of two
manually defined abstract components, a Web Client and a Web Service.  These
two components both rely on external libraries, as represented by the
Undefined component, which are outside of the scope of the application itself.
When starting at the abstract tier, all abstract data flow paths must be
considered as potential data flow paths.  The component tier data flow graph
for this application is shown in figure 14.


                            .--------------------.
                     .------| origin(Web Client) |
                     |      `--------------------`
                     V               |
         .----------------.          |
         | fin(Undefined) |<---.     |
         `----------------`     \    |
        .-^    |    ^            |   |
       /       V    |            |   |
       |  .----------------.     |   |
       |  | fin(Undefined) |     |   |
       |  `----------------`     |   |
       |                  \      |   |
       |                   `--.  |   |
       |                      V  |   V
      .------------------.    .-----------------.
      | fin(Web Service) |<---| fin(Web Client) |
      `------------------`    `-----------------`
               |                    |      ^
               V                    V      |
      .------------------.    .------------------.
      | fin(Web Service) |    | fout(Web Client) |
      `------------------`    `------------------`

                   Figure 14
	Complete compenet tier data flow graph for the
	web application.

Using the data flow graph shown in figure 14, PQE uses the Reachability()
algorithm to determine data flow paths between a formal output parameter in
the Undefined component and a formal input parameter in the Undefined
component.  At this generalization tier, there are many different paths that
can be taken between these two components.  This effectively results in the
qualification of nearly all of the assembly tier data flow paths.  These data
flow paths are used to represent the data flow graph at the assembly tier.

In this example, PQE offers no improvements at abstract tiers because it is a
requirement that all abstract data flow information be represented.  Since the
amount of information required to represent abstract data flow is minimal,
this is not seen as a deficiency.  Furthermore, for this particular example,
nearly all component data flow paths are found to be involved in reachable
paths.  At worst, this is indicative that for small applications, it may not
be necessary to start the algorithm by looking at abstract data flow
information.  Instead, one might immediately progress to the module or data
type tiers.

3.2) Module Tier

The module tier uses the set of data flow paths found at the abstract
component tier to construct a data flow graph that shows the data flow
relationships between formal input and formal output parameters passed between
modules.  The graph is generated using the one-to-many table that was
populated during generalization which conveys the module data flow paths that
were generalized by the set of qualified component data flow paths.  For this
particular example, nearly all of the module data flow paths were qualified as
potentially being involved in a reachable path between the source and sink
flow descriptor.  The graph that is generated as a result is shown in figure
15.


Using this graph, the Reachability() algorithm is again employed to find paths
between the source and sink flow descriptor at the module tier.  In this case,
only the edges between the nodes highlighted in dark orange are found to be
involved in reachable paths between fout(System.Web) and fin(System).  The
important thing to note is that even at the module tier, a data flow path is
illustrated between fin(WebClient) and fin(WebService).  This will be a trend
that will continue to each more specific generalization tier.

3.3) Data Type Tier

The data type tier uses the set of data flow paths found at the module tier to
construct a data flow graph that shows the data flow relationships between
formal input and formal output parameters passed between data types.  The
graph is generated using the one-to-many table that was populated during
generalization which conveys the data type data flow paths that were
generalized by the set of qualified module data flow paths.  The graph that is
generated as a result is shown in figure 16.


Using the graph, the Reachability() algorithm is again employed to find paths
between the source and sink flow descriptor at the data type tier.  Due to the
simplicity of the example application, only a few data flow paths were
rendered.  The complete data flow path from fout(System.Web.HttpRequest) to
fin(System.Diagnostics.Process.Start) can be clearly seen.

3.4) Procedure Tier

The procedure tier uses the set of data flow paths found at the data type tier
to construct a data flow graph that shows the data flow relationships between
formal input and formal output parameters passed between procedures.  Unlike
previous tiers, procedure tier data flow paths explicitly identify the formal
parameter index that data is being passed to.  This helps to further isolate
data flow paths from one another and improves the overall accuracy of paths
that are selected.  The graph is generated using the one-to-many table that
was populated during generalization which conveys the procedure data flow
paths that were generalized by the set of qualified data type data flow paths.
The graph that is generated as a result is shown in figure 17.


Using the graph, the Reachability() algorithm is again employed to find paths
between the source and sink flow descriptor at the procedure tier.  Due to the
simplicity of the example application, only a few data flow paths were
rendered.  The complete data flow path from fout(getQueryString, 0) to
fin(Start, 0) can be clearly seen.

3.5) Basic Block Tier

The basic block tier uses the set of data flow paths found at the procedure
tier to construct a data flow graph that shows the data flow relationships
between formal input and formal output parameters passed between basic blocks.
Like the procedure tier, basic block tier data flow paths also explicitly
identify the formal parameter index that data is being passed to.  The graph
is generated using the one-to-many table that was populated during
generalization which conveys the basic block data flow paths that were
generalized by the set of qualified procedure data flow paths.  The graph that
is generated as a result is shown in figure 18.  Due to the way that Phoenix
currently represents basic blocks, the basic block tier data flow paths offer
very little generalization beyond the instruction tier.


Using the graph, the Reachability() algorithm is again employed to find paths
between the source and sink flow descriptor at the basic block tier.  Due to
the simplicity of the example application, only a few data flow paths were
rendered.  The complete data flow path from fout(getQueryString, 0) to
fin(Start, 0) can be clearly seen.

3.6) Instruction Tier

The instruction tier uses the set of data flow paths found at the basic block
tier to construct a data flow graph that shows the data flow relationships
between formal input and formal output parameters passed between instructions.
Like the basic block tier, instruction tier data flow paths also explicitly
identify the formal parameter index that data is being passed to.  The graph
is generated using the one-to-many table that was populated during
generalization which conveys the instruction data flow paths that were
generalized by the set of qualified basic block data flow paths.  The graph
that is generated as a result is shown in figure 19.  The instruction tier
data flow paths represent the final step taken by the algorithm as they
contain the most specific description of data flow paths.


Using the graph, the Reachability() algorithm is again employed to find paths
between the source and sink flow descriptor at the instruction tier.  Due to
the simplicity of the example application, only a few data flow paths were
rendered.  The complete data flow path from fout(getQueryString, 0) to
fin(Start, 0) can be clearly seen along with source lines that are encountered
along the way.

4) Acknowledgements

The author would like to thank Rolf Rolles, Richard Johnson, Halvar Flake,
Jordan Hind, and many others for thoughtful discussions and feedback.

5) Conclusion

This document has attempted to convey the potential benefits of generalizing
data flow information along generalization tiers.  Each generalization tier is
used to represent the data flow behaviors of an abstract or concrete software
element such as an instruction, basic block, procedure, and so on. Using this
concept, data flow information can be collected at the most specific tier, the
instruction tier, and then generalized to increasingly less-specific tiers.
The generalization process has the effect of reducing the amount of data that
must be considered at once while still conveying a general description of the
manner in which data flows within an application.

Generalized data flow information can be immediately used in conjunction with
existing graph reachability problems.  For instance, a common task that
involves determining reachable data flow paths between a conceptual source and
sink location within an application can potentially benefit from operating on
generalized data flow information.  This paper has illustrated these potential
benefits by defining the Progressive Qualified Elaboration (PQE) algorithm
which can be used to progressively determine reachability at each
generalization tier.  By starting at the least specific generalization tier
and progressing toward the most specific, it is possible to restrict the
amount of data flow information that must be considered at once to a minimal
set.  This is accomplished by using reachable paths found at each
generalization tier to qualify the set of data flow paths that must be
considered at more specific generalization tiers.

While these benefits are thought to be present, the author has yet to
conclusively prove this to be the case.  The results presented in this paper
do not prove the presumed usefulness of generalizing data flow information
beyond the procedure tier.  The author believes that analysis of large
applications involving hundreds of modules could benefit from generalizing
data flow information to the data type, module, and more abstract tiers.
However, at the time of this writing, conclusive data has not been
collected to prove this usefulness.  The author hopes to collect
information that either confirms or refutes this point during future
research.

At present, the underlying implementation used to generate the results
described in this paper has a number of known limitations.  The first
limitation is that it does not currently take into account formal parameters
that are not passed at a call site, such as fields, global variables, and so
on.  This significantly restricts the accuracy of the data flow model that it
is currently capable of generating.  This limitation represents a more general
problem of needing to better refine the underlying completeness of the data
flow information that is captured.

While the algorithms presented in this paper were portrayed in the context of
data flow analysis, it is entirely possible to apply them to other fields as
well, such as control flow analysis.  The PQE algorithm itself is conceptually
generic in that it simply describes a process that can be employed to qualify
the next set of analysis information that must be considered from a more
generic set of analysis information.  This may facilitate future research
directions.

References

[1] Atkinson, Griswold.  Implementation Techniques for Efficient Data-flow Analysis of Large Programs.
    Proceedings of the IEEE International Conference on Software Maintenance (ICSM'01).  2001.
    http://www.cse.scu.edu/ atkinson/papers/icsm-01.ps

[2] Das, M.  Static Analysis of Large Programs: Some Experiences
    2000.  http://research.microsoft.com/manuvir/Talks/pepm00.ppt

[3] Das, M., Lerner, S., Seigle, M. ESP: Path-Sensitive Program Verification in Polynomial Time.
    Proceedings of the SIGPLAN 2002 Conference on programming language design. 2002.
    http://www.cs.cornell.edu/courses/cs711/2005fa/papers/dls-pldi02.pdf

[4] Das, M., Fahndrich, M., Rehof, J. From Polymorphic Subtyping to CFL Reachability: Context-Sensitive Flow Analysis Using Instantiation Constraints. 2000.
    http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-99-84

[5] Dinakar Dhurjati1, Manuvir Das, and Yue Yang.
    Path-Sensitive Dataflow Analysis with Iterative Refinement.
    SAS'06: The 13th International Static Analysis Symposium, Seoul, August 2006.

[6] Erikson, Manocha.  Simplification Culling of Static and Dynamic Scene Graphs.
    TR9809-009 by University of North Carolina at Chapel Hill. 1998.  citeseer.ist.psu.edu/erikson98simplification.html

[7] Horwitz, S., Reps, T., and Binkley, D., Interprocedural slicing using dependence graphs.
    In Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation,
    (Atlanta, GA, June 22-24, 1988), ACM SIGPLAN Notices 23, 7 (July 1988), pp. 35-46.

[8] Horwitz, S., Reps, T., and Binkley, D., Retrospective: Interprocedural slicing using dependence graphs.
    20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation (1979 - 1999):
    A Selection, K.S. McKinley, ed., ACM SIGPLAN Notices 39, 4 (April 2004), 229-231.

[9] Gregor, D., Schupp, S.  Retaining Path-Sensitive Relations across Control Flow Merges.
    Technical report 03-15, Rensselaer Polytechnic Institute, November 2003.
    http://www.cs.rpi.edu/research/ps/03-15.ps

[10] Kiss, Ja<4A>sz, Lehotai, Gyimo<6D>thy.  Interprocedural Static Slicing of Binary Executables.
     http://www.inf.u-szeged.hu/ akiss/pub/kiss_interprocedural.pdf

[11] Microsoft Corporation.  Phoenix Framework.
     http://research.microsoft.com/phoenix/

[12] Naumovich, G., Avrunin, G. S., and Clarke, L. A. 1999.
     Data flow analysis for checking properties of concurrent Java programs.
     In Proceedings of the 21st international Conference on Software Engineering
     (Los Angeles, California, United States, May 16 - 22, 1999).
     International Conference on Software Engineering.
     IEEE Computer Society Press, Los Alamitos, CA, 399-410.

[13] Reps, T., Horwitz, S., and Sagiv, M., Precise interprocedural dataflow analysis via graph reachability.
     In Conference Record of the 22nd ACM Symposium on Principles of Programming Languages,
     (San Francisco, CA, Jan. 23-25, 1995), pp. 49-61.

[14] Reps, T., Sagiv, M., and Horwitz S., Interprocedural dataflow analysis via graph reachability.
     TR 94-14, Datalogisk Institut, University of Copenhagen, Copenhagen, Denmark, April 1994.

[15] A. Rountev, B. G. Ryder, and W. Landi. Dataflow analysis of program fragments.
     In Proc. Symp. Foundations of Software Engineering, LNCS 1687, pages 235--252, 1999.
     http://citeseer.ist.psu.edu/rountev99dataflow.html

[16] Schultes, Dominik.  Fast and Exact Shortest Path Queries Using Highway Hierachies. 2005.
     http://algo2.iti.uka.de/schultes/hwy/hwyHierarchies.pdf