Zines/uninformed/7.2.txt

Memalyze: Dynamic Analysis of Memory Access Behavior in Software
skape
mmiller@hick.org
4/2007

Abstract

This paper describes strategies for dynamically analyzing an application's
memory access behavior.  These strategies make it possible to detect when a
read or write is about to occur at a given location in memory while an
application is executing.  An application's memory access behavior can provide
additional insight into its behavior.  For example, it may be able to provide
an idea of how data propagates throughout the address space.  Three individual
strategies which can be used to intercept memory accesses are described in
this paper.  Each strategy makes use of a unique method of intercepting memory
accesses.  These methods include the use of Dynamic Binary Instrumentation
(DBI), x86 hardware paging features, and x86 segmentation features.  A
detailed description of the design and implementation of these strategies for
32-bit versions of Windows is given.  Potential uses for these analysis
techniques are described in detail.

1) Introduction

If software analysis had a holy grail, it would more than likely be centered
around the ability to accurately model the data flow behavior of an
application.  After all, applications aren't really much more than
sophisticated data processors that operate on varying sets of input to produce
varying sets of output.  Describing how an application behaves when it
encounters these varying sets of input makes it possible to predict future
behavior.  Furthermore, it can provide insight into how the input could be
altered to cause the application to behave differently.  Given these benefits,
it's only natural that a discipline exists that is devoted to the study of
data flow analysis.

There are a two general approaches that can be taken to perform data flow
analysis.  The first approach is referred to as static analysis and it
involves analyzing an application's source code or compiled binaries without
actually executing the application.  The second approach is dynamic analysis
which, as one would expect, involves analyzing the data flow of an application
as it executes.  The two approaches both have common and unique benefits and
no argument will be made in this paper as to which may be better or worse.
Instead, this paper will focus on describing three strategies that may be used
to assist in the process of dynamic data flow analysis.

The first strategy involves using Dynamic Binary Instrumentation (DBI) to
rewrite the instruction stream of the executing application in a manner that
makes it possible to intercept instructions that read from or write to memory.
Two well-known examples of DBI implementations that the author is familiar
with are DynamoRIO and Valgrind[3, 11].  The second strategy that will be
discussed involves using the hardware paging features of the x86 and x64
architectures to trap and handle access to specific pages in memory.  Finally,
the third strategy makes use of the segmentation features included in the x86
architecture to trap memory accesses by making use of the null selector.
Though these three strategies vary greatly, they all accomplish the same goal
of being able to intercept memory accesses within an application as it
executes.

The ability to intercept memory reads and writes during runtime can support
research in additional areas relating to dynamic data flow analysis.  For
example, the ability to track what areas of code are reading from and writing
to memory could make it possible to build a model for the data propagation
behaviors of an application.  Furthermore, it might be possible to show with
what degree of code-level isolation different areas of memory are accessed.
Indeed, it may also be possible to attempt to validate the data consistency
model of a threaded application by investigating the access behaviors of
various regions of memory which are referenced by multiple threads.  These are
but a few of the many potential candidates for dynamic data flow analysis.

This paper is organized into three sections.  Section 2 gives an introduction
to three different strategies for facilitating dynamic data flow analysis.
Section 3 enumerates some of the potential scenarios in which these strategies
could be applied in order to render some useful information about the data
flow behavior of an application.  Finally, section 4 describes some of the
previous work whose concepts have been used as the basis for the research
described herein.

2) Strategies

This section describes three strategies that can be used to intercept runtime
memory accesses.  The strategies described herein do not rely on any static
binary analysis.  Techniques that do make use of static binary analysis are
outside of the scope of this paper.

2.1) Dynamic Binary Instrumentation

Dynamic Binary Instrumentation (DBI) is a method of analyzing the behavior of
a binary application at runtime through the injection of instrumentation code.
This instrumentation code executes as part of the normal instruction stream
after being injected.  In most cases, the instrumentation code will be
entirely transparent to the application that it's been injected to.  Analyzing
an application at runtime makes it possible to gain insight into the behavior
and state of an application at various points in execution.  This highlights
one of the key differences between static binary analysis and dynamic binary
analysis.  Rather than considering what may occur, dynamic binary analysis has
the benefit of operating on what actually does occur.  This is by no means
exhaustive in terms of exercising all code paths in the application, but it
makes up for this by providing detailed insight into an application's concrete
execution state.

The benefits of DBI have made it possible to develop some incredibly advanced
tools.  Examples where DBI might be used include runtime profiling,
visualization, and optimization tools.  DBI implementations generally fall
into two categories: light-weight or heavy-weight.  A light-weight DBI
operates on the architecture-specific instruction stream and state when
performing analysis.  A heavy-weight DBI operates on an abstract form of the
instruction stream and state.  An example a heavy-weight DBI is Valgrind which
performs analysis on an intermediate representation of the machine state[11,
7].  An example of a light-weight DBI is DynamoRIO which performs analysis
using the architecture-specific state[3].  The benefit of a heavy-weight DBI
over a light-weight DBI is that analysis code written against the intermediate
representation is immediately portable to other architectures, whereas
light-weight DBI analysis implementations must be fine-tuned to work with
individual architectures.  While Valgrind is a novel and interesting
implementation, it is currently not supported on Windows.  For this reason,
attention will be given to DynamoRIO for the remainder of this paper. There are
many additional DBI frameworks and details, but for the sake of limiting scope
these will not be discussed.  The reader should consult reference material to
learn more about this subject[11].

DynamoRIO is an example of a DBI framework that allows custom instrumentation
code to be integrated in the form of dynamic libraries.  The tool itself is a
combination of Dynamo, a dynamic optimization engine developed by researchers
at HP, and RIO, a runtime introspection and optimization engine developed by
MIT.  The fine-grained details of the implementation of DynamoRIO are outside
of the scope of this paper, but it's important to understand the basic
concepts[2].

At a high-level, figure 1 from Transparent Binary Optimization provides a
great visualization of the process employed by Dynamo[2].  In concrete terms,
Dynamo works by processing an instruction stream as it executes.  To
accomplish this, Dynamo assumes responsibility for the execution of the
instruction stream.  It uses a disassembler to identify the point of the next
branch instruction in the code that is about to be executed.  The set of
instructions disassembled is referred to as a fragment (although, it's more
commonly known as a basic block).  If the target of the branch instruction is
in Dynamo's fragment cache, it executes the (potentially optimized) code in
the fragment cache.  When this code completes, it returns control to Dynamo to
disassemble the next fragment.  If at some point Dynamo encounters a branch
target that is not in its fragment cache, it will add it to the fragment cache
and potentially optimize it.  This is the perfect opportunity for
instrumentation code to be injected into the optimized fragment that is
generated for a branch target.  Injecting instrumentation code at this level
is entirely transparent to the application.  While this is an
oversimplification of the process used by DynamoRIO, it should at least give
some insight into how it functions.

One of the best features of DynamoRIO from an analysis standpoint is that it
provides a framework for inserting instrumentation code during the time that a
fragment is being inserted into the fragment cache.  This is especially useful
for the purposes of intercepting memory accesses within an application.  When
a fragment is being created, DynamoRIO provides analysis libraries with the
instructions that are to be included in the fragment that is generated.  To
optimize for performance, DynamoRIO provides multiple levels of disassembly
information.  At the most optimized level, only very basic information
about the instructions is provided.  At the least optimized level, very
detailed information about the instructions and their operands can be
obtained.  Analysis libraries are free to control the level of information
that they retrieve.  Using this knowledge of DynamoRIO, it is now possible
to consider how one might design an analysis library that is able to
intercept memory reads and writes while an application is executing.

2.1.1) Design

DBI, and DynamoRIO in particular, make designing a solution that can intercept
memory reads and writes fairly trivial.  The basic design involves having an
analysis library that scans the instructions within a fragment that is being
created.  When an instruction that accesses memory is encountered,
instrumentation code can be inserted prior to the instruction.  The
instrumentation code can be composed of instructions that notify an
instrumentation function of the memory operand that is about to be read from
or written to.  This has the effect of causing the instrumentation function to
be called when the fragment is executed.  These few steps are really all that
it takes instrument the memory access behavior of an application as it
executes using DynamoRIO.

2.1.2) Implementation

The implementation of the DBI approach is really just as easy as the design
description makes it sound.  To cooperate with DynamoRIO, an analysis library
must implement a well-defined routine named dynamorio_basic_block which is
called by DynamoRIO when a fragment is being created.  This routine is passed
an instruction list which contains the set of instructions taken from the
native binary.  Using this instruction list, the analysis library can make a
determination as to whether or not any of the operands of an instruction
either explicitly or implicitly reference memory.  If an instruction does
access memory, then instrumentation code must be inserted.

Inserting instrumentation code with DynamoRIO is a pretty painless process.
DynamoRIO provides a number of macros that encapsulate the process of creating
and inserting instructions into the instruction list.  For example,
INSTR_CREATE_add will create an add instruction with a specific set of arguments
and instrlist_meta_preinsert will insert an instruction prior to another
instruction within the instruction list.

A proof of concept implementation is included with the source code provided
along with this paper.

2.1.3) Considerations

This approach is particularly elegant thanks to the concepts of dynamic binary
instrumentation and to DynamoRIO itself for providing an elegant framework
that supports inserting instrumentation code into the fragment cache.  Since
DynamoRIO is explicitly designed to be a runtime optimization engine, the fact
that the instrumentation code is cached within the fragment cache means that
it gains the benefits of DynamoRIO's fragment optimization algorithms.  When
compared to alternative approaches, this approach also has significantly less
overhead once the fragment cache begins to become populated.  This is because
all of the instrumentation code is placed entirely inline with the application
code that is executing rather than having to rely on alternative means of
interrupting the normal course of program execution.  Still, this approach is
not without its set of considerations.  Some of these considerations are
described below:

 1. Requires the use of a disassembler
    DynamoRIO depends on its own internal disassembler.  This can be a source
    of problems and limitations.

 2. Self-modifying and dynamic code
    Self-modifying and dynamically generated code can potentially cause problems
    with DynamoRIO.

 3. DynamoRIO is closed source
    While this has nothing to do with the actual concept, the fact that
    DynamoRIO is closed source can be limiting in the event that there are
    issues with DynamoRIO itself.

2.2) Page Access Interception

The hardware paging features of the x86 and x64 architectures represent a
potentially useful means of obtaining information about the memory access
behavior of an application.  This is especially true due to the well-defined
actions that the processor takes when a reference is made to a linear address
whose physical page is either not present or has had its access restricted.
In these cases, the processor will assert the page fault interrupt (0x0E) and
thereby force the operating system to attempt to gracefully handle the virtual
memory reference.  In Windows, the page fault interrupt is handled by
nt!KiTrap0E.  In most cases, nt!KiTrap0E will issue a call into
nt!MmAccessFault which is responsible for making a determination about the
nature of the memory reference that occurred.  If the memory reference fault
was a result of an access restriction, nt!MmAccessFault will return an access
violation error code (0xC0000005).  When an access violation occurs, an
exception record is generated by the kernel and is then passed to either the
user-mode exception dispatcher or the kernel-mode exception dispatcher
depending on which mode the memory access occurred in.  The job of the
exception dispatcher is to give a thread an opportunity to gracefully recover
from the exception.  This is accomplished by providing each of the registered
or vectored exception handlers with the exception information that was
collected when the page fault occurred.  If an exception handler is able to
recover, execution of the thread can simply restart where it left off.  Using
the principles outlined above, it is possible to design a system that is
capable of both trapping and handling memory references to specific pages in
memory during the course of normal process execution.

2.2.1) Design

The first step that must be taken to implement this system involves
identifying a method that can be used to trap references to arbitrary pages in
memory.  Fortunately, previous work has done much to identify some of the
different approaches that can be taken to accomplish this[8, 4].  For the purposes
of this paper, one of the most useful approaches centers around the ability to
define whether or not a page is restricted from user-mode access.  This is
controlled by the Owner bit in a linear address' page table entry (PTE)[5].  When
the Owner bit is set to 0, the page can only be accessed at privilege level 0.
This effectively restricts access to kernel-mode in all modern operating
systems.  Likewise, when the Owner bit is set to 1, the page can be accessed
from all privilege levels.  By toggling the Owner bit to 0 in the PTEs
associated with a given set of linear addresses, it is possible to trap all
user-mode references to those addresses at runtime.  This effectively solves
the first hurdle in implementing a solution to intercept memory access
behavior.

Using the approach outlined above, any reference that is made from user-mode
to a linear address whose PTE has had the Owner bit set to 0 will result in an
access violation exception being passed to the user-mode exception dispatcher.
This exception must be handled by a custom exception handler that is able to
distinguish transient access violations from ones that occurred as a result of
the Owner bit having been modified.  This custom exception handler must also
be able to recover from the exception in a manner that allows execution to
resume seamlessly.  Distinguishing exceptions is easy if one assumes that the
custom exception handler has knowledge in advance of the address regions that
have had their Owner bit modified.  Given this assumption, the act of
distinguishing exceptions is as simple as seeing if the fault address is
within an address region that is currently being monitored.  While
distinguishing exceptions may be easy, being able to gracefully recovery is an
entirely different matter.

To recover and resume execution with no noticeable impact to an application
means that the exception handler must have a mechanism that allows the
application to access the data stored in the pages whose virtual mappings have
had their access restricted to kernel-mode.  This, of course, would imply that
the application must have some way, either direct or indirect, to access the
contents of the physical pages associated with the virtual mappings that have
had their PTEs modified.  The most obvious approach would be to simply toggle
the Owner bit to permit user-mode access.  This has many different problems,
not the least of which being that doing so would be expensive and would not
behave properly in multi-threaded environments (memory accesses could be
missed or worse).  An alternative to updating the Owner bit would be to have a
device driver designed to provide support to processes that would allow them
to read the contents of a virtual address at privilege level 0.  However,
having the ability to read and write memory through a driver means nothing if
the results of the operation cannot be factored back into the instruction that
triggered the exception.

Rather than attempting to emulate the read and write access, a better approach
can be used.  This approach involves creating a second virtual mapping to the
same set of physical pages described by the linear addresses whose PTEs were
modified.  This second virtual mapping would behave like a typical user-mode
memory mapping.  In this way, the process' virtual address space would contain
two virtual mappings to the same set of physical pages.  One mapping, which
will be referred to as the original mapping, would represent the user-mode
inaccessible set of virtual addresses.  The second mapping, which will be
referred to as the mirrored mapping, would be the user-mode accessible set of
virtual addresses.  By mapping the same set of physical pages at two
locations, it is possible to transparently redirect address references at the
time that exceptions occur.  An important thing to note is that in order to
provide support for mirroring, a disassembler must be used to figure out which
registers need to be modified.

To better understand how this could work, consider a scenario where an
application contains a mov [eax], 0x1 instruction.  For the purposes of this
example, assume that the eax register contains an address that is within the
original mapping as described above.  When this instruction executes, it will
lead to an access violation exception being generated as a result of the PTE
modifications that were made to the original mapping.  When the exception
handler inspects this exception, it can determine that the fault address was
one that is contained within the original mapping.  To allow execution to
resume, the exception handler must update the eax register to point to the
equivalent address within the mirrored region.  Once it has altered the value
of eax, the exception handler can tell the exception dispatcher to continue
execution with the now-modified register information.  From the perspective of
an executing application, this entire operation will occur transparently.
Unfortunately, there's still more work that needs to be done in order to
ensure that the application continues to execute properly after the exception
dispatcher continues execution.

The biggest problem with modifying the value of a register to point to the
mirrored address is that it can unintentionally alter the behavior of
subsequent instructions.  For example, the application may not function
properly if it assumes that it can access other non-mirrored memory addresses
relative to the address stored within eax.  Not only that, but allowing eax to
continue to be accessed through the mirrored address will mean that subsequent
reads and writes to memory made using the eax register will be missed for the
time that eax contains the mirrored address.

In order to solve this problem, it is necessary to come up with a method of
restoring registers to their original value after the instruction executes.
Fortunately, the underlying architecture has built-in support that allows a
program to be notified after it has executed an instruction.  This support is
known as single-stepping.  To make use of single-stepping, the exception
handler can set the trap flag (0x100) in the saved value of the eflags
register.  When execution resumes, the processor will generate a single step
exception after the original instruction executes.  This will result in the
custom exception handler being called.  When this occurs, the custom exception
handler can determine if the single step exception occurred as a result of a
previous mirroring operation.  If it was the result of a mirroring operation,
the exception handler can take steps to restore the appropriate register to
its original value.

Using these four primary steps, a complete solution to the problem of
intercepting memory accesses can be formed.  First, the Owner bit of the PTEs
associated with a region of virtual memory can be set to 0.  This will cause
user-mode references to this region to generate an access violation exception.
Second, an additional mapping to the set of physical pages described the
original mapping can be created which is accessible from user-mode.  Third,
any access violation exceptions that reach the custom exception handler can be
inspected.  If they are the result of a reference to a region that is being
tracked, the register contents of the thread context can be adjusted to
reference the user-accessible mirrored mapping.  The thread can then be
single-stepped so that the fourth and final step can be taken.  When a
single-step exception is generated, the custom exception handler can restore
the original value of the register that was modified.  When this is complete,
the thread can be allowed to continue as if nothing had happened.

2.2.2) Implementation

An implementation of this approach is included with the source code released
along with this paper.  This implementation has two main components: a
kernel-mode driver and a user-mode DLL.  The kernel-mode driver provides a
device object interface that allows a user-mode process to create a mirrored
mapping of a set of physical pages and to toggle the Owner bit of PTEs
associated with address regions.  The user-mode DLL is responsible for
implementing a vectored exception handler that takes care of processing access
violation exceptions by mirroring the address references to the appropriate
mirrored region.  The user-mode DLL also exposes an API that allows
applications to create a memory mirror.  This abstracts the entire process and
makes it simple to begin tracking a specific memory region.  The API also
allows applications to register callbacks that are notified when an address
reference occurs.  This allows further analysis of the memory access behavior
of the application.

2.2.3) Considerations

While this approach is most definitely functional, it comes with a number of
caveats that make it sub-optimal for any sort of large-scale deployment.  The
following considerations are by no means all-encompassing, but some of the
more important ones have been enumerated below:

 1. Unsafe modification of PTEs
    It is not safe to modify PTEs without acquiring certain locks.
    Unfortunately, these locks are not exported and are therefore inaccessible
    to third party drivers.

 2. Large amount of overhead
    The overhead associated with having to take a page fault and pass the
    exception on to the be handled by user-mode is substantial.  Memory access
    time with respect to the application could jump from nanoseconds to micro
    or even milli seconds.

 3. Requires the use of a disassembler
    Since this approach relies on mirroring memory references from one virtual
    address to another, a disassembler has to be used to figure out which
    registers need to be modified with the mirrored address.  Any time a
    disassembler is needed is an indication that things are getting fairly
    complicated.

 4. Cannot track memory references to all addresses
    The fact that this approach relies on locking physical pages prevents it
    from feasibly tracking all memory references.  In addition, because the
    thread stack is required to be valid in order to dispatch exceptions, it's
    not possible to track reads and writes to thread stacks using this
    approach.

2.3) Null Segment Interception

Segmentation is an extremely old feature of the x86 architecture.  Its purpose
has been to provide software with the ability to partition the address space
into distinct segments that can be referenced through a 16-bit segment
selector.  Segment selectors are used to index either the Global Descriptor
Table (GDT) or the Local Descriptor Table (LDT).  Segment descriptors convey
information about all or a portion of the address space.  On modern 32-bit
operating systems, segmentation is used to set up a flat memory model
(primarily only used because there is no way to disable it).  This is further
illustrated by the fact that the x64 architecture has effectively done away
with the ES, DS, and SS segment registers in 64-bit mode.  While segment
selectors are primarily intended to make it possible to access memory, they
can also be used to prevent access to it.

2.3.1) Design

Segmentation is one of the easiest ways to trap memory accesses.  The majority
of instructions which reference memory implicitly use either the DS or ES
segment registers to do so.  The one exception to this rule are instructions
that deal with the stack.  These instructions implicitly use the SS segment
register.  There are a few different ways one can go about causing a general
protection fault when accessing an address relative to a segment selector, but
one of the easiest is to take advantage of the null selector.  The null
selector, 0x0, is a special segment selector that will always cause a general
protection fault when using it to reference memory.  By loading the null
selector into DS, for example, the mov [eax], 0x1 instruction would cause a
general protection fault when executed.  Using the null selector solves the
problem of being able to intercept memory accesses, but there still needs to
be some mechanism to allow the application to execute normally after
intercepting the memory access.

When a general protection fault occurs in user-mode, the kernel generates an
access violation exception and passes it off to the user-mode exception
dispatcher in much the same way as was described in 2.2.  Registering a custom
exception handler makes it possible to catch this exception and handle it
gracefully.  To handle this exception, the custom exception handler must
restore DS and ES segment registers to valid segment selectors by updating the
thread context record associated with the exception.  On 32-bit versions of
Windows, the segment registers should be restored to 0x23.  Once the the
segment registers have been updated, the exception dispatcher can be told to
continue execution.  However, before this happens, there is an additional step
that must be taken.

It is not enough to simply restore the segment registers and then continue
execution.  This would lead to subsequent reads and writes being missed as a
result of the DS and ES segment registers no longer pointing to the null
selector.  To address this, the custom exception handler should toggle the
trap flag in the context record prior to continuing execution.  Setting the
trap flag will cause the processor to generate a single step exception after
the instruction that generated the general protection fault executes.  This
single step exception can then be processed by the custom exception handler to
reset the DS and ES segment registers to the null selector.  After the segment
registers have been updated, the trap flag can be disabled and execution can
be allowed to continue.  By following these steps, the application is able to
make forward progress while also making it possible to trap all memory reads
and writes that use the DS and ES segment registers.

2.3.2) Implementation

The implementation for this approach involves registering a vectored exception
handler that is able to handle the access violation and single step exceptions
that are generated.  Since this approach relies on setting the segment
registers DS and ES to the null selector, an implementation must take steps to
update the segment register state for each running thread in a process and for
all new threads as they are created.  Updating the segment register state for
running threads involves enumerating running threads in the calling process
using the toolhelp library.  For each thread that is not the calling thread,
the SetThreadContext routine can be used to update segment registers.  The
calling thread can update the segment registers using native instructions.  To
alter the segment registers for new threads, the DLLTHREADATTACH notification
can be used.  Once all threads have had their DS and ES segment registers
updated, memory references will immediately begin causing access violation
exceptions.

When these access violation exceptions are passed to the vectored exception
handler, appropriate steps must be taken to restore the DS and ES segment
registers to a valid segment selector, such as 0x23.  This is accomplished by
updating the SegDs and SegEs segment registers in the CONTEXT structure that
is passed in association with an exception.  In addition to updating these
segment registers, the trap flag (0x100) must also be set in the EFlags
register so that the DS and ES segment registers can be restored to the null
selector in order to trap subsequent memory accesses.  Setting the trap flag
will lead to a single step exception after the instruction that generated the
access violation executes.  When the single step exception is received, the
SegDs and SegEs segment registers can be restored to the null selector.

These few steps capture the majority of the implementation, but there is a
specific Windows nuance that must be handled in order for this to work right.
When the Windows kernel returns to a user-mode process after a system call has
completed, it restores the DS and ES segment selectors to their normal value
of 0x23.  The problem with this is that without some way to reset the segment
registers to the null selector after a system call returns, there is no way to
continue to track memory accesses after a system call.  Fortunately, there is
a relatively painless way to reset the segment registers after a system call
returns.  On Windows XP SP2 and more recent versions of Windows, the kernel
determines where to transfer control to after a system call returns by looking
at the function pointer stored in the shared user data memory mapping.
Specifically, the SystemCallReturn attribute at 0x7ffe0304 holds a pointer to
a location in ntdll that typically contains just a ret instruction as shown
below:

0:001> u poi(0x7ffe0304)
ntdll!KiFastSystemCallRet:
7c90eb94 c3              ret
7c90eb95 8da42400000000  lea     esp,[esp]
7c90eb9c 8d642400        lea     esp,[esp]

Replacing this single ret instruction with code that resets the DS and ES
registers to the null selector followed by a ret instruction is enough to make
it possible to continue to trap memory accesses after a system call returns.
However, this replacement code should not take these steps if a system call
occurs in the context of the exception dispatcher, as this could lead to a
nesting issue if anything in the exception dispatcher references memory, which
is very likely.

An implementation of this approach is included with the source code provided
along with this paper.

2.3.3) Considerations

There are a few considerations that should be noted about this approach.  On
the positive side, this approach is unique when compared to the others
described in this paper due to the fact that, in principle, it should be
possible to use it to trap memory accesses in kernel-mode, although it is
expected that the implementation may be much more complicated.  This approach
is also much simpler than the other approaches in that it requires far less
code.  While these are all good things, there are some negative considerations
that should also be pointed out.  These are enumerated below:

 1. Will not work on x64
    The segmentation approach described in this section will not work on x64
    due to the fact that the DS, ES, and even SS segment selectors are
    effectively ignored when the processor is in 64-bit mode.

 2. Significant performance overhead
    Like many of the other approaches, this one also suffers from significant
    performance overhead involved in having to take a GP and DB fault for
    every address reference.  This approach could be be further optimized by
    creating a custom LDT entry (using NtSetLdtEntries) that describes a
    region whose base address is 0 and length is n where n is just below the
    address of the region(s) that should be monitored.  This would have the
    effect of allowing memory accesses to succeed within the lower portion of
    the address space and fail in the higher portion (which is being
    monitored).  It's important to note that the base address of the LDT entry
    must be zero.  This is problematic since most of the regions that one
    would like to monitor (heap) are allocated low in the address space.  It
    would be possible to work around this issue by having
    NtAllocateVirtualMemory allocate using MEM\_TOP\_DOWN.

 3. Requires a disassembler
    Unfortunately, this approach also requires the use of a disassembler in
    order to extract the effective address that caused the access violation
    exception to occur.  This is necessary because general protection faults
    that occur due to a segment selector issue generate exception records that
    flag the fault address as being 0xffffffff.  This makes sense in the
    context that without a valid segment selector, there is no way to
    accurately calculate the effective address.  The use of a disassembler
    means that the code is inherently more complicated than it would otherwise
    need to be.  There may be some way to craft a special LDT entry that would
    still make it possible to determine the address that cause the fault, but
    the author has not investigated this.

3) Potential Uses

The ability to intercept an application's memory accesses is an interesting
concept but without much use beyond simple statistical and visual analysis.
Even though this is the case, the data that can be collected by analyzing
memory access behavior can make it possible to perform much more extensive
forms of dynamic binary analysis.  This section will give a brief introduction
to some of the hypothetical areas that might benefit from being able to
understand the memory access behavior of an application.

3.1) Data Propagation

Being able to gain knowledge about the way that data propagates throughout an
application can provide extremely useful insights.  For example, understanding
data propagation can give security researchers an idea of the areas of code
that are affected, either directly or indirectly, by a buffer that is received
from a network socket.  In this context, having knowledge about areas affected
by data would be much more valuable than simply understanding the code paths
that are taken as a result of the buffer being received.  Though the two may
seem closely related, the areas of code affected by a buffer that is received
should actually be restricted to a subset of the overall code paths taken.

Even if understanding data propagation within an application is beneficial, it
may not be clear exactly how analyzing memory access behavior could make this
possible.  To understand how this might work, it's best to think of memory
access in terms of its two basic operations: read and write.  In the course of
normal execution, any instruction that reads from a location in memory can be
said to be dependent on the last instruction that wrote to that location.
When an instruction writes to a location in memory, it can be said that any
instructions that originally wrote to that location no longer have claim over
it.  Using these simple concepts, it is possible to build a dependency graph
that shows how areas of code become dependent on one another in terms of a
reader/writer relationship.  This dependency graph would be dynamic and would
change as a program executes just the same as the data propagation within an
application would change.

At this point in time, the author has developed a very simple implementation
based on the DBI strategy outlined in this paper.  The current implementation
is in need of further refinement, but it is capable of showing reader/writer
relationships as the program executes.  This area is ripe for future research.

3.2) Memory Access Isolation

From a visualization standpoint, it might be interesting to be able to show
with what degrees of code-level isolation different regions of memory are
accessed.  For example, being able to show what areas of code touch individual
heap allocations could provide interesting insight into the containment model
of an application that is being analyzed.  This type of analysis might be able
to show how well designed the application is by inferring code quality based
on the average number of areas of code that make direct reference to unique
heap allocations.  Since this concept is a bit abstract, it might make sense
to discuss a more concrete example.

One example might involve an object-oriented C++ application that contains
multiple classes such as Circle, Shape, Triangle, and so on.  In the first
design, the application allows classes to directly access the attributes of
instances.  In the second design, the application forces classes to reference
attributes through public getters and setters.  Using memory access behavior
to identify code-level isolation, the first design might be seen as a poor
design due to the fact that there will be many code locations where unique
heap allocations (class instances) have the contents of their memory accessed
directly.  The second design, on the other hand, might be seen as a more
robust design due to the fact that the unique heap allocations would be
accessed by fewer places (the getters and setters).

It may actually be the case that there's no way to draw a meaningful
conclusion by analyzing code-level isolation of memory accesses.  One specific
case that was raised to the author involved how the use of inlining or
aggressive compiler optimizations might incorrectly indicate a poor design.
Even though this is likely true, there may be some knowledge that can be
obtained by researching this further.  The author is not presently aware of an
implementation of this concept but would love to be made aware if one exists.

3.3) Thread Data Consistency

Programmers familiar with the pains of thread deadlocks and thread-related
memory corruption should be well aware of how tedious these problems can be to
debug.  By analyzing memory access behavior in conjunction with some
additional variables, it may be possible to make determinations as to whether
or not a memory operation is being made in a thread safe manner.  At this
point, the author has not defined a formal approach that could be taken to
achieve this, but a few rough ideas have been identified.

The basic idea behind this approach would be to combine memory access behavior
with information about the thread that the access occurred in and the set of
locks that were acquired when the memory access occurred.  Determining which
locks are held can be as simple as inserting instrumentation code into the
routines that are used to acquire and release locks at runtime.  When a lock
is acquired, it can be pushed onto a thread-specific stack.  When the lock is
released, it can be removed.  The nice thing about representing locks as a
stack is that in almost every situation, locks should be acquired and released
in symmetric order.  Acquiring and releasing locks asymmetrically can quickly
lead to deadlocks and therefore can be flagged as problematic.

Determining data consistency is quite a bit trickier, however.  An analysis
library would need some means of historically tracking read and write access
to different locations in memory.  Still, determining what might be a data
consistency issue from this historical data is challenging.  One example of a
potential data consistency issue might be if two writes occur to a location in
memory from separate threads without a common lock being acquired between the
two threads.  This isn't guaranteed to be problematic, but it is at the very
least be indicative of a potential problem.  Indeed, it's likely that many
other types of data consistency examples exist that may be possible to capture
in relation to memory access, thread context, and lock ownership.

Even if this concept can be made to work, the very fact that it would be a
runtime solution isn't a great thing.  It may be the case that code paths that
lead to thread deadlocks or thread-related corruption are only executed rarely
and are hard to coax out.  Regardless, the author feels like this represents
an interesting area of future research.

4) Previous Work

The ideas described in this paper benefit greatly from the concepts
demonstrated in previous works.  The memory mirroring concept described in 2.2
draws heavily from the PaX team's work relating to their VMA mirroring and
software-based non-executable page implementations[8].  Oded Horovitz provided an
implementation of the paging approach for Windows and applied it to
application security[4].  In addition, there have been other examples that use
concepts similar to those described by PaX to achieve additional results, such
as OllyBone, ShadowWalker, and others[10, 9].  The use of DBI in 2.1 for
memory analysis is facilitated by the excellent work that has gone into
DynamoRIO, Valgrind, and indeed all other DBI frameworks[3, 11].

It should be noted that if one is strictly interested in monitoring writes to
a memory region, Windows provides a built-in feature known as a write watch.
When allocating a region with VirtualAlloc, the MEM_WRITE_WATCH flag can be set.
This flag tells the kernel to track writes that occur to the region.  These
writes can be queried at a later point in time using GetWriteWatch[6].

It is also possible to use guard pages and other forms of page protection,
such as PAGE_NOACCESS, to intercept memory access to a page in user-mode.
Pedram Amini's PyDbg supports the concept of memory breakpoints which are
implemented using guard pages[12].  This type of approach has two limitations
that are worth noting.  The first limitation involves an inability to pass
addresses to kernel-mode that have had a memory breakpoint set on them (either
guard page or PAGE_NOACCESS).  If this occurs it can lead to unexpected
behavior, such as by causing a system call to fail when referencing the
user-mode address.  This would not trigger an exception in user-mode.
Instead, the system call would simply return STATUS_ACCESS_VIOLATION.  As a
result, an application might crash or otherwise behave improperly.  The second
limitation is that there may be consequences in multi-threaded environments
where memory accesses are missed.

5) Conclusion

The ability to analyze the memory access behavior of an application at runtime
can provide additional insight into how an application works.  This insight
might include learning more about how data propagates, deducing the code-level
isolation of memory references, identifying potential thread safety issues,
and so on.  This paper has described three strategies that can be used to
intercept memory accesses within an application at runtime.

The first approach relies on Dynamic Binary Instruction (DBI) to inject
instrumentation code before instructions that access memory locations.  This
instrumentation code is then capable of obtaining information about the
address being referenced when instructions are executed.

The second approach relies on hardware paging features supported by the x86
and x64 architecture to intercept memory accesses.  This works by restricting
access to a virtual address range to kernel-mode access.  When an application
attempts to reference a virtual address that has been marked as such, an
exception is generated that is then passed to the user-mode exception
dispatcher.  A custom exception handler can then inspect the exception and
take the steps necessary to allow execution to continue gracefully after
having tracked the memory access.

The third approach uses the segmentation feature of the x86 architecture to
intercept memory accesses.  It does this by loading the DS and ES segment
registers with the null selector.  This has the effect of causing instructions
which implicitly use these registers to generate a general protection fault
when referencing memory.  This fault results in an access violation exception
being generated that can be handled in much the same way as the hardware
paging approach.

It is hoped that these strategies might be useful to future research which
could benefit from collecting memory access information.

References

[1] AMD.  AMD64 Architecture Programmer's Manual: Volume 2 System Programming.
    http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf; accessed 5/2/2007.

[2] Bala, Duesterwald, Banerija.  Transparent Dynamic Optimization.
    http://www.hpl.hp.com/techreports/1999/HPL-1999-77.pdf; accessed 5/2/2007.

[3] Hewlett-Packard, MIT.  DynamoRIO.
    http://www.cag.lcs.mit.edu/dynamorio/; accessed 4/30/2007.

[4] Horovitz, Oded.  Memory Access Detection.
    http://cansecwest.com/core03/mad.zip; accessed 5/7/2007.

[5] Intel.  Intel Architecture Software Developer's Manual Volume 3: System Programming.
    http://download.intel.com/design/PentiumII/manuals/24319202.pdf; accessed 5/1/2007.

[6] Microsoft Corporation.  GetWriteWatch.
    http://msdn2.microsoft.com/en-us/library/aa366573.aspx; accessed 5/5/2007.

[7] Nethercote, Nicholas.  Dynamic Binary Analysis and Instrumentation.
    http://valgrind.org/docs/phd2004.pdf; accessed 5/2/2007.

[8] PaX Team.  PAGEEXEC.
    http://pax.grsecurity.net/docs/pageexec.txt; accessed 5/1/2007.

[9] Sparks, Butler.  Shadow Walker: Raising the Bar for Rootkit Detection.
    https://www.blackhat.com/presentations/bh-jp-05/bh-jp-05-sparks-butler.pdf; accessed 5/3/2007.

[10] Stewart, Joe.  Ollybone.
     http://www.joestewart.org/ollybone/; accessed 5/3/2007.

[11] Valgrind.  Valgrind.
     http://valgrind.org/; accessed 4/30/2007.

[12] Amini, Pedram.  PaiMei.
     http://pedram.redhive.com/PaiMei/docs/; accessed 5/10/2007.