mirror of https://github.com/fdiskyou/Zines.git
801 lines
47 KiB
Plaintext
801 lines
47 KiB
Plaintext
Memalyze: Dynamic Analysis of Memory Access Behavior in Software
|
|
skape
|
|
mmiller@hick.org
|
|
4/2007
|
|
|
|
Abstract
|
|
|
|
This paper describes strategies for dynamically analyzing an application's
|
|
memory access behavior. These strategies make it possible to detect when a
|
|
read or write is about to occur at a given location in memory while an
|
|
application is executing. An application's memory access behavior can provide
|
|
additional insight into its behavior. For example, it may be able to provide
|
|
an idea of how data propagates throughout the address space. Three individual
|
|
strategies which can be used to intercept memory accesses are described in
|
|
this paper. Each strategy makes use of a unique method of intercepting memory
|
|
accesses. These methods include the use of Dynamic Binary Instrumentation
|
|
(DBI), x86 hardware paging features, and x86 segmentation features. A
|
|
detailed description of the design and implementation of these strategies for
|
|
32-bit versions of Windows is given. Potential uses for these analysis
|
|
techniques are described in detail.
|
|
|
|
1) Introduction
|
|
|
|
If software analysis had a holy grail, it would more than likely be centered
|
|
around the ability to accurately model the data flow behavior of an
|
|
application. After all, applications aren't really much more than
|
|
sophisticated data processors that operate on varying sets of input to produce
|
|
varying sets of output. Describing how an application behaves when it
|
|
encounters these varying sets of input makes it possible to predict future
|
|
behavior. Furthermore, it can provide insight into how the input could be
|
|
altered to cause the application to behave differently. Given these benefits,
|
|
it's only natural that a discipline exists that is devoted to the study of
|
|
data flow analysis.
|
|
|
|
There are a two general approaches that can be taken to perform data flow
|
|
analysis. The first approach is referred to as static analysis and it
|
|
involves analyzing an application's source code or compiled binaries without
|
|
actually executing the application. The second approach is dynamic analysis
|
|
which, as one would expect, involves analyzing the data flow of an application
|
|
as it executes. The two approaches both have common and unique benefits and
|
|
no argument will be made in this paper as to which may be better or worse.
|
|
Instead, this paper will focus on describing three strategies that may be used
|
|
to assist in the process of dynamic data flow analysis.
|
|
|
|
The first strategy involves using Dynamic Binary Instrumentation (DBI) to
|
|
rewrite the instruction stream of the executing application in a manner that
|
|
makes it possible to intercept instructions that read from or write to memory.
|
|
Two well-known examples of DBI implementations that the author is familiar
|
|
with are DynamoRIO and Valgrind[3, 11]. The second strategy that will be
|
|
discussed involves using the hardware paging features of the x86 and x64
|
|
architectures to trap and handle access to specific pages in memory. Finally,
|
|
the third strategy makes use of the segmentation features included in the x86
|
|
architecture to trap memory accesses by making use of the null selector.
|
|
Though these three strategies vary greatly, they all accomplish the same goal
|
|
of being able to intercept memory accesses within an application as it
|
|
executes.
|
|
|
|
The ability to intercept memory reads and writes during runtime can support
|
|
research in additional areas relating to dynamic data flow analysis. For
|
|
example, the ability to track what areas of code are reading from and writing
|
|
to memory could make it possible to build a model for the data propagation
|
|
behaviors of an application. Furthermore, it might be possible to show with
|
|
what degree of code-level isolation different areas of memory are accessed.
|
|
Indeed, it may also be possible to attempt to validate the data consistency
|
|
model of a threaded application by investigating the access behaviors of
|
|
various regions of memory which are referenced by multiple threads. These are
|
|
but a few of the many potential candidates for dynamic data flow analysis.
|
|
|
|
This paper is organized into three sections. Section 2 gives an introduction
|
|
to three different strategies for facilitating dynamic data flow analysis.
|
|
Section 3 enumerates some of the potential scenarios in which these strategies
|
|
could be applied in order to render some useful information about the data
|
|
flow behavior of an application. Finally, section 4 describes some of the
|
|
previous work whose concepts have been used as the basis for the research
|
|
described herein.
|
|
|
|
2) Strategies
|
|
|
|
This section describes three strategies that can be used to intercept runtime
|
|
memory accesses. The strategies described herein do not rely on any static
|
|
binary analysis. Techniques that do make use of static binary analysis are
|
|
outside of the scope of this paper.
|
|
|
|
2.1) Dynamic Binary Instrumentation
|
|
|
|
Dynamic Binary Instrumentation (DBI) is a method of analyzing the behavior of
|
|
a binary application at runtime through the injection of instrumentation code.
|
|
This instrumentation code executes as part of the normal instruction stream
|
|
after being injected. In most cases, the instrumentation code will be
|
|
entirely transparent to the application that it's been injected to. Analyzing
|
|
an application at runtime makes it possible to gain insight into the behavior
|
|
and state of an application at various points in execution. This highlights
|
|
one of the key differences between static binary analysis and dynamic binary
|
|
analysis. Rather than considering what may occur, dynamic binary analysis has
|
|
the benefit of operating on what actually does occur. This is by no means
|
|
exhaustive in terms of exercising all code paths in the application, but it
|
|
makes up for this by providing detailed insight into an application's concrete
|
|
execution state.
|
|
|
|
The benefits of DBI have made it possible to develop some incredibly advanced
|
|
tools. Examples where DBI might be used include runtime profiling,
|
|
visualization, and optimization tools. DBI implementations generally fall
|
|
into two categories: light-weight or heavy-weight. A light-weight DBI
|
|
operates on the architecture-specific instruction stream and state when
|
|
performing analysis. A heavy-weight DBI operates on an abstract form of the
|
|
instruction stream and state. An example a heavy-weight DBI is Valgrind which
|
|
performs analysis on an intermediate representation of the machine state[11,
|
|
7]. An example of a light-weight DBI is DynamoRIO which performs analysis
|
|
using the architecture-specific state[3]. The benefit of a heavy-weight DBI
|
|
over a light-weight DBI is that analysis code written against the intermediate
|
|
representation is immediately portable to other architectures, whereas
|
|
light-weight DBI analysis implementations must be fine-tuned to work with
|
|
individual architectures. While Valgrind is a novel and interesting
|
|
implementation, it is currently not supported on Windows. For this reason,
|
|
attention will be given to DynamoRIO for the remainder of this paper. There are
|
|
many additional DBI frameworks and details, but for the sake of limiting scope
|
|
these will not be discussed. The reader should consult reference material to
|
|
learn more about this subject[11].
|
|
|
|
DynamoRIO is an example of a DBI framework that allows custom instrumentation
|
|
code to be integrated in the form of dynamic libraries. The tool itself is a
|
|
combination of Dynamo, a dynamic optimization engine developed by researchers
|
|
at HP, and RIO, a runtime introspection and optimization engine developed by
|
|
MIT. The fine-grained details of the implementation of DynamoRIO are outside
|
|
of the scope of this paper, but it's important to understand the basic
|
|
concepts[2].
|
|
|
|
At a high-level, figure 1 from Transparent Binary Optimization provides a
|
|
great visualization of the process employed by Dynamo[2]. In concrete terms,
|
|
Dynamo works by processing an instruction stream as it executes. To
|
|
accomplish this, Dynamo assumes responsibility for the execution of the
|
|
instruction stream. It uses a disassembler to identify the point of the next
|
|
branch instruction in the code that is about to be executed. The set of
|
|
instructions disassembled is referred to as a fragment (although, it's more
|
|
commonly known as a basic block). If the target of the branch instruction is
|
|
in Dynamo's fragment cache, it executes the (potentially optimized) code in
|
|
the fragment cache. When this code completes, it returns control to Dynamo to
|
|
disassemble the next fragment. If at some point Dynamo encounters a branch
|
|
target that is not in its fragment cache, it will add it to the fragment cache
|
|
and potentially optimize it. This is the perfect opportunity for
|
|
instrumentation code to be injected into the optimized fragment that is
|
|
generated for a branch target. Injecting instrumentation code at this level
|
|
is entirely transparent to the application. While this is an
|
|
oversimplification of the process used by DynamoRIO, it should at least give
|
|
some insight into how it functions.
|
|
|
|
One of the best features of DynamoRIO from an analysis standpoint is that it
|
|
provides a framework for inserting instrumentation code during the time that a
|
|
fragment is being inserted into the fragment cache. This is especially useful
|
|
for the purposes of intercepting memory accesses within an application. When
|
|
a fragment is being created, DynamoRIO provides analysis libraries with the
|
|
instructions that are to be included in the fragment that is generated. To
|
|
optimize for performance, DynamoRIO provides multiple levels of disassembly
|
|
information. At the most optimized level, only very basic information
|
|
about the instructions is provided. At the least optimized level, very
|
|
detailed information about the instructions and their operands can be
|
|
obtained. Analysis libraries are free to control the level of information
|
|
that they retrieve. Using this knowledge of DynamoRIO, it is now possible
|
|
to consider how one might design an analysis library that is able to
|
|
intercept memory reads and writes while an application is executing.
|
|
|
|
2.1.1) Design
|
|
|
|
DBI, and DynamoRIO in particular, make designing a solution that can intercept
|
|
memory reads and writes fairly trivial. The basic design involves having an
|
|
analysis library that scans the instructions within a fragment that is being
|
|
created. When an instruction that accesses memory is encountered,
|
|
instrumentation code can be inserted prior to the instruction. The
|
|
instrumentation code can be composed of instructions that notify an
|
|
instrumentation function of the memory operand that is about to be read from
|
|
or written to. This has the effect of causing the instrumentation function to
|
|
be called when the fragment is executed. These few steps are really all that
|
|
it takes instrument the memory access behavior of an application as it
|
|
executes using DynamoRIO.
|
|
|
|
2.1.2) Implementation
|
|
|
|
The implementation of the DBI approach is really just as easy as the design
|
|
description makes it sound. To cooperate with DynamoRIO, an analysis library
|
|
must implement a well-defined routine named dynamorio_basic_block which is
|
|
called by DynamoRIO when a fragment is being created. This routine is passed
|
|
an instruction list which contains the set of instructions taken from the
|
|
native binary. Using this instruction list, the analysis library can make a
|
|
determination as to whether or not any of the operands of an instruction
|
|
either explicitly or implicitly reference memory. If an instruction does
|
|
access memory, then instrumentation code must be inserted.
|
|
|
|
Inserting instrumentation code with DynamoRIO is a pretty painless process.
|
|
DynamoRIO provides a number of macros that encapsulate the process of creating
|
|
and inserting instructions into the instruction list. For example,
|
|
INSTR_CREATE_add will create an add instruction with a specific set of arguments
|
|
and instrlist_meta_preinsert will insert an instruction prior to another
|
|
instruction within the instruction list.
|
|
|
|
A proof of concept implementation is included with the source code provided
|
|
along with this paper.
|
|
|
|
2.1.3) Considerations
|
|
|
|
This approach is particularly elegant thanks to the concepts of dynamic binary
|
|
instrumentation and to DynamoRIO itself for providing an elegant framework
|
|
that supports inserting instrumentation code into the fragment cache. Since
|
|
DynamoRIO is explicitly designed to be a runtime optimization engine, the fact
|
|
that the instrumentation code is cached within the fragment cache means that
|
|
it gains the benefits of DynamoRIO's fragment optimization algorithms. When
|
|
compared to alternative approaches, this approach also has significantly less
|
|
overhead once the fragment cache begins to become populated. This is because
|
|
all of the instrumentation code is placed entirely inline with the application
|
|
code that is executing rather than having to rely on alternative means of
|
|
interrupting the normal course of program execution. Still, this approach is
|
|
not without its set of considerations. Some of these considerations are
|
|
described below:
|
|
|
|
1. Requires the use of a disassembler
|
|
DynamoRIO depends on its own internal disassembler. This can be a source
|
|
of problems and limitations.
|
|
|
|
2. Self-modifying and dynamic code
|
|
Self-modifying and dynamically generated code can potentially cause problems
|
|
with DynamoRIO.
|
|
|
|
3. DynamoRIO is closed source
|
|
While this has nothing to do with the actual concept, the fact that
|
|
DynamoRIO is closed source can be limiting in the event that there are
|
|
issues with DynamoRIO itself.
|
|
|
|
2.2) Page Access Interception
|
|
|
|
The hardware paging features of the x86 and x64 architectures represent a
|
|
potentially useful means of obtaining information about the memory access
|
|
behavior of an application. This is especially true due to the well-defined
|
|
actions that the processor takes when a reference is made to a linear address
|
|
whose physical page is either not present or has had its access restricted.
|
|
In these cases, the processor will assert the page fault interrupt (0x0E) and
|
|
thereby force the operating system to attempt to gracefully handle the virtual
|
|
memory reference. In Windows, the page fault interrupt is handled by
|
|
nt!KiTrap0E. In most cases, nt!KiTrap0E will issue a call into
|
|
nt!MmAccessFault which is responsible for making a determination about the
|
|
nature of the memory reference that occurred. If the memory reference fault
|
|
was a result of an access restriction, nt!MmAccessFault will return an access
|
|
violation error code (0xC0000005). When an access violation occurs, an
|
|
exception record is generated by the kernel and is then passed to either the
|
|
user-mode exception dispatcher or the kernel-mode exception dispatcher
|
|
depending on which mode the memory access occurred in. The job of the
|
|
exception dispatcher is to give a thread an opportunity to gracefully recover
|
|
from the exception. This is accomplished by providing each of the registered
|
|
or vectored exception handlers with the exception information that was
|
|
collected when the page fault occurred. If an exception handler is able to
|
|
recover, execution of the thread can simply restart where it left off. Using
|
|
the principles outlined above, it is possible to design a system that is
|
|
capable of both trapping and handling memory references to specific pages in
|
|
memory during the course of normal process execution.
|
|
|
|
2.2.1) Design
|
|
|
|
The first step that must be taken to implement this system involves
|
|
identifying a method that can be used to trap references to arbitrary pages in
|
|
memory. Fortunately, previous work has done much to identify some of the
|
|
different approaches that can be taken to accomplish this[8, 4]. For the purposes
|
|
of this paper, one of the most useful approaches centers around the ability to
|
|
define whether or not a page is restricted from user-mode access. This is
|
|
controlled by the Owner bit in a linear address' page table entry (PTE)[5]. When
|
|
the Owner bit is set to 0, the page can only be accessed at privilege level 0.
|
|
This effectively restricts access to kernel-mode in all modern operating
|
|
systems. Likewise, when the Owner bit is set to 1, the page can be accessed
|
|
from all privilege levels. By toggling the Owner bit to 0 in the PTEs
|
|
associated with a given set of linear addresses, it is possible to trap all
|
|
user-mode references to those addresses at runtime. This effectively solves
|
|
the first hurdle in implementing a solution to intercept memory access
|
|
behavior.
|
|
|
|
Using the approach outlined above, any reference that is made from user-mode
|
|
to a linear address whose PTE has had the Owner bit set to 0 will result in an
|
|
access violation exception being passed to the user-mode exception dispatcher.
|
|
This exception must be handled by a custom exception handler that is able to
|
|
distinguish transient access violations from ones that occurred as a result of
|
|
the Owner bit having been modified. This custom exception handler must also
|
|
be able to recover from the exception in a manner that allows execution to
|
|
resume seamlessly. Distinguishing exceptions is easy if one assumes that the
|
|
custom exception handler has knowledge in advance of the address regions that
|
|
have had their Owner bit modified. Given this assumption, the act of
|
|
distinguishing exceptions is as simple as seeing if the fault address is
|
|
within an address region that is currently being monitored. While
|
|
distinguishing exceptions may be easy, being able to gracefully recovery is an
|
|
entirely different matter.
|
|
|
|
To recover and resume execution with no noticeable impact to an application
|
|
means that the exception handler must have a mechanism that allows the
|
|
application to access the data stored in the pages whose virtual mappings have
|
|
had their access restricted to kernel-mode. This, of course, would imply that
|
|
the application must have some way, either direct or indirect, to access the
|
|
contents of the physical pages associated with the virtual mappings that have
|
|
had their PTEs modified. The most obvious approach would be to simply toggle
|
|
the Owner bit to permit user-mode access. This has many different problems,
|
|
not the least of which being that doing so would be expensive and would not
|
|
behave properly in multi-threaded environments (memory accesses could be
|
|
missed or worse). An alternative to updating the Owner bit would be to have a
|
|
device driver designed to provide support to processes that would allow them
|
|
to read the contents of a virtual address at privilege level 0. However,
|
|
having the ability to read and write memory through a driver means nothing if
|
|
the results of the operation cannot be factored back into the instruction that
|
|
triggered the exception.
|
|
|
|
Rather than attempting to emulate the read and write access, a better approach
|
|
can be used. This approach involves creating a second virtual mapping to the
|
|
same set of physical pages described by the linear addresses whose PTEs were
|
|
modified. This second virtual mapping would behave like a typical user-mode
|
|
memory mapping. In this way, the process' virtual address space would contain
|
|
two virtual mappings to the same set of physical pages. One mapping, which
|
|
will be referred to as the original mapping, would represent the user-mode
|
|
inaccessible set of virtual addresses. The second mapping, which will be
|
|
referred to as the mirrored mapping, would be the user-mode accessible set of
|
|
virtual addresses. By mapping the same set of physical pages at two
|
|
locations, it is possible to transparently redirect address references at the
|
|
time that exceptions occur. An important thing to note is that in order to
|
|
provide support for mirroring, a disassembler must be used to figure out which
|
|
registers need to be modified.
|
|
|
|
To better understand how this could work, consider a scenario where an
|
|
application contains a mov [eax], 0x1 instruction. For the purposes of this
|
|
example, assume that the eax register contains an address that is within the
|
|
original mapping as described above. When this instruction executes, it will
|
|
lead to an access violation exception being generated as a result of the PTE
|
|
modifications that were made to the original mapping. When the exception
|
|
handler inspects this exception, it can determine that the fault address was
|
|
one that is contained within the original mapping. To allow execution to
|
|
resume, the exception handler must update the eax register to point to the
|
|
equivalent address within the mirrored region. Once it has altered the value
|
|
of eax, the exception handler can tell the exception dispatcher to continue
|
|
execution with the now-modified register information. From the perspective of
|
|
an executing application, this entire operation will occur transparently.
|
|
Unfortunately, there's still more work that needs to be done in order to
|
|
ensure that the application continues to execute properly after the exception
|
|
dispatcher continues execution.
|
|
|
|
The biggest problem with modifying the value of a register to point to the
|
|
mirrored address is that it can unintentionally alter the behavior of
|
|
subsequent instructions. For example, the application may not function
|
|
properly if it assumes that it can access other non-mirrored memory addresses
|
|
relative to the address stored within eax. Not only that, but allowing eax to
|
|
continue to be accessed through the mirrored address will mean that subsequent
|
|
reads and writes to memory made using the eax register will be missed for the
|
|
time that eax contains the mirrored address.
|
|
|
|
In order to solve this problem, it is necessary to come up with a method of
|
|
restoring registers to their original value after the instruction executes.
|
|
Fortunately, the underlying architecture has built-in support that allows a
|
|
program to be notified after it has executed an instruction. This support is
|
|
known as single-stepping. To make use of single-stepping, the exception
|
|
handler can set the trap flag (0x100) in the saved value of the eflags
|
|
register. When execution resumes, the processor will generate a single step
|
|
exception after the original instruction executes. This will result in the
|
|
custom exception handler being called. When this occurs, the custom exception
|
|
handler can determine if the single step exception occurred as a result of a
|
|
previous mirroring operation. If it was the result of a mirroring operation,
|
|
the exception handler can take steps to restore the appropriate register to
|
|
its original value.
|
|
|
|
Using these four primary steps, a complete solution to the problem of
|
|
intercepting memory accesses can be formed. First, the Owner bit of the PTEs
|
|
associated with a region of virtual memory can be set to 0. This will cause
|
|
user-mode references to this region to generate an access violation exception.
|
|
Second, an additional mapping to the set of physical pages described the
|
|
original mapping can be created which is accessible from user-mode. Third,
|
|
any access violation exceptions that reach the custom exception handler can be
|
|
inspected. If they are the result of a reference to a region that is being
|
|
tracked, the register contents of the thread context can be adjusted to
|
|
reference the user-accessible mirrored mapping. The thread can then be
|
|
single-stepped so that the fourth and final step can be taken. When a
|
|
single-step exception is generated, the custom exception handler can restore
|
|
the original value of the register that was modified. When this is complete,
|
|
the thread can be allowed to continue as if nothing had happened.
|
|
|
|
2.2.2) Implementation
|
|
|
|
An implementation of this approach is included with the source code released
|
|
along with this paper. This implementation has two main components: a
|
|
kernel-mode driver and a user-mode DLL. The kernel-mode driver provides a
|
|
device object interface that allows a user-mode process to create a mirrored
|
|
mapping of a set of physical pages and to toggle the Owner bit of PTEs
|
|
associated with address regions. The user-mode DLL is responsible for
|
|
implementing a vectored exception handler that takes care of processing access
|
|
violation exceptions by mirroring the address references to the appropriate
|
|
mirrored region. The user-mode DLL also exposes an API that allows
|
|
applications to create a memory mirror. This abstracts the entire process and
|
|
makes it simple to begin tracking a specific memory region. The API also
|
|
allows applications to register callbacks that are notified when an address
|
|
reference occurs. This allows further analysis of the memory access behavior
|
|
of the application.
|
|
|
|
2.2.3) Considerations
|
|
|
|
While this approach is most definitely functional, it comes with a number of
|
|
caveats that make it sub-optimal for any sort of large-scale deployment. The
|
|
following considerations are by no means all-encompassing, but some of the
|
|
more important ones have been enumerated below:
|
|
|
|
1. Unsafe modification of PTEs
|
|
It is not safe to modify PTEs without acquiring certain locks.
|
|
Unfortunately, these locks are not exported and are therefore inaccessible
|
|
to third party drivers.
|
|
|
|
2. Large amount of overhead
|
|
The overhead associated with having to take a page fault and pass the
|
|
exception on to the be handled by user-mode is substantial. Memory access
|
|
time with respect to the application could jump from nanoseconds to micro
|
|
or even milli seconds.
|
|
|
|
3. Requires the use of a disassembler
|
|
Since this approach relies on mirroring memory references from one virtual
|
|
address to another, a disassembler has to be used to figure out which
|
|
registers need to be modified with the mirrored address. Any time a
|
|
disassembler is needed is an indication that things are getting fairly
|
|
complicated.
|
|
|
|
4. Cannot track memory references to all addresses
|
|
The fact that this approach relies on locking physical pages prevents it
|
|
from feasibly tracking all memory references. In addition, because the
|
|
thread stack is required to be valid in order to dispatch exceptions, it's
|
|
not possible to track reads and writes to thread stacks using this
|
|
approach.
|
|
|
|
2.3) Null Segment Interception
|
|
|
|
Segmentation is an extremely old feature of the x86 architecture. Its purpose
|
|
has been to provide software with the ability to partition the address space
|
|
into distinct segments that can be referenced through a 16-bit segment
|
|
selector. Segment selectors are used to index either the Global Descriptor
|
|
Table (GDT) or the Local Descriptor Table (LDT). Segment descriptors convey
|
|
information about all or a portion of the address space. On modern 32-bit
|
|
operating systems, segmentation is used to set up a flat memory model
|
|
(primarily only used because there is no way to disable it). This is further
|
|
illustrated by the fact that the x64 architecture has effectively done away
|
|
with the ES, DS, and SS segment registers in 64-bit mode. While segment
|
|
selectors are primarily intended to make it possible to access memory, they
|
|
can also be used to prevent access to it.
|
|
|
|
2.3.1) Design
|
|
|
|
Segmentation is one of the easiest ways to trap memory accesses. The majority
|
|
of instructions which reference memory implicitly use either the DS or ES
|
|
segment registers to do so. The one exception to this rule are instructions
|
|
that deal with the stack. These instructions implicitly use the SS segment
|
|
register. There are a few different ways one can go about causing a general
|
|
protection fault when accessing an address relative to a segment selector, but
|
|
one of the easiest is to take advantage of the null selector. The null
|
|
selector, 0x0, is a special segment selector that will always cause a general
|
|
protection fault when using it to reference memory. By loading the null
|
|
selector into DS, for example, the mov [eax], 0x1 instruction would cause a
|
|
general protection fault when executed. Using the null selector solves the
|
|
problem of being able to intercept memory accesses, but there still needs to
|
|
be some mechanism to allow the application to execute normally after
|
|
intercepting the memory access.
|
|
|
|
When a general protection fault occurs in user-mode, the kernel generates an
|
|
access violation exception and passes it off to the user-mode exception
|
|
dispatcher in much the same way as was described in 2.2. Registering a custom
|
|
exception handler makes it possible to catch this exception and handle it
|
|
gracefully. To handle this exception, the custom exception handler must
|
|
restore DS and ES segment registers to valid segment selectors by updating the
|
|
thread context record associated with the exception. On 32-bit versions of
|
|
Windows, the segment registers should be restored to 0x23. Once the the
|
|
segment registers have been updated, the exception dispatcher can be told to
|
|
continue execution. However, before this happens, there is an additional step
|
|
that must be taken.
|
|
|
|
It is not enough to simply restore the segment registers and then continue
|
|
execution. This would lead to subsequent reads and writes being missed as a
|
|
result of the DS and ES segment registers no longer pointing to the null
|
|
selector. To address this, the custom exception handler should toggle the
|
|
trap flag in the context record prior to continuing execution. Setting the
|
|
trap flag will cause the processor to generate a single step exception after
|
|
the instruction that generated the general protection fault executes. This
|
|
single step exception can then be processed by the custom exception handler to
|
|
reset the DS and ES segment registers to the null selector. After the segment
|
|
registers have been updated, the trap flag can be disabled and execution can
|
|
be allowed to continue. By following these steps, the application is able to
|
|
make forward progress while also making it possible to trap all memory reads
|
|
and writes that use the DS and ES segment registers.
|
|
|
|
2.3.2) Implementation
|
|
|
|
The implementation for this approach involves registering a vectored exception
|
|
handler that is able to handle the access violation and single step exceptions
|
|
that are generated. Since this approach relies on setting the segment
|
|
registers DS and ES to the null selector, an implementation must take steps to
|
|
update the segment register state for each running thread in a process and for
|
|
all new threads as they are created. Updating the segment register state for
|
|
running threads involves enumerating running threads in the calling process
|
|
using the toolhelp library. For each thread that is not the calling thread,
|
|
the SetThreadContext routine can be used to update segment registers. The
|
|
calling thread can update the segment registers using native instructions. To
|
|
alter the segment registers for new threads, the DLLTHREADATTACH notification
|
|
can be used. Once all threads have had their DS and ES segment registers
|
|
updated, memory references will immediately begin causing access violation
|
|
exceptions.
|
|
|
|
When these access violation exceptions are passed to the vectored exception
|
|
handler, appropriate steps must be taken to restore the DS and ES segment
|
|
registers to a valid segment selector, such as 0x23. This is accomplished by
|
|
updating the SegDs and SegEs segment registers in the CONTEXT structure that
|
|
is passed in association with an exception. In addition to updating these
|
|
segment registers, the trap flag (0x100) must also be set in the EFlags
|
|
register so that the DS and ES segment registers can be restored to the null
|
|
selector in order to trap subsequent memory accesses. Setting the trap flag
|
|
will lead to a single step exception after the instruction that generated the
|
|
access violation executes. When the single step exception is received, the
|
|
SegDs and SegEs segment registers can be restored to the null selector.
|
|
|
|
These few steps capture the majority of the implementation, but there is a
|
|
specific Windows nuance that must be handled in order for this to work right.
|
|
When the Windows kernel returns to a user-mode process after a system call has
|
|
completed, it restores the DS and ES segment selectors to their normal value
|
|
of 0x23. The problem with this is that without some way to reset the segment
|
|
registers to the null selector after a system call returns, there is no way to
|
|
continue to track memory accesses after a system call. Fortunately, there is
|
|
a relatively painless way to reset the segment registers after a system call
|
|
returns. On Windows XP SP2 and more recent versions of Windows, the kernel
|
|
determines where to transfer control to after a system call returns by looking
|
|
at the function pointer stored in the shared user data memory mapping.
|
|
Specifically, the SystemCallReturn attribute at 0x7ffe0304 holds a pointer to
|
|
a location in ntdll that typically contains just a ret instruction as shown
|
|
below:
|
|
|
|
0:001> u poi(0x7ffe0304)
|
|
ntdll!KiFastSystemCallRet:
|
|
7c90eb94 c3 ret
|
|
7c90eb95 8da42400000000 lea esp,[esp]
|
|
7c90eb9c 8d642400 lea esp,[esp]
|
|
|
|
Replacing this single ret instruction with code that resets the DS and ES
|
|
registers to the null selector followed by a ret instruction is enough to make
|
|
it possible to continue to trap memory accesses after a system call returns.
|
|
However, this replacement code should not take these steps if a system call
|
|
occurs in the context of the exception dispatcher, as this could lead to a
|
|
nesting issue if anything in the exception dispatcher references memory, which
|
|
is very likely.
|
|
|
|
An implementation of this approach is included with the source code provided
|
|
along with this paper.
|
|
|
|
2.3.3) Considerations
|
|
|
|
There are a few considerations that should be noted about this approach. On
|
|
the positive side, this approach is unique when compared to the others
|
|
described in this paper due to the fact that, in principle, it should be
|
|
possible to use it to trap memory accesses in kernel-mode, although it is
|
|
expected that the implementation may be much more complicated. This approach
|
|
is also much simpler than the other approaches in that it requires far less
|
|
code. While these are all good things, there are some negative considerations
|
|
that should also be pointed out. These are enumerated below:
|
|
|
|
1. Will not work on x64
|
|
The segmentation approach described in this section will not work on x64
|
|
due to the fact that the DS, ES, and even SS segment selectors are
|
|
effectively ignored when the processor is in 64-bit mode.
|
|
|
|
2. Significant performance overhead
|
|
Like many of the other approaches, this one also suffers from significant
|
|
performance overhead involved in having to take a GP and DB fault for
|
|
every address reference. This approach could be be further optimized by
|
|
creating a custom LDT entry (using NtSetLdtEntries) that describes a
|
|
region whose base address is 0 and length is n where n is just below the
|
|
address of the region(s) that should be monitored. This would have the
|
|
effect of allowing memory accesses to succeed within the lower portion of
|
|
the address space and fail in the higher portion (which is being
|
|
monitored). It's important to note that the base address of the LDT entry
|
|
must be zero. This is problematic since most of the regions that one
|
|
would like to monitor (heap) are allocated low in the address space. It
|
|
would be possible to work around this issue by having
|
|
NtAllocateVirtualMemory allocate using MEM\_TOP\_DOWN.
|
|
|
|
3. Requires a disassembler
|
|
Unfortunately, this approach also requires the use of a disassembler in
|
|
order to extract the effective address that caused the access violation
|
|
exception to occur. This is necessary because general protection faults
|
|
that occur due to a segment selector issue generate exception records that
|
|
flag the fault address as being 0xffffffff. This makes sense in the
|
|
context that without a valid segment selector, there is no way to
|
|
accurately calculate the effective address. The use of a disassembler
|
|
means that the code is inherently more complicated than it would otherwise
|
|
need to be. There may be some way to craft a special LDT entry that would
|
|
still make it possible to determine the address that cause the fault, but
|
|
the author has not investigated this.
|
|
|
|
3) Potential Uses
|
|
|
|
The ability to intercept an application's memory accesses is an interesting
|
|
concept but without much use beyond simple statistical and visual analysis.
|
|
Even though this is the case, the data that can be collected by analyzing
|
|
memory access behavior can make it possible to perform much more extensive
|
|
forms of dynamic binary analysis. This section will give a brief introduction
|
|
to some of the hypothetical areas that might benefit from being able to
|
|
understand the memory access behavior of an application.
|
|
|
|
3.1) Data Propagation
|
|
|
|
Being able to gain knowledge about the way that data propagates throughout an
|
|
application can provide extremely useful insights. For example, understanding
|
|
data propagation can give security researchers an idea of the areas of code
|
|
that are affected, either directly or indirectly, by a buffer that is received
|
|
from a network socket. In this context, having knowledge about areas affected
|
|
by data would be much more valuable than simply understanding the code paths
|
|
that are taken as a result of the buffer being received. Though the two may
|
|
seem closely related, the areas of code affected by a buffer that is received
|
|
should actually be restricted to a subset of the overall code paths taken.
|
|
|
|
Even if understanding data propagation within an application is beneficial, it
|
|
may not be clear exactly how analyzing memory access behavior could make this
|
|
possible. To understand how this might work, it's best to think of memory
|
|
access in terms of its two basic operations: read and write. In the course of
|
|
normal execution, any instruction that reads from a location in memory can be
|
|
said to be dependent on the last instruction that wrote to that location.
|
|
When an instruction writes to a location in memory, it can be said that any
|
|
instructions that originally wrote to that location no longer have claim over
|
|
it. Using these simple concepts, it is possible to build a dependency graph
|
|
that shows how areas of code become dependent on one another in terms of a
|
|
reader/writer relationship. This dependency graph would be dynamic and would
|
|
change as a program executes just the same as the data propagation within an
|
|
application would change.
|
|
|
|
At this point in time, the author has developed a very simple implementation
|
|
based on the DBI strategy outlined in this paper. The current implementation
|
|
is in need of further refinement, but it is capable of showing reader/writer
|
|
relationships as the program executes. This area is ripe for future research.
|
|
|
|
3.2) Memory Access Isolation
|
|
|
|
From a visualization standpoint, it might be interesting to be able to show
|
|
with what degrees of code-level isolation different regions of memory are
|
|
accessed. For example, being able to show what areas of code touch individual
|
|
heap allocations could provide interesting insight into the containment model
|
|
of an application that is being analyzed. This type of analysis might be able
|
|
to show how well designed the application is by inferring code quality based
|
|
on the average number of areas of code that make direct reference to unique
|
|
heap allocations. Since this concept is a bit abstract, it might make sense
|
|
to discuss a more concrete example.
|
|
|
|
One example might involve an object-oriented C++ application that contains
|
|
multiple classes such as Circle, Shape, Triangle, and so on. In the first
|
|
design, the application allows classes to directly access the attributes of
|
|
instances. In the second design, the application forces classes to reference
|
|
attributes through public getters and setters. Using memory access behavior
|
|
to identify code-level isolation, the first design might be seen as a poor
|
|
design due to the fact that there will be many code locations where unique
|
|
heap allocations (class instances) have the contents of their memory accessed
|
|
directly. The second design, on the other hand, might be seen as a more
|
|
robust design due to the fact that the unique heap allocations would be
|
|
accessed by fewer places (the getters and setters).
|
|
|
|
It may actually be the case that there's no way to draw a meaningful
|
|
conclusion by analyzing code-level isolation of memory accesses. One specific
|
|
case that was raised to the author involved how the use of inlining or
|
|
aggressive compiler optimizations might incorrectly indicate a poor design.
|
|
Even though this is likely true, there may be some knowledge that can be
|
|
obtained by researching this further. The author is not presently aware of an
|
|
implementation of this concept but would love to be made aware if one exists.
|
|
|
|
3.3) Thread Data Consistency
|
|
|
|
Programmers familiar with the pains of thread deadlocks and thread-related
|
|
memory corruption should be well aware of how tedious these problems can be to
|
|
debug. By analyzing memory access behavior in conjunction with some
|
|
additional variables, it may be possible to make determinations as to whether
|
|
or not a memory operation is being made in a thread safe manner. At this
|
|
point, the author has not defined a formal approach that could be taken to
|
|
achieve this, but a few rough ideas have been identified.
|
|
|
|
The basic idea behind this approach would be to combine memory access behavior
|
|
with information about the thread that the access occurred in and the set of
|
|
locks that were acquired when the memory access occurred. Determining which
|
|
locks are held can be as simple as inserting instrumentation code into the
|
|
routines that are used to acquire and release locks at runtime. When a lock
|
|
is acquired, it can be pushed onto a thread-specific stack. When the lock is
|
|
released, it can be removed. The nice thing about representing locks as a
|
|
stack is that in almost every situation, locks should be acquired and released
|
|
in symmetric order. Acquiring and releasing locks asymmetrically can quickly
|
|
lead to deadlocks and therefore can be flagged as problematic.
|
|
|
|
Determining data consistency is quite a bit trickier, however. An analysis
|
|
library would need some means of historically tracking read and write access
|
|
to different locations in memory. Still, determining what might be a data
|
|
consistency issue from this historical data is challenging. One example of a
|
|
potential data consistency issue might be if two writes occur to a location in
|
|
memory from separate threads without a common lock being acquired between the
|
|
two threads. This isn't guaranteed to be problematic, but it is at the very
|
|
least be indicative of a potential problem. Indeed, it's likely that many
|
|
other types of data consistency examples exist that may be possible to capture
|
|
in relation to memory access, thread context, and lock ownership.
|
|
|
|
Even if this concept can be made to work, the very fact that it would be a
|
|
runtime solution isn't a great thing. It may be the case that code paths that
|
|
lead to thread deadlocks or thread-related corruption are only executed rarely
|
|
and are hard to coax out. Regardless, the author feels like this represents
|
|
an interesting area of future research.
|
|
|
|
4) Previous Work
|
|
|
|
The ideas described in this paper benefit greatly from the concepts
|
|
demonstrated in previous works. The memory mirroring concept described in 2.2
|
|
draws heavily from the PaX team's work relating to their VMA mirroring and
|
|
software-based non-executable page implementations[8]. Oded Horovitz provided an
|
|
implementation of the paging approach for Windows and applied it to
|
|
application security[4]. In addition, there have been other examples that use
|
|
concepts similar to those described by PaX to achieve additional results, such
|
|
as OllyBone, ShadowWalker, and others[10, 9]. The use of DBI in 2.1 for
|
|
memory analysis is facilitated by the excellent work that has gone into
|
|
DynamoRIO, Valgrind, and indeed all other DBI frameworks[3, 11].
|
|
|
|
It should be noted that if one is strictly interested in monitoring writes to
|
|
a memory region, Windows provides a built-in feature known as a write watch.
|
|
When allocating a region with VirtualAlloc, the MEM_WRITE_WATCH flag can be set.
|
|
This flag tells the kernel to track writes that occur to the region. These
|
|
writes can be queried at a later point in time using GetWriteWatch[6].
|
|
|
|
It is also possible to use guard pages and other forms of page protection,
|
|
such as PAGE_NOACCESS, to intercept memory access to a page in user-mode.
|
|
Pedram Amini's PyDbg supports the concept of memory breakpoints which are
|
|
implemented using guard pages[12]. This type of approach has two limitations
|
|
that are worth noting. The first limitation involves an inability to pass
|
|
addresses to kernel-mode that have had a memory breakpoint set on them (either
|
|
guard page or PAGE_NOACCESS). If this occurs it can lead to unexpected
|
|
behavior, such as by causing a system call to fail when referencing the
|
|
user-mode address. This would not trigger an exception in user-mode.
|
|
Instead, the system call would simply return STATUS_ACCESS_VIOLATION. As a
|
|
result, an application might crash or otherwise behave improperly. The second
|
|
limitation is that there may be consequences in multi-threaded environments
|
|
where memory accesses are missed.
|
|
|
|
5) Conclusion
|
|
|
|
The ability to analyze the memory access behavior of an application at runtime
|
|
can provide additional insight into how an application works. This insight
|
|
might include learning more about how data propagates, deducing the code-level
|
|
isolation of memory references, identifying potential thread safety issues,
|
|
and so on. This paper has described three strategies that can be used to
|
|
intercept memory accesses within an application at runtime.
|
|
|
|
The first approach relies on Dynamic Binary Instruction (DBI) to inject
|
|
instrumentation code before instructions that access memory locations. This
|
|
instrumentation code is then capable of obtaining information about the
|
|
address being referenced when instructions are executed.
|
|
|
|
The second approach relies on hardware paging features supported by the x86
|
|
and x64 architecture to intercept memory accesses. This works by restricting
|
|
access to a virtual address range to kernel-mode access. When an application
|
|
attempts to reference a virtual address that has been marked as such, an
|
|
exception is generated that is then passed to the user-mode exception
|
|
dispatcher. A custom exception handler can then inspect the exception and
|
|
take the steps necessary to allow execution to continue gracefully after
|
|
having tracked the memory access.
|
|
|
|
The third approach uses the segmentation feature of the x86 architecture to
|
|
intercept memory accesses. It does this by loading the DS and ES segment
|
|
registers with the null selector. This has the effect of causing instructions
|
|
which implicitly use these registers to generate a general protection fault
|
|
when referencing memory. This fault results in an access violation exception
|
|
being generated that can be handled in much the same way as the hardware
|
|
paging approach.
|
|
|
|
It is hoped that these strategies might be useful to future research which
|
|
could benefit from collecting memory access information.
|
|
|
|
References
|
|
|
|
[1] AMD. AMD64 Architecture Programmer's Manual: Volume 2 System Programming.
|
|
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf; accessed 5/2/2007.
|
|
|
|
[2] Bala, Duesterwald, Banerija. Transparent Dynamic Optimization.
|
|
http://www.hpl.hp.com/techreports/1999/HPL-1999-77.pdf; accessed 5/2/2007.
|
|
|
|
[3] Hewlett-Packard, MIT. DynamoRIO.
|
|
http://www.cag.lcs.mit.edu/dynamorio/; accessed 4/30/2007.
|
|
|
|
[4] Horovitz, Oded. Memory Access Detection.
|
|
http://cansecwest.com/core03/mad.zip; accessed 5/7/2007.
|
|
|
|
[5] Intel. Intel Architecture Software Developer's Manual Volume 3: System Programming.
|
|
http://download.intel.com/design/PentiumII/manuals/24319202.pdf; accessed 5/1/2007.
|
|
|
|
[6] Microsoft Corporation. GetWriteWatch.
|
|
http://msdn2.microsoft.com/en-us/library/aa366573.aspx; accessed 5/5/2007.
|
|
|
|
[7] Nethercote, Nicholas. Dynamic Binary Analysis and Instrumentation.
|
|
http://valgrind.org/docs/phd2004.pdf; accessed 5/2/2007.
|
|
|
|
[8] PaX Team. PAGEEXEC.
|
|
http://pax.grsecurity.net/docs/pageexec.txt; accessed 5/1/2007.
|
|
|
|
[9] Sparks, Butler. Shadow Walker: Raising the Bar for Rootkit Detection.
|
|
https://www.blackhat.com/presentations/bh-jp-05/bh-jp-05-sparks-butler.pdf; accessed 5/3/2007.
|
|
|
|
[10] Stewart, Joe. Ollybone.
|
|
http://www.joestewart.org/ollybone/; accessed 5/3/2007.
|
|
|
|
[11] Valgrind. Valgrind.
|
|
http://valgrind.org/; accessed 4/30/2007.
|
|
|
|
[12] Amini, Pedram. PaiMei.
|
|
http://pedram.redhive.com/PaiMei/docs/; accessed 5/10/2007.
|