Zines/phrack/65/4.txt

                             ==Phrack Inc.==

               Volume 0x0c, Issue 0x41, Phile #0x04 of 0x0f

|=-----------------------------------------------------------------------=|
|=---=[ Stealth hooking : Another way to subvert the Windows kernel ]=---=|
|=-----------------------------------------------------------------------=|
|=--------------------=[ by mxatone and ivanlef0u ]=---------------------=|
|=-----------------------------------------------------------------------=|

                 
1 - Introduction on anti-rookits technologies and bypass
    1.1 - Rookits and anti-rootkits techniques
    1.2 - About kernel level protections
    1.3 - Concept key: use kernel code against itself
    
2 - Introducing stealth hooking on IDT.
    2.1 - How Windows manage hardware interrupts 
        2.1.1 - Hardware interrupts dispatching on Windows     
        2.1.2 - Hooking hardware IT like a ninja
        2.1.3 - Application 1 : Kernel keylogger
        2.1.4 - Application 2 : NDIS incoming packets sniffer
    2.2 - Conclusion about stealth hooking on IDT
        
3 - Owning NonPaged pool using stealth hooking
    3.1 - Kernel allocation layout review
        3.1.1 - Difference between Paged and NonPaged pool
        3.1.2 - NonPaged pool tables
        3.1.3 - Allocation and free algorithms
    3.2 - Getting code execution abusing allocation code
        3.2.1 - Data corruption of MmNonPagedPoolFreeListHead
        3.2.2 - Expend it for every size
    3.3 - Exploit our position
        3.3.1 - Generic stack redirection
        3.3.2 - Userland process code injection

4 - Detection
5 - Conclusion
6 - References


---[ 1 - Introduction on anti-rookits technologies and bypass
    
Nowadays rootkits and anti-rootkits are becoming more and more important
into the IT security landscape. Loved by some, hated by others, rootkits 
can be considered as the holy grail of backdoors : stealthy, little, 
close to hardware, ingenious, vicious... Their control over a computer 
locally or remotely make them the best choice for an attacker. 
Anti-rootkits try to detect and eradicate those malicious programs. 
Rk techniques and complexity are evolving fast and today developing a rk or
anti-rk is a very hard mission. 

This paper deals about rootkits on Windows platform. More precisely about
new kind of hijacking techniques that can be applied to the Windows kernel.
Readers are assumed to be aware about rootkits techniques on Windows.

----[ 1.1 - Rootkits and anti-rootkits technics

A rootkit hijacks an operating system's behavior. In order to achieve this
task, it can simply modify the operating system's binaries but that's not 
very stealthy. Most rk's use hooks on important functions and change theirs
results. A basic hook redirects execution flow by changing function start 
or a function pointer but there is no single way to hook a routine. The 
most common example is the SSDT (System Service Descriptor Table), this 
table contains the syscall list which is a set of functions pointers. If 
you can modify a pointer in this table, you are able to control the 
behavior of one function. That's an example of how rootkits proceed, 
obviously there is a lot of critical areas that can be controlled by an 
attacker.

Anti-rootkits try to check those areas, but the task is very hard. Most of 
the time, anti-rk software makes a comparison between the memory image of 
the program and its binary on the disk or verify some function pointer 
tables to see if something has changed.

That's how the war between rk-makers and anti-rk-junkies began, trying 
to find the best way, the best area, for hooking critical operating 
system features. On Windows those following areas are often used by 
rootkits :

- SSDT (kernel syscalls table) and shadow SSDT (win32k syscall table) are 
the simplest solution.

- MSR (Model Specific Registers) can be modified by a rootkit. On Windows 
the MSR_SYSENTER_EIP is used by the assembly instructions 'sysenter' to 
enter into ring0 mode. Hijacking this MSR allow an attacker to control
the system.

- MajorFunctions are functions used by drivers for I/O processing with 
others devices, hooking those functions can be useful for a rootkit.

- IDT (Interrupt Descriptor Table) is table used by the system for 
handling exceptions and interruptions. 

Another kind of techniques has appeared. By accessing to the kernel 
objects a rootkit can easily change information about processes, threads, 
loaded modules and other stuff. Those techniques are called DKOM (Direct 
Kernel Object Manipulation). For example, the Windows kernel maintains a 
double linked list called PsActiveProcessList (EPROCESS structures) 
containing information about running processes. Unlink one of them and 
your process will disappear from process lists like task manager, whereas 
the process is still running.

To block those kernel objects modifications, anti-rk checks other 
sections. For processes, they used to read the PspCidTable which 
has a table of PID (Process IDentifier) and TID (Thread IDentifier).
A comparison between this table and PsActiveProcessList shows hidden
processes. Against those attacks anti-rk tools have to find others
sections and tricks to detect altered objects.

One of the first paper about Windows stealth was written by Holy Father, 
"Invisibility on NT boxes" [1]. With this paper came one of the first 
public implementations of a rootkit with a ring0 driver, Hacker 
Defender [2], coded by Holy Father and Ratter of the famous VXing mag 29A 
[3]. This driver was able to elevate process rights using token 
manipulation. The rest of the rootkit uses user-land hooks to perform files
and registry hiding, process infection with dll injection. A good example 
of a full ring0 rootkit is NT Rootkit of Greg Hoglund [4], this driver uses
SSDT hooks to perform stealth operation. It registers a Filter Device 
Object above the NTFS file system and above the keyboard device for 
filtering IRP (I/O Request Packets). It also provides a NDIS protocol 
driver to hide communication on the network. Even if this rk was written 
for NT 4.0 and Win2K it's a perfect example for beginners. 

After came more advanced ring0 rk like FU [5], written by Fuzen_op and its
improvement FUto published in the famous technical journal Uninformed [6].
Vista improvement on driver verification introduces new rootkits mostly
based on hardware features. Like BootRoot [7] and Pixie [8] by Eeye 
loaded before any protection. Finally Joanna Rutkowska with her Blue 
Pill [9] used virtualization technology to create layer between the 
operating system and the hardware.

In the wild the rk are used most of the time for lame mail spamming or 
botnets. They often use old techniques but some of them are interesting 
like Rustock [10] series or StormWorm [11] and the MBR rootkit [12]. They 
implement a lot of tricks as ADS (Alternate Data Stream), code obfuscation,
anti-debug, anti-VM or polymorphic code. The goal is not only subverting
the kernel but also slow down their analysis and make them harder to
defeat. 

Even if the technology used by rootkits are more and more sophisticated, 
the underground community is still developing POCs to improve current 
techniques. Unreal [13] and AK992 [14] are both great examples. The first 
uses an ADS and a NTFS MajorFunctions hooking to hide itself, the second 
checks IRP completion when sended to disk's devices. You can find plenty 
examples of rootkit techniques on rootkit.com.

Finally, this part would not be complete if we don't speak about anti-rk. 
The most famous is Rk Unhooker by MP_ART & EP_X0FF and their team UG North.
Others anti-rk are DarkSpy [15] by CardMagic, IceSword [16] by pjf and 
Gmer [17].

----[ 1.2 - About kernel level protections

When we talk about protection, we must notice where the protection takes
place into the system. A protection has an advantage on an attack only 
if it operates from a higher level. Protections like PaX or Exec Shield 
are efficient because they protecting userland from kernel.

Protections like PatchGuard and other HIPS also protect the system 
integrity but as far as an attacker can find a way to attack those 
protections at their own level they will be useless. A protection is 
reliable only if it can't be corrupted by an attacker. Assuming an 
attacker find a way to inject code into the protection and you can 
consider that your b0x is dead.

That's why PatchGuard isn't so efficient [18]. But we know that disabling
or destroying a protection is very noisy. No, the best way is to fly under 
the radar by working with special objects and events that cannot be 
checked because of their volatility.

In June 2006, Greg Hoglund presented the concept of KOH (Kernel Object 
Hooking) [19]. A new way of detouring code execution, you don't have to 
modify static code section but rather you work on dynamic allocated 
structures/codes like DPC (Deferred Procedure Calls). For protections,
it's hard to find and verify those areas due to their instabilities. 

Others cool objects are IRP. They are the object used by the Windows 
kernel I/O manager to communicate with devices. Each I/O operation on 
hardware generates an IRP, sycalls send IRP to a driver through his 
device. In general a driver owns several devices; one of them is used to 
communique with the userland by using IOCTL and others devices are 
managing IRP by filtering them or performing a requested task. 
IRP are sent to a driver using its MajorFunctions table. This table 
includes the different functionalities provided by the driver. You can 
check the result returned by a MajorFunction by installing a completion 
routine on an IRP. They are very volatile objects; controlling and 
checking them is very hard. 

In fact, if you want to check everything you would need to completely 
redesign operating system architecture. So keep in mind that protection 
cannot be everywhere at every time and we will demonstrate it in the 
following parts.

----[ 1.3 - Concept key: use kernel code against itself

The idea behind this paper is exploiting kernel code. Exploitation is 
possible because input defines code behavior. Submitting a crafted input 
to a vulnerable software can leads into code execution. Dangerous input is
of course defined by your target. Kernel space contains more exploitation 
scenarios because you can change its environment. A rootkit can not 
change basic inputs as arguments. But it can change the environment around 
a code. Heap exploitation techniques such as unlinking is a perfect 
example. By changing a memory block structure, you are able to overwrite 
4 bytes. Some techniques can even change next allocated block address [20].
It does work because a program trusts those information. In kernel, you 
have a total control on the environment. Also completely checking the 
kernel is bad for performance and totally impossible.

Changing code environment has been used successfully for the phide2 
rootkit [21] technique. This rootkit can hide threads without hooking 
Windows scheduler which is impressive. As it relies on code behavior, it 
needs strong reverse knowledge. It extends this concept into unknown 
operating system behaviors. Generic protections are based on generic 
assumptions. Such as checking only driver images for code hooks. These 
days operating systems design is against those protections and requires 
advanced software rootkit techniques.

---[ 2 - Introducing stealth hooking on IDT

Let's introduce our concept about stealth hooking with an example based on 
IDT. First we will see what is the IDT and its purpose. Then we will 
discuss about hardware interrupts and how Windows deals with them. 

IDT (Interrupt Descriptor Table) is a CPU specific linear table localized 
in kernel-land. IDT can be read with ring3 privilege level but you must
have ring0 privilege if you want to write into it. IDT is composed of 256 
entries of KIDTENTRY structures and you can use the Kernel Debugger (KD) 
included into the Debugging Tools for Windows [22] to see the definition
of an IDT entry.

kd> dt nt!_KIDTENTRY
   +0x000 Offset           : Uint2B
   +0x002 Selector         : Uint2B
   +0x004 Access           : Uint2B
   +0x006 ExtendedOffset   : Uint2B

Here we don't want to (re)explain the architecture of the IDT so we advise
you to read Kad's paper published in Phrack 59 about IDT and about how it 
works [23].

The first 32 entries of IDT are reserved by the CPU for exceptions. Others
are use to handle hardware interrupts and special system events.

Here is a dump of the first 64 entries of the Windows' IDT.
 
kd> !idt -a
Dumping IDT:

00: 804df350 nt!KiTrap00
01: 804df4cb nt!KiTrap01
02: Task Selector = 0x0058
03: 804df89d nt!KiTrap03
04: 804dfa20 nt!KiTrap04
05: 804dfb81 nt!KiTrap05
06: 804dfd02 nt!KiTrap06
07: 804e036a nt!KiTrap07
08: Task Selector = 0x0050
09: 804e078f nt!KiTrap09
0a: 804e08ac nt!KiTrap0A
0b: 804e09e9 nt!KiTrap0B
0c: 804e0c42 nt!KiTrap0C
0d: 804e0f38 nt!KiTrap0D
0e: 804e164f nt!KiTrap0E
0f: 804e197c nt!KiTrap0F
10: 804e1a99 nt!KiTrap10
11: 804e1bce nt!KiTrap11
12: 804e197c nt!KiTrap0F
13: 804e1d34 nt!KiTrap13
14: 804e197c nt!KiTrap0F
15: 804e197c nt!KiTrap0F
16: 804e197c nt!KiTrap0F
17: 804e197c nt!KiTrap0F
18: 804e197c nt!KiTrap0F
19: 804e197c nt!KiTrap0F
1a: 804e197c nt!KiTrap0F
1b: 804e197c nt!KiTrap0F
1c: 804e197c nt!KiTrap0F
1d: 804e197c nt!KiTrap0F
1e: 804e197c nt!KiTrap0F
1f: 804e197c nt!KiTrap0F

20: 00000000 
21: 00000000 
22: 00000000 
23: 00000000 
24: 00000000 
25: 00000000 
26: 00000000 
27: 00000000 
28: 00000000 
29: 00000000 
2a: 804deb92 nt!KiGetTickCount
2b: 804dec95 nt!KiCallbackReturn
2c: 804dee34 nt!KiSetLowWaitHighThread
2d: 804df77c nt!KiDebugService
2e: 804de631 nt!KiSystemService
2f: 804e197c nt!KiTrap0F
30: 806f3d48 hal!HalpClockInterrupt
31: 80dd816c i8042prt!I8042KeyboardInterruptService (KINTERRUPT 80dd8130)
32: 804ddd04 nt!KiUnexpectedInterrupt2
33: 80dd3224 serial!SerialCIsrSw (KINTERRUPT 80dd31e8)
34: 804ddd18 nt!KiUnexpectedInterrupt4
35: 804ddd22 nt!KiUnexpectedInterrupt5
36: 804ddd2c nt!KiUnexpectedInterrupt6
37: 804ddd36 nt!KiUnexpectedInterrupt7
38: 806edef0 hal!HalpProfileInterrupt
39: 80f0827c ACPI!ACPIInterruptServiceRoutine (KINTERRUPT 80f08240)
3a: 80dc67cc vmsrvc+0x1C16 (KINTERRUPT 80dc6790)
3b: 80df6414 NDIS!ndisMIsr (KINTERRUPT 80df63d8)
3c: 80de040c i8042prt!I8042MouseInterruptService (KINTERRUPT 80de03d0)
3d: 804ddd72 nt!KiUnexpectedInterrupt13
3e: 80ed78a4 atapi!IdePortInterrupt (KINTERRUPT 80ed7868)
3f: 80f01dd4 atapi!IdePortInterrupt (KINTERRUPT 80f01d98)
40: 804ddd90 nt!KiUnexpectedInterrupt16
[...]

This dump represents a typical Windows IDT, you can see the IDT entries 
index followed by the address of the handler and this name. The first 32 
entries are filled by KiTrap* functions that manage exceptions. The rest 
of the table is left to the system, you can see specials system interrupts 
like KiSystemService and KiCallbackReturn and handlers used by drivers 
like I8042KeyboardInterruptService or I8042MouseInterruptService.

----[ 2.1 - How Windows manage hardware interrupts 

When we talk about interrupts we must introduce the concept of IRQL 
(Interrupt ReQuest Level). The kernel represents IRQLs internally as a 
number from 0 through 31 on x86 with higher numbers representing higher 
priority interrupts. Although the kernel defines the standard set of IRQLs 
for software interrupts, the HAL (Hardware Abstraction Layer) maps 
hardware interrupt numbers to the IRQLs.

   +----------------+
31 |    Highests    | \
to |    IRQLs       | | Clock, system failure.
27 |                | /
   +----------------+
26 |                | \
to |  DEVICE_IRQL   | | Hardware interrupts.          
3  |                | /
   +----------------+
2  | DISPATCH_LEVEL | Scheduler, DPC.
   +----------------+ 
1  |    APC_LEVEL   | Used when dispatching APC.
   +----------------+ 
0  |  PASSIVE_LEVEL | Threads run at this IRQL. 
   +----------------+    
   
Each processor has its own IRQL. You can have a core running at an IRQL=
DISPATCH_LEVEL whereas another is running at PASSIVE_LEVEL. In fact IRQL
represents the "mask ability" of the current running code. Interrupts from 
a source with an IRQL above the current level interrupt the processor, 
whereas interrupts from sources with IRQLs equal to or below the current 
level are masked until an executing thread decrease the IRQL.

Some system components are not accessible when code is running at 
IRQL>=DISPATH_LEVEL. Accessing to paged memory (memory which can be 
swapped on disk) is impossible and lots of kernel functions cannot be used.

Hardware interrupts are asynchronous and reached by external peripherals. 
For example when you hit a key, your keyboard device sends an IRQ 
(Interrupt ReQuest) routed by the Southbridge [24] on your interrupt 
controller through the Northbridge [25]. The Southbridge is a chip that can
be described like a I/O controller hub. This chip receives all the I/O 
externals interrupt and send them to the Northbridge. The Northbridge is
directly connected to your memory and high speed graphic bus also to your 
CPU. This chip is also known as the memory controller hub.

On most x86 systems we find a chipset called i82489, Advanced Programmable 
Interrupt Controller (APIC). The APIC is composed by 2 main components, a 
I/O APIC, one per CPU, and a LAPIC (Local APIC) on each core. I/O APIC 
uses a routing algorithm to dispatch an interrupt on the best adapted core.
According to the principle of locality, I/O APIC will deliver the device 
interrupt on the core which handled it the previous time [26].

After this LAPIC translates the IRQ to an 8-bits value, an interrupt 
vector. This interrupt vector represents IDT's entry index associated with 
the handler. When the core is ready to handle the interrupt, its 
instruction flow is redirected on the IDT entry. 

   IDT            IDT            IDT           IDT 
    1              2              3             4 
  +---+          +---+          +---+         +---+         
  |   |          |   |          |   |         |   |
  |---|          |---|          |---|         |---|
  |   |          |   |          |   |         |   |
  |---|          |---|          |---|         |---|
  |   |          |   |          |   |         |   |
  +---+          +---+          +---+         +---+    
    |              |              |             |
+--------+     +--------+     +--------+    +--------+
|        |     |        |     |        |    |        | 
| core 1 |     | core 2 |     | core 3 |    | core 4 |
|        |     |        |     |        |    |        |
+--------+     +--------+     +--------+    +--------+ 
| LAPIC  |     | LAPIC  |     | LAPIC  |    | LAPIC  |
+---+----+     +---+----+     +---+----+    +---+----+
    |              |              |             |   
    |              |              |             |
<---+--------------+------+-------+-------------+-----> 
               Interrupt  |         Processor system bus
                Messages  |
                          |
                          |
    External       +------+------+
    Interrupts     |             |
    --------------->   I/O APIC  |  
                   |             |
                   +-------------+

-----[ 2.3.1 Hardware interrupts dispatching on Windows                 
                   
On Windows, the interrupt handler isn't executed immediately, there is a 
code template first. This template is implemented in the function 
KiInterruptTemplate and does two things. First, it saves the current 
core state in the stack and dispatches code flow to the right "interrupt 
dispatcher".

When a interrupt is raised, after the core status core is saved, code flow 
is transferred to the interrupt handler as defined in the IDT. In fact 
each interrupt handler in the IDT points to a KiInterruptTemplate 
routine [27]. KiInterruptTemplate will call KiInterruptDispatch which 
performs the following operations :

- Acquire the service routine spinlock.

- Raise IRQL to DEVICE_IRQL, the IRQL of a given interrupt vector is 
calculated by subtracting the interrupt vector from 27d.

- Call the interrupt handler, an ISR (Interrupt Service Routine).

- Lower IRQL.

- Release the service routine spinlock.

For example, the keyboard device ISR is I8042KeyboardInterruptService. 
ISR are routines for handling interrupts like top-halves in the linux 
kernel. According to the WDK (Windows Driver Kit), the ISR must do 
whatever is appropriate to the device to dismiss the interrupt. Then, it 
should do only what is necessary to save stage and queue a DPC. It means it
interruption management will take place on a lower IRQL than during ISR
execution. The I/O processing is done into the DPC.

DPC (Deferred Procedure Call) are equivalent of bottom-halves in linux. 
DPC works at IRQL DISPATCH_LEVEL, lower than the ISR's IRQL. In fact the 
ISR will queue a DPC to process the entire interrupt at a lower IRQL in 
order to avoid the core preemption taking too much time. For the keyboard 
the DPC is I8042KeyboardIsrDpc. Here a figure to sum up the interrupt 
processing :

                                              +-------------------------+
Hardware Interrupt                       /----> Here we are at          |  
       |                                 |    | IRQL=DEVICE_LEVEL       |
       |                                 |    | The KiInterruptDispatch |
       /---> IDT ---\                    |    | routine calls the ISR.  |
                    |                    |    |                         |
                    |                    |    | ISR handles interrupt   |
            +-----------------------+    |    | and queue a DPC for     |
            |  KiInterruptTemplate ------/    | later processing        |
            +-----------------------+         |                         |
                                              +-------------------------+
                                              
KiInterruptDispatch receives one main argument from KiInterruptTemplate, 
a pointer to an interrupt object stored in the EDI register. Interrupt
objects are defined by a KINTERRUPT structure :

kd> dt nt!_KINTERRUPT
   +0x000 Type             : Int2B
   +0x002 Size             : Int2B
   +0x004 InterruptListEntry : _LIST_ENTRY
   +0x00c ServiceRoutine   : Ptr32     unsigned char 
   +0x010 ServiceContext   : Ptr32 Void
   +0x014 SpinLock         : Uint4B
   +0x018 TickCount        : Uint4B
   +0x01c ActualLock       : Ptr32 Uint4B
   +0x020 DispatchAddress  : Ptr32     void 
   +0x024 Vector           : Uint4B
   +0x028 Irql             : UChar
   +0x029 SynchronizeIrql  : UChar
   +0x02a FloatingSave     : UChar
   +0x02b Connected        : UChar
   +0x02c Number           : Char
   +0x02d ShareVector      : UChar
   +0x030 Mode             : _KINTERRUPT_MODE
   +0x034 ServiceCount     : Uint4B
   +0x038 DispatchCount    : Uint4B
   +0x03c DispatchCode     : [106] Uint4B

We retrieve in this structure, the SpinLock and the ServiceRoutine. Notice 
that SynchronizeIrql contains the IRQL when the ISR will be executed.

For each entry in the IDT which handles a hardware interrupt, the 
KiInterruptTemplate is contained in the DispatchCode table of the 
KINTERRUPT structure. 
 
For the keyboard device we have this KINTERRUPT :

kd> dt nt!_KINTERRUPT 80dd8130
   +0x000 Type             : 22
   +0x002 Size             : 484
   +0x004 InterruptListEntry : _LIST_ENTRY [ 0x80dd8134 - 0x80dd8134 ]
   +0x00c ServiceRoutine   : 0xfa815495     unsigned char  
   ->i8042prt!I8042KeyboardInterruptService+0
   +0x010 ServiceContext   : 0x80e2ec88 
   +0x014 SpinLock         : 0
   +0x018 TickCount        : 0xffffffff
   +0x01c ActualLock       : 0x80e2ed48  -> 0
   +0x020 DispatchAddress  : 0x804da8d8     void  nt!KiInterruptDispatch+0
   +0x024 Vector           : 0x31
   +0x028 Irql             : 0x1a ''
   +0x029 SynchronizeIrql  : 0x1a ''
   +0x02a FloatingSave     : 0 ''
   +0x02b Connected        : 0x1 ''
   +0x02c Number           : 0 ''
   +0x02d ShareVector      : 0 ''
   +0x030 Mode             : 1 ( Latched )
   +0x034 ServiceCount     : 0
   +0x038 DispatchCount    : 0xffffffff
   +0x03c DispatchCode     : [106] 0x56535554

Let's have a look at the beginning of KiInterruptTemplate :

nt!KiInterruptTemplate:
804da972 54              push    esp
804da973 55              push    ebp
804da974 53              push    ebx
804da975 56              push    esi
804da976 57              push    edi
804da977 83ec54          sub     esp,54h
804da97a 8bec            mov     ebp,esp
804da97c 89442444        mov     dword ptr [esp+44h],eax
804da980 894c2440        mov     dword ptr [esp+40h],ecx
804da984 8954243c        mov     dword ptr [esp+3Ch],edx
804da988 f744247000000200 test    dword ptr [esp+70h],20000h
804da990 0f852a010000    jne     nt!V86_kit_a (804daac0)
804da996 66837c246c08    cmp     word ptr [esp+6Ch],8
804da99c 7423            je      nt!KiInterruptTemplate+0x4f (804da9c1)
804da99e 8c642450        mov     word ptr [esp+50h],fs
804da9a2 8c5c2438        mov     word ptr [esp+38h],ds
804da9a6 8c442434        mov     word ptr [esp+34h],es
804da9aa 8c6c2430        mov     word ptr [esp+30h],gs
804da9ae bb30000000      mov     ebx,30h
804da9b3 b823000000      mov     eax,23h
804da9b8 668ee3          mov     fs,bx
804da9bb 668ed8          mov     ds,ax
804da9be 668ec0          mov     es,ax
804da9c1 648b1d00000000  mov     ebx,dword ptr fs:[0]
804da9c8 64c70500000000ffffffff mov dword ptr fs:[0],0FFFFFFFFh
804da9d3 895c244c        mov     dword ptr [esp+4Ch],ebx
804da9d7 81fc00000100    cmp     esp,10000h
804da9dd 0f82b5000000    jb      nt!Abios_kit_a (804daa98)
804da9e3 c744246400000000 mov     dword ptr [esp+64h],0
804da9eb fc              cld
804da9ec 8b5d60          mov     ebx,dword ptr [ebp+60h]
804da9ef 8b7d68          mov     edi,dword ptr [ebp+68h]
804da9f2 89550c          mov     dword ptr [ebp+0Ch],edx
804da9f5 c74508000ddbba  mov     dword ptr [ebp+8],0BADB0D00h
804da9fc 895d00          mov     dword ptr [ebp],ebx
804da9ff 897d04          mov     dword ptr [ebp+4],edi
804daa02 f60550f0dfffff  test    byte ptr ds:[0FFDFF050h],0FFh
804daa09 750d            jne     nt!Dr_kit_a (804daa18)

nt!KiInterruptTemplate2ndDispatch:
804daa0b bf00000000      mov     edi,0
nt!KiInterruptTemplateObject:
804daa10 e9c3fcffff      jmp     nt!KeSynchronizeExecution+0x2 (804da6d8)
[...]

Remember, this code is unique for each KINTERRUPT. We said before that 
KiInterruptDispatch receives its arguments from the EDI register (a 
pointer to the KINTERRUPT of the interrupt). In the KiInterruptTemplate 
we can see this little code :

[...] 
nt!KiInterruptTemplate2ndDispatch:
804daa0b bf00000000      mov     edi,0
nt!KiInterruptTemplateObject:
804daa10 e9c3fcffff      jmp     nt!KeSynchronizeExecution+0x2 (804da6d8)
[...] 

Here we have a mov "edi, 0" and a jmp, but if we look at the 
KiInterruptTemplate code contained in the keyboard's KINTERRUPT we have :

ffb72525 bf5024b7ff      mov     edi,0FFB72450h ; Keyboard KINTERRUPT
ffb7252a e9a9839680      jmp     nt!KiInterruptDispatch (804da8d8)

Wow, instructions are modified! The kernel will dynamically changes those 
2 instructions in the KiInterruptTemplate code. In EDI we find the 
KINTERRUPT object and the jmp branch on KiInterruptDispatch.

Why this implementation ? Because we can easily change the dispatch 
handler. Even if we often have the KiInterruptDispatch we can find 
KiFloatingDispatch or KiChainDispatch. KiChainedDispatch is for vectors 
shared among multiple interrupt objects and KiFloatingDispatch is like 
KiInterruptDispatch, but it saves the floating core state too. 

Windows provides APIs for connecting interrupts on IDT. IoConnectInterrupt 
and IoConnectInterruptEx, according to the WDK :

NTSTATUS 
  IoConnectInterrupt(
    OUT PKINTERRUPT  *InterruptObject,
    IN PKSERVICE_ROUTINE  ServiceRoutine,
    IN PVOID  ServiceContext,
    IN PKSPIN_LOCK  SpinLock  OPTIONAL,
    IN ULONG  Vector,
    IN KIRQL  Irql,
    IN KIRQL  SynchronizeIrql,
    IN KINTERRUPT_MODE    InterruptMode,
    IN BOOLEAN  ShareVector,
    IN KAFFINITY  ProcessorEnableMask,
    IN BOOLEAN  FloatingSave
    );

As you can see IoConnectInterrupt returns in the InterruptObject parameter 
a KINTERRUPT structure, the same that we retrieve in the IDT. Previously
you have seen in the KiInterruptTemplate two labels, 
KiInterruptTemplateObject and KiInterruptTemplate2ndDispatch. Those two 
labels are used by kernel function to find the two instructions in the 
KiInterruptTemplateRoutine. KeInitializeInterrupt uses the
KiInterruptTemplateObject label to update the "jmp Ki*Dispatch" and the
KiConnectVectorAndInterruptObject function uses 
KiInterruptTemplate2ndDispatch to modify the "mov edi, <&Kinterrupt>".

-----[ 2.3.2 Hooking hardware IDT like a ninja 

Now, think about this. We want to hook the IDT in a stealth way, we know 
that replacing an entry directly is not the best solution. Anti-rooktits 
don't check the dynamically allocated KiInterruptTemplate routine. So we 
can modify this routine as we wish. There are three possible ways :

- Redirect the "jmp Ki*Dispatch" on our dispatch routine, we have to code 
our dispatch routine, not so hard. 

- Change the kinterrupt address passed in EDI by the instruction 
"mov edi, <&Kinterrupt>". The new KINTERRUPT will be the same than the 
previous one, only the ServiceRoutine will be modified by us.    

- Create our own KiInterruptTemplate, hard ...

In this paper, we choosed the simplest way. We change the 
"mov edi, <&kinterrupt>" by a "mov edi, <&OurKinterrupt>" and we implement 
our ServiceRoutine. We know that this instruction is followed by a jmp, so 
with a disassembly engine we can retrieve the instruction before the jmp
nt!KiInterruptDispatch and modify it. We must keep in mind, when the 
ServiceRoutine is running, the interrupt is not handled yet and we are 
running at DEVICE_IRQL IRQL. This is not a fair situation, because a 
lot of Windows kernel functions are not accessible. We know, that most 
ISR queued a DPC, so after the ISR has been executed, the last entry in 
the current core DPC queue should contain the DPC routine of our interrupt.

If we want to access data generated by the interrupt we must proceed like 
the ISR. Replacing the original ISR by our own ISR handler is very hard, 
because it depends too much on the hardware device. But we know that the 
real I/O is done by the DPC, so when KiInterruptTemplate will call our 
ServiceRoutine, first we call the original ServiceRoutine and we modify 
the last DPC entry by our. 

DPC are represented by KDPC structures :

kd> dt nt!_KDPC
   +0x000 Type             : Int2B
   +0x002 Number           : UChar
   +0x003 Importance       : UChar
   +0x004 DpcListEntry     : _LIST_ENTRY
   +0x00c DeferredRoutine  : Ptr32     void 
   +0x010 DeferredContext  : Ptr32 Void
   +0x014 SystemArgument1  : Ptr32 Void
   +0x018 SystemArgument2  : Ptr32 Void
   +0x01c Lock             : Ptr32 Uint4B

DPC list can be found in the KPRCB (Kernel Processor Control Region Block)
structure of the current processor. KPRCB is preceded by a KPCR (Kernel 
Processor Control Block) structure which is located at FS:[0x1C] on the 
current processor. KPRCB is a 0x120 bytes from the beginning of the KPCR 
structure.

dt nt!_KPRCB
[...]
   +0x860 DpcListHead      : _LIST_ENTRY
   +0x868 DpcStack         : Ptr32 Void ; DPC arguments
   +0x86c DpcCount         : Uint4B ; DPC core counter
   +0x870 DpcQueueDepth    : Uint4B ; Numbers of DPC in the list
   +0x874 DpcRoutineActive : Uint4B
   +0x878 DpcInterruptRequested : Uint4B
   +0x87c DpcLastCount     : Uint4B
   +0x880 DpcRequestRate   : Uint4B
   +0x884 MaximumDpcQueueDepth : Uint4B
   +0x888 MinimumDpcRate   : Uint4B

Now we know how to retrieve the DPC of our interrupt, we can easily 
change it to our own and handle the data.
   
For the keyboard the DPC is queued by KeInsertQueueDpc in the 
I8xQueueCurrentKeyboardInput routine called by the keyboard's ISR.
 
kd> dt nt!_KDPC 80e3461c
+0x000 Type             : 19 ; 19=DpcObject
+0x002 Number           : 0 ''
+0x003 Importance       : 0x1 ''
+0x004 DpcListEntry     : _LIST_ENTRY [ 0xffdff980 - 0x80559684 ]
+0x00c DeferredRoutine  : 0xfa815650    void  i8042prt!I8042KeyboardIsrDpc
+0x010 DeferredContext  : 0x80e343b8 
+0x014 SystemArgument1  : (null) 
+0x018 SystemArgument2  : (null) 
+0x01c Lock             : 0xffdff9c0  -> 0

Here is the figure of the attack :

                                            MyKinterrupt structure
                                            +---------------------+
Hardware Interrupt                     /---->  MyServiceRoutine   |
       |                               |    |  Calls the original |
       |                               |    |  ISR             ------\
       \---> IDT ---\                  |    |  And modify the DPC |  |  
                    |                  |    |  queue.             |  |
                    |                  |    +---------------------+  |
            +---------------------+    |                             |
            | KiInterruptTemplate -----/      Original Kinterrupt    |
            +---------------------+         +---------------------+  |
     Core                                   |                     |  |
+------------+                              |   ServiceRoutine <-----/
|            |                              | Queues the ISR's DPC|
|DpcListHead |--\                           +---------------------+       
|            |  |
+------------+  |
                | +-----+     +-----+     +-----+     +-----+
                \-> DPC |---->| DPC |---->| DPC |---->| DPC |-->DpcListHead
   DpcListHead<---|     |<----|     |<----|     |<----|     |
                  +-----+     +-----+     +-----+     +-----+  
                                                        /\     
                                                        ||
                                                  Last DPC entry
                                                  Modified after the call
                                                  to the ServiceRoutine.
 
-----[ 2.3.3 - Application 1 : Kernel keylogger 

It's time to design a POC. In this sample we will see how to sniff 
keyboard keystrokes. As you see previously, we are now able to control the
DPC generated by an interrupt. For the keyboard we will hijack the 
I8042KeyboardIsrDpc routine which is set into the DPC's keyboard 
interruption. With our own DPC handler we will reproduce the behavior of 
the original routine, unfortunately this kind of routine is hard to write 
so we ripped some pieces of codes and used reversing techniques (notice the
lazy hacker style).

In our DPC handler we must call the KeyboardClassServiceCallback [28] 
routine, this routine is provided by the Kbdclass driver. This callback 
transfers input data buffer of a device to the class data queue. A function
keyboard driver must calls this class service callback in its DPC routine.
Here is the KeyboardClassServiceCallback's prototype :

VOID
  KeyboardClassServiceCallback (
    IN PDEVICE_OBJECT  DeviceObject,
    IN PKEYBOARD_INPUT_DATA  InputDataStart,
    IN PKEYBOARD_INPUT_DATA  InputDataEnd,
    IN OUT PULONG  InputDataConsumed
    );

Parameters :
    DeviceObject : Pointer to the class device object. 
    
    InputDataStart : Pointer to the first keyboard input data packet in 
    the input data buffer of the port device.
    
    InputDataEnd : Pointer to the keyboard input data packet that 
    immediately follows the last data packet in the input data buffer of 
    the port device.
 
    InputDataConsumed : Pointer to the number of keyboard input data 
    packets that are transferred by the routine.    
    
KEYBOARD_INPUT_DATA is defined by :

typedef struct _KEYBOARD_INPUT_DATA {
  USHORT  UnitId;
  USHORT  MakeCode;
  USHORT  Flags;
  USHORT  Reserved;
  ULONG  ExtraInformation;
} KEYBOARD_INPUT_DATA, *PKEYBOARD_INPUT_DATA;

So in our DPC handler we just have to check the MakeCode member of the set
of KEYBOARD_INPUT_DATA structures. The MakeCode (or scancode) represents 
the data sent by the keyboard to the system when you hit or release a 
key, each key has it's own scancode and the system usually translates the 
scancode into a character depending on you code page. For example the 
scancode 19d on classical US keyboard is translated into the keycode 'e'.

In order to know if CAPSLOCK is activated we send an IOCTL to the 
functional keyboard device but we can only send IOCTL at a PASSIVE_LEVEL
IRQL. For that we use a system thread which will sent IOCTL with the 
kernel API IoBuildDeviceIoControlRequest. In fact the scancodes are queued 
in a list locked by a spinlock and thread synchronized with a semaphore. 
The thread is listening to incoming keystrokes then converts scancodes into
keycodes. Like the kernel keylogger Klog does [29]. 

-----[ 2.3.4 - Application 2 : NDIS packet sniffer 

In the same way, an interrupt is raised when your network card receives 
a packet. When this kind of interrupt is raised NDIS ISR handler 
(ndisMIsr) routine launches the miniport ISR interrupt handler. The 
ndisMIsr routine is used as a wrapper for miniport ISR and DPC. You can 
see in the IDT the following entry : 

3b: 80df6414 NDIS!ndisMIsr (KINTERRUPT 80df63d8) 

It means, your ISR handler is not called directly when an interrupt 
occurs, it is the ndisMIsr routine. Miniport's ISR is called by ndisMIsr 
and the miniport DPC is also queued in this routine. The DPC queued is 
the ndisMDpc routine which wraps your own DPC miniport handler. Finally 
NDIS wraps all the interrupt process with ndisMIsr and ndisMDpc routines on
Windows XP with NDIS 5.1. We don't know if this implementation is still 
present on Windows Vista with NDIS 6.0.

We know we can hijack the ndisMDpc handler by our own handler. With NDIS
we will proceed in the same way but we will not hook the MiniportDpc 
routine but directly hook the ndisMDpc routine. Why? Because we know that 
ndisMDpc wraps the MiniportDpc routine and in fact MiniportDpc depends too
much on the hardware of the miniport device. Each miniport device is 
represented by an NDIS_MINIPORT_BLOCK [30] structure, in this structure
we find a reference to a NDIS_MINIPORT_INTERRUP structure, which looks 
like :

kd> dt ndis!_NDIS_MINIPORT_INTERRUPT
   +0x000 InterruptObject  : Ptr32 _KINTERRUPT
   +0x004 DpcCountLock     : Uint4B
   +0x008 Reserved         : Ptr32 Void
   +0x00c MiniportIsr      : Ptr32 Void 
   +0x010 MiniportDpc      : Ptr32 Void 
   +0x014 InterruptDpc     : _KDPC
   +0x034 Miniport         : Ptr32 _NDIS_MINIPORT_BLOCK
   +0x038 DpcCount         : UChar
   +0x039 Filler1          : UChar
   +0x03c DpcsCompletedEvent : _KEVENT
   +0x04c SharedInterrupt  : UChar
   +0x04d IsrRequested     : UChar

If we look at the ndisMDpc routine we notice that only the first parameter
is used and this parameter refers to a NDIS_MINIPORT_INTERRUPT structure.
The ndisMDpc function will call the MiniportDpc field of this structure. 
We just have to hijack this pointer by our routine in order to control the 
incoming packets on the system.

The NDIS documentation specifies that a miniport DPC routine must 
notify the bound protocol driver that an that an array of received 
packets is available by calling the NdisMIndicateReceivePacket function
[31].

VOID
  NdisMIndicateReceivePacket(
    IN NDIS_HANDLE  MiniportAdapterHandle,
    IN PPNDIS_PACKET  ReceivePackets,
    IN UINT  NumberOfPackets
    ); 

In the ndis.h header we have :
#define NdisMIndicateReceivePacket(_H, _P, _N)              \
{                                                           \
    (*((PNDIS_MINIPORT_BLOCK)(_H))->PacketIndicateHandler)( \
                        _H,                                 \
                        _P,                                 \
                        _N);                                \
}

So in our MiniportDpc routine we will hihjack the PacketIndicateHandler, 
which is often the ethFilterDprIndicateReceivePacket routine in the 
NDIS_MINIPORT_BLOCK structure, in order to filter the incoming packets on 
the miniport. After we have hijacked this pointer we call the original 
MiniportDpc routine that will process everything. After that, we restore 
the PacketIndicateHandler handler in the NDIS_MINIPORT_BLOCK for stealth
reasons. To sum up we must :

- Hijack the routine into the DPC queued by the ndisMIsr routine.

- Now that we have hijacked the ndisMDpc we modify the 
PacketIndicateHandler into the NDIS_MINIPORT_BLOCK of the miniport.

- We call the ndisMDpc routine. It will call the original MiniportDpc 
handler

- The MiniportDpc routine calls the NdisMIndicateReceivePacket macro. Our
filter function is called and we do our job.

- When the ndisMDpc returns we restore the original PacketIndicateHandler 
into the NDIS_MINIPORT_BLOCK of the miniport.

With this filter, we can monitor or modify the incoming packets. For 
example, our PacketIndicateHandler hook can search in the incoming packets 
for a tag, when this tag his found the rootkit triggers a function.
               
---[ 2.2 - Conclusion about stealth hooking on IDT

In this part we have seen how Windows manages his hardware interrupts by 
using a global template function dedicated to all interrupts. The fact 
that this template routine his forged for each interrupts is the main 
point of this attack, with that we can create a fake template routine that 
cannot be detected directly. The stealth of our attack remains on two 
points :

- We modify only dynamic allocated and forged code

- We hijack highly temporal dynamic allocated structures which when 
running, are always preempting the core. 

So, even if the scope of our attack is restricted, controlling the hardware
is the best way for a rk to reach critical components. Finally, we have 
just cheated the system with its own features and that's the purpose of a 
stealth rootkit.
                
--[ 3 - Owning NonPaged pool using stealth hooking

Rootkit sophistication depends on how it subverts the kernel. More 
complex techniques come out as kernel and hardware understanding evolve. 
Nowadays there is so many ways to subvert the kernel, in consequence 
protections become harder to defeat. We're going to present a different 
means to gain control. Next techniques apply this approach to the kernel 
memory allocator. 

Our goal is getting execution on every NonPaged allocation without using
any hook. It must bypass any hooking verification even those based on code 
page comparison or hashing. It will be done by modifying data used by the 
allocator. We just apply the concept of using code against itself. We do 
believe that this concept can be used on others components and in 
different ways successfully.

We won't try to convince you that this technique is perfect. It evades 
current protections and detection systems. The more important is that they 
would need more than a simple modification to prevent and block an attack
based on kernel code behavior.

---[ 3.1 - Kernel allocation layout review

As every operating system, Windows kernel puts forward some functions in 
order to allocate or free memory. Virtual memory is organized as block of 
memory called pages. In Intel x86 architecture, a page size is 4096 bytes 
and most allocations requests are smaller. Thus, kernel functions like 
ExAllocatePoolWithTag and ExFreePoolWithTag kept unused memory blocks for 
next allocations. Internal functions directly interact with hardware each 
time a page is needed. All those procedures are complex and delicate that's
why drivers trust kernel implementation.

-----[ 3.1.1 - Difference between Paged and NonPaged pool

Kernel system memory is divided in two different kind of pool. It has been 
separated to distinguish most used memory blocks. The system must know 
which pages should be resident and which can be temporarily discarded. The 
page fault handler restores pageable memory only when IRQL is inferior of
DPC or DISPATCH level. Paged pool can be paged in or out of the system. A 
memory block paged out will be saved on the file system and so unused part 
of paged memory will not be resident in memory. NonPaged pool is present 
in every IRQL level and then is put-upon for important tasks.

The file pagefile.sys contains paged out memory. It was attacked to inject 
unsigned code into Vista kernel [32]. Some solutions was discussed as 
disabling kernel memory paging. Joanna Rutkowska defended this solution as
more secure than others but with a small physical memory loss. Microsoft
just denied raw disk access, which may prove that Paged and NonPaged 
layout is an important feature of Windows kernel [33].

This article focuses on NonPaged pool layout as PagedPool handling is 
totally different. NonPaged pool can be more or less considered as 
following a typical heap implementation. Global information about system 
pool can be found in Microsoft Windows Internals [34].

-----[ 3.1.2 - NonPaged pool tables

The allocation algorithm must be fast allocating on the most used sizes. 
That why three different tables exist and each one is devoted to a size 
range. We found this organization in most memory management algorithms. 
Retrieving memory blocks from hardware takes time. Windows balances between
response faster and avoid memory wasting. Response time becomes faster if 
memory blocks are stored for next allocations. In the other hand, if you 
keep too much memory, it can penalize memory demands.

Each table implements a different way to store memory blocks. We will 
present each table and where you can find them.

The NonPaged lookaside is a per-processor table covering size inferior or 
equal to 256 bytes. Each processor has a processor control register (PCR) 
storing data concerning only a single processor like IRQL level, GDT, IDT. 
Its extension called processor control region (PCRB) contains lookasides 
tables. Next windbg dump presents NonPaged lookaside table and its 
structure.

kd> !pcr
KPCR for Processor 0 at ffdff000:
    Major 1 Minor 1
  NtTib.ExceptionList: 805486b0
      NtTib.StackBase: 80548ef0
     NtTib.StackLimit: 80546100
   NtTib.SubSystemTib: 00000000
        NtTib.Version: 00000000
    NtTib.UserPointer: 00000000
        NtTib.SelfTib: 00000000

              SelfPcr: ffdff000
                 Prcb: ffdff120
                 Irql: 00000000
                  IRR: 00000000
                  IDR: ffffffff
        InterruptMode: 00000000
                  IDT: 8003f400
                  GDT: 8003f000
                  TSS: 80042000

        CurrentThread: 80551920
           NextThread: 00000000
           IdleThread: 80551920

            DpcQueue:  0x80551f80 0x804ff29c
kd> dt nt!_KPRCB ffdff120
[...]
   +0x5a0 PPNPagedLookasideList : [32]
      +0x000 P                : 0x819c6000 _GENERAL_LOOKASIDE
      +0x004 L                : 0x8054dd00 _GENERAL_LOOKASIDE
[...]
kd> dt nt!_GENERAL_LOOKASIDE
   +0x000 ListHead         : _SLIST_HEADER
   +0x008 Depth            : Uint2B
   +0x00a MaximumDepth     : Uint2B
   +0x00c TotalAllocates   : Uint4B
   +0x010 AllocateMisses   : Uint4B
   +0x010 AllocateHits     : Uint4B
   +0x014 TotalFrees       : Uint4B
   +0x018 FreeMisses       : Uint4B
   +0x018 FreeHits         : Uint4B
   +0x01c Type             : _POOL_TYPE
   +0x020 Tag              : Uint4B
   +0x024 Size             : Uint4B
   +0x028 Allocate         : Ptr32     void*
   +0x02c Free             : Ptr32     void
   +0x030 ListEntry        : _LIST_ENTRY
   +0x038 LastTotalAllocates : Uint4B
   +0x03c LastAllocateMisses : Uint4B
   +0x03c LastAllocateHits : Uint4B
   +0x040 Future           : [2] Uint4B
   
Lookaside tables permit faster block retrieving than typical double linked 
list. For this optimization lock time is really important and a single 
linked list is a faster mechanism than software locking. 
ExInterlockedPopEntrySList function is used to pop an entry from a single 
linked list using hardware locking instruction "lock".

PPNPagedLookasideList is the lookaside table we were talking about. It 
contains two lookaside lists P and L. Depth field of the GENERAL_LOOKASIDE 
structure defines how many entries can be in ListHead single list. The 
system updates regularly the depth using different counters. The update 
algorithm is based on processor number and is different for P and L. Depth 
of the P list is updated more frequently than L list as it optimizes 
performances on very small blocks.

The second table depends how many processors are used and how system 
managed them. Allocation system walk it if size is inferior or equal to 
4080 bytes or if lookaside research failed. Even if target table can 
change, it always has the same POOL_DESCRIPTOR structure. On single 
processor, a variable called PoolVector is used to retrieve 
NonPagedPoolDescriptor pointer. On multi processor, the 
ExpNonPagedPoolDescriptor table has 16 slots containing pool descriptors. 
Each processor PRCB points on a KNODE structure. A node can be linked on 
more than one processor and contains a color field used as an index in 
ExpNonPagedPoolDescriptor. Next figures illustrate this algorithm.

          PoolVector
        +------------+
        |  NonPaged  | --------------> NonPagedPoolDescriptor
        |------------+
        |   Paged    |
        +------------+

         [ Figure 1 - Single processor pool descriptor ]

         Processor #1
        +------------+
        |            |                  ExpNonPagedPoolDescriptor
        |    PRCB  ------\                +-------------------+
        |            |   |          /---------> SLOT #01      |
        +------------+   |          |     |     SLOT #02      |
               /---------/          |     |     SLOT #03      |
               |         KNODE      |     |     SLOT #04      |
               |---> +------------+ |     |     SLOT #05      |
               |     | Proc mask  | |     |     SLOT #06      |
               |     | color (01) --/     |     SLOT #07      |
               |     | ...        |       |     SLOT #08      |
               |     +------------+       |     SLOT #09      |
               |                          |     SLOT #10      |
               \---------\                |     SLOT #11      |
         Processor #2    |                |     SLOT #12      |
        +------------+   |                |     SLOT #13      |
        |            |   |                |     SLOT #14      |
        |    PRCB  ------/                |     SLOT #15      |
        |            |                    |     SLOT #16      |
        +------------+                    +-------------------+

        [ Figure 2 - Multiple processor pool descriptor ]

A global variable ExpNumberOfNonPagedPools defines if multi processor case 
is used. It should reflect processor number but it can change between 
operating system versions.

The next dump shows POOL_DESCRIPTOR structure from windbg.

kd> dt nt!_POOL_DESCRIPTOR
   +0x000 PoolType         : _POOL_TYPE
   +0x004 PoolIndex        : Uint4B
   +0x008 RunningAllocs    : Uint4B
   +0x00c RunningDeAllocs  : Uint4B
   +0x010 TotalPages       : Uint4B
   +0x014 TotalBigPages    : Uint4B
   +0x018 Threshold        : Uint4B
   +0x01c LockAddress      : Ptr32 Void
   +0x020 PendingFrees     : Ptr32 Void
   +0x024 PendingFreeDepth : Int4B
   +0x028 ListHeads        : [512] _LIST_ENTRY

Queued spinlock synchronization, part of HAL library, is used to restrict 
concurrency on a pool descriptor. It assures that only one thread and one 
processor will access and unlink an entry from a pool descriptor. HAL 
library changes on different architectures and what is a simple IRQL 
raising on single processor becomes a more complex queued system on 
multi-processor. For default pool descriptor, general NonPaged queued 
spinlock is locked (LockQueueNonPagedPoolLock). Else, a custom queued 
spinlock is created.

The third and last table is shared by processors for size superior of 4080 
bytes. MmNonPagedPoolFreeListHead is also used when others tables lack 
memory. It composed by 4 LIST_ENTRY each one representing a page number, 
except for the last one which holds all superiors pages kept by the system.
Access to this table is guarded by general non paged queued spinlock 
also called LockQueueNonPagedPoolLock. During the free procedure of a 
smaller block, ExFreePoolWithTag merges it with previous and next free 
blocks. It can create a block superior or equal to 1 page. In this case, 
the new block is added in the MmNonPagedPoolFreeListHead table.

-----[ 3.1.3 - Allocation and free algorithms

Kernel allocation does not change that much between OS versions but its 
algorithm is as hard as the userland heap one. In this part, we want to 
illustrate basic behavior between tables during allocation or free 
procedures. A lot of details have been thrown away such as synchronization 
mechanisms. Those algorithms will help you for the technique explanation 
but also understanding the basic elements of kernel allocation. Despite 
kernel exploitation is not part of this paper, pool overflow is an 
interesting topic that needs understanding of some part of this algorithm. 

NonPaged pool allocation algorithm (ExAllocatePoolWithTag):

    IF [ Size > 4080 bytes ]
      [
        - Call the MiAllocatePoolPages function
          - Walk MmNonPagedPoolFreeListHead LIST_ENTRY table.
          - Retrieve memory from hardware if necessary.
          - Return memory page aligned (without header).
      ]

    IF [ Size <= 256 bytes ]
      [
        - Pop entry from PPNPagedLookasideList table.
        - If something is found return memory block.
      ]

    IF [ ExpNumberOfNonPagedPools > 1 ]
      - PoolDescriptor from ExpNumberOfNonPagedPools and used index
        comes from PRCB KNODE color.
    ELSE
      - PoolDescriptor is PoolVector first entry, designed by symbol
        as NonPagedPoolDescriptor.

    FOREACH [ >= Size entry of PoolDescriptor.ListHeads ]
      [
        IF [ Entry is not empty ]
          [
            - Unlink entry and split it if needed
            - Return memory block
          ]
      ]

    - Call the MiAllocatePoolPages function
      - Walk MmNonPagedPoolFreeListHead LIST_ENTRY table..
      - Split it correctly to the right size
      - Return new memory block

NonPaged pool free algorithm (ExFreePoolWithTag) :

    IF [ MemoryBlock is page aligned ]
      [
        - Call the MiFreePoolPages function
          - Determine block type (Paged or NonPaged)
          - Depending on how many blocks are kept in
            MmNonPagedPoolFreeListHead, we release it to the hardware.
      ]
    ELSE
      [
        - Merge previous and next block if possible

        IF [ NewMemoryBlock size <= 256 bytes ]
          [
            - Look at PPNPagedLookasideList entry depth and see if we
              should keep it.
            - We return if memory block is pushed into lookaside list
          ]

        IF [ NewMemoryBlock size <= 4080 bytes ]
          [
            - Use POOL_HEADER PoolIndex variable to determine which
              PoolDescriptor must be used.
            - Insert it in the proper LIST_ENTRY array entry
            - If anything goes well, return
          ]

        - Depending on how many blocks are kept in
          MmNonPagedPoolFreeListHead, we release it to the hardware.
      ]

Paged pool algorithm is very different especially for page aligned blocks. 
Smaller size management should be not that far from NonPaged but in 
assembly code we definitely saw that NonPaged and Paged pool are totally 
separated. Once you know a little more about how NonPaged allocation works,
we can now talk about exploitation part.

---[ 3.2 - Getting code execution abusing allocation code

Our main goal is getting code execution on every allocation attempts for  
NonPaged pool only. This result must be done only by changing data used by 
targeted code. Our purpose is proving that kernel code can serve our 
interest only by changing typical data environment. Our work is based
on a new rootkit developed to gain control over NonPaged allocation.

We start with getting code execution for allocation superior or equal
to 1 page. As we saw on previous part, it concerns the third and last 
table.

-----[ 3.2.1 - Data corruption of MmNonPagedPoolFreeListHead

MmNonPagedPoolFreeListHead conserves page aligned memory blocks to speed up
memory allocation. It links held memory block using a LIST_ENTRY structure.
This structure is common and use in Windows heap library for example.

kd> dt nt!_LIST_ENTRY
   +0x000 Flink            : Ptr32 _LIST_ENTRY
   +0x004 Blink            : Ptr32 _LIST_ENTRY

MmNonPagedPoolFreeListHead access is protected by general NonPaged queued 
spinlock LockQueueNonPagedPoolLock. It assures that only one thread and 
processor can look and modify this structure.

So we need a way to get control over allocation and unlinking procedure 
seems perfect. We can poison this linked list with a fake entry, with the 
highest size possible, which unlinking will modify current executed code. 
At kernel level, you can modify code as data without any protection issues.
Unlinking was used when heap exploitation started [20] but modifying code 
was not possible from userland. As spinlock assures us exclusivity, there 
is no risk on some race condition. The created "hook" would be dynamic and 
code restored directly. Page guard protection reverse [18] shows that code 
is only checked every 5 minutes. Whether a modification is found, real code
is just replaced.

This method has plenty assets but also a lot of obstacles. Let start by
enumerating all those obstacles :

 - On a basic implementation of unlinking, list become unwalkable.
   It breaks most utilization of the table.
 - Pass through page cleaning methods and always be the first block on the 
   list otherwise we could miss some call.
 - We break code path and sooner or later we must return as if our
   hijacking has never been there and everything goes fine.
 - Processor prefetch make self code modification dangerous.

Unlinking gives us 4 bytes overwriting to build an opcode and create a 
redirection. In our case, we influenced current context and a register 
should point to the unlinked entry. We said should point without choosing a
single register because kernel changes between versions or service packs. 
As soon as we discuss context, we will stay talking about general 
situations. We choose to make a jmp [reg+XX] which is FF60XX in hex.

This technique effectiveness lies on keeping the MmNonPagedPoolFreeListHead
walkable. A double linked list, as LIST_ENTRY, is walkable if Flink is 
correct. Therefore we can choose an address for Flink as 0xXXXX60FF and 
Blink will point to the code address. The Intel x86 architecture using 
little endian our address is quite easy to found, we must check opcode 
offset and discard too close possibilities. Next figure illustrates a 
poisoned entry.

          MmNonPagedPoolFreeListHead[i]
   /------> +--------------------+
   |        |       Flink        | ---\
   |        |--------------------|    |
   |  <---- |       Blink        |    |
   |        +--------------------+    |
   |        |        ...         |    |
   |        +--------------------+    |
   |  /-------------------------------/
   |  |
   |  |         Poisoned entry
   |  |     +--------------------+
   |  |     | PreviousSize : -   |
   |  |     +--------------------+
   |  |     |   PoolIndex : -    |
   |  |     +--------------------+
   |  |     | PoolType: NonPaged |
   |  |     +--------------------+
   |  |     |   BlockSize : i    |
   |  |     +--------------------+
   |  |     |    PoolTag : -     |
   |  \---> +--------------------+
   |        | Flink : 0xYYXX60FF | <--\
   |        |--------------------|    |
   |   X--- | Blink : 0x80YYYYYY |    |
   |        +--------------------+    |
   |                                  |
   |  /-------------------------------/
   |  |     Fake entry (0xYYXX60FF)
   |  |     +--------------------+
   |  |     | PreviousSize : -   |
   |  |     +--------------------+
   |  |     |   PoolIndex : -    |
   |  |     +--------------------+
   |  |     | PoolType: NonPaged |
   |  |     +--------------------+
   |  |     |   BlockSize : < i  |
   |  |     +--------------------+
   |  |     |    PoolTag : -     |
   |  |---> +--------------------+
   |  |     | Flink : 0x80.....  | ---\
   |  |     |--------------------|    |
   |  \---- | Blink : Poisoned   |    |
   |        +--------------------+    |
   \--------------- [...] ------------/

    Unlinking instruction      : mov [0x80YYYYYY], 0xYYXX60FF
    New Opcode after unlinking : jmp [reg+XX] (FF 60 XX)

   [ Figure 3 - Poisoned double linked list ]
   
This figure shows a MmNonPagedPoolFreeListHead entry layout that assures 
predicted unlinking and then code execution. We must maintain this layout 
or we will lose our position. NonPaged blocks come from two different 
virtual memory ranges. The second memory region start is stored in 
MmNonPagedPoolExpansionStart. A cleaning function is called sometime to 
free blocks from the expansion NonPaged pool. To avoid this cleaning, we 
can use a Paged pool block locked. You can lock a memory block with the
MmProbeAndLockPages function. This lock makes described memory region as 
resident. Another more discreet way is to remap a NonPaged block with the 
function MmMapLockedPagesSpecifyCache. It is more discrete because this 
mapping would be just before expansion NonPaged pool memory range. Using
a locked Paged pool block creates an address totally differently. A quick 
look at those addresses between NonPaged ones show a clear difference.
As virtual memory is very large, it does not take too much time to find 
an address like 0xYYXX60FF. We will not unlock those pages until our
technique is running.

To defeat code path issues we differentiate two different states. The first
state is when our block is selected. The second state is when our block is 
unlinked. If we were able to return to the first step with our next fake 
entry selected, we could continue walking code as normal. We achieve that 
by using a generic approach. At IRQL equal to DISPATCH_LEVEL, we corrupt a 
MmNonPagedPoolFreeListHead entry with some invalid pointers. With a hook on
the page fault handler we are capable to see first and second stages, 
restore the right context each time and save context difference between 
those states. 

Assembly dump from MiAllocatePoolPages :

  lea     eax, [esi+8] ; Stage #1 esi is selected block and esi+8 its size
  cmp     [eax], ebx   ; Check with needed size
  mov     ecx, esi
  jnb     loc_47014B
  [...]
  
loc_47014B:
  sub     [esi+8], ebx
  mov     eax, [esi+8]
  shl     eax, 0Ch
  add     eax, esi
  cmp     _MmProtectFreedNonPagedPool, 0 ; Protected mode, don't care
  mov     [ebp+arg_4], eax
  jnz     short loc_47016E
  mov     eax, [esi]      ; \ Stage #2
  mov     ecx, [esi+4]    ; | Unlinking
  mov     [ecx], eax      ; | procedure
  mov     [eax+4], ecx    ; /
  jmp     short loc_470174

Now let's see how it works during our test technique with interrupt fault
handler (int 0xE) hooked :

  lea     eax, [esi+8]
                       ; Stage #1 - Check with needed size
  cmp     [eax], ebx   ; ----> PAGE FAULT esi = 0xAAAAAAAA | eax = esi + 8
                       ;     - We keep EIP and all registers
                       ;     - Scan all registers for 0xAAAAAAAA +/- 8
                       ;       and correct the current context. Continue.
  mov     ecx, esi
  jnb     loc_47014B
  [...]
  
loc_47014B:
  sub     [esi+8], ebx
  mov     eax, [esi+8]
  shl     eax, 0Ch
  add     eax, esi
  cmp     _MmProtectFreedNonPagedPool, 0 ; Protected mode, don't care
  mov     [ebp+arg_4], eax
  jnz     short loc_47016E
  mov     eax, [esi]      ; \ Stage #2 - Unlinking procedure
  mov     ecx, [esi+4]    ; |
  mov     [ecx], eax      ; | ------> PAGE FAULT ecx = 0xBBBBBBBB
                          ; |                    eax = 0xCCCCCCCC
                          ; |     - Keep EIP and sub this context from
                          ; |       Stage #1 saved context
                          ; |     - Change fault registers and 
                          ; |       structure pointers. Continue.
  mov     [eax+4], ecx    ; /
  jmp     short loc_470174
    
Fault addresses 0xAAAAAAA, 0xBBBBBBBB and 0xCCCCCCCC must point on invalid
addresses to force a caught page fault. This test is made only once and 
when we still have exclusivity on all processors. The int 0xE (page fault)
handler is restored just after.

This generic technique permits us to restore a valid context just before 
selected block size is checked. Once we get code execution, we apply 
context difference, change the current block register and then return at 
first stage address. It works well because our two stages are very close, 
once a selected block size is checked, unlinking is directly made.

Given examples were based on a single LIST_ENTRY of the 
MmNonPagedPoolFreeListHead table but you must poison all entries. If a 
given entry is empty (except for our fake blocks), the algorithm tries 
next the entry. It means we will be called more than one time per 
allocation. We created a mechanism to manage multiple call on a single 
allocation. If the first entry is empty, the second entry is used and so 
on. Then we will be called twice or more. By checking current table, we 
can predict a future code execution on the same allocation and avoid 
executing payload more than one time per allocation request.

Prefetch is a processor feature that retrieves more than a single 
instruction from memory before it executes them. Some processor use a 
complex branch prediction algorithm to fetch as much instruction as 
possible. After some tests, we saw that processors invalidate code cache 
when a modification occurs in cached memory addresses. Our driver supports 
a case where code modification could be right after current instruction. 
To achieve that we created a routine which calculates prefetch cache size 
and consider it in next parts of our technique. We could also search 
specifics instructions which clean prefetch cache like a far jump but it 
can only be used as an option.

This technique gives us code execution for NonPaged allocation superior or 
equal to 1 page. It achieves that with a stealth hook, created by kernel 
code and cleaned by our routine directly after. It's far from being 
perfect as those allocations are not used that much. Next part describes 
how this technique can be extended to gain control over all NonPaged pool 
allocations.

-----[ 3.2.2 - Expend it for every size

Others lists can not be hijacked the same way because synchronization 
mechanisms are not exclusive. Changing some assembly code becomes tricky if
it can be executed by more than one thread at a time. Our method is 
assuring our previous technique execution on any allocation. Once we have 
control, we can find a way to retore ExAllocatePoolWithTag context with a 
correct return value. We must do that without recoding a single line of 
memory allocator. It is possible to create our own allocator but Windows 
one is great and it will perfectly do the job for us.

During allocation, the lookaside list is checked first. It will pop an 
entry and if this entry is not NULL, use it. This entry comes from 
GENERAL_LOOKASIDE ListHeader field. This field structure is SLIST_HEADER.

kd> dt nt!_SLIST_HEADER .
   +0x000 Alignment        : Uint8B
   +0x000 Next             :
      +0x000 Next             : Ptr32 _SINGLE_LIST_ENTRY
   +0x004 Depth            : Uint2B
   +0x006 Sequence         : Uint2B

The ExInterlockedPopEntrySList function pops an entry from a SLIST_HEADER
structure. The Next field is a pointer to the next SLIST node (single 
linked list). The Depth field represents how many entries are kept in the 
list. ExFreePoolWithTag compare GENERAL_LOOKASIDE optimal depth with 
current SLIST_HEADER depth. ExAllocatePoolWithTag does not check this 
field and just looks if some entry can be popped out Next field. To stunt 
allocation and free procedure on NonPaged lookaside table, we set Next 
field to NULL and Depth field to 0xFFFF. This state will be preserved and 
this table will not be used anymore. 

Our technique expansion relies entirely on subverting how the 
ExpNonPagedPoolDescriptor table is used. In the previous part, we explained
global variable ExpNumberOfNonPagedPools involvement in this process. It 
is possible to expand number of NonPaged pools and then play with current 
KNODE color. During allocation, the KNODE color defines which pool 
descriptor is used. Then during free procedure, PoolIndex field of 
POOL_HEADER keep pool descriptor color.

So we can use this nice feature to our advantage. Default KNODE color on 
every processors would point on an empty pool descriptors. It will lead to 
code execution using our base technique. If the function 
MiAllocatePoolPages return address is not the one use for classical page 
rounded allocation, we know that a smaller allocation occur. All we have 
to do is switch PRCB KNODE pointer to a copy with custom color and recall 
ExAllocatePoolWithTag. Everything related to allocation and block 
management will be implemented as it needs to be even if it differs 
between operating system versions. Returned blocks PoolIndex will point to 
our own pool descriptor and free procedure, which will perfectly work. Lets
see how it will look on a single processor.

      ExpNonPagedPoolDescriptor
        +-------------------+  
        | PREVIOUS POOLDESC | <--- Kept for compatibility (0)
        |  EMPTY POOLDESC   | <--- Default KNODE->color (1)
        |        --         |  
        |        --         |  
        |        --         |  
        |        --         |  
        |        --         |  
        |        --         |  
        |        --         |  
        |        --         |  
        |        --         |  
        |        --         |  
        |        --         |  
        |        --         |  
        |        --         |  
        |  CUSTOM POOLDESC  | <--- Used for our allocations (16)
        +-------------------+  

      [ Figure 4 - Corrupted ExpNonPagedPoolDescriptor ]
      [            on single processor                 ]

This setup is just an example and you can manage the arrangement as you 
want. We could transfer previous blocks from older pool descriptors in our 
own and then receive free blocks. It is also possible to use multiple pool 
descriptors and so on. Beware of system pool descriptor recycling as it can
leads to strange behavior specially on multi-processor architecture.

Once we have our fresh allocated block, we must return at 
ExAllocatePoolWithTag return address. MiAllocatePoolPages has been called 
to retrieve a new page and fill the current pool descriptor with it. It's 
obvious that we can't return normally and let page allocation occurs. On 
Intel x86 architecture the stack is used to store local variables, 
arguments and saved registers. The Windows compiler starts by reserving 
local variable and then pushes each register before its modification. 
The next figure shows our stack configuration once we have code execution.

    top
      +--------------------+
      | Our stack elements |      Restore assembly example
      +--------------------+ <------ /---------------\
      |                    |         | pop ecx       |
      |  Saved registers   |         | pop ebx       |
      |                    |         | pop esi       |
      +--------------------+         | leave         |
      |                    |         | retn 0Ch      |
      |                    |         \---------------/
      |                    |                |
      |                    |                |
      |  Stack variables   |                |
      |                    |                |
      |                    |                |
      |                    |                |
      +--------------------+         [new stack level]
      |     Saved EBP      |                |
      +--------------------+                |
      |   Return Address   |                |
      +--------------------+                |
      |                    |                |
      | Function arguments |                |
      |                    |                |
      +--------------------+ <--------------/
    bottom

      [ Figure 5 - Stack context after code execution ]
      [            ~ small blocks case ~              ]

The restore assembly part shows correct assembly in current function which 
perfectly restores the context. It does not correspond of the first series 
of pop instruction before return. There is an important risk that some 
register has not been pushed yet. It is possible to deduce the pushed 
register number by looking at function prologue when stack variables are 
reserved. In the Windows compiler, it's quite simple and we can easily 
calculate the pushed register number. A simple disassembly analysis on 
needed pop register number does the job. It must be done for 
MiAllocatePoolPages and ExAllocatePoolWithTag. We change the return 
address stored in the stack and go to the deduced MiAllocatePoolPages 
address. Last step is setting eax register for the return value. Both 
functions return a value and preserve eax value. Our analyzer is dynamic 
and registers each pop and its register. That why we can restore the 
proper context even if it changes between versions.

The Windows compiler is really easy to predict and does not create too 
strange assembly organization. This technique is theoretically possible 
on every assembly code that follow stdcall specification. The approach 
could differ on others compilers.

---[ 3.3 Exploit our position

This article present a way of subverting the Windows kernel by modifying 
only data. No function pointers, no static hooking or others classical 
technique. It could exempt us of any other explanation. But it would not 
be complete without some concrete examples. I personally believe that the 
only limitation here is imagination.

-----[ 3.3.1 Generic stack redirection

Allocation occurs in so many places that you must rely on known context 
and functions. Once everything is setup and before releasing exclusivity, 
some stack redirection database can be created.

The first way to do this is calling a handler if stack backtracing 
reveals a specific function. Stack backtracing shows only return addresses
and not which function call it. Debuggers resolve those functions by deep
analysis or symbol checking. Implementing those features would take too 
much time. So it's better to target a specific return address on 
ExAllocatePoolWithTag stack frame. It will definitely improve check speed. 
To do that, we indicates to our stack redirection API that we target a 
specific function. Then launch a normal call or procedure that will lead to
our function. Every allocation during this time will show important 
backtrace stacks.

Let say, we target an IRP and we know which function handles it by looking 
at IRP dispatch table. We also know by reversing that it will allocate a 
NonPaged block. Launching an I/O request, our API could register some 
NonPaged call and recognize later.

In the wild, it will call the appropriate handler with sub context 
information. Sometimes getting a context is not enough. The second way 
stays on same principles but modifies the stack to assure our handler is
called once the function end. Efficiency depends on what is your target 
and how you modifying it. 

-----[ 3.3.2 Userland process code injection

This technique can be also used to inject code in userland to subvert 
trusted applications. NonPaged allocation occurs a lot in kernel mode and 
it happens in every process. Some kernel drivers like win32k.sys call 
userland many times. This call is achieve by the function 
KeUserModeCallback [35]. It modifies userland stack to switch temporarily
for a call in userland. Available functions are limited by a table. 

Userland injection from kernel should not be resident and only concern
known trusted application as browsers. Injection can be done on 
explorer.exe as well to launch an hidden instance of a trusted program.
KeUserModeCallback algorithm can be easily remade or copied then 
relocated.Redirection table could be subverted to redirect the call. We 
can also think about exploiting userland calls. It does not make any sense 
to add checks on those available functions.

--[ 4 - Detection

This article does not try to convince you that subverting IDT or 
allocation mechanism using advanced technique is the future. Most detection
tools only indicate if a rookit may or may not be in this computer. It has 
pains identifying which module is responsible. It detects antivirus or 
firewall as rootkits. A protection layout could detect itself as a rootkit 
because it does everything a rootkit does and so does not ask it to block 
or uninstall a rootkit. Rootkit papers demonstrate so many great ways to 
easily bypass those protections. But we don't see much those techniques in 
the wild, simply because rootkits don't need them for the moment. 

Detect software behavior modification could be part of a Verifiable 
Operating System [36]. It will involve basic checks on known memory 
structures. Checks integrity of LIST_ENTRY structures and correct them if
needed. We can blame rootkit protections as much as we want but detecting 
rootkits on a closed operating system is almost impossible. Gives more 
information for kernel components will certainly leads to more 
sofisticates attacks. In the other hand, it could reduce attack surface. 
It is specially true on a defence oriented operating system. Next 
protection improvements should come from the operating system itself.

Now that there are hardware improvements for virtualisation, such as
hypervisors, there will be extensions to hardware to detect and protect
against rootkits. It offers a real control on operating system behavior 
without advanced research on kernel layout. Some protections techniques 
that were impossible to implement in Windows environment like PAX, could 
rely on those hardware features. Our techniques could be detected by 
registering and monitoring some specific events on the processor. It is 
possible today to do that but performance issues are important.

Our attacks could be blocked using targeted protection such as signatures. 
An attack is defined as how many times it takes to create a generic 
protection. In this area, Patchguard is an important improvement.

--[ 5 - Conclusion

This papers techniques were made to show that elegant software hijacking 
can still evades most protections and avoid any performance issues or 
unstable behaviors. Even though, these techniques are hardly reliable and
should be considered only as a technical proof of concept. New protections
are not efficient enough or present. They do not represent a threat for
a rootkit which targets millions of computers. Reversing is an important 
tool in improving software rootkits techniques. Detecting that a rootkit
is present should not be enough. A protection which cannot uninstall a
rootkit or prevent infection is useless. Drivers signatures was a good 
idea as it was designed to stop current infections entries. But infection
prevention includes local kernel exploitation. Generic detection of those 
attacks would need an important improvement in anti-rootkits protections 
and operating system design.

--[ 6 - References

[1] Holy Father, Invisibility on NT boxes, How to become unseen on Windows 
NT (Version: 1.2)
http://vx.netlux.org/lib/vhf00.html

[2] Holy Father, Hacker Defender
https://www.rootkit.com/vault/hf/hxdef100r.zip

[3] 29A
http://vx.netlux.org/29a

[4] Greg Hoglund, NT Rootkit
https://www.rootkit.com/vault/hoglund/rk_044.zip

[5] fuzen_op, FU
http://www.rootkit.com/project.php?id=12

[6] Peter Silberman, C.H.A.O.S, FUto
http://uninformed.org/?v=3&a=7 

[7] Eeye, Bootroot
http://research.eeye.com/html/tools/RT20060801-7.html

[8] Eeye, Pixie
http://research.eeye.com/html/papers/download/
eEyeDigitalSecurity_Pixie%20Presentation.pdf

[9] Joanna Rutkowska and Alexander Tereshkin, Blue Pill project
http://bluepillproject.org/

[10] Frank Boldewin, A Journey to the Center of the Rustock.B Rootkit
http://www.reconstructer.org/papers/
A%20Journey%20to%20the%20Center%20of%20the%20Rustock.B%20Rootkit.zip

[11] Frank Boldewin, Peacomm.C - Cracking the nutshell
http://www.reconstructer.org/papers/
Peacomm.C%20-%20Cracking%20the%20nutshell.zip

[12] Stealth MBR rootkit
http://www2.gmer.net/mbr/

[13] EP_X0FF and MP_ART, Unreal.A, bypassing modern Antirootkits
http://www.rootkit.com/newsread.php?newsid=647

[14] AK922 : Bypassing Disk Low Level Scanning to Hide File
http://rootkit.com/newsread.php?newsid=783

[15] CardMagic and wowocock, DarkSpy
http://www.fyyre.net/~cardmagic/index_en.html

[16] pjf, IceSword
http://pjf.blogone.net

[17] Gmer
http://www.gmer.net/index.php

[18] Pageguard papers (Uniformed) :

- Bypassing PatchGuard on Windows x64 by skape & Skywing
  http://www.uninformed.org/?v=all&a=14&t=sumry
  
- Subverting PatchGuard Version 2 by Skywing
  http://www.uninformed.org/?v=all&a=28&t=sumry
  
- PatchGuard Reloaded: A Brief Analysis of PatchGuard Version 3 by Skywing
  http://www.uninformed.org/?v=all&a=38&t=sumry

[19] Greg Hoglund, Kernel Object Hooking Rootkits (KOH Rootkits)
http://www.rootkit.com/newsread.php?newsid=501
    
[20] Windows Heap Overflows - David Litchfield 
http://www.blackhat.com/presentations/win-usa-04/bh-win-04-litchfield/
bh-win-04-litchfield.ppt

[21] Bypassing Klister 0.4 With No Hooks or Running a Controlled 
     Thread Scheduler by 90210 - 29A
http://vx.netlux.org/29a/magazines/29a-8.rar

[22] Microsoft, Debugging Tools for Windows
http://www.microsoft.com/whdc/devtools/debugging/default.mspx

[23] Kad, Phrack 59, Handling Interrupt Descriptor Table for fun and profit
http://phrack.org/issues.html?issue=59&id=4#article

[24] Wikipedia, Southbridge
http://en.wikipedia.org/wiki/Southbridge_(computing)

[25] Wikipedia, Northbridge
http://en.wikipedia.org/wiki/Northbridge_%28computing%29

[26] The NT Insider, Stop Interrupting Me -- Of PICs and APICs  
http://www.osronline.com/article.cfm?article=211 (login required)

[27] Russinovich, Solomon, Microsoft Windows Internals, Fourth Edition
Chapter 3. System Mechanisms -> Trap Dispatching

[28] MSDN, KeyboardClassServiceCallback
http://msdn2.microsoft.com/en-us/library/ms793303.aspx

[29] Clandestiny, Klog
http://www.rootkit.com/vault/Clandestiny/Klog%201.0.zip

[30] Alexander Tereshkin, Rootkits: Attacking Personal Firewalls
www.blackhat.com/presentations/bh-usa-06/BH-US-06-Tereshkin.pdf 

[31] MSDN, NdisMIndicateReceivePacket
http://msdn2.microsoft.com/en-us/library/aa448038.aspx

[32] Subverting VistaTM Kernel For Fun And Profit by Joanna Rutkowska
http://invisiblethings.org/papers/
joanna%20rutkowska%20-%20subverting%20vista%20kernel.ppt

[33] Vista RC2 vs. pagefile attack by Joanna Rutkowska
http://theinvisiblethings.blogspot.com/2006/10/
vista-rc2-vs-pagefile-attack-and-some.html

[34] Russinovich, Solomon, Microsoft Windows Internals, Fourth Edition
Chapter 7. Memory Management -> System Memory Pools

[35] KeUserModCallback ref - "Ring0 under WinNT/2k/XP" by Ratter - 29A
http://www.illmob.org/files/text/29a7/Articles/29A-7.003 

[36] Joanna Rutkowska - Towards Verifiable Operating Systems
http://theinvisiblethings.blogspot.com/2007/01/
towards-verifiable-operating-systems.htm