PageBuster: stealthily dump all the code ever executed
Technical
Ever wanted to dump all the executable pages of a process? Do you crave something capable of dealing with packed processes?
We've got you covered! May I introduce PageBuster, our tool to gather dumps of all executable pages of packed Linux processes. Keep reading to find out its details and what happens under the hoods!
First things first, the the code on GitHub and the demo:
Introduction
There are plenty of scenarios in which the ability to dump executable pages is highly desirable. Of course, there are many methods, some of which standard de facto, but it is not always as easy as it seems.
For example, think about the case of packed malware samples. Run-time packers are often used by malware-writers to obfuscate their code and hinder static analysis. Packers can be of growing complexity, and, in many cases, a precise moment in time when the entire original code is completely unpacked in memory doesn't even exist.
Therefore, the goals of PageBuster are:
- to dump all the executable pages, without assuming there is a moment in time where the program is fully unpacked;
- to do this in a stealthy way (no VM, no ptrace).
In particular, given the widespread use of packers and their variety, our objective is to have a single all-encompassing solution, as opposed to packer-specific ones.
Ultimately, PageBuster fits in the context of the rev.ng decompiler. Specifically, it is related to what we call MetaAddress. Among other things, a MetaAddress enables you to represent an absolute value of an address together with a timestamp (epoch), so that it can be used to track how a memory location changes during the execution of a program. Frequently, you can have different code at different moments at the same address during program execution. PageBuster was designed around this simple yet effective data structure.
Packers... what?
The vast majority of malicious samples nowadays uses packers to conceal their data and code sections – which are then restored at run-time by a dedicated unpacking routine. Most of the present solutions rely on different heuristics to detect the end of the unpacking routine, and therefore the correct moment to dump the memory of the process.
In particular, those "dumping" solutions often assume that:
- there is a moment in time in which the entire original code is wholly unpacked in memory;
- if a sample contains multiple layers of packing, these are unpacked in sequence and the original application code is the one decoded in the last layer;
- the execution of the packer and the original application are not mangled together (i.e., there is a precise point in time in which the packer transfers the control to the original entry point).
According to Ugarte-Pedrero's well-known classification, there exist many kinds of packers, depending on their different behaviour.
As stated earlier, our main goal is to build a solution capable of handling different flavours of packer. Our starting point is the best known "Type I" packer, UPX. In the following you can find a short summary of the various types of packers:
-
Type I packers represent the simplest case, in which a single unpacking routine is executed before transferring the control to the unpacked program. UPX is a notable example of this class.
-
Type II packers contain multiple unpacking layers, each one executed sequentially to unpack the following routine. Once the original code has been reconstructed, the last transition transfers control back to it.
-
Type III are similar to the previous ones, but here the original code may not necessarily be located in the deepest layer. However, a tail transition still exists to separate the packer and the application code.
-
Type IV and Type V are multi-layer packers that interleave the unpacking code with the original program. In Type IV there is a hard-to-find moment where the original code is completely unpacked in memory, while Type V ones reveal it one frame at a time.
-
Type VI packers are the most complex. They are packers in which only a single fragment of the original program (as little as a single instruction) is unpacked at any given moment in time.
One of the strengths of our method is that it is independent of the complexity of the packer, be it Type I or Type VI.
Intuition
OK, but... how do you do it?
Recalling our main objective, we want to:
- dump all executable pages;
- associate them with an epoch (a timestamp);
- for each executable page, whenever it is modified and the execution jumps into it, we want a separate dump.
In order to do this, we want to prevent the target process to allocate pages with both write and execute permission at the same time.
Intuitively, even if a piece at a time, the actual code of a packed process will eventually reside in the main memory. At first, (part of the) code should be written somewhere and later on the execution should jump into it. If the program won't actually jump into some parts of the unpacked code then those pages will be at least marked as executable.
So, WX
pages are the ones that are more interesting to us.
But, as said, we also want to be transparent with respect to the process. Our mechanism employs kernel syscall hijacking to force accesses to such pages, which normally would succeed, to trigger page faults, which in turn, we handle transparently. We handle those page faults in kernel space, without sending a signal to the userspace program being unpacked. Exactly as in a regular run of the program.
At a high level, the system works as follows. We hook mmap/mprotect
syscalls in order to monitor memory allocations and to prevent pages from having both write and execution permissions at once. So, when a WX
permission is requested on a page, we tamper with the syscall behaviour by removing the write permission.
Therefore, any subsequent write access to those memory locations triggers a page fault (SIGSEGV - invalid write attempt
). We can then intercept memory access faults and study their reason.
When a segmentation fault occurs we first determine if the faulting address is part of any of the monitored regions. In other words, we establish if the fault was caused by our manipulation of page permissions. If not, we immediately defer the fault handling to the original page fault handler. But, if the fault (invalid write attempt) originates from a page we are monitoring, we give back to the process write permission, remove the execution permission, and then resume the execution of the faulting code. The program will now be able to access the page successfully.
At this point, the process actually "thinks" to have both write and execution permission, while it has only write ones.
You might be wondering why we remove the x
permission from the faulting page: it's because we want to be able to catch the program execution when it reaches that page. In fact, any jump into -x
memory triggers a sigsegv (SIGSEGV - bad jump
) that we are now able to catch.
The way we handle these "bad jump" faults is similar to how we handle invalid write attempts
: if we are sure they are induced by us, we give back to the process the execution permission it needs and remove the write permission flag.
However, before resuming the execution of the faulting code, we dump the content of the pages.
By construction, we are sure that we dump all the executable pages associated with the target process, and, if they have changed during program execution, we dump a copy of each page as it was before being executed.
We produced two different proof of concepts implementation of PageBuster: a userspace-only prototype, and a more robust, kernel-side prototype.
Let's start from the former.
Ordinary userspace world
Rather than immediately getting our hands dirty with the kernel-side implementation, we started with a user-space prototype.
You can take a look at the code on GitHub.
First of all, we need to hook/hijack mmap/mrotect
. In order to do this, we leveraged LD_PRELOAD
environment variable. It's a simple way to hook library calls in a program. If you are not familiar with it, check out
Rafał Cieślak's blog post on this topic.
The interesting thing is that the libraries inside that variable have the highest priority. If you set LD_PRELOAD
to the path of a shared object, that file will be loaded before any other library (including the C runtime, libc.so
). For instance, if you want to run ls
with a fancy custom malloc()
implementation you basically do this:
where in the malloc.so
file you put a custom implementation of the malloc()
function, with the same symbol name.
So, we exploited this fact and wrote a custom library to override what we need.
Now that we have the ingredient to track the memory pages of a process, here's what we need in our LD_PRELOAD
library:
-
A struct for each page we monitor, with the related permissions the process thinks to have on it.
-
A way to install a custom signal handler for the SIGSEGVs induced by us. We do this by overriding the _init() function and setting here the handler.
-
The actual signal handler, with:
-
A way to detect if a SIGSEGV is induced by us
-
A way to recognize if it is an invalid write attempt or a bad jump. In the first case, we give back the original permissions the process thinks to have, except for the execution ones. In the second case, we give back the permissions, remove the write one (if the process will write new executable code on that page we must catch it), but before resuming the execution, we dump the page content.
In our case, being a prototype, we focused only on Linux x86-64. So the value in the
REG_ERR
register and how it is handled is architecture-specific. -
A custom
mmap
andmprotect
implementation to use together withLD_PRELOAD
to track the pages we care about (perform the permissions reductions and collect a new entry for each involved page).
-
Welcome to the kernel world
Once we were satisfied with the exploration we did in user-space, we moved to the kernel land.
Talk is cheap, so check out the code on GitHub.
Why do we need a kernel implementation? Of course, there are plenty of reasons, among which transparency and resistance against anti-debugging techniques. But the main one is that we want to catch syscalls not only that starts from the target process, but also (and most importantly) that regards it. For instance, we want to catch the mmap/mprotect
syscalls which allocate memory pages for the code of the process itself. Indeed, those calls arise from the kernel and from the loader, not from the target process.
This is crucial because several packers "delegate" to the kernel/loader the allocation of the memory region where the unpacking routine will unpack the executable code. We need those pages! And we need to be able to track those syscalls!
So, after this little userspace warmup, let's now pass to real PageBuster: the robust and stealthy implementation employing a simple kernel module.
We designed a dynamic analysis framework that works as a controlled environment for processes. It allows to transparently execute packed processes and gather a dump of all their executable pages.
During the implemenation and testing of PageBuster, we heavily exploited Ciro Santilli's emulator:
The perfect emulation setup to study and develop the Linux kernel v5.9.2, kernel modules, QEMU, gem5 and x86_64, ARMv7 and ARMv8 userland and bare-metal assembly, ANSI C, C++ and POSIX. [...] Highly automated. Thoroughly documented. Automated tests. [...]
Among the main advantages, this solution gives very fast and comfortable development/testing/debugging tools and, last but not least, it helps avoiding memory corruption on our machine when operating with the kernel source code.
Of course, this is just for development purposes. The real product will be insmod
-able on any kernel.
PageBuster core components
All the components live inside a unique out-of-tree LKM (Loadable Kernel Module). This means that:
- it does not require a custom kernel to run;
- you can load/unload it according to needs.
Target Selection
The system needs to know which processes to focus on, i.e. which processes are our target. We chose the following strategy:
- we pass to the LKM the name of target process as an argument;
- when we hook a
mmap/mprotect
, we will add the pages to the set of pages we are tracking or not, depending on whether the issuing process matches the target one.
How can we obtain info about the "issuing" process? Although kernel modules don't execute sequentially as applications do, most actions performed by the kernel are related to a specific process. Kernel code can know the current process driving it by accessing the global item current
, a pointer to struct task_struct
, which as of version 5.9.2 of the kernel is defined in the asm-generic/current.h
header. The current
pointer refers to the user process currently executing. During the execution of a system call, such as mmap
or mprotect
, the current process is the one that invoked the syscall. Kernel code can use process-specific information by using current
, if it needs to do so. And so do we!
Hook Module
Another important component of PageBuster is the syscall hooking system. We leveraged ftrace for this purpose.
The ftrace infrastructure enables us to register hooks to the beginning of specific functions. This allows us to hijack function calls as we've done in the userspace implementation but in a more robust and elusive way. To register a function callback, a ftrace_ops
is required. This structure is used to tell ftrace the information it needs to do its magic: the function we want to intercept and a pointer to our callback.
To enable tracing calls we use register_ftrace_function(&ops)
, and to disable them unregister_ftrace_function(&ops)
.
As a starting point for PageBuster kernel-mode, we looked at ilammy/ftrace-hook and customized it for our purposes. At this point, we hijack mmap
, mprotect
and force_sig_fault
. The latter is needed for the custom page fault handler presented later on. Moreover, the module is completely parametric (additions/removals of hooks are very fast) and recursion-free (we want to avoid callbacks triggering when calling the original mmap
and mprotect
syscalls, having an infinite loop).
The flags we use for installing hooks are:
FTRACE_OPS_FL_SAVE_REGS
: if the callback requires reading or modifying thept_regs
passed to the callback, then it must set this flag;FTRACE_OPS_FL_RECURSION_SAFE
: by default, a wrapper is added around the callback to make sure that recursion of the function does not occur. That is, if a function that is called as a result of the callback’s execution is also traced, ftrace will prevent the callback from being called again. It is OK if another callback traces a function that is called by a callback that is marked recursion safe);FTRACE_OPS_FL_IPMODIFY
: if the callback is to "hijack" the traced function (have another function called instead of the traced function) it requires setting this flag.
More info can be found here.
Metadata Storage
When we hijack syscalls and modify the flags as explained above, we need to keep track of what are the real permissions flags that the target process thinks to have with respect to some memory areas. Since we are working at page granularity, we need to store an entry for each memory page, with its associated permissions. For this purpose, we exploited the kernel intrusive linked lists.
To sum up, a single entry in the list will store the address of a memory page and the protections the process think to have on it.
That list will be populated by the hooking system: the mmap/mprotect
hooks will take care to add to the structure an entry for each page.
On the other hand, when a page fault occurs, our custom handler will query the linked list in order to understand whether it was induced by PageBuster or not.
Finally, when the userspace program implicitly or explicitly sets a monitored page to -x
, we delete the relative entry and free it.
Page fault handling mechanism
The page fault handler is the component that is reached when the process tries to write/jump over memory locations of which we have restricted the permissions. Our approach consists in fixing those protections on the fly, before returing to user space and making the process re-execute the faulting instruction. In the following show what happens before we return to userspace and re-execute the faulting instruction:
Let's suppose for example that the process wants to jump in a page (0x8090700
) in which it wrote some code earlier. However, accordingly to what we said, we removed the x
permission in our handler. So, the instruction marked above will trigger a page fault (in this case a bad jump
). After fixing the permissions and saving the page content, we will give back the execution permission so that the execution flow will automagically resume exactly from the jmp
instruction.
OK, but how do we do that?
To understand how this is all possible we have to take a look at the Linux kernel. In particular, we want to see what happens when a segmentation fault is triggered:
early_pf_idts[]
idtentry page_fault
has_error_code
do_page_fault()
__do_page_fault()
do_user_addr_fault()
bad_area_nosemaphore()
_bad_area_nosemaphore()
force_sig_fault(SIGSEGV, si_code, (void __user *)address)
SIGSEGV set here!force_sig_fault()
force_sig_fault_to_task()
force_sig_info_to_task()
send_signal()
__send_signal()
We discovered that we can break the call chain immediately before force_sig_fault
by simply returning. In particular, that causes the faulting instruction to be executed again.
We can now add a call to a mprotect
-ish function in order to fix the permissions for the page where the target process faulted and simply return. After this, the control is given back to the userland process. As aforesaid, once we return to the target process the faulting instruction will be executed once again, this time successfully.
This is simply done by hooking force_sig_fault
and customizing it accordingly. We first check if the fault is induced by us. If it is not, we dispatch to the real force_sig_fault
. Otherwise, we proceed with the custom handling.
Ultimately, as a final step, we need to distinguish among the case of bad jump and invalid write attempt. This enables us to know when to perform the dump.
To do that, we rely on the REG_ERR
register, whose values can be accessed through the error_code
variable. In particular, we are interested in those two flags: X86_PF_WRITE
and X86_PF_INSTR
. This last part is architecture-specific, but can easily be adapted to different platforms.
Memory Dump
We still need a mechanism to dump the content of the memory pages we are monitoring.
Despite knowing very well that this is a "Thing You Never Should Do in the Kernel", we actually need to dump the unpacked pages somewhere, in order to be able to analyze them later, even if the packer unmaps them at a certain point. Moreover, we need to access those userspace pages to dump their content to file. To do this we must deal with SMAP, Supervisor Mode Access Prevention, namely an Intel processors' feature intended to protect userspace from the kernel. SMAP can be temporarily disabled for explicit memory accesses by setting the EFLAGS.AC
(Alignment Check) flag. The stac
(Set AC) and clac
(Clear AC) instructions can be used to easily set or clear the flag.
Another important aspect, motivated by the forthcoming integration with rev.ng, is to be able to place all the memory dumps in a timeline. So that, each dumped page belongs to an epoch. We need to associate an epoch to each page we dump so that we can reconstruct the execution flow later. In that way, dumps of the same page, performed in different time instants, will be different. This is trivially implemented using a global counter.
Bringing it all together
Check out the demo above! It shows how we dump all executable pages of a process packed with UPX. Here we use a simple vanilla C99 Hello, World!
program, compiled statically in order to achieve the minimum size requested from UPX to perform the compression and finally packed.
UPX, on its own, features no anti-debug checks, no scrambled code/stolen bytes and no encryption. Despite that, within the malware scene, it is very often used as an "outer" layer. Malware writers like to make reversing harder so they chain two or more packers to make the analyst's life miserable and make automated packers fail. This stops standard unpacking script and dumping systems from working.
This widespread use of UPX is one of the main reasons why we decided to use it as a baseline for PageBuster.
UPX works as follows: first relocates the sections, renames them and then alters the entry point.
The new entry point will run a short unpacking routine at the end of which the execution jumps to the original entry point. UPX1
contains the stub code and this code will unpack the real program code that lies in UPX0
.
Since PageBuster dumps all the executable pages, it can successfully reconstruct the history of the process.
Future works
Now that we have the main skeleton of PageBuster up and running, we have many plans for its future:
- Integrate PageBuster with rev.ng
- Port PageBuster to Windows (partially implemented)
We also want to:
- Test it with all types of packers
- Handle packers that unpack and run code in the same page
- Create a single ELF out of dumped pages
- Handle the case where the unpacking code and the original code do not run in the same process and establish inter-process communication