Wednesday, May 31, 2006

Support for Linux Kernel Modules

Boomerang can now load Linux kernel modules and find the init_module and cleanup_module entrypoints. Loading ELF sections is a tricky business. I now honour the alignment information, which means kernel modules will load into Boomerang with the same section placement as they are loaded into IDA Pro. There was also some problems with the way we handle R_386_PC32 relocations. Checking the type of the symbol indicated by the relocation solves the problem. I also managed to speed it up significantly by removing an unnecessary check. Hopefully it really is unnecessary.

My globals-at-decode-time code is now checked in. I await howls of disapproval on the mailing list, but hey, I do so every time I check in.

Tuesday, May 30, 2006

More GUI Work and Relocations

Today I got a lot of work done on the GUI. I can now edit signature files in a tab at decode time and the corresponding signatures and parameters are shown in the Library Procedures table. For the rare times where a decompilation actually makes it to code generation without something going wrong, I can now open the output file and edit it in a tab. I even gave my main window a title.

On the topic of relocations/symbol information. I can now load a linux .o file and get absolutely no addresses in my RTL. This is because I take a relocation at memory location x as an absolute guarentee that memory location x contains an address. I look up the address in the symbol map and replace the constant that would ordinarily be produced with an a[global] expression. One surprise I had on my test binary was that string constants are not assigned a symbol. I expected at least a "no name" symbol. As such, I speculatively test the memory at the given address and, if a suitable string constant is found, I replace the address constant with the string constant.

When you consider that I'm doing all this at decode time, it leaves the decompile stage to focus on the hard problems of recognising parameters, etc. I've yet to decide if an STT_FUNC symbol implies that the procedure conforms to a known ABI. This is clearly true for some binaries but may not be true for all. Perhaps a user specified option is the way to go there. Then I could recognise parameters at decode time.

Another interesting source of information is the STT_FILE symbols. This is definitely of interest as it tells us how the output of the decompiler should be clustered. "Clustering" is a term used for any grouping of code or data. One could say that OO programming is a clustering philosophy. Boomerang has supported emitting procedures to a tree of files for some time. Although all globals currently go into the "main" output file (typically named program.c). Of course, this will be of more interest to me when I actually have output worthy of clustering.

Monday, May 29, 2006

Woe of the Instruction Decoder

Boomerang uses the NJMC toolkit to decode instructions. These are the files in frontend/machine/pentium (or sparc or ppc or whatever you're interested in). We chose to use this technology because it ment we didn't have to write code to implement a new architecture, we could just write a "specification" file. Unfortunately, the NJMC toolkit is slowly rotting. It is hard to build. I've never built it. Mike has built it a couple of times (and failed a lot more times). Every architecture is different and no-one maintains it. We also have some issues with the code it generates. It produces huge cpp files which struggle to compile on some build targets and make the resulting binary much bigger than it could be.

So how much work is it to replace? I considered writing a new tool that would take the same input files as the NJMC toolkit and generate the same output, but that only solves half the problems. Then I came to wonder, what's wrong with just using a simple disassembler and modifying it to return a DecodeResult. There's even some BSD code available for a number of architectures that we could base it on. We could modify PentiumFrontend::decodeInstruction to call both the NJMC toolkit generated code and the new code. If the results are different, we output a warning to the LOG file and use the result from the original code. Once all the warnings have been fixed we can retire that original code.

In the mean time, Mike and I have managed to put together a binary package for x86 linux users that they can use instead of going through the pain of trying to build the toolkit.

Sunday, May 28, 2006

More relocation mayhem

I couldn't get to sleep last night, as something about relocations was nagging at me. Finally, around 2am, it hit me. I got up and sent a long email to Mike. The problem is, we've been thinking about relocations way too literally. The ElfBinaryFile loader treats relocations as the loader of an operating system would treat them, as numbers to be added to addresses in the image. But a decompiler wants a lot more information than this. The relocations tell us something that is gold and we just ignore it. For example, suppose you have a relocation like this:

00000009 R_386_32 myarray

To a traditional loader it is saying: go look up the symbol myarray and calculate its address in memory, then go to offset 9 in the .text segment and add that address to whatever is there. But to a decompiler, what it is telling us is that we should add a global reference to myarray to the expression we generate for the address at offset 9. So say the instruction that included offset 9 was decoded as something like this:

r24 := m[r27 * 4 + 12]

then we should be processing this relocation to change the expression to something like this:

r24 := m[a[myarray] + r27 * 4 + 12]

which the decompiler can recognise as an array reference and finally produce code like:

local1 := myarray[local2 + 3]

What's funny is that Boomerang does a lot of work to figure out what a constant refers to and produce just these kinds of expressions. If we were to take better note of what the relocation information is telling us, we wouldn't need to do all this work.

But now the hard part. Boomerang is not designed to perform this kind of processing of relocations. The BinaryFile classes are completely seperate from the NMJCDecoder classes. When we want to decode an instruction we simply pass in the native address we want to decode and the delta between the native address and the host address. Even if we could modify the decoder to add relocation information to the expressions it generates (and I'm having some difficulty seeing how we can, as dis_Mem appears to hide the addresses of immediate values) we'd have to redesign this to pass in a BinaryFile object. I see a lot of work in my future.

Saturday, May 27, 2006

Overlapping registers

The x86 instruction set is an ugly mess. Often with a desire to make things more flexible, people make things harder to understand. In the case of instruction sets, this makes a decompiler's job more difficult. Consider the following x86 asm code:

mov eax, 302
mov al, 5
mov ebx, eax

What value is in ebx? It makes it easier if we write 302 as 12Eh. Then we can easily say that ebx contains 105h, that is, 261. In boomerang, the decoder would turn those three instructions into this RTL:

*32* r24 := 302
*8* r8 := 5
*32* r27 := r24

This is clearly wrong. As the information that r8 overlaps with the bottom 8 bits of r24 is completely absent. This is more correct:

*32* r24 := 302
*16* r0 := truncu(32, 16, r24)
*8* r12 := r24@15:8
*8* r8 := truncu(32, 8, r24)
*8* r8 := 5
*32* r24 := r24@31:8 | zfill(8, 32, r8)
*16* r0 := truncu(32, 16, r24)
*32* r27 := r24

But just look at the explosion in the number of statements. I havn't even included statements to define bx, bh, and bl, which should go after the assignment to ebx. Boomerang currently contains code to add these statements, but because of the explosion of statements it is typically disabled. The users of Boomerang would rather have wrong RTL than have unreadable RTL.

Can we improve on this? I am currently writing some code that will search the procedure for any use of the 16 bit and 8 bit overlapping registers. If it finds no use of a particular register it will not output statements to define that register. So, for the code above, the RTL would only be:

*32* r24 := 302
*8* r8 := 5
*32* r24 := r24@31:8 | zfill(8, 32, r8)
*32* r27 := r24

Which is both readable at decode time and very early into the decompilation stage reduce to:

*32* r24 := 302
*8* r8 := 5
*32* r24 := 261
*32* r27 := 261

Dead code elimination will then remove the first statement. The second statement will go away when unused statements are removed because inevitably r24 will become an output of the procedure and the decompiler will not add r8 as an output as r24 already covers it.

Better Support for Relocatable ELF Files

Looking at how the Boomerang ELF loader handles symbols and relocations, I noticed that something was clearly wrong for relocatable files (i.e., .o files). The loader was assuming that the VirtualAddress members of the section table were set as they are in executable images. This is not the case. It is the duty of the loader to choose an arbitary starting address and to load each section at appropriate offsets from that address. I decided that choosing the same default address that IDA Pro uses was probably a good idea. I often switch between Boomerang and IDA Pro to gather information, especially information that Boomerang has gotten wrong. I also decided to delay loading any section that starts with ".rel." until all the other sections are loaded because IDA Pro does so. I don't know why it does it, but I want my addresses to match up with those in IDA Pro, so I have to follow their lead.

After fixing this, I noticed that all the symbols and relocations now point to addresses that are not in any section. That is, all the symbols that point to the .text section now need to have the base address of the section added to their offset. Relocations point to symbols, so they too need to have the base address of the section added to their offset.

Some of this was already in the ELF loader, or at least some attempt had been made to add it.

Relocatable ELF files have no "entrypoints" as such, as they are not typically intended to be executed on their own. Typically the linker combines multiple .o files to produce an executable. So, at first, it would appear that attempting to decompile any .o file that doesn't have the main symbol in it, and where no specific entrypoints have been supplied by the user, would be a bit questionable. However, two examples spring to mind where the decompiler should be able to find an entrypoint. The first is kernel modules, where the initialization and teardown functions are valid entrypoints. The second is X11 modules, which have similar entrypoints. Obviously in both these cases there are interesting targets for decompilation.

A(nother) GUI for Boomerang

Quite a while ago I attempted to write a GUI for Boomerang. In fact, I've done this a couple of times. The stalling point has always been: what good is a GUI? Decompilers are supposed to be automatic. You should be able to give a decompiler an executable and trust it to spit out a C file that meets some definition of program equivalence with that executable. So if the decompiler works perfectly, what is there for the user to do? Surely anything they can offer will be more productively applied to the output, and for that they can just use standard source code manipulation tools. Well, there's two problems with this line of thinking. First, there's the sad fact that no decompiler is perfect. In fact, the state of the art is far, far, from adequate, let alone perfect. Secondly, standard source code manipulation tools are woefully underpowered for even the most simplest tasks of traditional reverse engineering (where traditional means "starting from source code"). For example, renaming a function or a variable is still a completely manual process for most C programmers. Java programmers might have a few more fancy tools - what they call "refactoring" tools - but for C programmers it's basically a text editor and sometimes something like Cscope.

So there certainly is a need for a GUI for a decompiler. I started by documenting the workflow of Boomerang. I figured we need to present this workflow as a key paradigm to the user, so best to get it right up front. I identified 5 key steps:
  1. Initialize
  2. Load
  3. Decode
  4. Decompile
  5. Generate Code
The sixth step, make the code maintainable, is more freeform than the rest, so I've not presented it explicitly in the workflow tab of the GUI. At each stage, widgets are shown which present information to the user so they can monitor the progress of the decompiler. The user is also given the opportunity to enter or update the information that will be used in the next step.

So far, the user can see and edit the entrypoints of program and where each section of the program is loaded into memory before proceeding to the decode step. The user is then presented with the procedures of the program as they are decoded, and any library procedures that statements have been found to call. I hope to allow the user to specify any information on unknown library procedures, and the ability to remove any unwanted procedures, before proceeding to decompilation.

At decompilation time, the user is shown a truncated call graph built as the decompilation progresses. Procedures that have been decompiled or are currently under decompilation are shown in blue type, whereas procedures that have been delayed awaiting the completion of the decompilation of their children (or as a result of recursion) are left in black type. The procedure currently under work is made the current selection of the tree and all parent nodes are expanded. Double clicking on any procedure in the tree will open a new tab containing the RTL of that procedure. I hope to allow the user to manipulate the RTL before proceeding to the next step.

The code generation stage presents the user with a tree of clusters (directories and files) to which the code of each procedure will be generated. The user can add new clusters and drag procedures from cluster to cluster. Double clicking on a file will open a new tab (or an existing one if already opened) with the contents of the file. Double clicking on a procedure will do the same but scroll the contents such that the selected procedure is at the top of the window. At the moment, the only operations available for manipulating the contents of the file are the usual text editor commands. However, I hope to offer a range of commands, such as renaming procedures and variables, that make improving the quality of the output easier.

An option presented at the initialization stage (and in a menu option) offers the user an opportunity to more fine grainly monitor the decompilation step. After enabling "decompilation debugging" a Step button is added to the status bar of the GUI. Upon the completion of the decode step the user can select which procedures they are interested in monitoring the decompilation, or they can select the default, all procedures. Proceeding to the decompilation stage, the progress will now stop and the status bar will read "decompiling proc1: before initialise" and a new tab containing the RTL of proc1 will be opened. When the user has completed inspection of the RTL of proc1 they can press Step and the decompilation will continue. For each procedure selected in the decode stage, around eight messages will be displayed in the status bar and the user will need to press Step. Each time the RTL of the procedure will be updated. This allows the user to spot bugs in the decompiler. At the moment, I'm using this to fix the bugs, but in the future, users may prefer to just fix the RTL.

I'm writing the GUI in C++ on Windows using the Qt4 toolkit. I will ensure it compiles and works on Linux before I check it into the Boomerang CVS repository. Shortly after I will be making binary packages for both Windows and Linux.

Welcome to my blog

Well, I've finally done it. I've succumbed to the popular movement of spewing thoughts into the void. I've started a blog. My resistance to blogging has always been two fold. First, I've felt that bloggers are innately vain people who believe their random thoughts are somehow more interesting than the random thoughts of others. Second, I've felt that content management software has nothing to offer over a plain html web page.. Especially seeing as you must type into a web form instead of using a proper text editor. But alas, I've come to feel that some of my activities of late may actually be of interest to other people and besides, I've not felt I was solely indulging the author's ego whilst reading the blogs of others. As for content management system, I guess it doesn't matter how you get the info out there, so long as it is remotely interesting.

Speaking of interesting, what is this blog about? It's about the kinds of things I find interesting. I'm an applications programmer, I get paid to do this by a major corporation (I'd rather not say which one) and I also write code for fun. Some of that code I write for fun is particularly interesting. In particular, I work on the Boomerang decompiler project, and some of my recent activities revolve around making it easier to use and applying it to objects of interest. I also have an interest in MMORPGs (programming them, not really playing them) and other facets of computer graphics. I am a supporter of Free Software and have contributed to a number of projects, which I may also talk about.

So here we go.