Posts

Showing posts from May, 2006

Support for Linux Kernel Modules

Boomerang can now load Linux kernel modules and find the init_module and cleanup_module entrypoints. Loading ELF sections is a tricky business. I now honour the alignment information, which means kernel modules will load into Boomerang with the same section placement as they are loaded into IDA Pro. There was also some problems with the way we handle R_386_PC32 relocations. Checking the type of the symbol indicated by the relocation solves the problem. I also managed to speed it up significantly by removing an unnecessary check. Hopefully it really is unnecessary. My globals-at-decode-time code is now checked in. I await howls of disapproval on the mailing list, but hey, I do so every time I check in.

More GUI Work and Relocations

Today I got a lot of work done on the GUI. I can now edit signature files in a tab at decode time and the corresponding signatures and parameters are shown in the Library Procedures table. For the rare times where a decompilation actually makes it to code generation without something going wrong, I can now open the output file and edit it in a tab. I even gave my main window a title. On the topic of relocations/symbol information. I can now load a linux .o file and get absolutely no addresses in my RTL. This is because I take a relocation at memory location x as an absolute guarentee that memory location x contains an address. I look up the address in the symbol map and replace the constant that would ordinarily be produced with an a[global] expression. One surprise I had on my test binary was that string constants are not assigned a symbol. I expected at least a "no name" symbol. As such, I speculatively test the memory at the given address and, if a suitable string

Woe of the Instruction Decoder

Boomerang uses the NJMC toolkit to decode instructions. These are the files in frontend/machine/pentium (or sparc or ppc or whatever you're interested in). We chose to use this technology because it ment we didn't have to write code to implement a new architecture, we could just write a "specification" file. Unfortunately, the NJMC toolkit is slowly rotting. It is hard to build. I've never built it. Mike has built it a couple of times (and failed a lot more times). Every architecture is different and no-one maintains it. We also have some issues with the code it generates. It produces huge cpp files which struggle to compile on some build targets and make the resulting binary much bigger than it could be. So how much work is it to replace? I considered writing a new tool that would take the same input files as the NJMC toolkit and generate the same output, but that only solves half the problems. Then I came to wonder, what's wrong with just using a

More relocation mayhem

I couldn't get to sleep last night, as something about relocations was nagging at me. Finally, around 2am, it hit me. I got up and sent a long email to Mike. The problem is, we've been thinking about relocations way too literally. The ElfBinaryFile loader treats relocations as the loader of an operating system would treat them, as numbers to be added to addresses in the image. But a decompiler wants a lot more information than this. The relocations tell us something that is gold and we just ignore it. For example, suppose you have a relocation like this: 00000009 R_386_32 myarray To a traditional loader it is saying: go look up the symbol myarray and calculate its address in memory, then go to offset 9 in the .text segment and add that address to whatever is there. But to a decompiler, what it is telling us is that we should add a global reference to myarray to the expression we generate for the address at offset 9. So say the instruction that included offse

Overlapping registers

The x86 instruction set is an ugly mess. Often with a desire to make things more flexible, people make things harder to understand. In the case of instruction sets, this makes a decompiler's job more difficult. Consider the following x86 asm code: mov eax, 302 mov al, 5 mov ebx, eax What value is in ebx? It makes it easier if we write 302 as 12Eh. Then we can easily say that ebx contains 105h, that is, 261. In boomerang, the decoder would turn those three instructions into this RTL: *32* r24 := 302 *8* r8 := 5 *32* r27 := r24 This is clearly wrong. As the information that r8 overlaps with the bottom 8 bits of r24 is completely absent. This is more correct: *32* r24 := 302 *16* r0 := truncu(32, 16, r24) *8* r12 := r24@15:8 *8* r8 := truncu(32, 8, r24) *8* r8 := 5 *32* r24 := r24@31:8 | zfill(8, 32, r8) *16* r0 := truncu(32, 16, r24) *32* r27 := r24 But just look at the explosion in the number of statements. I havn't even included statements to define bx, bh, and bl, whi

Better Support for Relocatable ELF Files

Looking at how the Boomerang ELF loader handles symbols and relocations, I noticed that something was clearly wrong for relocatable files (i.e., .o files). The loader was assuming that the VirtualAddress members of the section table were set as they are in executable images. This is not the case. It is the duty of the loader to choose an arbitary starting address and to load each section at appropriate offsets from that address. I decided that choosing the same default address that IDA Pro uses was probably a good idea. I often switch between Boomerang and IDA Pro to gather information, especially information that Boomerang has gotten wrong. I also decided to delay loading any section that starts with ".rel." until all the other sections are loaded because IDA Pro does so. I don't know why it does it, but I want my addresses to match up with those in IDA Pro, so I have to follow their lead. After fixing this, I noticed that all the symbols and relocations now point

A(nother) GUI for Boomerang

Quite a while ago I attempted to write a GUI for Boomerang. In fact, I've done this a couple of times. The stalling point has always been: what good is a GUI? Decompilers are supposed to be automatic. You should be able to give a decompiler an executable and trust it to spit out a C file that meets some definition of program equivalence with that executable. So if the decompiler works perfectly, what is there for the user to do? Surely anything they can offer will be more productively applied to the output, and for that they can just use standard source code manipulation tools. Well, there's two problems with this line of thinking. First, there's the sad fact that no decompiler is perfect. In fact, the state of the art is far, far, from adequate, let alone perfect. Secondly, standard source code manipulation tools are woefully underpowered for even the most simplest tasks of traditional reverse engineering (where traditional means "starting from source code&q

Welcome to my blog

Well, I've finally done it. I've succumbed to the popular movement of spewing thoughts into the void. I've started a blog. My resistance to blogging has always been two fold. First, I've felt that bloggers are innately vain people who believe their random thoughts are somehow more interesting than the random thoughts of others. Second, I've felt that content management software has nothing to offer over a plain html web page.. Especially seeing as you must type into a web form instead of using a proper text editor. But alas, I've come to feel that some of my activities of late may actually be of interest to other people and besides, I've not felt I was solely indulging the author's ego whilst reading the blogs of others. As for content management system, I guess it doesn't matter how you get the info out there, so long as it is remotely interesting. Speaking of interesting, what is this blog about? It's about the kinds of things I find i