Saturday, December 30, 2006

Manual Decompilation

Argh. It's 2006, and I still don't have a good decompiler. All is not lost. Thankfully, there are still interesting things to decompile that are both small and contain lots of stuff that makes decompilation easy (e.g., symbols, relocations). So, let's do it manually using some trustworthy old fashioned tools: a disassembler, a text editor and some string processing tools.

Let's choose a target. I'm going to go with a linux kernel module because they are small, contain symbols and relocations and because there exist GPL ones that I won't get in trouble for reverse engineering publicly. Just choosing something at random from /lib/modules on my Ubuntu linux box I come across new_wlan_acl.ko from the madwifi-ng drivers.

Right, now we need a disassembly. No problem. Just do objdump -d --adjust-vma=0x8000000 new_wlan_acl.ko > out.dis. That almost gives me the object as it would look mapped into the linux kernel. Slight problem though, none of the relocations have been applied. You can see this by looking at some of the calls:

800004e: e8 fc ff ff ff call 800004f

That's clearly wrong. So let's look at the relocations. Just do objdump -r new_wlan_acl.ko > out.rel. Here's the culprit:

0000004f R_386_PC32 printk

ahh, so that's a call to printk at 800004e. Ok. Unfortunately, objdump will not apply these relocations for us. We have to do it manually. This is our first bit of work. I've done it for us. See out.dis-with-rel.

Well then, now that we have a nice fixed up disassembly, we want to start converting this into something resembling C. Rather than do this completely by hand, we're going to use the unix equivalent of hand tools: sed. By carefully crafting some regular expressions we can replace convert at&t asm for x86 into C statements. Here's a taste:

REG="%\([a-z]*\)"
HEXCONST="\(0x[0-9a-f]*\)"
TAB="&\t\t\t"

sed "s/mov $HEXCONST($REG),$REG/$TAB \3 = M(\2 + \1);/"

and here's the whole script. After we run it on out.dis-with-rel we get out.trans. It puts the C side by side with the asm so you can eyeball it and check that the script isn't doing something crazy. You'll note there's a lot of instructions that havn't been translated. These are either control transfer instructions, which we'll handle manually, or they are rare instructions that we can handle exceptionally. Often when we don't get a translation it is where I've manually applied a relocation. Mostly though, we now have C that does the same thing as the asm.

Of course, that won't do. The next thing to do is load up out.trans somewhere you can see it and load up a new text editor for out.c. We're going to add a new function declaration for every function label in the disassembly:

void acl_attach()
{
}

void acl_detach()
{
}

void _acl_free()
{
}

etc. Then pick whichever one you feel comfortable starting with. I like to pick the smallest.. so let's have a look at acl_getpolicy. It's really small. Copy the contents of out.trans for this function into out.c:

void acl_getpolicy()
{
80005e0: 8b 44 24 04 mov 0x4(%esp),%eax eax = M(esp + 0x4);
80005e4: 8b 80 48 06 00 00 mov 0x648(%eax),%eax eax = M(eax + 0x648);
80005ea: 8b 00 mov (%eax),%eax eax = M(eax);
80005ec: c3 ret
}

Ok, it doesn't get much simpler than this. Here's the "optimization" process..

  1. find all the parameters. They look like M(esp + 0x4), but this can get complicated if you have any pushes or adds/subtracts to esp.
  2. replace the parameters with param1, param2, param3, etc.
  3. add the required number of parameters to the function.
  4. turn the M()'s into appropriate *(unsigned int*) syntax.
  5. rename any remaining registers to appropriate variable names. eax = a, ebx = b, etc is fine.
  6. give the best types you can to everything to reduce the number of casts. If you change something into an int* and it is used in pointer arithmetic, remember to divide constants by 4.

Here's what acl_getpolicy looks like now:

int acl_getpolicy(int **param1)
{
return *param1[402]; // 402 = 0x192 = 0x648 / 4
}

And that's about the best we can do without any external information. And if we compile this somehow replace the original function with this, it will do the same thing. But it really doesn't tell me much does it? Of course, if I cheat and go google for "madwifi acl_getpolicy" I'll find this page which has this code for acl_getpolicy:

static int acl_getpolicy(struct ieee80211vap *vap)
{
struct aclstate *as = vap->iv_as;

return as->as_policy;
}

but we'll just pretend we didn't see that, ok? :)

There's a few more things we have to do. First, there's a number of places where out.trans says we access some constant plus .rodata.str1.1. These are string constants, so we can just replace them with the actual string that we read from the full contents of the object. Other places we access some constant + .rodata. Often these are tables, and we need to remember that there are relocations into the .rodata section, just as there are relocations into the .text section.

Here's my final out.c. You may note I didn't finish it. You get the idea.

3 comments:

  1. Decompiling to this level of C is extremely similar to what we're doing in the Static Binary Translation group at Yahoo to translate binaries between architectures. (We don't care if the intermediate C is maintainable, we just want it to be efficient). You're welcome to drop in and visit us sometime.

    ReplyDelete
  2. Hi people!
    I have been trying to get rid the boring feeling I was suffering for such a long time. Nevertheless, just have realized that this column is pretty worthy! then, I am so pleased of have read it.
    Thank you so much for sharing this fact of matter.

    ReplyDelete