Saturday, December 30, 2006

Manual Decompilation

Argh. It's 2006, and I still don't have a good decompiler. All is not lost. Thankfully, there are still interesting things to decompile that are both small and contain lots of stuff that makes decompilation easy (e.g., symbols, relocations). So, let's do it manually using some trustworthy old fashioned tools: a disassembler, a text editor and some string processing tools.

Let's choose a target. I'm going to go with a linux kernel module because they are small, contain symbols and relocations and because there exist GPL ones that I won't get in trouble for reverse engineering publicly. Just choosing something at random from /lib/modules on my Ubuntu linux box I come across new_wlan_acl.ko from the madwifi-ng drivers.

Right, now we need a disassembly. No problem. Just do objdump -d --adjust-vma=0x8000000 new_wlan_acl.ko > out.dis. That almost gives me the object as it would look mapped into the linux kernel. Slight problem though, none of the relocations have been applied. You can see this by looking at some of the calls:

800004e: e8 fc ff ff ff call 800004f

That's clearly wrong. So let's look at the relocations. Just do objdump -r new_wlan_acl.ko > out.rel. Here's the culprit:

0000004f R_386_PC32 printk

ahh, so that's a call to printk at 800004e. Ok. Unfortunately, objdump will not apply these relocations for us. We have to do it manually. This is our first bit of work. I've done it for us. See out.dis-with-rel.

Well then, now that we have a nice fixed up disassembly, we want to start converting this into something resembling C. Rather than do this completely by hand, we're going to use the unix equivalent of hand tools: sed. By carefully crafting some regular expressions we can replace convert at&t asm for x86 into C statements. Here's a taste:

REG="%\([a-z]*\)"
HEXCONST="\(0x[0-9a-f]*\)"
TAB="&\t\t\t"

sed "s/mov $HEXCONST($REG),$REG/$TAB \3 = M(\2 + \1);/"

and here's the whole script. After we run it on out.dis-with-rel we get out.trans. It puts the C side by side with the asm so you can eyeball it and check that the script isn't doing something crazy. You'll note there's a lot of instructions that havn't been translated. These are either control transfer instructions, which we'll handle manually, or they are rare instructions that we can handle exceptionally. Often when we don't get a translation it is where I've manually applied a relocation. Mostly though, we now have C that does the same thing as the asm.

Of course, that won't do. The next thing to do is load up out.trans somewhere you can see it and load up a new text editor for out.c. We're going to add a new function declaration for every function label in the disassembly:

void acl_attach()
{
}

void acl_detach()
{
}

void _acl_free()
{
}

etc. Then pick whichever one you feel comfortable starting with. I like to pick the smallest.. so let's have a look at acl_getpolicy. It's really small. Copy the contents of out.trans for this function into out.c:

void acl_getpolicy()
{
80005e0: 8b 44 24 04 mov 0x4(%esp),%eax eax = M(esp + 0x4);
80005e4: 8b 80 48 06 00 00 mov 0x648(%eax),%eax eax = M(eax + 0x648);
80005ea: 8b 00 mov (%eax),%eax eax = M(eax);
80005ec: c3 ret
}

Ok, it doesn't get much simpler than this. Here's the "optimization" process..

  1. find all the parameters. They look like M(esp + 0x4), but this can get complicated if you have any pushes or adds/subtracts to esp.
  2. replace the parameters with param1, param2, param3, etc.
  3. add the required number of parameters to the function.
  4. turn the M()'s into appropriate *(unsigned int*) syntax.
  5. rename any remaining registers to appropriate variable names. eax = a, ebx = b, etc is fine.
  6. give the best types you can to everything to reduce the number of casts. If you change something into an int* and it is used in pointer arithmetic, remember to divide constants by 4.

Here's what acl_getpolicy looks like now:

int acl_getpolicy(int **param1)
{
return *param1[402]; // 402 = 0x192 = 0x648 / 4
}

And that's about the best we can do without any external information. And if we compile this somehow replace the original function with this, it will do the same thing. But it really doesn't tell me much does it? Of course, if I cheat and go google for "madwifi acl_getpolicy" I'll find this page which has this code for acl_getpolicy:

static int acl_getpolicy(struct ieee80211vap *vap)
{
struct aclstate *as = vap->iv_as;

return as->as_policy;
}

but we'll just pretend we didn't see that, ok? :)

There's a few more things we have to do. First, there's a number of places where out.trans says we access some constant plus .rodata.str1.1. These are string constants, so we can just replace them with the actual string that we read from the full contents of the object. Other places we access some constant + .rodata. Often these are tables, and we need to remember that there are relocations into the .rodata section, just as there are relocations into the .text section.

Here's my final out.c. You may note I didn't finish it. You get the idea.

Sunday, December 17, 2006

Spring Cleaning

I, like many people, own a robotic vacuum cleaner. It's crap. With all the advanced robotic technology it has on board, it's still lacking in vacuum cleaner technology. Although it is bagless, it's not the good kind of bagless. There's no double vortex mechanism here. In fact, there's very little suction on it at all. That's kind of important for a vacuum cleaner I think. Besides which, the robotic technology isn't all that "advanced" anyway. It still gets stuck in corners or on that same part of the rug. So when I feel the carpet needs a bit of a spruce, I pull out the trusty Vax and put my back into it.

Where's my maidbot? It's been 40 years since the The Jetsons and I am still waiting for my flying car, err, I mean, maidbot. Now, of course, I realise that it would be a bit expensive to get together a crack team of Japanese scientists just to make me a maidbot (and yes, they have to be Japanese scientists) and that no amount of mass production magic is going to make a maidbot affordable any time soon, but say I was a rich man, could it be done? How would it be done?

Making a robotic vacuum cleaner is pretty easy I think. Yes, there is lots of hard problems to solve, but overall, the task of a vacuum cleaner is pretty simple: remove dust from carpet. The task of a maidbot includes that too, but it includes so much more. There's removing dust from glassware. There's removing dust from tabletops and other services. There's the whole damn removing dust thing, yes, we get that, but that's not all! Dust is a bad thing and I'd love my home to be free of it, but it isn't what I call "mess" and that's what I want my maidbot to clean up. Frankly, I'd be happy to have a seperate dust-patrol-bot to take care of the dust, and the maidbot can do all the other stuff.

So what is this "other stuff" I'm talking about? I mean, that's pretty vague. If I was to take that specification to my Japanese scientists they might come up with all sorts of things. Probably things that involve automatic weapons or lasers or something. That's Japanese scientists for ya, and I think the reason why I still don't have a maidbot is precisely that: no-one knows what the hell it is supposed to do. Not to the level of detail needed by Japanese scientists anyway.

I've spent all day (well, some of the day, well, a few hours) cleaning up my dwelling and I think I know what it is all about now. I've had one of those rare moments of clarity where, as an engineer, you step back from the situation and think "I could code something to do this." And here it is folks: cleaning is all about classification.

Last time I looked, the state of the art in computer vision technology was the Scale-Invariant Feature Transform algorithm. Given an arbitary object, it can extract features which it can then recognise at a later date, regardless of how the object is rotated or scaled. There are techniques for quickly recognising a large catalog of objects, but I don't think these algorithms run in real-time, nor has anyone attempted to use these features to build categories of objects. So, some refinement of the state of the art is necessary.

Instead of just recognising objects, our maidbot will need to recognise the common features that objects which are already in close proximity, say, on a shelf, have. Then, recognising an object that is not lying in a grouping of similar objects, i.e., a misplaced object, determine which grouping of objects shares the most number of features and placing the misplaced object with that group of objects.

Of course, the maidbot can't just chuck the object with the others and claim it is now tidy.. Books are stacked on shelves in a particular way. Your way of stacking books is probably slightly different to my way of stacking books. This is procedural knowledge that the maidbot needs to be able to reverse engineer from examples and mimic. I'm not aware of any state of the art in this area.

Oh, and there's that binary classification which every maidbot should place higher priority on than any other possible classification: clean or dirty. Otherwise the Japanese scientists are likely to come up with a maidbot that puts dirty cutlery in the draw and clean cutlery in the sink/dishwasher. However, so long as the maidbot has done a good survey of the various collections of objects in the house before it starts cleaning, I don't think it will require any special case programming. The maidbot will see the collection of knives, forks and spoons in the draw and the collection of knives, forks and spoons in the sink and, needed to seperate the classifications of these two groups of objects, it will determine that one set is clean and the other set is dirty. So the dirty knives, forks and spoons will better match a misplaced dirty knife, fork or spoon and so the maidbot will not mix dirty cutlery with clean cutlery.

Of course, the best way to test this hypothesis is to wait until we have running code and then tweak as required.