Saturday, December 30, 2006

Manual Decompilation

Argh. It's 2006, and I still don't have a good decompiler. All is not lost. Thankfully, there are still interesting things to decompile that are both small and contain lots of stuff that makes decompilation easy (e.g., symbols, relocations). So, let's do it manually using some trustworthy old fashioned tools: a disassembler, a text editor and some string processing tools.

Let's choose a target. I'm going to go with a linux kernel module because they are small, contain symbols and relocations and because there exist GPL ones that I won't get in trouble for reverse engineering publicly. Just choosing something at random from /lib/modules on my Ubuntu linux box I come across new_wlan_acl.ko from the madwifi-ng drivers.

Right, now we need a disassembly. No problem. Just do objdump -d --adjust-vma=0x8000000 new_wlan_acl.ko > out.dis. That almost gives me the object as it would look mapped into the linux kernel. Slight problem though, none of the relocations have been applied. You can see this by looking at some of the calls:

800004e: e8 fc ff ff ff call 800004f

That's clearly wrong. So let's look at the relocations. Just do objdump -r new_wlan_acl.ko > out.rel. Here's the culprit:

0000004f R_386_PC32 printk

ahh, so that's a call to printk at 800004e. Ok. Unfortunately, objdump will not apply these relocations for us. We have to do it manually. This is our first bit of work. I've done it for us. See out.dis-with-rel.

Well then, now that we have a nice fixed up disassembly, we want to start converting this into something resembling C. Rather than do this completely by hand, we're going to use the unix equivalent of hand tools: sed. By carefully crafting some regular expressions we can replace convert at&t asm for x86 into C statements. Here's a taste:

REG="%\([a-z]*\)"
HEXCONST="\(0x[0-9a-f]*\)"
TAB="&\t\t\t"

sed "s/mov $HEXCONST($REG),$REG/$TAB \3 = M(\2 + \1);/"

and here's the whole script. After we run it on out.dis-with-rel we get out.trans. It puts the C side by side with the asm so you can eyeball it and check that the script isn't doing something crazy. You'll note there's a lot of instructions that havn't been translated. These are either control transfer instructions, which we'll handle manually, or they are rare instructions that we can handle exceptionally. Often when we don't get a translation it is where I've manually applied a relocation. Mostly though, we now have C that does the same thing as the asm.

Of course, that won't do. The next thing to do is load up out.trans somewhere you can see it and load up a new text editor for out.c. We're going to add a new function declaration for every function label in the disassembly:

void acl_attach()
{
}

void acl_detach()
{
}

void _acl_free()
{
}

etc. Then pick whichever one you feel comfortable starting with. I like to pick the smallest.. so let's have a look at acl_getpolicy. It's really small. Copy the contents of out.trans for this function into out.c:

void acl_getpolicy()
{
80005e0: 8b 44 24 04 mov 0x4(%esp),%eax eax = M(esp + 0x4);
80005e4: 8b 80 48 06 00 00 mov 0x648(%eax),%eax eax = M(eax + 0x648);
80005ea: 8b 00 mov (%eax),%eax eax = M(eax);
80005ec: c3 ret
}

Ok, it doesn't get much simpler than this. Here's the "optimization" process..

  1. find all the parameters. They look like M(esp + 0x4), but this can get complicated if you have any pushes or adds/subtracts to esp.
  2. replace the parameters with param1, param2, param3, etc.
  3. add the required number of parameters to the function.
  4. turn the M()'s into appropriate *(unsigned int*) syntax.
  5. rename any remaining registers to appropriate variable names. eax = a, ebx = b, etc is fine.
  6. give the best types you can to everything to reduce the number of casts. If you change something into an int* and it is used in pointer arithmetic, remember to divide constants by 4.

Here's what acl_getpolicy looks like now:

int acl_getpolicy(int **param1)
{
return *param1[402]; // 402 = 0x192 = 0x648 / 4
}

And that's about the best we can do without any external information. And if we compile this somehow replace the original function with this, it will do the same thing. But it really doesn't tell me much does it? Of course, if I cheat and go google for "madwifi acl_getpolicy" I'll find this page which has this code for acl_getpolicy:

static int acl_getpolicy(struct ieee80211vap *vap)
{
struct aclstate *as = vap->iv_as;

return as->as_policy;
}

but we'll just pretend we didn't see that, ok? :)

There's a few more things we have to do. First, there's a number of places where out.trans says we access some constant plus .rodata.str1.1. These are string constants, so we can just replace them with the actual string that we read from the full contents of the object. Other places we access some constant + .rodata. Often these are tables, and we need to remember that there are relocations into the .rodata section, just as there are relocations into the .text section.

Here's my final out.c. You may note I didn't finish it. You get the idea.

Sunday, December 17, 2006

Spring Cleaning

I, like many people, own a robotic vacuum cleaner. It's crap. With all the advanced robotic technology it has on board, it's still lacking in vacuum cleaner technology. Although it is bagless, it's not the good kind of bagless. There's no double vortex mechanism here. In fact, there's very little suction on it at all. That's kind of important for a vacuum cleaner I think. Besides which, the robotic technology isn't all that "advanced" anyway. It still gets stuck in corners or on that same part of the rug. So when I feel the carpet needs a bit of a spruce, I pull out the trusty Vax and put my back into it.

Where's my maidbot? It's been 40 years since the The Jetsons and I am still waiting for my flying car, err, I mean, maidbot. Now, of course, I realise that it would be a bit expensive to get together a crack team of Japanese scientists just to make me a maidbot (and yes, they have to be Japanese scientists) and that no amount of mass production magic is going to make a maidbot affordable any time soon, but say I was a rich man, could it be done? How would it be done?

Making a robotic vacuum cleaner is pretty easy I think. Yes, there is lots of hard problems to solve, but overall, the task of a vacuum cleaner is pretty simple: remove dust from carpet. The task of a maidbot includes that too, but it includes so much more. There's removing dust from glassware. There's removing dust from tabletops and other services. There's the whole damn removing dust thing, yes, we get that, but that's not all! Dust is a bad thing and I'd love my home to be free of it, but it isn't what I call "mess" and that's what I want my maidbot to clean up. Frankly, I'd be happy to have a seperate dust-patrol-bot to take care of the dust, and the maidbot can do all the other stuff.

So what is this "other stuff" I'm talking about? I mean, that's pretty vague. If I was to take that specification to my Japanese scientists they might come up with all sorts of things. Probably things that involve automatic weapons or lasers or something. That's Japanese scientists for ya, and I think the reason why I still don't have a maidbot is precisely that: no-one knows what the hell it is supposed to do. Not to the level of detail needed by Japanese scientists anyway.

I've spent all day (well, some of the day, well, a few hours) cleaning up my dwelling and I think I know what it is all about now. I've had one of those rare moments of clarity where, as an engineer, you step back from the situation and think "I could code something to do this." And here it is folks: cleaning is all about classification.

Last time I looked, the state of the art in computer vision technology was the Scale-Invariant Feature Transform algorithm. Given an arbitary object, it can extract features which it can then recognise at a later date, regardless of how the object is rotated or scaled. There are techniques for quickly recognising a large catalog of objects, but I don't think these algorithms run in real-time, nor has anyone attempted to use these features to build categories of objects. So, some refinement of the state of the art is necessary.

Instead of just recognising objects, our maidbot will need to recognise the common features that objects which are already in close proximity, say, on a shelf, have. Then, recognising an object that is not lying in a grouping of similar objects, i.e., a misplaced object, determine which grouping of objects shares the most number of features and placing the misplaced object with that group of objects.

Of course, the maidbot can't just chuck the object with the others and claim it is now tidy.. Books are stacked on shelves in a particular way. Your way of stacking books is probably slightly different to my way of stacking books. This is procedural knowledge that the maidbot needs to be able to reverse engineer from examples and mimic. I'm not aware of any state of the art in this area.

Oh, and there's that binary classification which every maidbot should place higher priority on than any other possible classification: clean or dirty. Otherwise the Japanese scientists are likely to come up with a maidbot that puts dirty cutlery in the draw and clean cutlery in the sink/dishwasher. However, so long as the maidbot has done a good survey of the various collections of objects in the house before it starts cleaning, I don't think it will require any special case programming. The maidbot will see the collection of knives, forks and spoons in the draw and the collection of knives, forks and spoons in the sink and, needed to seperate the classifications of these two groups of objects, it will determine that one set is clean and the other set is dirty. So the dirty knives, forks and spoons will better match a misplaced dirty knife, fork or spoon and so the maidbot will not mix dirty cutlery with clean cutlery.

Of course, the best way to test this hypothesis is to wait until we have running code and then tweak as required.

Thursday, November 16, 2006

God Damn I Hate Cygwin

First of all, let's get the rant out of the way. Cygwin is a big pile of junk. It's like everyone who touches Cygwin gets the clue beaten out of them before they're allowed to hack on it. Here's a tip guys, if you're going to try to do something really hard (like build a POSIX compatible software layer on top of an operating system that holds many parts of the POSIX standard as the anti-thesis of its design) you have to put a lot of effort into it. Ok, done.

I recently had this problem:
Fatal server error: could not open default font 'fixed'

for which I googled, lots and lots and lots and found no adequate solution. I found the Cygwin/X Frequently Asked Questions and it had two suggests as to what the problem could be:
  1. You don't actually have the xorg-x11-fnts package installed (duh, thanks guys, yeah, that wasn't the first thing I checked).
  2. The mount point for /usr/X11R6/lib/X11/fonts was invalid at the time that Cygwin's setup.exe installed the xorg-x11-fnts package.

This advice is, of course, absolutely useless. What mount point are you talking about guys? In what way could it be "invalid"? Why would have I have to manually fuck around with this stuff anyways? My /usr/X11R6/lib/X11/fonts happens to be mounted on /, which is mounted to c:\cygwin.. ya know, like everyone else who uses cygwin. What the hell are you guys on about?

Well, turns out what they are trying to say is that you need to make a mount point for /usr/X11R6/lib/X11/fonts. The problem is that stupid "do you want to use unix mode or text mode?" option at the start of the cygwin installer. Probably 50% of people choose unix mode, then discover something doesn't work (like, say, scripts that have been checked into a repository by someone who chose textmode) and then quickly change to textmode. If you are using textmode for /. then you will have this problem. Here's how to fix it:
foo@mymachine ~
$ cd /usr/X11R6/lib/X11
foo@mymachine /usr/X11R6/lib/X11
$ mv fonts fonts_real
foo@mymachine /usr/X11R6/lib/X11
$ mount -b c:\\cygwin\\usr\\X11R6\\lib\\X11\\fonts_real 
           /usr/X11R6/lib/X11/fonts

Your fonts dir should now be mounted in "binmode". But that's not all! You need to edit /usr/X11R6/bin/font-update and look for a need_update="", between those quotes, put a 1. Now do:
foo@mymachine ~
$ /usr/X11R6/bin/font-update

and don't forget to edit /usr/X11R6/bin/font-update and take the 1 out of those quotes.

If you are a Cygwin/X developer, please, please, please add a mount command to the installer! We shouldn't have to do this shit manually!

Tuesday, November 14, 2006

Windows Printer on Linux

I'm a Ubuntu user but I also run a number of Windows machines for work purposes. I was actually shocked recently to discover that CUPS supports my Brother HL-5140 laser printer without me having to download any binary blobs from the Brother website. So that's good.

What's not so great is that gnome-cups-ui has no concept of workgroups or domains for connecting via smb to a windows printer. When you go to "add new printer" or when you view the connection properties after adding the new printer, the fields you are presented with are:

  • Host
  • Printer
  • Username
  • Password

After googling for a bit, I discovered that people have figured out that if you put the workgroup name in the Host field and then enter username:password@host/Printer into the Printer field and nothing into either the username or password fields, you can print!

Curious about this obscurity, I did an apt-get source gnome-cups-ui and hunted around for a bit. Eventually, I found some code that looks basically like this:

if (username != NULL && username[0] && password != NULL && password[0])
ret = g_strdup_printf("smb://%s:%s@%s/%s", username, password, host, printer);
else
ret = g_strdup_printf("smb://%s/%s", host, printer);

do you see what's happening? The hack outlined above uses the second case. An alternative would be to enter workgroup/username into the Username field and your password into the password field, that would use the first case. The problem with doing that, however, is that if you don't enter a password then the username will be ignored.. this isn't much good if you're trying to use the Guest account, which requires no password. Also, even if you're using a real account, the resulting URI won't parse correctly so when you refresh the connection properties you'll end up with the workgroup in the Host field and everything else (including your password) in the Printer field.

I wrote myself a nice little patch to ask for a workgroup and parse the uri correctly, etc. If anyone wants it, you can grab it from here:

gnome-cups-manager-0.31-with-workgroup.diff

However, turns out this has already been fixed in the latest CVS, so I won't be contacting the authors. Kinda makes me wonder how long it takes for fixes like this to filter out of CVS and into the distributions. Ubuntu releases are tied to the GNOME releases, so it should be one of the fastest to get fixes like this, but this is still broken in the latest Ubuntu release (Edgy).

Thursday, November 02, 2006

A 5mb binary blob in the kernel?

If you pop over to the NVIDIA web site and download the 3d card drivers for Linux, you'll note that there is a /usr/src/nv directory. In that directory is source code to the "thin layer" to the Linux kernel which NVIDIA links their binary blob. This, of course, makes no legal difference - NVIDIA are still extending the Linux kernel and therefore it is unlawful to distribute the Linux kernel along with the NVIDIA drivers, but NVIDIA doesn't do that, so it's not a problem - for them. Anyway, that's a side issue. I was thinking, recently, about the Linux kernel's "tainted" flag. Essentially, whenever you install a kernel module that isn't under some accepted open source license, the kernel sets a flag so that developers know there is a chance that any bugs you report might have been caused by code they can't fix. In general, this is not such a big deal as kernel modules are typically small and easy to isolate. The NVIDIA graphics drivers, on the other hand, are not small, they are actually over 5 meg.. loading anything that is 5 meg in size into the kernel is typically a bad idea. It's much better to split some parts out into userland and use thunks to connect the kernel part to the userland part.

So back to this /usr/src/nv directory. If you have a look, there's a bunch of source files, header files, make files and this nv-kernel.o file. That's the binary blob. They don't give us source for that bit. For me, it's 5104332 bytes, and most of the symbols have been replaced with stuff like _nv003299rm. This, of course, is to make it harder to understand if you were to try to reverse engineer it.

Now, someone out there, if you build your own kernel, I need your help (done, see comments). Go into the /usr/src/nv directory, do a make and get this thing to build. You may have to screw around with paths and stuff, and this is the reason why I havn't done it myself. Ok, got it to build? Great. Now make clean and remove the nv-kernel.o from the Makefile, then make. You should get some errors.. in particular, we're looking for "undefined symbol" errors. How many symbols are undefined?

This tells us how big the interface is between the source layer and the binary layer. If this interface is small enough we can write thunks for each symbol and move the binary layer into userland. If we do speed tests and discover that it isn't much slower to run the blob in userland then we can create a module that contains no binary code, and maybe the Linux kernel developers will consider this "untainted" enough that it is useful to someone. Knowing that interface will also help reverse engineer the binary part.

Thursday, October 12, 2006

Skip the Intermediaries

Sometimes copyright law is just stoopid. Sometimes the rules just don't apply. Have you heard the story of Steve McDonald and White Stripes? Here's nice flash animation of what happened, or you can keep reading.. Steve McDonald is a veteran member of the band Redd Kross. He likes the White Stripes, but he thought they would sound better if they had a bass guitarist. So he appointed himself. He had the equipment, and the skill, so he made up some bass tracks and added them to his favourite White Stripes songs. He then posted those songs on the Redd Kross website - without permission. Of course, this could land him in hot water, but luckily he bumped into Jack White who gave his assurances that he wasn't going to sue. Before that happened I'm guessing Steve McDonald just didn't give a damn.. after all, he's a rocker, man.

I'm not a rocker, but I am a rebel, ask anyone. I once posted the full c99 standard to my web site so people wouldn't have to pay $10 for a copy so they could double check all the nasty things I was saying about it (don't get me started) or, worse yet, take my word for it. The mean lawyers from the C standards committee sent me a cease and desist email which I just ignored. Well, I didn't completely ignore it, I forwarded it to the dude who ran my web site at the time and asked him if he wanted to do anything about it, and he said he didn't, so then I ignored it. Know what they did? Nothing. Guess they figured they were on shaky ground, or their legal budget extends to sending mean emails and not following through.

In that case I was just trying to make a point. People shouldn't have to pay for something to recommend to others that they don't support it. I currently have a web site where I give out copies of a favourite old game of mine. I'm also involved in an open source project to recreate that game for modern platforms. You can play the game on DosBox, sure, but can you run DosBox on your pocket pc or palm pilot? No, I didn't think so. One day you may be able to play that game on these new platforms, and as mobile phone games have shown, it's not just the nostailgic who benefit when that happens.

What will happen when the owner of the copyright on this game comes knocking on my door? Will I ignore them? Hell no. I will beg them to release the game under a creative commons license so that we can legally distribute it.

Wednesday, October 04, 2006

At last. Some violence!

Hehe, after struggling to place the head, torso and legs of pedestrians together to make the animations work in FreeSynd for the last week or so I happened to find this information about HELE-0.ANI, HFRA-0.ANI and HSTA-0.ANI. Turns out these three files contain just about everything you need to know to draw objects in the game. There are 1970 animations which are made up of frames taken from a pool of 8949, each of which are made up of elements taken from a pool of 10486, each of which index into the available 1180 game sprites. The amount of work required to manually reproduce these animations (and that I had intended on doing) is phenomonal. A conservative estimate is that this information has knocked a year off the development time. In just one day (today) I have figured out the exact placement of units on the map, fixed the drawing of animations and cleaned up a lot of very ugly code. The result is that agents can now walk around, with and without all the different weapons out, and shoot those weapons. The guards in the first level can do likewise, as well as get shot, die and bleed. I can also shoot cars making them explode.

So, again, I turn back to that great mystery of the GAME*.DAT files. In particular I still don't know where the information to place things like trees or doors comes from. I also don't know where to obtain the detailed mission objective information - like *who* exactly you have to kill to win an assassinate mission.

Thursday, September 28, 2006

Improving On An Old Chestnut


Breaking my own rule that improvements should be belayed until such time as the original game has been reimplemented, I've added waypoints and path finding to agent movement in FreeSynd. I found this documentation of the A* path finding algorithm to be most enlightening.. although, frankly, the sketch is more than adequate to implement a useful algorithm for agent movement.

It always annoyed me when I told my agents to go to point X and they got stuck inside buildings or took some inordinately long path. Now you can hold down the ctrl key and enter waypoints manually, or you can let the path finder do its job (this is the default). Pressing the ctrl key without clicking the mouse will display the selected agents' current path(s) as a bright yellow line. At the moment the planning algorithm does not take elevated terrain into account, so trying to send agents onto bridges or up stairs simply won't work just yet.

There's still a hell of a lot to do before the game is playable, including shooting, enemy AI, vehicle controls, sound, etc. But if you'd like to play around with the path finding, you can grab this all-in-one package I've put together. It includes windows binaries, all the required data files from the original game and source code (so you can build it on linux if that's your prefered platform). If you feel the urge, please report bugs on the bug tracker. I'd also love you hear any ideas you have for new features.

Tuesday, September 19, 2006

Into The Unknown..

Back when Mike and I started the Boomerang decompiler project, we had the choice to start from scratch or to use our work on the UQBT project as a base. There was pros and cons to both approaches, but in the end we decided that UQBT had quite a lot of value in it, so we went with that. Unfortunately, at the time Mike and I started working on a decompiler based on UQBT, the copyright was owned by a number of parties. We had to talk to lawyers from the University of Queensland, Sun Microsystems and the individual people who had worked on the codebase over the years. Initially the lawyers were of a single opinion: no way. The individual contributors, however, were of mixed opinions: what for? and which license? As it turned out, answering the second question also answered the first question and started to put the lawyers minds at ease.

The goal of the Boomerang decompiler project was decided, way back in 2001, to be a guiding force in bringing about a market for decompilation technology. By publicly showing that decompilation was possible, we hoped to encourage people to invest time and money into making tools that they either make available to the public, or use to offer services to the public. This is why we chose a BSD-style license, so anyone who wanted to use our code for commercial (more specifically, proprietary) purposes, would be free to do so.

Over the years, I think we've done a lot to change the prevailing opinion that decompilation is "impossible" or at least economically uninteresting. Our paper on commercial decompilation from a small contractor perspective "Using a Decompiler for Real-World Source Recovery" was well received in 2004, whereas our paper "Computer Security Analysis through Decompilation and High-Level Debugging" was openly bashed at the 2001 WCRE conference. In just 3 years the opinion that decompilation was something someone with good tool support could do in an economical amount of time has changed from wishful thinking to accepted common knowledge.

So how about that market? Is the flood of cheap decompiler tools coming? Well, ever since 2001, Ilfak Guilfanov has been talking about adding commercially supported decompilation to IDA Pro. A large number of plugins exist, but without commercial support I really don't think we're there yet. Decompilation services for source recovery are still pretty much unheard of, but there are some people out there now who claim they can do it.. that's more than there was before. It seems the image problem is sorting itself out. Slowly, people are starting to see decompilation of machine code with the same eyes as they see decompilation of Java.

It's time to stop sitting on the sidelines. Mike and I have accepted roles with a company that is using decompilation technology to provide market valued services. We're now taking a direct role in influencing the creation of this market. Of course, I can't tell you which company it is we'll be working for, or what services they will be providing, but chances are you'll be able to guess. Just don't ask me to confirm or deny your guesses :)

Unfortunately, this means that we must stop working on Boomerang. It's possible that others may be interested in continuing the development of Boomerang, and we'll be around to offer a little help where we can. In this blog I have spoken a lot about Boomerang, but that's not what this blog is about. This blog is about me and, in particular, my participation in the open source community. That won't change. I will continue to write about the open source projects I work on. At the moment that's very game related, but in the future this too may change.

Saturday, September 09, 2006

Other Fun

I've recently been working on some game related projects. First, there's the RPG engine I've started which I call Stallion. I've received just about no interested about this from anyone so I havn't been paying it much attention. Now and then I get the desire to hack on an RPG and none of the available codebases for graphical RPGs interest me as much as the ones for textual RPGs. So why not hack on a textual RPG? Because it's just so dry without players to consider, and with players to consider you've just gotta be so dedicated.

Then there's my other project. I've always been a fan of Bullfrog's Syndicate. My fan page gets a lot of hits (probably because I give out copies of the game on it). If you've never played Syndicate you probably won't understand the attraction. You'll also think we're all crazy for fooling around with DosBox or old hardware to get the game to actually run and never consider doing the like yourself. As such, you may never experience the game, which I consider a tragedy. That's where FreeSynd comes in. A visitor to my website recommended I take a look at it. The goal of the project is to make a reimplementation of the game using the original game data, and later create new missions/graphics/etc from it. I guess that means one day you will get a chance to play the game. Of course, this begs the question of why you would want to.. all I can suggest is that you or your kids might like to experience the games my generation played when we were young, much like I enjoy playing Pac Man or Space Invaders. If we have to rely on DosBox or old hardware that just might not be possible.

In about 14 hours I get on a plane for 13 hours. I then get to sit around in an airport for another 5 hours and then get on another plane for 6 hours. My entertainment on this trip will consist of a copy of The Confusion by Neal Stephenson and a junky Sony VIAO laptop with 192mb of RAM. Assuming I have access to a power socket (I'm flying in cattle class) I expect to spend more time working on FreeSynd than reading Stephenson but, ya know, flying is such a funny thing, maybe I'll just nag the cabin staff.

Wednesday, August 23, 2006

I havn't forgotten you

Havn't been working on Boomerang much lately. I made a few changes to make the exports of DLL files appear as entrypoints in the GUI. Seems to work, will check it in soon. I also made some recent changes that handle gotos inside switches better. This means that magic like Duff's Device now decompile sensibly.

Tuesday, July 25, 2006

And then, some calls just don't return

A small number of library procedures are "called" but never actually return. Eventually I'd like to have a way to specify these procedures with anotations in the signature files, but for the moment they are hard coded in the frontend. That's acceptable for the moment as there is only five: _exit, exit, ExitProcess, abort and _assert. Thing is, what happens when you have a branch over one of these calls, as you often do. Typically you end up with phi statements that refer to the call or the statements before it because there's no way to show that these definitions are killed by the call but not defined by it. We could add this to the SSA form but a simpler solution is available. Whenever the decompiler determines that the destination of a call is one of these no return procedures then we simply drop the out edge of the containing basic block. Without an out edge the definitions cannot flow beyond the call.

Using dataflow based type analysis and some of my new changes the decompilation of extract_kyra.exe is currently looking a bit better. In particular, proc9, proc10 and proc2 are showing significant improvement.

Sunday, July 23, 2006

MinGW's tricky prologue code

Continuing with my ongoing test program extract_kyra.exe from the scummvm tools I've been looking at the very first call in main. It would appear that this exe was compiled with stack checking runtime code enabled. That very first call is to a short little procedure that takes a single parameter in the eax register; the number of bytes to subtract from esp. Here's a disassembly of the procedure:

push ecx
mov ecx, esp
add ecx, 8

loc_405336:
cmp eax, 1000h
jb short loc_40534D
sub ecx, 1000h
or dword ptr [ecx], 0
sub eax, 1000h
jmp short loc_405336

loc_40534D:
sub ecx, eax
or dword ptr [ecx], 0
mov eax, esp
mov esp, ecx
mov ecx, [eax]
mov eax, [eax+4]
jmp eax

It not only subtracts the requested number of bytes from the stack pointer, it also tests the stack every 4k to ensure that a stack overflow hasn't occured. This is all very amusing but it isn't the kind of stuff we want to see in a decompilation. If you're interested in low level detail like this, a disassembler is the tool to use to discover it.

The first thing we have to do is modify Win32BinaryFile::IsStaticLinkedLibProc to return true when it is passed the address of this procedure. I decided to encapsulate the pattern which recognises this procedure in a function called IsMinGWsAllocStack which just does a memcmp on the literal bytes. If the procedure contained any relocations I'd have to do something more complicated, but thankfully this one doesn't.

Next, I need to modify Win32BinaryFile::SymbolByAddress to return an appropriate name for any procedure that matches the pattern. I chose __mingw_allocstack, but I might modify this later if a more appropriate name (like the one the mingw programmers actually used) becomes known to me.

Finally, I need to modify PentiumFrontEnd::helperFunc to recognise this name and replace the call with a single statement: *32* r28 := r28 - r24.

Here's my changes to Win32BinaryFile.cpp and pentiumfrontend.cpp.

The next two procedure calls in main look like they're part of the runtime initialisation code too. I'll recognise and remove them. In summary, I now recognise five static library procedures in MinGW exes: __mingw_allocstack, __mingw_frame_init, __mingw_frame_end, __mingw_cleanup_setup, and an inline asm implementation of malloc. As a result, I have reduced the number of user procedures to be decompiled from 70+ to 7. Here they are; still atrocious, but there's now less of them. Turning on Mike's dataflow based type analysis makes the output different and sometimes better. Seeing as Mike is actively working to make this better I've modified the GUI to use dataflow based type analysis by default. Eventually I've got to add some more options to the init phase of the workflow so you don't have to recompile to fiddle with different options.

Thursday, July 20, 2006

Chronicle of a Decompilation

I've been promising to do this for a while, so here goes. I have downloaded a small open source utility called extract_kyra which is part of the scummvm tools. It doesn't really matter what the utility does. In fact, I prefer not to know as it gives me an unfair advantage compared to, say, decompiling malware. What's important is that this tool is under the GPL, so it is permisible for me to decompile it. I will be writing a number of these posts that describe all the issues I've run into and the steps I've completed to produce the decompilation. I promise that I will not go and download the source code for this program until such time as I believe my decompilation is "complete".

The first problem with decompiling this program is that Boomerang has failed to find main(). I can tell this because the load phase of the workflow does not list main as a found entrypoint, it only lists start. This is because this exe was compiled with MinGW and it would appear this is the first time anyone has tried to decompile an exe compiled with this compiler.

To find main I need to modify the loader to recognise the runtime support code that loads it. We do this with hand crafted patterns in the function GetMainEntrypoint() of each loader. In this case, the loader is Win32BinaryFile. Here's a diff of Win32BinaryFile.cpp from CVSWeb that adds my pattern. Essentially I just start at the program's entrypoint, follow the first relative call to get to __mingw_CRTStartup then main can be found by following the second last relative call before an indirect call to ExitProcess. I invented this pattern by studying the disassembly of the exe produced by IDA Pro, which has an extensive library of patterns for matching runtime code.

Now that we have main we can move on to the decode phase of the workflow. Boomerang automatically follows all control paths from main and finds any procedures that are called. There are two kinds of procedures found: user procedures and library procedures. The former are the ones we're interested in decompiling, the later are typically dynamically linked to an exe. Library procedures provide a much needed source of type information. Therefore it is important to note any procedures which are unknown. For this decompilation we have a number of them. It is most effective to define signatures for these procedures before we continue the workflow.

If we double click on one of the <unknown> library procedures in the list, Boomerang will open a new tab with a text editor that is open on an appropriate signature file. It is important to ensure that the signature file is appropriate, otherwise you might end up with the wrong kind of signature, but most of the time Boomerang gets it right. The editor is scrolled to the end of the file and a single line comment is provided to tell you what signature you need to add. In the screenshot I have added a signature for TlsGetValue, a win32 api which is the first unknown library procedure that the exe calls. Once you've added the signature, remember to remove the single line comment and save (CTRL-s). The Library Procedures list on the Workflow tab will immediately be updated. If it isn't, you've made a mistake. Continue adding signatures to the appropriate signature files until there are no <unknown> library procedures remaining.

That done, we can now start the decompile phase of the workflow. Boomerang will output each user procedure as it is visited and the procedure will turn blue when processing is complete for that procedure. Library procedures are not shown as they require no processing.

I immediately see a problem. At some point I decided it would be nice if we recognised user procedures that solely contain a jump to a library procedure and give them a descriptive name; __imp_libproc. These days I think we can do even better and just replace the call to the __imp procedure with a direct call to the library procedure. Looking in frontend.cpp I see some code that starts with:

// Is the called function a thunk calling a library function?
// A "thunk" is a function which only consists of: "GOTO library_function"
if( call && call->getDestProc() == NULL && call->getFixedDest() != NO_ADDRESS ) {

This really looks like what I want, but for some reason it isn't working. I think the problem lies in that check for getDestProc() == NULL. When I wrote this I was probably trying to solve a very specific problem and hadn't considered the general case. I think we always want to replace a call to a jump to a library procedure with a call to the library procedure. Removing that check gets rid of the __imp procedures.

The decompilation is now proceeding smoothly. A good 26 procedures have been processed but Boomerang appears to have stalled on proc65. At present we can't interrupt the decompilation to see what is going on.. In the future I intend to implement a nice big STOP button that you can press to interrupt decompilation, but for now you can only stop the decompilation by closing the window.

One way to monitor the decompilation is to turn on verbose mode. This causes vast amounts of information to be dumped to a file called log in the output directory. Wading through all this debugging information can be a pain, and it's the primary reason why I wrote a GUI in the first place.

Another way is to attach an external debugger. Using conditional breakpoints you can step through the source code and see what the problem is. Similarly you can use profiling tools to determine what parts of the code are being executed the most.

Before we resort to that, let's try the internal debugging. Close the Boomerang window and start it again. This time, we'll check the Enable debugging box on the Workflow tab before starting the load phase. In the decode phase we now see a checkbox next to each user procedure that has been found. Click on the header of the Debug column to toggle all the checkboxes to off and then toggle just the checkbox for proc65 to on. Proceed to the decompiling phase. The status bar will update rapidly as each procedure is processed until we get to proc65 when a new tab will open containing the RTL of that procedure. Decompilation is now paused and will remain so until you disable debugging through the Debug -> Enable menu option. You can temporarily restart the decompilation by pressing the Step button or selecting the Debug -> Step menu option. The decompilation will stop again once a single atom of processing has been completed or is ready to be initiated. The status bar will keep you informed of what is being done.

Pressing Step repeatedly we see the RTL updating rapidly. The button will go gray after each press indicating that the decompiler is busy and cannot be interrupted. When the decompiler is done processing a single step the button will be active again. Keep pressing the button until it doesn't become active again. The status bar reads: debugging proc65: after processing types. This gives us a clear indication of where in the decompilation of proc65 that we've reached. I find it particularly strange that it is an after message where we've stalled. Before this message I saw a processing constants message and these two messages indicate to me that we're in UserProc::middleDecompile. Looking at this part of the code I see some code that checks for indirect jumps that have been removed by propagation of constants. This is something that happens often in win32 programs. There is a log output message here but no similar message for the GUI. Checking the log indicates that this code has executed. I'll add new code here so the GUI gets a message as well.

Now that I have some idea of where to look I can attach my external debugger and look at what is happening. In this instance, it appears that Boomerang has discovered a switch statement. Decoding each arm of the switch statement is taking a significant amount of time (around a minute for each). There are 13 arms to the switch statement in this procedure, so it only looks like we've stalled. This is another good place to put a GUI message. Question is, why are we taking so long to decode these arms of the switch?

Well, it turns out a function I wrote to add the semantics of overlapping registers (PentiumFrontEnd::processOverlapped) did not take into account the possibility of redecoded procedures. As such, every time the decompiler decoded an arm of the switch statement it was adding more and more overlapped register assignments to the procedure. A simple flag on each BB to indicate that processing has been done and should not be redone is sufficient to fix this bug. We now decode past proc65 but it appears we've stopped on another procedure. On further inspection it's obvious this is just a GUI problem.. I need to show progress of the final decompilation steps as apparently they can take a long time.

Now the decompiler runs for a very long time. It processes every available procedure and begins the final phase of removing all the accumulated cruft from them. Everything appears to be going great until we hit proc48. During the remove redundant parameters analysis we get a stack overflow. On my windows machine this results in decompiler just disappearing. Attaching a debugger before the stack overflow occurs tells you it is a stack overflow but gives you no indication of where in your source code the stack overflow occurs. Thankfully we have the internal debugging and our log file to guide us. Here's a little exercised code path:

bool UserProc::checkForGainfulUse(Exp* bparam, ProcSet& visited) {
visited.insert(this); // Prevent infinite recursion
...
if (visited.find(dest) == visited.end() &&
checkForGainfulUse(lloc, visited))
return true;

What's wrong here? Yes, that's right. We're ensuring dest isn't in the visited set and then we're recursing.. I believe the author of this code intended to call dest->checkForGainfulUse. That'd make more sense.

Removing redundant parameters is currently implemented using a fixed point algorithm (repeat until no change). It seems proc48 causes the decompiler to repeat a lot but eventually it does stop. Believe it not, the decompiling phase actually finished. On my machine it took about 20 solid minutes of processing. The whole time it sat on 100% until the very end and then, ironically, it went back to 98%. Clearly there's some improvements that can be done to the progress bar logic.

We're now ready to enter the code generation phase of the workflow. All code is emitted to a single file in a subdirectory called extract_kyra in the the output directory. For some reason the progress bar only went up to 79%. Again, clearly some work I need to do there. Double clicking on any procedure name will open the output file. In the future I hope to make it scroll automatically to the selected procedure and provide all sorts of refactoring transformations to aid the user in making the output maintainable.

So what's the output look like? In a word: atrocious. You can look at it if you like (this page also has all future versions of my decompilation of this program). Most every procedure has too many parameters, too many locals and is missing types or even whole code sequences. This poor output is a result of bugs in the decompiler. Now that I've gotten to this stage I can start attacking each failure and, one bug at a time, improve the quality of the output. Then maybe the next large program I try will require less effort to decompile, and the next program will require less again, and so on. One day we'll have a click and go decompiler for machine code executables, but we're still a long way off.

Wednesday, July 19, 2006

Bi-directional Dataflow Type Inference

After reading this paper I've starting implementing a brand new type analysis in Boomerang. The first step is calculating reverse dominance frontiers. A regular dominance frontier is the set of nodes that have a predecessor that is dominated by the given node, but is not itself dominated. These are the nodes where phi statements are to be placed when transforming a program into SSA form. The reverse dominance frontier is much the same, except it is the successors that must be post-dominated. Boomerang already calculates immediate post-dominators for use by the code generation phase, but we've never before had a use for reverse dominator frontiers. The paper describes an extension to SSA form called static single information form (SSI) which introduces a new kind of assignment: the sigma function statement which are placed on the reverse dominator frontiers. The purpose of this new statement is to split definitions before uses on different code paths. I will be using a more verbose form of this notation where a sigma statement for each use is emitted.

For some reason the algorithms used to place sigma functions are more complex than the algorithms we use to place phi functions. This is most probably because we're using the simple dominance frontier based algorithms that don't guarentee O(EV) performance. As such, I'm starting to wonder if sigma functions really are just a performance enhancement and if the same thing can't be achieved with just regular assignments strategically placed on each branch. In any case, it seems to me that reverse dominance frontiers are only necessary so you can calculate minimal SSI form. Reducing the number of sigma functions you introduce is obviously a good thing, but it may not be particularly important for Boomerang as we often detect and remove unused statements anyway.

At this juncture I'm abandoning my implementation. Boomerang needs a lot more retrofitting to support this form than I think is necessary for type inference. Instead, I'm going to try implementing the dataflow alogrithm without SSI and see if/how it fails. Then, if necessary, I'll add some adhoc sigma like manipulations to merge type information from uses.

Monday, July 17, 2006

Short Circuit Analysis

I've checked in some code that detects branches to branches and merges them if they meet some basic requirements. As such, Boomerang can now generate code like:

if (a < b && b < c) {
// x
} else {
// y
}

instead of generating a goto statement. I've checked in a couple of test programs that exercise this analysis. I havn't looked at how this analysis effects loops or more obscure control structures.. so there could well be bugs here.

Tuesday, July 04, 2006

Unfinished Assortments

I received an email of inspiration from Mike a week ago outlining how similar conjunctions and disjunctions in short circuited if statements are at the binary level. After looking at the problem myself I found it was a pretty simple problem. If one of the two out edges of a branch node is another branch node which contains only a branch statement, and the destination of that statement is shared with the first branch statement then you can merge the condition of the second branch into the condition of the first branch. Exactly what combination of fall/taken edges are shared by the two branches is what determines whether an || or a && should be used. This is a pretty easy transformation to do just before code generation (or just after decompilation) and I'm about half way through implementing it.

Unfortunately I got sidetracked. My work, you know, the people who pay me, they have me doing - wait for it - Java development. I moaned about not just being a one trick pony a few too many times, so they've decided to put me to work on something other than C/C++. That's ok. So now I'm deep in the bowls of JBoss and Seam and Java Server Faces and Servlets and xhtml. Sometimes I feel dirty, but it passes.

In between being sidetracked I manage to find time to fool around with mysql. That's a nice big pain in the ass. I hate databases.

I've also been fooling around with Ogre3d again. In many ways it's fuckin' great, but if you want to do something that is not exactly on the beaten path, watch out. I've been trying to draw polygons, with the hope of providing an interface to extrude polygons into recognisable objects. Something with a Google Sketchup style interface would be nice. Unfortunately, to draw polygons you need to break them down into triangles. Why triangles? Cause that's all Ogre can draw. I managed to find some C++ code on flipcode that claims to be able to do it. Guess I'll try it out next time I get a chance. You'd figure something like this would be present in a graphics engine. You'd be wrong, because 99% of the people who use Ogre just load meshes from files, which they export from 3d editing programs.

Saturday, June 24, 2006

What license is that MTA?

A Mail Transfer Agent is that bit of software that listens on port 25 and accepts mail when you send it. There's a lot of them available, but which ones are truely free?

I find that a good moral compass on questions of licensing is to look at the OpenBSD project. What they use is typically the most free you can get. So what do they use?

Sendmail, which has these license terms. They're pretty ass. Basically you can distribute it if you're "open source" in the GPL sense of the term; you have to promise to hand over source code, or if you are "freeware". So yeah, if you want to make a binary only CD of OpenBSD and include Sendmail you're going to have to promise whoever you give it to that you'll give them the source if they ask, or you can't charge them anything more than distribution costs. Seems kind of anti-OpenBSD-philosophy to me. But maybe there's nothing better out there.

What about qmail? Ask anyone and they'll tell you it's more secure than Sendmail, so why doesn't OpenBSD, the poster OS for security, use that? Well, the distribution terms for qmail are about as free as Microsoft's shared source licenses. i.e., they're not.

What about postfix? They've got a happy mouse with a mail sack logo and everyone loves them, what license have they got? Well it's a bit hard to find out. If you dig through the web page you won't find a reference to it, but if you download the source code you'll find it is under the IBM Public License. Which is not only very GPL-like it is also incompatible with the GPL. Something to do with patents.

Exim is pretty popular. It's under the GPL.

There's the Courier MTA, which most people probably thought was just an MDA, I know I did. It's also under the GPL.

The Apache project has an MTA called James which is written in Java. Ironically it's the most free. It's under the Apache License 1.1 which has an obnoxious advertising clause but doesn't require you to provide source code. If you like your Java there's also the JBoss MTA, which is under the LGPL.

So I guess if you want a BSD licensed MTA you need to go dig up an old copy of Sendmail. I was honestly surprised and thought OpenBSD would have forked Sendmail way back when it was BSD. OpenMTA? Maybe, someday.

Sunday, June 18, 2006

Order of Decompilation

Which order you decompile the procedures of a program in can determine how much information you have available to make good analysis. In Boomerang we've developed a system of depth first search of the call graph, which takes into account recursion, to have the most information about parameters available when needed. For example, if a program consists of two procedures, one called by the other, it makes the most sense to decompile the procedure which is the leaf in the call graph so that the procedure that calls it has the most information about what parameters it needs to pass to it.

What happens if the way the leaf procedure takes a parameter that is a pointer to a compound type? By the parameter's usage the decompiler may be able to determine that it is a pointer, it might even be able to determine that it is a pointer to a compound, but unless every member in the compound is accessed in a way that restricts that member's type sufficiently, the type that the decompiler can determine for the parameter is going to be very vague.

On the other hand, the value that the calling procedure is passing to the leaf procedure may be the result of a library call, or in some other way have a well defined type. As such, from a typing perspective, it would appear to make more sense to do top down processing of procedures in the call graph. However, this can only be done once a bottom up processing has been completed.

As such, I am currently looking at implementing an analysis that determines when additional type information for a called procedure that has been decompiled is discovered and marks that procedure as requiring additional processing. Hopefully no processing in the called procedure will need to be "undone".

Tuesday, June 13, 2006

Binary Release Alpha 0.3

I've added a new binary release of Boomerang for win32 to the sourceforge project. The Linux binaries are also up. If you'd like to try making a release for some other platform, please let me know.

What can the new release do? Well, it crashes less. It supports ELF .o files much better than previous releases. It includes some changes that make for better output in some instances. Overall, just general improvements.

What can't the new release do? This is still an alpha release. That means we don't expect you to be able to do very much work with it. Running it on /bin/ls will still give you horrendous output, but try /bin/arch.

Saturday, June 10, 2006

So Many Types and No (Good) Type Analysis

The type analysis in Boomerang remains a nice big mess. There have been three attempts at a type analysis system: dataflow based, constraint based and adhoc. At the moment the adhoc gives the best results, and the other two crash, a lot. Sometimes there is an absolute orgy of types in an input program, and the type analysis will assign the type void to a local. I've just added some more adhoc type analysis that will recognise when the programmer is assigning the elements of an object of unknown type to an object of known type and copy the known types for the elements to the unknown type. This is very specific but hopefully it occurs in more than just the one input program I was looking at. In C the programmer would have written something like this:

struct foo *p = someFuncThatReturnsAFoo();
p->name = global.name;
p->count = global.count;
p->pos = global.pos;
p->other = 0;

If that call is to a library proc we will have the struct foo, and know that p is a pointer to one, but for global we probably only know that is has a size of 24 bytes, if we know anything. Clearly we need to invent a type for global and clearly that type must be a compound consisting of 3 elements, so why not copy the names and types of those elements from what we are assigning to.

The question is, can a generic universal type analysis algorithm do this? Or is this simply a heuristic that works well for some programs but would produce strange and unpredictable results for others.

Friday, June 09, 2006

Conflation of Pointer and Array Types

A common source of confusion for new C programmers is the conflation of pointers and arrays that C does. I often think of the dynamic semantics of the language when I'm thinking deeply about passing arrays to functions. Typically, you can tell an experienced programmer that C always passes arrays by reference, never by value, and they won't go wrong.

Not all languages are like this, so in Boomerang we try to represent pointers and arrays as seperate non-conflated types. In our type system an array is a type used to describe those bytes in memory that contain a finite number of objects of a particular base type. Similarly, a pointer is a type used to describe those bytes in memory that contain an address, which if followed will reveal a single object of a particular base type.

As such, it is necessary to refer to somethings explicitly using the Boomerang type system that are typically implied by the C type system. For example, a C string is often written in the C type system as char *, or "a pointer to a character". This is clearly a misnomer. The only string that is accurately defined by this type is the empty string as C strings must be zero terminated, and if you're only pointing to a single character then that character must be zero. In the Boomerang type system we would write a C string as char[] *, or "a pointer to an (unspecified length) array of characters".

When I wrote Boomerang's header file parser I had the choice of which type system to use. Should I assume the signature file was using the C type system and silently translate the contents into the Boomerang type system? Or should I allow the user to specify types exactly as they would appear if we were to call Type::print on the resultant object? I tried both and found that 99% of the time the C type and the Boomerang type were the same, so the signature file expects the types to be in Boomerang format. This means that ocassionally you will see something weird in a Boomerang signature file. For example:

typedef struct {
int token;
const char* name;
OptionValueType type;
ValueUnion value;
Bool found;
} OptionInfoRec;
typedef OptionInfoRec *OptionInfoPtr;
typedef OptionInfoRec OptionInfoRecs[];

typedef OptionInfoRecs *AvailableOptionsFunc(int chipid, int bustype);

what's that function pointer type returning? It's a pointer to an array of OptionInfoRec structs. In C we'd just use OptionInfoPtr, because we assume the programmer knows that an AvailableOptionsFunc will return more than just one OptionInfoRec. In fact, an AvailableOptionsFunc is supposed to return a pointer to a null terminated array of OptionInfoRec structs, where "null terminated" means most of the members of the last OptionInfoRec are zero. It's pretty hard to define a sensible type for that, but C programmers work with types like that all the time so we have to try.

This also means that in the code generator for C we have to recognise when certain operations are not needed. For example, this assignment:

344 *32* m[local6][local1].name := "foo"

would cause code like this to be emitted:

OptionInfoRecs *local6;
int local1;
...
(*local6)[local1].name = "foo";

which is correct, but is not very pretty. We're not using the full syntactic sugar of the language. This is much better:

OptionInfoRec *local6;
int local1;
...
local6[local1].name = "foo";

and is probably how the programmer wrote it in the first place.

Thursday, June 08, 2006

Reverse Strength Reduction

Strength reduction is a compiler optimisation that tries to replace expensive operations involving a loop variant with less expensive operations. The most common example of strength reduction is the replacement of array indexing with pointer arithmetic. Consider this program fragment:

for (int i = 0; i < 25; i++)
arr[i] = 9;

where arr is an array of integers. An unoptimised compilation, in an RTL form, might look like this:

1 *32* r25 := 0
2 BRANCH 6
Highlevel: r25 >= 25
3 *32* m[a[arr] + r25 * 4] := 9
4 *32* r25 := r25 + 1
5 GOTO 2

which would actually be some very nice RTL for the decompiler to discover because it is easy to recognise that arr is an array with stride 4 and replace statement 3 with:

3 *32* a[r24] := 9

unfortunately, the compiler will have gotten to the RTL before we have, and most any compiler will do some strength reduction to get rid of that multiply in the middle of the left of statement 3. So what the decompiler will see is more like this:

1 *32* r25 := 0
2 BRANCH 6
Highlevel: r25 >= 100
3 *32* m[a[arr] + r25] := 9
4 *32* r25 := r25 + 4
5 GOTO 2

what we need to do is some analysis that will reverse the strength reduction and turn this RTL into that preferable RTL. As it turns out this isn't too hard to do for simple loops. Here's what the RTL looks like once it is in SSA form.

1 *32* r25 := 0
81 *32* r25 := phi(1, 4)
2 BRANCH 6
Highlevel: r25{81} >= 100
3 *32* m[a[arr] + r25{81}] := 9
4 *32* r25 := r25{81} + 4
5 GOTO 81

We can simply do a pass over the statements of the procedure looking for any assignment of the form x := x{p} + c where c is a constant and p is a PhiAssign that has only two elements, the assignment we're looking at and another assignment of the form x := 0. This a "high level pattern" and when matched it indicates that we can apply a "transformation". All we have to do to transform this RTL into the array indexing form is replace every occurance of r25{81}, except for the one in the increment statement, with r25{81} * c, and replace c in the increment statement with 1. Another high level pattern will recognise m[a[arr] + r25{81} * 4]and replace it with arr[r25{81}].

Wednesday, June 07, 2006

Using Qt4 and the Boehm Garbage Collector on Linux

I had an major issue where the Boehm garbage collector was crashing and spitting errors because of my use of Qt4's QThread class. The problem was simple enough, Qt4 calls pthread_create when it should be calling GC_pthread_create. I could have solved this problem by modifying qthread_private.cpp to do this, but that would mean distributing a modified Qt4, which is just silly for such a small change. So after much annoyance, I managed to come up with a solution that, although not pretty, seems to work. As such, there will be a Linux version of the GUI available to download when I make binary packages sometime in the next week.

Forcing a Signature

A while ago I added a bool to the Signature class that allowed the user to specify that the signature was already fully formed and did not need any processing. This was part of the "symbol file" hack that we used to do procedure-at-a-time decompilation using the command line tool. I noticed today that we were not honouring the forced bit anymore, for example, we were removing unused parameters and returns from the signature, so I fixed that. It occured to me that any procedure we discover via a function pointer was an excellent candidate for setting the forced bit on. The results were pretty spectacular as locals and globals were soon inheriting type information from the parameters of the signature. Unfortunately, the same could not be said of returns. In particular, consider some code like this:

mystruct *proc5()
12 { *v** r24 } := malloc(343)
13 m[r24{12} + 4] := "foo"
14 RET r24{12}

It's pretty clear that any local we create for r24 should be of type mystruct* but the first definition of r24 has us assigning it a type of void*. We could use some kind of lattice logic to determine that mystruct* is "better" than void* but the fact that the signature is forced tells us that this is no debate, r24 must have the type mystruct*. This analysis can be done in the early decompilation stage, but where to put the type? At the moment I'm considering changing the type of the return in the call statement. So line 12 will become:

12 {*mystruct** r24 } := malloc(343)

and immediately this type will flow on to turn statement 13 into something similar to:

13 m[r24{12}].name := "foo"

which is much more preferable.

Tuesday, June 06, 2006

Types, Globals and Varargs

I have a sample input program that has some code similar to this:

228 { *32* r24, *32* r28 } := CALL knownLibProc( .. arguments .. )
..
307 *32* m[r24{228}] := 232
308 *32* m[r24{228} + 4] := 91
309 *32* m[r24{228} + 8] := "some string"

where knownLibProc returns a pointer to a struct in r24. Early in the decompilation this type will be propogated into the statements in 307, 308 and 309 producing:

307 *32* m[r24{228}].size := 232
308 *32* m[r24{228}].id := 91
309 *32* m[r24{228}].name := "some string"

our intermediate representation doesn't have an operator equivalent to C's -> operator, the above is more like writing (*p).size, but the backend takes care of that and will emit a -> instead. Unfortunately I was getting an assert fault before I even get to that. The problem was that the 228 instance of r24 was being assigned a local variable, and that local was not inheriting the return type of the call. So the adhoc type analysis would take a look at an expression like m[local30].size and come to the conclusion that m[local30] has an impossible type because the type of local30 was int.

Fixing this was not as easy as it should be because adhoc type analysis is such a big mess. Investigating this bug I found that globals that are passed as arguments to library procedures were not being typed with the obviously valuable type information in the libproc's signature. I then discovered that even when they were typed with the correct type the initial values for those globals were not being calculated correctly. In the case of a struct type (called a compound type in Boomerang) we weren't calculating any initial value at all. This was obviously a terrible oversight.

Finally, I've found a problem with variable number of arguments calls. I was under the impression that the new defcollectors were effectively collecting live statements and adding them as appropriate, but apparently not. For the binary I am working on, it appears that the last argument of a vararg call is always 0, so I should be able to put a hack in to add arguments to call until I hit one with that constant.

Otherwise, the UI continues to evolve. I can now view struct information for any named type, which is particularly useful at sorting out padding issues. One day I might consider an analysis to determine padding for a struct automatically from use information, but for the moment it's easiest to just write a sample program with the original header information, calculate byte offsets to members and compare them with the bit offsets in the parsed signature information.

Sunday, June 04, 2006

Memory Leaking and Automatic Collection

I checked in a change to Makefile.in today that lets one disable the garbage collector more easily. I then tried out a few memory leak detection tools. First I tried ccmalloc. I couldn't even get this working, it just crashes, even with the sample program on the web site. Then I gave mpatrol a go. I'd heard good things about mpatrol. Unfortunately it doesn't seem to like dynamic loading of shared libraries and (for no good reason) we don't link to the loaders statically in Boomerang. So I gave up and installed valgrind. It still rocks. It not only told me how much memory we were leaking and where from, it also told me some use-before-define errors of which I wasn't aware. I checked in fixes.

Next, I had a look at Boost's shared_ptr class. I'm hoping to figure out a way to easily add something like this to replace the garbage collector. Unfortunately, the use of smart pointers is anything but easy. You'd think that you could define something like this:

typedef boost::shared_ptr<prog> ProgPtr;

and then replace every ocurrance of Prog* with ProgPtr. Of course, that doesn't work. Not only do you have the problem of no operator= being defined for a shared_ptr but you also have the problem of where to put this declaration. You can't put it after the definition of the Prog class, cause how do you forward declare ProgPtr if your Prog class includes any references to ProgPtr? So you try to put the typedef before the class definition, but then you get errors which indicate that a forward declaration of Prog is not sufficient.

In many ways, this is the greatest strength of the garbage collector, it's a drop in replacement. I'm starting to think that adding all those thousands of delete statements would be less trouble than using smart pointers. If anyone knows how you go about migrating to smart pointers without so much pain, I'd appreciate an email.

Friday, June 02, 2006

Debugging the Decompiler

One of the most useful features of the new GUI will be the ability to step through a decompilation and inspect the RTL at each step. To date I have implemented a Step button that allows the user to inspect a procedure before each major phase of the decompilation on that procedure. In the future, I intend to add more debugging points, perhaps even to the resolution of a single RTL change. I expect that some way for the user to specify the desired level of resolution will be required. Whether that is a bunch of menu options, or a spinner or even multiple Step buttons (step to next change, step to next analysis, step to next phase, step to next procedure, etc), I havn't decided.

The UI already has a course form of breakpoints. At the decoding phase you can specify which procedures you want to inspect, and the decompilation will run without stopping until it gets to one of those procedures. It would be sensible to allow the user to set a breakpoint on a particular line of the RTL and run the decompilation until that line is changed.

Mike has sent me a paper which defines the term "time travel debugging" where the user has not only the ability to step forward, but also the ability to step backwards. This is particularly useful in conjunction with breakpoints. If something has gone wrong you need only place a breakpoint on that something and then run backwards until the breakpoint fires, then you can see what broke. Implementing something similar in Boomerang is definitely possible, but it requires something like the current "memo" system to checkpoint the RTL at appropriate intervals.

Multithreaded Garbage Collection on Linux

I tried compiling the Boomerang GUI on Linux last night. After much effort getting Qt4 to compile I was hopeful that everything would just work. Unfortunately the trick I used to get the garbage collector to only collect memory allocated by the decompiler thread on WIN32 doesn't work on Linux. Apparently you're supposed to call GC_pthread_create instead of pthread_create to start new threads, well I'm not calling either, I'm getting Qt to create my thread for me. So what to do? I guess I could modify Qt to use GC_pthread_create, but that means any binaries I distribute will have to include my modified Qt. I'm going to look into ways to register a thread with the allocator directly.

Another alternative is to just stop all this garbage collector nonsense, but modifying Boomerang to delete when appropriate is just out of the question. I have seriously considered using smart pointers, possibly from Boost, but as of yet I've not made a firm decision. It would be nice to remove the libgc dependancy, especially as I have been thinking of removing the libexpat dependancy recently.

Thursday, June 01, 2006

Displaying Pretty RTL

I did some fixes to the html output option in the various print() functions of Boomerang today. This is all so I can display RTL as pretty as possible. I'm thinking that hovering the mouse over a RefExp should highlight the line that is referenced by it. That's my first goal, and then I'll work on a context menu. All this is possible because I can hide pointers in the html output which refer back to the intermediate representation. Qt4 has the advantage that good html display widgets are standard parts of the toolkit. What I don't intend to do is to allow freeform manipulation of the RTL. That would require an RTL parser, which I'm simply not in the mood to write, at least this month.

Wednesday, May 31, 2006

Support for Linux Kernel Modules

Boomerang can now load Linux kernel modules and find the init_module and cleanup_module entrypoints. Loading ELF sections is a tricky business. I now honour the alignment information, which means kernel modules will load into Boomerang with the same section placement as they are loaded into IDA Pro. There was also some problems with the way we handle R_386_PC32 relocations. Checking the type of the symbol indicated by the relocation solves the problem. I also managed to speed it up significantly by removing an unnecessary check. Hopefully it really is unnecessary.

My globals-at-decode-time code is now checked in. I await howls of disapproval on the mailing list, but hey, I do so every time I check in.

Tuesday, May 30, 2006

More GUI Work and Relocations

Today I got a lot of work done on the GUI. I can now edit signature files in a tab at decode time and the corresponding signatures and parameters are shown in the Library Procedures table. For the rare times where a decompilation actually makes it to code generation without something going wrong, I can now open the output file and edit it in a tab. I even gave my main window a title.

On the topic of relocations/symbol information. I can now load a linux .o file and get absolutely no addresses in my RTL. This is because I take a relocation at memory location x as an absolute guarentee that memory location x contains an address. I look up the address in the symbol map and replace the constant that would ordinarily be produced with an a[global] expression. One surprise I had on my test binary was that string constants are not assigned a symbol. I expected at least a "no name" symbol. As such, I speculatively test the memory at the given address and, if a suitable string constant is found, I replace the address constant with the string constant.

When you consider that I'm doing all this at decode time, it leaves the decompile stage to focus on the hard problems of recognising parameters, etc. I've yet to decide if an STT_FUNC symbol implies that the procedure conforms to a known ABI. This is clearly true for some binaries but may not be true for all. Perhaps a user specified option is the way to go there. Then I could recognise parameters at decode time.

Another interesting source of information is the STT_FILE symbols. This is definitely of interest as it tells us how the output of the decompiler should be clustered. "Clustering" is a term used for any grouping of code or data. One could say that OO programming is a clustering philosophy. Boomerang has supported emitting procedures to a tree of files for some time. Although all globals currently go into the "main" output file (typically named program.c). Of course, this will be of more interest to me when I actually have output worthy of clustering.

Monday, May 29, 2006

Woe of the Instruction Decoder

Boomerang uses the NJMC toolkit to decode instructions. These are the files in frontend/machine/pentium (or sparc or ppc or whatever you're interested in). We chose to use this technology because it ment we didn't have to write code to implement a new architecture, we could just write a "specification" file. Unfortunately, the NJMC toolkit is slowly rotting. It is hard to build. I've never built it. Mike has built it a couple of times (and failed a lot more times). Every architecture is different and no-one maintains it. We also have some issues with the code it generates. It produces huge cpp files which struggle to compile on some build targets and make the resulting binary much bigger than it could be.

So how much work is it to replace? I considered writing a new tool that would take the same input files as the NJMC toolkit and generate the same output, but that only solves half the problems. Then I came to wonder, what's wrong with just using a simple disassembler and modifying it to return a DecodeResult. There's even some BSD code available for a number of architectures that we could base it on. We could modify PentiumFrontend::decodeInstruction to call both the NJMC toolkit generated code and the new code. If the results are different, we output a warning to the LOG file and use the result from the original code. Once all the warnings have been fixed we can retire that original code.

In the mean time, Mike and I have managed to put together a binary package for x86 linux users that they can use instead of going through the pain of trying to build the toolkit.

Sunday, May 28, 2006

More relocation mayhem

I couldn't get to sleep last night, as something about relocations was nagging at me. Finally, around 2am, it hit me. I got up and sent a long email to Mike. The problem is, we've been thinking about relocations way too literally. The ElfBinaryFile loader treats relocations as the loader of an operating system would treat them, as numbers to be added to addresses in the image. But a decompiler wants a lot more information than this. The relocations tell us something that is gold and we just ignore it. For example, suppose you have a relocation like this:

00000009 R_386_32 myarray

To a traditional loader it is saying: go look up the symbol myarray and calculate its address in memory, then go to offset 9 in the .text segment and add that address to whatever is there. But to a decompiler, what it is telling us is that we should add a global reference to myarray to the expression we generate for the address at offset 9. So say the instruction that included offset 9 was decoded as something like this:

r24 := m[r27 * 4 + 12]

then we should be processing this relocation to change the expression to something like this:

r24 := m[a[myarray] + r27 * 4 + 12]

which the decompiler can recognise as an array reference and finally produce code like:

local1 := myarray[local2 + 3]

What's funny is that Boomerang does a lot of work to figure out what a constant refers to and produce just these kinds of expressions. If we were to take better note of what the relocation information is telling us, we wouldn't need to do all this work.

But now the hard part. Boomerang is not designed to perform this kind of processing of relocations. The BinaryFile classes are completely seperate from the NMJCDecoder classes. When we want to decode an instruction we simply pass in the native address we want to decode and the delta between the native address and the host address. Even if we could modify the decoder to add relocation information to the expressions it generates (and I'm having some difficulty seeing how we can, as dis_Mem appears to hide the addresses of immediate values) we'd have to redesign this to pass in a BinaryFile object. I see a lot of work in my future.

Saturday, May 27, 2006

Overlapping registers

The x86 instruction set is an ugly mess. Often with a desire to make things more flexible, people make things harder to understand. In the case of instruction sets, this makes a decompiler's job more difficult. Consider the following x86 asm code:

mov eax, 302
mov al, 5
mov ebx, eax


What value is in ebx? It makes it easier if we write 302 as 12Eh. Then we can easily say that ebx contains 105h, that is, 261. In boomerang, the decoder would turn those three instructions into this RTL:

*32* r24 := 302
*8* r8 := 5
*32* r27 := r24


This is clearly wrong. As the information that r8 overlaps with the bottom 8 bits of r24 is completely absent. This is more correct:

*32* r24 := 302
*16* r0 := truncu(32, 16, r24)
*8* r12 := r24@15:8
*8* r8 := truncu(32, 8, r24)
*8* r8 := 5
*32* r24 := r24@31:8 | zfill(8, 32, r8)
*16* r0 := truncu(32, 16, r24)
*32* r27 := r24


But just look at the explosion in the number of statements. I havn't even included statements to define bx, bh, and bl, which should go after the assignment to ebx. Boomerang currently contains code to add these statements, but because of the explosion of statements it is typically disabled. The users of Boomerang would rather have wrong RTL than have unreadable RTL.

Can we improve on this? I am currently writing some code that will search the procedure for any use of the 16 bit and 8 bit overlapping registers. If it finds no use of a particular register it will not output statements to define that register. So, for the code above, the RTL would only be:

*32* r24 := 302
*8* r8 := 5
*32* r24 := r24@31:8 | zfill(8, 32, r8)
*32* r27 := r24


Which is both readable at decode time and very early into the decompilation stage reduce to:

*32* r24 := 302
*8* r8 := 5
*32* r24 := 261
*32* r27 := 261


Dead code elimination will then remove the first statement. The second statement will go away when unused statements are removed because inevitably r24 will become an output of the procedure and the decompiler will not add r8 as an output as r24 already covers it.

Better Support for Relocatable ELF Files

Looking at how the Boomerang ELF loader handles symbols and relocations, I noticed that something was clearly wrong for relocatable files (i.e., .o files). The loader was assuming that the VirtualAddress members of the section table were set as they are in executable images. This is not the case. It is the duty of the loader to choose an arbitary starting address and to load each section at appropriate offsets from that address. I decided that choosing the same default address that IDA Pro uses was probably a good idea. I often switch between Boomerang and IDA Pro to gather information, especially information that Boomerang has gotten wrong. I also decided to delay loading any section that starts with ".rel." until all the other sections are loaded because IDA Pro does so. I don't know why it does it, but I want my addresses to match up with those in IDA Pro, so I have to follow their lead.

After fixing this, I noticed that all the symbols and relocations now point to addresses that are not in any section. That is, all the symbols that point to the .text section now need to have the base address of the section added to their offset. Relocations point to symbols, so they too need to have the base address of the section added to their offset.

Some of this was already in the ELF loader, or at least some attempt had been made to add it.

Relocatable ELF files have no "entrypoints" as such, as they are not typically intended to be executed on their own. Typically the linker combines multiple .o files to produce an executable. So, at first, it would appear that attempting to decompile any .o file that doesn't have the main symbol in it, and where no specific entrypoints have been supplied by the user, would be a bit questionable. However, two examples spring to mind where the decompiler should be able to find an entrypoint. The first is kernel modules, where the initialization and teardown functions are valid entrypoints. The second is X11 modules, which have similar entrypoints. Obviously in both these cases there are interesting targets for decompilation.

A(nother) GUI for Boomerang

Quite a while ago I attempted to write a GUI for Boomerang. In fact, I've done this a couple of times. The stalling point has always been: what good is a GUI? Decompilers are supposed to be automatic. You should be able to give a decompiler an executable and trust it to spit out a C file that meets some definition of program equivalence with that executable. So if the decompiler works perfectly, what is there for the user to do? Surely anything they can offer will be more productively applied to the output, and for that they can just use standard source code manipulation tools. Well, there's two problems with this line of thinking. First, there's the sad fact that no decompiler is perfect. In fact, the state of the art is far, far, from adequate, let alone perfect. Secondly, standard source code manipulation tools are woefully underpowered for even the most simplest tasks of traditional reverse engineering (where traditional means "starting from source code"). For example, renaming a function or a variable is still a completely manual process for most C programmers. Java programmers might have a few more fancy tools - what they call "refactoring" tools - but for C programmers it's basically a text editor and sometimes something like Cscope.

So there certainly is a need for a GUI for a decompiler. I started by documenting the workflow of Boomerang. I figured we need to present this workflow as a key paradigm to the user, so best to get it right up front. I identified 5 key steps:
  1. Initialize
  2. Load
  3. Decode
  4. Decompile
  5. Generate Code
The sixth step, make the code maintainable, is more freeform than the rest, so I've not presented it explicitly in the workflow tab of the GUI. At each stage, widgets are shown which present information to the user so they can monitor the progress of the decompiler. The user is also given the opportunity to enter or update the information that will be used in the next step.

So far, the user can see and edit the entrypoints of program and where each section of the program is loaded into memory before proceeding to the decode step. The user is then presented with the procedures of the program as they are decoded, and any library procedures that statements have been found to call. I hope to allow the user to specify any information on unknown library procedures, and the ability to remove any unwanted procedures, before proceeding to decompilation.

At decompilation time, the user is shown a truncated call graph built as the decompilation progresses. Procedures that have been decompiled or are currently under decompilation are shown in blue type, whereas procedures that have been delayed awaiting the completion of the decompilation of their children (or as a result of recursion) are left in black type. The procedure currently under work is made the current selection of the tree and all parent nodes are expanded. Double clicking on any procedure in the tree will open a new tab containing the RTL of that procedure. I hope to allow the user to manipulate the RTL before proceeding to the next step.

The code generation stage presents the user with a tree of clusters (directories and files) to which the code of each procedure will be generated. The user can add new clusters and drag procedures from cluster to cluster. Double clicking on a file will open a new tab (or an existing one if already opened) with the contents of the file. Double clicking on a procedure will do the same but scroll the contents such that the selected procedure is at the top of the window. At the moment, the only operations available for manipulating the contents of the file are the usual text editor commands. However, I hope to offer a range of commands, such as renaming procedures and variables, that make improving the quality of the output easier.

An option presented at the initialization stage (and in a menu option) offers the user an opportunity to more fine grainly monitor the decompilation step. After enabling "decompilation debugging" a Step button is added to the status bar of the GUI. Upon the completion of the decode step the user can select which procedures they are interested in monitoring the decompilation, or they can select the default, all procedures. Proceeding to the decompilation stage, the progress will now stop and the status bar will read "decompiling proc1: before initialise" and a new tab containing the RTL of proc1 will be opened. When the user has completed inspection of the RTL of proc1 they can press Step and the decompilation will continue. For each procedure selected in the decode stage, around eight messages will be displayed in the status bar and the user will need to press Step. Each time the RTL of the procedure will be updated. This allows the user to spot bugs in the decompiler. At the moment, I'm using this to fix the bugs, but in the future, users may prefer to just fix the RTL.

I'm writing the GUI in C++ on Windows using the Qt4 toolkit. I will ensure it compiles and works on Linux before I check it into the Boomerang CVS repository. Shortly after I will be making binary packages for both Windows and Linux.