Posts

Showing posts from June, 2006

What license is that MTA?

A Mail Transfer Agent is that bit of software that listens on port 25 and accepts mail when you send it. There's a lot of them available, but which ones are truely free? I find that a good moral compass on questions of licensing is to look at the OpenBSD project . What they use is typically the most free you can get. So what do they use? Sendmail , which has these license terms . They're pretty ass. Basically you can distribute it if you're "open source" in the GPL sense of the term; you have to promise to hand over source code, or if you are "freeware". So yeah, if you want to make a binary only CD of OpenBSD and include Sendmail you're going to have to promise whoever you give it to that you'll give them the source if they ask, or you can't charge them anything more than distribution costs. Seems kind of anti-OpenBSD-philosophy to me. But maybe there's nothing better out there. What about qmail ? Ask anyone and they'll te

Order of Decompilation

Which order you decompile the procedures of a program in can determine how much information you have available to make good analysis. In Boomerang we've developed a system of depth first search of the call graph, which takes into account recursion, to have the most information about parameters available when needed. For example, if a program consists of two procedures, one called by the other, it makes the most sense to decompile the procedure which is the leaf in the call graph so that the procedure that calls it has the most information about what parameters it needs to pass to it. What happens if the way the leaf procedure takes a parameter that is a pointer to a compound type? By the parameter's usage the decompiler may be able to determine that it is a pointer, it might even be able to determine that it is a pointer to a compound, but unless every member in the compound is accessed in a way that restricts that member's type sufficiently, the type that the decompiler

Binary Release Alpha 0.3

I've added a new binary release of Boomerang for win32 to the sourceforge project. The Linux binaries are also up. If you'd like to try making a release for some other platform, please let me know. What can the new release do? Well, it crashes less. It supports ELF .o files much better than previous releases. It includes some changes that make for better output in some instances. Overall, just general improvements. What can't the new release do? This is still an alpha release. That means we don't expect you to be able to do very much work with it. Running it on /bin/ls will still give you horrendous output, but try /bin/arch.

So Many Types and No (Good) Type Analysis

The type analysis in Boomerang remains a nice big mess. There have been three attempts at a type analysis system: dataflow based, constraint based and adhoc. At the moment the adhoc gives the best results, and the other two crash, a lot. Sometimes there is an absolute orgy of types in an input program, and the type analysis will assign the type void to a local. I've just added some more adhoc type analysis that will recognise when the programmer is assigning the elements of an object of unknown type to an object of known type and copy the known types for the elements to the unknown type. This is very specific but hopefully it occurs in more than just the one input program I was looking at. In C the programmer would have written something like this: struct foo *p = someFuncThatReturnsAFoo(); p->name = global.name; p->count = global.count; p->pos = global.pos; p->other = 0; If that call is to a library proc we will have the struct foo , and know that p is a pointe

Conflation of Pointer and Array Types

A common source of confusion for new C programmers is the conflation of pointers and arrays that C does. I often think of the dynamic semantics of the language when I'm thinking deeply about passing arrays to functions. Typically, you can tell an experienced programmer that C always passes arrays by reference, never by value, and they won't go wrong. Not all languages are like this, so in Boomerang we try to represent pointers and arrays as seperate non-conflated types. In our type system an array is a type used to describe those bytes in memory that contain a finite number of objects of a particular base type. Similarly, a pointer is a type used to describe those bytes in memory that contain an address, which if followed will reveal a single object of a particular base type. As such, it is necessary to refer to somethings explicitly using the Boomerang type system that are typically implied by the C type system. For example, a C string is often written in the C type syste

Reverse Strength Reduction

Strength reduction is a compiler optimisation that tries to replace expensive operations involving a loop variant with less expensive operations. The most common example of strength reduction is the replacement of array indexing with pointer arithmetic. Consider this program fragment: for (int i = 0; i < 25; i++) arr[i] = 9; where arr is an array of integers. An unoptimised compilation, in an RTL form, might look like this: 1 *32* r25 := 0 2 BRANCH 6 Highlevel: r25 >= 25 3 *32* m[a[arr] + r25 * 4] := 9 4 *32* r25 := r25 + 1 5 GOTO 2 which would actually be some very nice RTL for the decompiler to discover because it is easy to recognise that arr is an array with stride 4 and replace statement 3 with: 3 *32* a[r24] := 9 unfortunately, the compiler will have gotten to the RTL before we have, and most any compiler will do some strength reduction to get rid of that multiply in the middle of the left of statement 3. So what the decompiler will see is more like this: 1 *32* r

Using Qt4 and the Boehm Garbage Collector on Linux

I had an major issue where the Boehm garbage collector was crashing and spitting errors because of my use of Qt4's QThread class. The problem was simple enough, Qt4 calls pthread_create when it should be calling GC_pthread_create. I could have solved this problem by modifying qthread_private.cpp to do this, but that would mean distributing a modified Qt4, which is just silly for such a small change. So after much annoyance, I managed to come up with a solution that, although not pretty, seems to work. As such, there will be a Linux version of the GUI available to download when I make binary packages sometime in the next week.

Forcing a Signature

A while ago I added a bool to the Signature class that allowed the user to specify that the signature was already fully formed and did not need any processing. This was part of the "symbol file" hack that we used to do procedure-at-a-time decompilation using the command line tool. I noticed today that we were not honouring the forced bit anymore, for example, we were removing unused parameters and returns from the signature, so I fixed that. It occured to me that any procedure we discover via a function pointer was an excellent candidate for setting the forced bit on. The results were pretty spectacular as locals and globals were soon inheriting type information from the parameters of the signature. Unfortunately, the same could not be said of returns. In particular, consider some code like this: mystruct *proc5() 12 { *v** r24 } := malloc(343) 13 m[r24{12} + 4] := "foo" 14 RET r24{12} It's pretty clear that any local we create for r24 should be of type m

Types, Globals and Varargs

I have a sample input program that has some code similar to this: 228 { *32* r24, *32* r28 } := CALL knownLibProc( .. arguments .. ) .. 307 *32* m[r24{228}] := 232 308 *32* m[r24{228} + 4] := 91 309 *32* m[r24{228} + 8] := "some string" where knownLibProc returns a pointer to a struct in r24. Early in the decompilation this type will be propogated into the statements in 307, 308 and 309 producing: 307 *32* m[r24{228}].size := 232 308 *32* m[r24{228}].id := 91 309 *32* m[r24{228}].name := "some string" our intermediate representation doesn't have an operator equivalent to C's -> operator, the above is more like writing (*p).size, but the backend takes care of that and will emit a -> instead. Unfortunately I was getting an assert fault before I even get to that. The problem was that the 228 instance of r24 was being assigned a local variable, and that local was not inheriting the return type of the call. So the adhoc type analysis would take a look a

Memory Leaking and Automatic Collection

I checked in a change to Makefile.in today that lets one disable the garbage collector more easily. I then tried out a few memory leak detection tools. First I tried ccmalloc. I couldn't even get this working, it just crashes, even with the sample program on the web site. Then I gave mpatrol a go. I'd heard good things about mpatrol. Unfortunately it doesn't seem to like dynamic loading of shared libraries and (for no good reason) we don't link to the loaders statically in Boomerang. So I gave up and installed valgrind. It still rocks. It not only told me how much memory we were leaking and where from, it also told me some use-before-define errors of which I wasn't aware. I checked in fixes. Next, I had a look at Boost's shared_ptr class. I'm hoping to figure out a way to easily add something like this to replace the garbage collector. Unfortunately, the use of smart pointers is anything but easy. You'd think that you could define something

Debugging the Decompiler

One of the most useful features of the new GUI will be the ability to step through a decompilation and inspect the RTL at each step. To date I have implemented a Step button that allows the user to inspect a procedure before each major phase of the decompilation on that procedure. In the future, I intend to add more debugging points, perhaps even to the resolution of a single RTL change. I expect that some way for the user to specify the desired level of resolution will be required. Whether that is a bunch of menu options, or a spinner or even multiple Step buttons (step to next change, step to next analysis, step to next phase, step to next procedure, etc), I havn't decided. The UI already has a course form of breakpoints. At the decoding phase you can specify which procedures you want to inspect, and the decompilation will run without stopping until it gets to one of those procedures. It would be sensible to allow the user to set a breakpoint on a particular line of the RTL

Multithreaded Garbage Collection on Linux

I tried compiling the Boomerang GUI on Linux last night. After much effort getting Qt4 to compile I was hopeful that everything would just work. Unfortunately the trick I used to get the garbage collector to only collect memory allocated by the decompiler thread on WIN32 doesn't work on Linux. Apparently you're supposed to call GC_pthread_create instead of pthread_create to start new threads, well I'm not calling either, I'm getting Qt to create my thread for me. So what to do? I guess I could modify Qt to use GC_pthread_create, but that means any binaries I distribute will have to include my modified Qt. I'm going to look into ways to register a thread with the allocator directly. Another alternative is to just stop all this garbage collector nonsense, but modifying Boomerang to delete when appropriate is just out of the question. I have seriously considered using smart pointers, possibly from Boost, but as of yet I've not made a firm decision. It would

Displaying Pretty RTL

I did some fixes to the html output option in the various print() functions of Boomerang today. This is all so I can display RTL as pretty as possible. I'm thinking that hovering the mouse over a RefExp should highlight the line that is referenced by it. That's my first goal, and then I'll work on a context menu. All this is possible because I can hide pointers in the html output which refer back to the intermediate representation. Qt4 has the advantage that good html display widgets are standard parts of the toolkit. What I don't intend to do is to allow freeform manipulation of the RTL. That would require an RTL parser, which I'm simply not in the mood to write, at least this month.