Saturday, June 10, 2006

So Many Types and No (Good) Type Analysis

The type analysis in Boomerang remains a nice big mess. There have been three attempts at a type analysis system: dataflow based, constraint based and adhoc. At the moment the adhoc gives the best results, and the other two crash, a lot. Sometimes there is an absolute orgy of types in an input program, and the type analysis will assign the type void to a local. I've just added some more adhoc type analysis that will recognise when the programmer is assigning the elements of an object of unknown type to an object of known type and copy the known types for the elements to the unknown type. This is very specific but hopefully it occurs in more than just the one input program I was looking at. In C the programmer would have written something like this:

struct foo *p = someFuncThatReturnsAFoo();
p->name = global.name;
p->count = global.count;
p->pos = global.pos;
p->other = 0;

If that call is to a library proc we will have the struct foo, and know that p is a pointer to one, but for global we probably only know that is has a size of 24 bytes, if we know anything. Clearly we need to invent a type for global and clearly that type must be a compound consisting of 3 elements, so why not copy the names and types of those elements from what we are assigning to.

The question is, can a generic universal type analysis algorithm do this? Or is this simply a heuristic that works well for some programs but would produce strange and unpredictable results for others.

2 comments:

  1. I just took a quick look at this post, but it seems like you're looking for something similar to the Hindley-Milner type inference algorithm that's found in the ML family of languages, like OCaml. It might be worthwhile to research the type system used in OCaml and Haskell.

    ReplyDelete
  2. The Hindley-Milner algorithm has been suggested for decompilation in

    Type-Based Decompilation, A. Mycroft, in "8th European Symposium on Programming", Springer-Verlag (LNCS 1576) 1999.

    However, it only solves the system of constraints, once they are gathered. The trick is in gathering the constraints in the first place, and then in using the result sensibly; it's not as easy as it looks. Mycroft only ever tested on paper, and was especially superficial on how to detect arrays.

    Boomerang did use constraints earlier, but it was found to be awkward to use; it seems that using OCaml or similar may have helped there.

    Boomerang now uses a data flow based type analysis, with a lattice and meet operator. However, it is currenly not doing as well as a sort of "ad hoc" type propagation system, mainly due to a lack of time to fully implement and test the type system.

    - emmerik

    ReplyDelete