Wednesday, July 19, 2006

Bi-directional Dataflow Type Inference

After reading this paper I've starting implementing a brand new type analysis in Boomerang. The first step is calculating reverse dominance frontiers. A regular dominance frontier is the set of nodes that have a predecessor that is dominated by the given node, but is not itself dominated. These are the nodes where phi statements are to be placed when transforming a program into SSA form. The reverse dominance frontier is much the same, except it is the successors that must be post-dominated. Boomerang already calculates immediate post-dominators for use by the code generation phase, but we've never before had a use for reverse dominator frontiers. The paper describes an extension to SSA form called static single information form (SSI) which introduces a new kind of assignment: the sigma function statement which are placed on the reverse dominator frontiers. The purpose of this new statement is to split definitions before uses on different code paths. I will be using a more verbose form of this notation where a sigma statement for each use is emitted.

For some reason the algorithms used to place sigma functions are more complex than the algorithms we use to place phi functions. This is most probably because we're using the simple dominance frontier based algorithms that don't guarentee O(EV) performance. As such, I'm starting to wonder if sigma functions really are just a performance enhancement and if the same thing can't be achieved with just regular assignments strategically placed on each branch. In any case, it seems to me that reverse dominance frontiers are only necessary so you can calculate minimal SSI form. Reducing the number of sigma functions you introduce is obviously a good thing, but it may not be particularly important for Boomerang as we often detect and remove unused statements anyway.

At this juncture I'm abandoning my implementation. Boomerang needs a lot more retrofitting to support this form than I think is necessary for type inference. Instead, I'm going to try implementing the dataflow alogrithm without SSI and see if/how it fails. Then, if necessary, I'll add some adhoc sigma like manipulations to merge type information from uses.

1 comment:

  1. I would like to go on the record here to say that I believe that sigma functions are not necessary for decompilation.

    Splitting variables at cotrol flow split points might sometimes avoid a union type. But remember that every definition stores a different type in SSA, so it would only be cases where a union is required anyway that this would happen, and you will have to emit a cast anyway. Still, a cast is better than a union, so maybe it has some value.

    Good luck, QuantumG!

    - emmerik