Boomerang 2
I was recently asked what I would do differently if I was writing a decompiler from scratch today. Having worked on decompilers for the past 9 years myself and worked with others who have worked on decompilers for longer still, I like to think we now know enough about it to recognise some of the mistakes we've made and identify some areas of improvement.
The primary difficulty of decompilation is that most analysis requires interprocedural information. Most everyone recognises this need early on so there's an assumption that you need to keep the intermediate representation for the entire program in memory at the same time. For many years we've had a hell of a lot of memory to play with (4 gig will be standard on desktop PCs next year) and with 64 bit architectures and virtual memory you can swap to disk for as much memory as you need. So it would appear that structuring a decompiler like a compiler, with intermediate disk formats for each compilation module is unnecessary. I think this is a mistake.
Having intermediate disk forms is more than just a solution to the memory issue. For a start, there's the ability to start and stop the analysis. In compilation, this is used to great effect when making incremental changes to source files; the files that are not changed do not need to be recompiled to their intermediate disk forms. It's a rare use case where one incrementally changes an input binary to a decompiler, (this does happen though, for example, when doing incremental comparisons of different versions of a program), but a decompilation takes a long time and a single bug can crash the decompiler resulting in hours of lost work. Interruptability alone is justification for intermediate disk forms, but there's another reason.
Because decompilation is still experimental engineering, inspection of intermediate forms is important. The Boomerang decompiler had a number of ways in which we could inspect the intermediate representation. We could stop the decompilation early, print out call graphs, dataflow graphs and register transfer language (RTL). A number of attempts were made at producing a GUI.. the primary purpose being to provide feedback on the progress of the decompilation and to allow inspection at the different stages. Tools to convert intermediate disk forms (which should be stored as efficiently as practical) into a form that is presentable to an engineer are much better seperated from the GUI, which now becomes primarily for feedback of progress.
This suggests another benefit of intermediate disk forms: no longer is it necessary to have a single tool, "the decompiler", to work on the intermediate representation. The decompiler can be seperated into individual tools which do a specific job and do it well. So what kind of tools do we need?
Firstly, a tool which uses a loader library to put the executable image into memory, and an instruction decoder library to turn individual instructions into intemediate representation, would use the variety of techniques that we've developed over the years to find all the code in an executable and, using another library, output it in the intermediate disk forms.
The inspection tools can be used at this point to confirm that all the code has been found and is loaded correctly. Because the intermediate representation is on disk, in a collection of files, the engineer can make a backup of these files and the decoding stage not need be repeated should some later analysis mess up the decompilation.
Following the decoding stage, a lot of analysis and data gathering occurs. Converting the intermediate representation to SSA form. Performing use/def analysis, liveness analysis, parameter analysis, etc. All this information could go into seperate files.. that way the previous intermediate forms can be kept around to do other analysis on, and an invalid analysis result can be recalculated without affecting other analysis phases.
After collecting all this information, the process of decompilation is essentially just transforming the intermediate representation from one form to another. As such, the workhorse of the decompiler is the transformation tool. It should load up the appropriate parts of the intermediate representation, read a script written in a domain specific language and then write back out the intermediate representation.
Often a transformation will require the incremental update of an analysis. I believe this can be acheived using a simple timestamp system on the intermediate disk forms. Yes, what I'm saying is that something similar to a Makefile can be used to drive the entire decompilation process.
The last tool, of course, is the code generation tool. For each designed language we need a tool that will take the intermediate representation and produce "source code" files.
That's enough about architecture. Already I've described a decompiler system which is very different to anything that has been built todate. If you're thinking about writing a decompiler, I hope you'll consider something similar to this.
Another thing I like to think I've learnt over the years of working on decompilers is that it sucks having to come up with an intermediate representation all by yourself. In fact, coming up with an intermediate representation for compilers could be a life time obsession all on its own. There's a lot of schools of thought on the subject. Obviously, something like the tuples of simple compiler is not a lot of good for decompiler. One needs to at least represent all the statements of a high level language as well as the individual instructions you might see in an assembly listing. There's a lot of work in it.. so why not take some advice from people who think about this kind of stuff?
The LLVM Compiler Infrastructure Project has a well thought out "virtual instruction set" which allows such complicated operands that it can express high level languages to such a degree that a simple PERL script can turn a C program into LLVM instructions (not that I recommend that). Perhaps more importantly, it has a mature, efficient on-disk representation and a number of useful analysis phases that work on it. There's an active community where you can report bugs, get help and recruit interested parties. I think you could do worse.
The primary difficulty of decompilation is that most analysis requires interprocedural information. Most everyone recognises this need early on so there's an assumption that you need to keep the intermediate representation for the entire program in memory at the same time. For many years we've had a hell of a lot of memory to play with (4 gig will be standard on desktop PCs next year) and with 64 bit architectures and virtual memory you can swap to disk for as much memory as you need. So it would appear that structuring a decompiler like a compiler, with intermediate disk formats for each compilation module is unnecessary. I think this is a mistake.
Having intermediate disk forms is more than just a solution to the memory issue. For a start, there's the ability to start and stop the analysis. In compilation, this is used to great effect when making incremental changes to source files; the files that are not changed do not need to be recompiled to their intermediate disk forms. It's a rare use case where one incrementally changes an input binary to a decompiler, (this does happen though, for example, when doing incremental comparisons of different versions of a program), but a decompilation takes a long time and a single bug can crash the decompiler resulting in hours of lost work. Interruptability alone is justification for intermediate disk forms, but there's another reason.
Because decompilation is still experimental engineering, inspection of intermediate forms is important. The Boomerang decompiler had a number of ways in which we could inspect the intermediate representation. We could stop the decompilation early, print out call graphs, dataflow graphs and register transfer language (RTL). A number of attempts were made at producing a GUI.. the primary purpose being to provide feedback on the progress of the decompilation and to allow inspection at the different stages. Tools to convert intermediate disk forms (which should be stored as efficiently as practical) into a form that is presentable to an engineer are much better seperated from the GUI, which now becomes primarily for feedback of progress.
This suggests another benefit of intermediate disk forms: no longer is it necessary to have a single tool, "the decompiler", to work on the intermediate representation. The decompiler can be seperated into individual tools which do a specific job and do it well. So what kind of tools do we need?
Firstly, a tool which uses a loader library to put the executable image into memory, and an instruction decoder library to turn individual instructions into intemediate representation, would use the variety of techniques that we've developed over the years to find all the code in an executable and, using another library, output it in the intermediate disk forms.
The inspection tools can be used at this point to confirm that all the code has been found and is loaded correctly. Because the intermediate representation is on disk, in a collection of files, the engineer can make a backup of these files and the decoding stage not need be repeated should some later analysis mess up the decompilation.
Following the decoding stage, a lot of analysis and data gathering occurs. Converting the intermediate representation to SSA form. Performing use/def analysis, liveness analysis, parameter analysis, etc. All this information could go into seperate files.. that way the previous intermediate forms can be kept around to do other analysis on, and an invalid analysis result can be recalculated without affecting other analysis phases.
After collecting all this information, the process of decompilation is essentially just transforming the intermediate representation from one form to another. As such, the workhorse of the decompiler is the transformation tool. It should load up the appropriate parts of the intermediate representation, read a script written in a domain specific language and then write back out the intermediate representation.
Often a transformation will require the incremental update of an analysis. I believe this can be acheived using a simple timestamp system on the intermediate disk forms. Yes, what I'm saying is that something similar to a Makefile can be used to drive the entire decompilation process.
The last tool, of course, is the code generation tool. For each designed language we need a tool that will take the intermediate representation and produce "source code" files.
That's enough about architecture. Already I've described a decompiler system which is very different to anything that has been built todate. If you're thinking about writing a decompiler, I hope you'll consider something similar to this.
Another thing I like to think I've learnt over the years of working on decompilers is that it sucks having to come up with an intermediate representation all by yourself. In fact, coming up with an intermediate representation for compilers could be a life time obsession all on its own. There's a lot of schools of thought on the subject. Obviously, something like the tuples of simple compiler is not a lot of good for decompiler. One needs to at least represent all the statements of a high level language as well as the individual instructions you might see in an assembly listing. There's a lot of work in it.. so why not take some advice from people who think about this kind of stuff?
The LLVM Compiler Infrastructure Project has a well thought out "virtual instruction set" which allows such complicated operands that it can express high level languages to such a degree that a simple PERL script can turn a C program into LLVM instructions (not that I recommend that). Perhaps more importantly, it has a mature, efficient on-disk representation and a number of useful analysis phases that work on it. There's an active community where you can report bugs, get help and recruit interested parties. I think you could do worse.
This is a nice article shared. Thanks for sharing this wonderful job..
ReplyDelete