Portable output for assembler


A straightforward decompilation of the first snippet yields this:

I can't say that the C code is any better than the assembly code:
- Single repne scasb has been replaced by an obscure loop.
- An additional variable to represent the ZF flag has been introduced.
- The result is longer than the initial assembly code.
It would be nice if the decompiler could replace this assembly code by a call to strlen. For a human reader, the difference would be spectacular:

Just one meaningful line, no puzzling x86 instructions, just plain and understandable code!
Now, the question is, how do I transform the initial assembly code into this ideal decompilation result? I could hardcode the decompiler to check if the first instruction is mov, the second is xor, and so on. You know better than me that this naive approach is severely limited: as soon as the compiler decides to shuffle instructions, use different registers, or replace repne scasb with a loop, our decompiler would be hopelessly confused and lost. Also, different compilers generate different code for built-in functions (just remember the second strlen example).
I can not hope to hardcode all these variations by hand! What if I could specify the sequence in an abstract form and match it against real assembly code? This idea looked attractive for me: I just need to build the pattern matcher once and specify patterns for built-in functions. Patterns could look like this:

- x86 instructions are gone - they have been replaced by abstract instructions for a virtual machine.
- Registers are gone - they have been replaced by abstract variable names.
Difficulties are not where we expect them - the most laborious part of the task turned out to be the pattern reader utility which would read the above text representation and produce something binary. And here I stopped and asked myself: what binary representation do I need? The answer was surprising: the pattern reader would generate a C text! The main reason is that C text is most portable, you just need to compile it. I could generate a binary file but then I would need to design its format. I could generate another text file but then I would need another reader. C code has a reader - a C compiler, it can also have any format I want with the structure and union declarations.
The path to the result turned out to be not as straight as I hoped:

The decompiler would be based on a utility which generates C code from an assembler for a virtual machine. Everything got mixed up.

Comments
A very interesting idea.
Posted by: GDR! | April 24, 2006 11:49 PM
I have been studying/planning a very similar concept for a CPU to virtual machine microcode (uc) translator to both help remove many of the processor specific analysis issues, to deal with structural/logical binary comparison (including perhaps polymorphic code), and possibly binary translation. Once the binary is converted and abstracted into a standard uc stream format, and the nonessential uc code removed, then only one simplified set of tools will be needed to do a variety of high level AST type application analysis and visualization of the form and function independent of the original physical environment. I'm happy to see that I am not alone in thinking in that direction, and even more so, honored that is happens to be you of all people. I hope that one day I will have something meaningful to share. ;)
Posted by: slcoleman
|
May 12, 2006 11:21 PM
isn't the first snippet computing the length of the string including the trailing 0? 'cause if yes, strlen is not its perfect decompiler equivalent.
Posted by: Gabi | July 21, 2006 11:40 PM