« March 2006 | Main | May 2006 »

April 24, 2006

Portable output for assembler

Sometimes unexpected detours are necessary to reach the goal. Take this simple assembly code:

This compiler generated code calculates the length of the input string. If you do not remember the exact definition of repne scasb, here is another snippet which does the same thing:

A straightforward decompilation of the first snippet yields this:

I can't say that the C code is any better than the assembly code:

  • Single repne scasb has been replaced by an obscure loop.
  • An additional variable to represent the ZF flag has been introduced.
  • The result is longer than the initial assembly code.

It would be nice if the decompiler could replace this assembly code by a call to strlen. For a human reader, the difference would be spectacular:

Just one meaningful line, no puzzling x86 instructions, just plain and understandable code!

Now, the question is, how do I transform the initial assembly code into this ideal decompilation result? I could hardcode the decompiler to check if the first instruction is mov, the second is xor, and so on. You know better than me that this naive approach is severely limited: as soon as the compiler decides to shuffle instructions, use different registers, or replace repne scasb with a loop, our decompiler would be hopelessly confused and lost. Also, different compilers generate different code for built-in functions (just remember the second strlen example).

I can not hope to hardcode all these variations by hand! What if I could specify the sequence in an abstract form and match it against real assembly code? This idea looked attractive for me: I just need to build the pattern matcher once and specify patterns for built-in functions. Patterns could look like this:

  • x86 instructions are gone - they have been replaced by abstract instructions for a virtual machine.
  • Registers are gone - they have been replaced by abstract variable names.

Difficulties are not where we expect them - the most laborious part of the task turned out to be the pattern reader utility which would read the above text representation and produce something binary. And here I stopped and asked myself: what binary representation do I need? The answer was surprising: the pattern reader would generate a C text! The main reason is that C text is most portable, you just need to compile it. I could generate a binary file but then I would need to design its format. I could generate another text file but then I would need another reader. C code has a reader - a C compiler, it can also have any format I want with the structure and union declarations.

The path to the result turned out to be not as straight as I hoped:

The decompiler would be based on a utility which generates C code from an assembler for a virtual machine. Everything got mixed up.

April 13, 2006

Sainte Ida

Apparently she was someone very pious and spiritual :)

http://nominis.cef.fr/contenus/saints_966.html

Today is her day.

IDA Pro started as a simple abbreviation but we quickly got used to the image of this nice lady (in fact the person depicted on the image is just a certain medieval lady, not a saint; not named Ida neither...).

April 11, 2006

Improving IDA analysis

For a typical MS Windows executable IDA does quite good job of recognizing code and creating functions and usually the result is eye-pleasing and easy to decipher. The analysis is quite good but not perfect - there are cases when it takes data for code or wrongly determines the function boundaries.

The good news are that there are easy methods to improve the situation.

It was obvious from the beginning that we can not make a perfect engine to tell code apart from data. Therefore we prepared several ways to alliviate the problem. First, the user has ultimate control over the listing and can anytime convert data to code and vice versa. Second, we created hooks for plugins. For example, each time IDA creates a function, it calls a hook named processor_t::func_bounds and a plugin has a chance to correct the function boundaries. Before creating any instruction, IDA calls processor_t::make_code and if it yields 0, IDA will forego from doing anything. The same scenario is used for data items (processor_t::make_data) and names (processor_t::rename).

In addition to these hooks, there are also events happening before and after the analysis. In fact, there are several events - one for each analysis queue. IDA has several of them:

  • code queue - addresses from this queue will be converted to code
  • function queue - addresses from this queue will be converted to functions
  • reanalysis queue - these addresses will be reanalyzed. This queue is used to create stack variables, correct cross references if a segment register gets modified and so on.
  • undefine - addresses from this queue will become unexplored
  • final queue - if an address from this queue is unexplored, ida will try to convert it to something (data or code). While this queue makes the listing nicer, all decisions for this queue are arbitrary 'best-guess'. If you prefer to work with more precise yet unexplored listing, you might want to turn off the final analysis. The option is available from the kernel options.

There are some other queues (for flirt signature files and other stuff) but the mentioned ones are the most important. When any queue becomes empty, an event (processor_t::auto_queue_empty) is generated. When all queues become empty, a final event (processor_t::auto_empty) is generated and if no plugin or processor module adds anything to the queues in response to it, then the analysis is declared completed (processor_t::auto_empty_finally). Many processor modules react to these events and fine tune the listing one way or another.

The basic autoanalysis algorithm is quite simple. Guys from Determina guessed it right: http://www.determina.com/security.research/ (btw, check the presentation for more interesting stuff; they also have developed a better pdb plugin).

The answer to this problem is "use events!" You will find dozens of them in idp.hpp. You can completely change analysis outcome by providing more information to the kernel. It is very easy to hook to the events:

hook_to_notification_point(HT_IDP, my_event_handler, your_data);
and the handler will be:
int idaapi my_event_handler(void *your_data,
                            int notification_code,
                            va_list va)
{
  if ( notification_code == processor_t::make_code )
  {
    // take care of instruction creation...
  }
  return 0; // pass on the event further
}
If the analysis is not up to your expectations, just hook to events.

April 02, 2006

IDA graph mode

The new IDA Pro introduces the graph mode. The disassembly of the current function is displayed as a graph: each basic block is represented as a node and cross references are represented as edges. It is easy to zoom, move, and modify the graph using the mouse, I'm sure you will just use the new interface without much difficulty. However, there are some unexpected commands which may render your life easier.

For example, the keyboard arrows can be used to move around the graph. This is something expected. But if you hold the Ctrl arrow and press the Up or Down keys, IDA will display the list of all predecessors or successors of the current node.

Double clicking on an edge with the Ctrl key pressed will jump to its destination. Alt will jump to its source.

Pressing '5' on the keypad will center the current node. If you prefer to use the mouse, try to click with the mouse wheel on a node - the clicked node will be centered.

There are many tricks like this. All this is described in minute detail in the help. It won't take long to read the graph-related pages and you will become really fast and comfortable with the graph view. I urge you to spend some 10-15 minutes reading it and playing with graphs.

IDA has more graph layout algorithms than you might think. See some of them in Dennis' blog. You can create your own layouts too (and even your own graphs of absolutely anything). Just take a look at the sample plugin in the SDK.

Latest news: Hex-Rays decompiler has been released!