« January 2010 | Main | March 2010 »

February 25, 2010

Custom data types and formats

Another new feature that will be available in the upcoming version of IDA Pro is the ability to create and render custom data types and formats.


(Embedded instructions disassembled and rendered along side with x86 code)

What are custom types and formats

  • Custom data type: A custom type is basically just a way to tag some bytes for later display with custom format, when the built-in IDA types (dt_byte, dt_word, etc) are not enough. For example: an XMM vector, a Pascal string, a half-precision (16 bits) floating-point number, a 16:32 far pointer (fword), uleb128 number and so on. To define a custom type, you need to provide its name, size (fixed or dynamically calculated), keyword for disassembly and a few other attributes.
  • Custom data format: The custom data format allows you do display a custom or built-in data type in any way you like. You can register several formats for each type and switch the representation. For example, you might want to switch the display of the same 16-byte XMM vector between four floats or two doubles. A format definition includes callback for printing (to display) and scanning (used during debugging to change the register values).
For example, here is a custom MAKE_DWORD format applied to the built-in dword type:

Its implementation is very simple:


Next we illustrate some possible usages of custom types and formats. Other uses are also possible too, it is up to your imagination.

Decoding embedded bytecodes

Imagine you are debugging an x86 program that implements its own VM and embeddes them in the program.
The classical solution for this problem can be:
  • Write a dedicated processor module and then load the extracted bytecodes separately
  • Or define the bytecodes as bytes and then use comments to describe the real meaning of those bytecodes.

With this new addition, one can just write a custom data type to handle the situation:


And if you happen to have a situation where the bytecodes are operands to instructions (as means of obfuscation), you can still apply the custom format on those operands:


The previous blog entry showed how to write processor modules using Python. What if one simply uses the "import" statement to import a full-blown processor module script and use it in the custom data types/formats? ;)

Displaying resource strings

When reversing MS Windows applications, one can encounter string IDs, but then how to easily and nicely go fetch the data and display it in the disassembly listing?
Normally, one would have to use a resource editor to extract the string value corresponding to the string id, then to create an enum in IDA for each string ID with a repeatable comment:


That works, but what about writing your own custom format instead:


And then applying it directly without having to use a resource editor to extract the string value, have the custom format do that programmatically for you :


This is how a resource string custom format handler can look like:


To take a closer look at it, you can download the custom data type handler script along with the source code of the simplevm assembler/disassembler and the C program that was used in this article.

February 16, 2010

Scriptable Processor modules

One of the new features we are preparing for the next version of IDA is the ability to write processor modules using your favorite scripting language.
After realizing how handy it is to write file loaders using scripting languages, we set out to making the same thing for processor modules. As an exercise for this new feature, we implemented a processor module for the EFI bytecode.


Background

In IDA Pro, a processor module implementation is usually split into four parts:
  1. Processor, assembler, instructions and registers definitions (ins.cpp/.hpp, reg.cpp)
  2. Decoder (ana.cpp): decodes an instruction into an insn_t structure (the 'cmd' global variable)
  3. Emulation (emu.cpp): emulates instructions, creates appropriate cross references, traces the stack, recognizes code patterns, etc...
  4. Output (out.cpp): outputs the result to the screen
The processor module is described using the processor_t structure. It holds pointers to registers, instructions, processor module name and other callbacks (ana, emu, out, notify, ...).
The assembler is described using the asm_t structure. It holds pointers to the assembler syntax and other callbacks.
For more information about structures and functions used in IDA API and processor modules (e.g. insn_t), see this great tutorial by Steve Micallef.

Writing a processor module in Python

To write a processor module in Python, we follow similar logic.
  1. Write the get_idp_desc() function. It simply tells IDA what processors the module can handle.
    def get_idp_desc(): return "EFI Byte code:ebc"

    The return value means that this processor is named "EFI Byte code" and its shortname is "ebc". Thus a subsequent call to set_processor_type('ebc') from the part of a file loader will succeed.

    In case of the pc processor module, which can handle many variations of x86 architecture, the string looks like this:
    Intel 80x86 processors:8086:80286r:80286p:80386r:80386p:...
  2. Define the registers and instructions:
    # Registers definition proc_Registers = [ # General purpose registers "R0", "R1", ..., "R10", ... ] # Instructions definition proc_Instructions = [ {'name': 'INSN1', 'feature': CF_USE1}, {'name': 'INSN2', 'feature': CF_USE1 | CF_CHG1} ... ]
  3. Write the get_idp_def() function. It should return a dictionary similar to the processor_t structure with the processor, assembler, instructions and registers definitions.
    # This function returns the processor module definition def get_idp_def(): return { 'version': IDP_INTERFACE_VERSION, # IDP id 'id' : 0x8000 + 1, # Processor features 'flag' : PR_USE32 | PRN_HEX | PR_RNAMESOK, # short processor names # Each name should be shorter than 9 characters 'psnames': ['ebc'], # long processor names # No restriction on name lengthes. 'plnames': ['EFI Byte code'], # number of registers 'regsNum': len(proc_Registers), # register names 'regNames': proc_Registers, # Array of instructions 'instruc': proc_Instructions, .... 'assembler': \ { # flag 'flag' : ASH_HEXF3 | AS_UNEQU | AS_COLON | ASB_BINF4 | AS_N2CHR, # Assembler name (displayed in menus) 'name': "EFI bytecode assembler", ... # byte directive 'a_byte': "db", # word directive 'a_word': "dw", # remove if not allowed 'a_dword': "dd", ... } # Assembler }
Now that we finished all the declarations, we can implement the decoder (or analyzer), emulator and the output callbacks.
  • The analyzer looks like this:
    def ph_ana(): """ Decodes an instruction into the global variable 'cmd' Current address is pre-filled in cmd.ea """ cmd = idaapi.cmd # take opcode byte b = ua_next_byte() # decode and fill cmd.Operands etc... # ... # Return decoded instruction size or zero return cmd.size

    And decoding one instruction/filling the 'cmd' variable may look like this:
    def decode_JMP8(opbyte, cmd): conditional = (opbyte & 0x80) != 0 cs = (opbyte & 0x40) != 0 cmd.Op1.type = o_near cmd.Op1.dtyp = dt_byte addr = ua_next_byte() cmd.Op1.addr = (as_signed(addr, 8) * 2) + cmd.size + cmd.ea if conditional: cmd.auxpref = FL_CS if cs else FL_NCS return True
  • The emulator:
    # Emulate instruction, create cross-references, plan to analyze # subsequent instructions, modify flags etc. Upon entrance to this function # all information about the instruction is in 'cmd' structure. # If zero is returned, the kernel will delete the instruction. def ph_emu(): aux = cmd.auxpref Feature = cmd.get_canon_feature() if Feature & CF_USE1: handle_operand(cmd.Op1, 1) if Feature & CF_CHG1: handle_operand(cmd.Op1, 0) if Feature & CF_USE2: handle_operand(cmd.Op2, 1) if Feature & CF_CHG2: handle_operand(cmd.Op2, 0) if Feature & CF_JUMP: QueueMark(Q_jumps, cmd.ea) # add flow xref if Feature & CF_STOP == 0: ua_add_cref(0, cmd.ea + cmd.size, fl_F) return 1
  • The output callback:
    # Generate text representation of an instruction in 'cmd' structure. # This function shouldn't change the database, flags or anything else. # All these actions should be performed only by ph_emu() function. def ph_out(): cmd = idaapi.cmd # Init output buffer buf = idaapi.init_output_buffer(1024) # First, output the instruction mnemonic OutMnem() # Output the first operand if present (this invokes the ph_outop callback) out_one_operand( 0 ) # Output the rest of the operands for i in xrange(1, 3): op = cmd[i] if op.type == o_void: break out_symbol(',') OutChar(' ') out_one_operand(i) # Terminate the output buffer term_output_buffer() # Emit the line cvar.gl_comm = 1 MakeLine(buf)
    Note that the previous callbacks are very similar to their C language counterparts.

Although this feature will not work with the current version of IDA Pro, you can download the EBC script sample for a preview of how a module would look.

If you like this feature, make sure to apply for the beta testing of next version when we announce it!

February 05, 2010

New IDC improvement in IDA Pro 5.6

Scripting with IDA Pro has always been a very handy feature, not only when used in scripts but also in expressions, breakpoint conditions, form fields, etc...
In IDA Pro 5.6 we improved the IDC language and made it more convenient to use by adding objects, exceptions, support for strings with embedded zeroes, string slicing and references.

General language improvements

Local variables can now be declared and initialized anywhere within a function:
static func1() { Message("Hello world\n"); auto s = AskStr("Enter new name", "noname00"); // ... auto i = 0; // .... }
Global variables can be declared (in a function or in the global scope) with the extern keyword:
// Global scope extern g_count; // Global variables cannot be initialized during declaration static main() { extern g_another_var; g_another_var = 123; g_count = 1; }
Functions can be passed around and used as callbacks:
static my_func(a,b) { Message("a=%d, b=%d\n", a, b); } static main() { auto f = my_func; f(1, 2); }
Strings can now contain the zero character thus allowing you to use IDC strings like buffers. This is extremely useful when used with Appcall to call functions that expect buffers:
auto s = "\x83\xF9\x00\x74\x10"; Message("len=%d\n", strlen(s)); // Construct a buffer with strfill() s = strfill('!', 100); Message("len=%d\n", strlen(s));
Strings can be easily manipulated with slices (Python style):
#define QASSERT(x) if (!(x)) { Warning("ASSERT: " #x); } auto x = "abcdefgh"; // get string slice QASSERT(x[1] == "b"); QASSERT(x[2:] == "cdefgh"); QASSERT(x[:3] == "abc"); QASSERT(x[4:6] == "ef"); // set string slice x[0] = "A"; QASSERT(x == "Abcdefgh"); x[1:3] = "BC"; QASSERT(x == "ABCdefgh"); // delete part of a string x[4:5] = ""; QASSERT(x == "ABCdfgh"); // patch part of the string with numbers x[0:4] = 0x11223344;
Strings and numbers are always passed by value in IDC, but now it is possible to pass variables by reference (using the ampersand operator):
static incr(a) { a++; } static main() { auto i = 1; incr(&i); Message("i=%d\n", i); }
Note that objects (described below) are always passed by reference.

IDC classes

Classes can now be declared in IDC. All classes derive from the built-in base class object:
auto o = object(); o.ea = here; o.flag = 0;
User objects can be defined with the class keyword:
class testclass { testclass(name) { Message("constructing: %s\n", name); this.name = name; } ~testclass() { Message("destructing: %s\n", this.name); } set_name(n) { Message("testclass.set_name -> old=%s new=%s\n", this.name, n); this.name = n; } get_name() { return this.name; } } static f1(n) { auto o1 = testclass("object in f1()"); o1.set_name(n); } static main() { auto o2 = testclass("object2 in main()"); Message("calling f1()\n"); f1("new object1 name"); Message("returned from f1()\n"); }
Which outputs the following when executed:
constructing: object2 in main() calling f1() constructing: object in f1() testclass.set_name -> old=object in f1() new=new object1 name destructing: new object1 name returned from f1() destructing: object2 in main()
To enumerate all the attributes in an object:
auto attr_name; auto o = object(); o.attr1 = "value1"; o.attr2 = "value2"; for ( attr_name=firstattr(o); attr_name != 0; attr_name=nextattr(o, attr_name) ) Message("->%s: %s\n", attr_name, getattr(o, attr_name));
If object attribute names are numbers then they can be accessed with the subscript operator:
auto o = object(); o[0] = "zero"; o[1] = "one";
With this knowledge, we can write a simple IDC list class:
class list { list() { this.__count = 0; } size() { return this.__count; } add(e) { this[this.__count++] = e; } } static main() { auto a = list(); a.add("hello"); a.add("world"); a.add(5); auto i; for (i=a.size()-1;i>=0;i--) print(a[i]); }
IDC classes also support inheritance:
class testclass_extender: testclass { testclass_extender(id): testclass('asdf') { this.id = id; } // Override a method and then call the base version set_name(n) { Message("testclass_extender-> %s\n", n); testclass::set_name(this, n); } }
They also support getattr/setattr hooking like in Python:
class attr_hook { attr_hook() { this.id = 1; } // setattr will trigger for every attribute assignment __setattr__(attr, value) { Message("setattr: %s->", attr); print(value); setattr(this, attr, value); } // getattr will only trigger for non-existing attributes __getattr__(attr) { Message("getattr: '%s'\n", attr); if ( attr == "magic" ) return 0x5f8103; // Ofcourse this will cause an exception since // we try to fetch a non-existing attribute return getattr(this, attr); } }

Exceptions

Normally when a runtime error occurs, the script will abort and the interpret will display the runtime error message. With the use of exception handling, one can catch runtime errors:
static test_exceptions() { // variable to hold the exception information auto e; try { auto a = object(); // Try to read an invalid attribute: Message("a.name=%s\n", a.name); } catch ( e ) { Message("Exception occured. Exception dump follows:\n"); print(e); } }
Resulting in the following output:
Executing function 'main'... Exception occured. Exception dump follows: object description: "No such attribute: object.name" file: "C:\\Temp\\ida56.idc" func: "test_exceptions" line: 91. 5Bh pc: 31. 1Fh qerrno: 1538. 602h

IDC debugging tips

Last but not least, we would like to mention two useful IDC debugging tips.

The first (we used it previously) involves the print() function:
// Print variables in the message window // This function print text representation of all its arguments to the output window. // This function can be used to debug IDC scripts void print (...);
This function can be very handy when used to print a variable of any type especially objects and all their nested attributes.

And the second tip involves the use of the command window to evaluate commands. The trick is to type an IDC statement without a terminating semicolon.
To illustrate, we will first use the DecodeInstruction() with a semicolon:

idc56_semi.gif

And now the same thing, repeated, without a semicolon would automatically invoke the print() against the returned result, thus:

idc56_nosemi.gif

Although we said two debugging tips, but here's the third: you can use the peroid key (".") to jump from an IDA View to the command window and the escape key to return to the IDA View.
The script snippets used in this blog entry can be downloaded from here.