TippingPoint Digital Vaccine Laboratories

MindshaRE: Naming Conventions

It is my belief that reverse engineering is one part patience, one part experience, and a whole lot of organization. OK, maybe that is a bit of an exaggeration, but organization is essential to reversing. Having a decent naming convention you stick to, not only helps you in the short term, but also 6 months down the line when you or your co-workers look at your IDB. There is no "right" naming convention, but everyone should at least have one they use regularly. So today in MindshaRE we will cover what I use to name functions, variables, and other information you might find in a binary.

MindshaRE is our weekly look at some simple reverse engineering tips and tricks. The goal is to keep things small and discuss every day aspects of reversing. You can view previous entries here by going through our blog history.

There are several reasons to actually use a naming convention. It makes your IDBs easier to read (for everyone), helps you organize functions, variables, and basic blocks, and in general makes your IDBs more professional looking, among other things. If you ever read this blog you know I try really hard to make everything as simple and clear as possible.

Naming convention standards have been debated ad-nauseum by developers since programming languages were created. For some reason everyone insists that their opinion is obviously the best. As long as you, and anyone you interact with, can read your labels and glean the intended information from them you win. Besides readability one of the most important aspects of a good naming convention is simplicity. If it's too complex you won't use it. A naming convention must become a natural extension to the way you reverse.

My personal naming convention is a mixture of Hungarian, UpperCamelCase, and traditional C style. I do this because I need type information, readability, and flexibility. I tend to make my labels longer and descriptive because it's easier to understand.

I have broken down my naming conventions into their respective categories. Let's jump right in.

Functions:

I label my functions more than anything. You have three different views of functions which you'll look at almost every day. Viewing the top of a function, calling a function, or looking at the function window will display the names you have given them. IDA starts you out with the ambiguous "sub_xxxxxxxx" moniker. This is fine but hardly a description of what the function does.

When I have reversed a function I will give it a UpperCamelCase name, trying to be as descriptive as possible. For instance "sub_7E4D5E88" might become "ReadFromFile". One drawback from this method is you have to be mindful of any import names that may conflict. IDA will let you use it, but might assign prototype information to the function. If I wanted to call this function "ReadFile" I might just call it "MyReadFile"

Another common occurrence is to simply append "Wrapper" to functions. In the above example the caller may be renamed to "ReadFromFileWrapper". This can become a little cumbersome when you get 4 wrappers deep. ReadFromFileWrapperWrapperWrapperWrapper just doesn't have the same ring to it. In that case I will just shorten to "ReadFromFileWWWW".

Arguments/Locals:

For arguments and locals I use Hungarian notation for its data type definitions. This seems to be the most descriptive method for associating needed type information with a variable name.

In general arguments and locals are named in similar fashion. The only difference is I will prepend a "arg_" to the arguments name in a function. This lets me easily differentiate between the two. If you need position information as well you can append it to the original making a name like "arg0_", or "arg_4_", whichever is more natural to you.

Let's pretend we have a local integer that contains a count. Using Hungarian notation I would call it "dwCount". To me this specifies its size (I'm assuming dword ints of course) and its purpose. If this was a pointer I'd prepend the name with a "p" to become "pdwCount". I realize people may groan at how this looks. That's fine, but looking at this label I can instantly tell we have a pointer to a 32-bit integer being used as a count. If this was an argument we would use "arg_dwCount" or "arg0_dwCount". To satisfy those whom may not always be on 32-bit platforms you could also label this by size "i64Count".

If we also need the signed information for the data type we can add that as well. Sometimes signed distinction is unnecessary, but I support more information than not. Our above example of a dword integer would be "udwCount", or "sdwCount". And admittedly, the ugliest name "pudwCount" to denote a pointer to an unsigned dword.

Here is a list of the data types I often encounter.

    b   Byte    bCount
    w   Word    wCount
    dw  Dword   dwCount
    p   Pointer pCount  pwCount psdwCount
    sz  String  szName
    a   Array   aNames
    s   Struct  sNames

Alternatively you could also use the c identifiers char, short, or long if you want. Whatever works for you.

Globals:

Global data varies. It could be a handle, jump table, global variable or hundreds of other things. With that said you may need to work on a case-by-case basis. Normally however, I will use the C ALL_UPPERCASE_GLOBAL nomenclature. Since I am use to this as a global variable it works well for me. If we had a jump table that handled packet processing we could name it "PACKET_HANDLER".

Branches:

Branches are your intraprocedural jumps to other basic blocks in the function. IDA names these as "loc_xxxxxxxx". Often times we want to rename this, for instance if we know the branch is a basic block that returns from the function.

For these instances I stick to the old c syntax of lower_case_underscore names. It helps me differentiate between functions and basic blocks easily. It also seems to be more readable in certain cases and stand out less. Lets pretend the basic block currently named "loc_7E4D5F56" returns True. I would label this as "return_true". If it returns false I'd go with "return_false". Some other common labels may be "check_null", "check_counter", "begin_for_loop", and "throw_exception". These labels are useful in explaining basic blocks in a single glance.

Marks:

Bookmarks are used to save a particularly interesting location. In general these can be free form and as descriptive as possible. A good label will tell you why the location is important. An example I often use is "read tcp socket data" or "read from configuration file" keeping everything lower case and forgetting punctuation. I also have the ubiquitous "im here" or "here" mark indicating my last position in the IDB.

Comments:

Comments should be readable and generally a single line. To me it's strange to see multiple lines of comments on a single address. You should insert any data you may have, or references to other addresses if need be. Remember any address IDA has a reference for that you put in a comment can be followed in the IDA GUI.

Creating names using a convention you are comfortable with helps everyone. Try to find something you feel is beneficial and it will become second nature. I don't know how many times I've gone back to an IDB and not known what was going on because I didn't name things properly. Forget trying to open someone else's IDB.

I would be very interested in hearing about the conventions you personally use. I certainly do not think my way is bulletproof or the absolute best. Everything can be improved and expanded. Please leave a comment if you have some suggestions, maybe one day everyone will use a similar style! In the very unlikely case people actually agree on a naming standard I'll draft up a document with more detail that can be used by everyone.

Hope you enjoyed this weeks MindshaRE.

-Cody
Tags:
Published On: 2008-10-02 13:44:00

Comments post a comment

  1. Dima commented on 2008-10-02 @ 21:18

    Heh, I use Hungarian notations for variables too, got this habit from software development. For the function arguments I usually keep arg_n, var_n prefixes, so I can see a variable location in the stack. For the class names I use ClassName_Method scheme for methods and ClassName_mPropery (that m prefix comes from MFC) for properties. In the case of "nested wrappers" i caught myself using names like FunctionWrapper1, FunctionWrapper2 and so on. Don't know why, used it once and it just stayed.

  2. dennis commented on 2008-10-03 @ 01:37

    nice topic, a uniform naming convention would really be nice.
    here's how i do:

    function names: lowercase, individual parts separated by underscores. example: read_length_field
    function wrappers: i usually append 'wrapper' as well, prepend underscore chars for multiple wrappers of the same function, each underscore represents a 'level' (though I am unhappy with that as it looks ugly and is not descriptive enough, there must be something better ;-) ) example: ___read_length_field_wrapper
    globals: are prepended 'g_'. example: g_number_of_packets
    comments: i sometimes add multiline repeatable comments to functions describing their usage/parameters/types

  3. Cody Pierce commented on 2008-10-03 @ 01:37

    @Dima: Thanks for the comment. I also keep the stack offsets in the name when I think it's needed. I like your function wrapper numbering, maybe I'll switch. I seldom get a couple wrappers deep without changing the name so it's never been a problem adding "W" to the end.

  4. Cody Pierce commented on 2008-10-03 @ 01:44

    @dennis: I think the main difference in naming conventions probably stems from where people first starting programming, or have more experience. Most unix programmers follow your naming convention, while programmers on the windows platform use the one I discussed. Both work. Hopefully someone will have a good idea for naming wrappers that all of us can use :)

  5. Ali Rizvi-Santiago commented on 2008-10-08 @ 12:02

    I tend prefix all local variables with l_, globals with g_, args with a_. I've also been using something sort of related to hungarian notation combined with things i've seen in john carmack's code from a long time ago. plus, suffixing it with the relative address for quickreferencing is also useful as opposed to pulling the address out of the opcode.

    -------
    for referencing info:
    p = pointer
    v = generic value

    for typing info:
    b, h, w, s = byte, halfword (16bits), word (32bits), string
    t = function table

    (i use h and w as it can be applied to more archs than just intel and still remain consistent.)
    -------

    so if you have a pointer to a byte that's an argument that's located at 0xc(%esp) from the entry to the function (or arg_8).

    apb_resultOfSomeKind_8
    means - argument; pointer to a byte; - arg_8

    lvw_bitsPerPixel_4c
    means - local; value of a word(32bits); - var_4c

    gpps_blah_57005
    means - global; pointer to a pointer to a string; - $baseaddress + 0x57005

    for wrappers, i really don't care to know if it's a wrapper or not, just about it's functionality (which i usually infer from it's name anyways..) i actually just use the same scheme that i use for global data. if it's a wrapper to a memcpy, i just suffix it with the relative address. makes it visually easy to know where you set bp if you're working with code that can be mapped to a different base addy. (such as adobe flash). if i ever need to know the level of wrappers, i usually save that for a comment as i classify that thought as more like documenting the specific implementation of a function as opposed to describing the actual meaning of the function.

    example:
    memcpy_1a75105

    i have an .idc i wrote a while ago for automatic application of this convention to a name. it's pretty trivial to write though. especially if you decide to use python as .idc is sorta lacking on string manipulation functions.

    i also stole pedram's idea of using '?' in the suffix when unsure about what some particular data is.
    http://pedram.redhive.com/blog/2005-11-02/


Trackback