MindshaRE is our weekly look at some simple reverse engineering tips and tricks. The goal is to keep things small and discuss every day aspects of reversing. You can view previous entries here by going through our blog history.
When I speak of constructors I am describing the code responsible for creating an object in a high level language like C++. When an object is instantiated it must first set up all data necessary to access that object, including properties and methods.
To identify these in a binary we first need to look at an example. Using symbols I have found a test case that is fairly typical.
.text:75464778 Domain__Domain proc near
.text:75464778 mov edi, edi
.text:7546477A push esi
.text:7546477B mov esi, ecx
.text:7546477D push edi
.text:7546477E xor edi, edi
.text:75464780 lea ecx, [esi+3Ch]
.text:75464783 mov [esi], edi
.text:75464785 mov [esi+4], di
.text:75464789 mov [esi+8], edi
.text:7546478C mov [esi+0Ch], edi
.text:7546478F mov [esi+10h], edi
.text:75464792 mov [esi+38h], edi
.text:75464795 call sub_7548A2BC
.text:7546479A push 4
.text:7546479C lea ecx, [esi+5Ch]
.text:7546479F call CQueue__CQueue
.text:754647A4 push 4
.text:754647A6 lea ecx, [esi+7Ch]
.text:754647A9 call CQueue__CQueue
.text:754647AE push 4
.text:754647B0 lea ecx, [esi+9Ch]
.text:754647B6 call sub_754616E7
.text:754647BB push 4
.text:754647BD lea ecx, [esi+0BCh]
.text:754647C3 call sub_754616E7
.text:754647C8 push 4
.text:754647CA lea ecx, [esi+0E0h]
.text:754647D0 mov [esi+0DCh], edi
.text:754647D6 call sub_754616E7
.text:754647DB lea ecx, [esi+100h]
.text:754647E1 call RandomChannelGenerator__RandomChannelGenerator
.text:754647E6 push edi
.text:754647E7 push edi
.text:754647E8 mov ecx, esi
.text:754647EA call Domain__LockDomainParameters
.text:754647EF pop edi
.text:754647F0 mov eax, esi
.text:754647F2 pop esi
.text:754647F3 retn
.text:754647F3 Domain__Domain endpWhen looking at constructors we want to see that a structure is being built. As you can see at the top of the function we see a structure being initialized. This is done by zeroing out elements in the object being created. The tip off is the following assembly.
.text:7546477E xor edi, edi
.text:75464780 lea ecx, [esi+3Ch]
.text:75464783 mov [esi], edi
.text:75464785 mov [esi+4], di
.text:75464789 mov [esi+8], edi
.text:7546478C mov [esi+0Ch], edi
.text:7546478F mov [esi+10h], edi
.text:75464792 mov [esi+38h], ediEdi becomes a zero register, and then is used to zero out elements in the structure being created using esi as the base pointer. To double check that the object being initialized is new, we can look at this functions caller.
.text:75461900 Controller__ApplicationCreateDomain proc near
...
.text:7546191C push 108h
.text:75461921 call operator_new
.text:75461926 test eax, eax
.text:75461928 pop ecx
.text:75461929 jz short loc_75461936
.text:7546192B mov ecx, eax
.text:7546192D call Domain__DomainThe call to operator_new() will create a memory region of 0x108 for our instantiated Domain object. At this point we are certain this is an object constructor.
The previous example is easy to find with symbols. IDA will label constructors ClassName::ClassName which turns into ClassName__ClassName in the UI. But what happens when we are stripped of our precious symbolic information? We will have to resort to pattern recognition and IDAPython.
In this sample script we will have a few basic requirements. First we need a function that uses a zero register. It will also have to use that zero register to initialize structure variables. You can see in the example above edi will be our zero register and the mov to structure offsets will be our initialization. The code will look like this:
def instruction_match(ea, mnem=None, op1=None, op2=None, op3=None):
if mnem and mnem != GetMnem(ea):
return False
if op1 and op1 != GetOpnd(ea, 0): return False
if op2 and op2 != GetOpnd(ea, 1): return False
if op3 and op3 != GetOpnd(ea, 2): return False
return True
segbeg = SegByName(".text")
segend = SegEnd(segbeg)
for ea in Functions(segbeg, segend):
function_name = GetFunctionName(ea)
beg = ea
end = FindFuncEnd(beg)
zero = False
count = 0
curea = beg
while curea <= end and curea != BADADDR:
mnem = GetMnem(curea)
if "xor" in mnem or "mov" in mnem:
if instruction_match(curea, "xor", "edi", "edi"):
zero = True
elif zero and "mov" in mnem:
# mov [esi+4], edi
optype = GetOpType(curea, 0)
if optype == 4:
op = GetOpnd(curea, 1)
if op in ["edi", "di"]:
count += 1
curea = NextHead(curea, end)
if count > 4:
log("%x\n" % beg)This script essentially implements our requirements. It will loop through all functions in the binary searching each line for a zero register and that register being used on a structure. Running this script gives us a slew of addresses to investigate, one being our original example at 0x75464778.There are a couple of issues with this script. First the compiler decides what register to use for the zero register and the structure. For instance look at the following example.
.text:754748D2 xor ebx, ebx
.text:754748D4 lea ecx, [edi+20h]
.text:754748D7 mov dword ptr [edi], offset CConfDescriptorListContainer___vftable_
.text:754748DD mov [edi+10h], ebx
.text:754748E0 mov [edi+14h], ebx
.text:754748E3 mov [edi+18h], ebx
.text:754748E6 mov [edi+1Ch], ebx
.text:754748E9 call sub_7548A2BCIt is very similar to the first example, and our pattern still applies. However, the compiler has chosen to use ebx for the zero register and edi for the object structure. To make our script better we would want to add these other possible registers.
A second problem is the lack of robustness in the script. For instance we do not track the zero register. If it changes, our script will still believe it is being used for initialization. Also we do not handle local variables being initialized to zero. In IDA the OpType for locals and structure offsets is the same.
Solving these problems would not be extremely difficult. With some additional text processing, and more robust requirements, our script could provide a reverse engineer with a handy tool for locating those crucial constructors in a binary. I hope this can be of some use in the future. As always if you have some additional ideas, or contributions, please leave a comment.
-Cody
