DVLabs has been collecting a large number of documents and files that are flagged as malicious and we're trying to decrease the number that we have to do a full manual analysis on. One of the methods we're using to aid in this is shellcode detection. If shellcode is detected inside the document we can reduce the amount of data we have to look at inside the file to find the attack. The majority of our code is in Python so shellcode detection using a Python module is preferable. The two I'll be looking at in this post are pylibemu and pylibscizzle.
Pylibemu
I prefer to use pylibemu over the standard libemu python bindings because they provide more functionality and also do not increase the difficulty of use. The functionality that most interests me is the test() function that profiles the execution against the win32 API and can give details about the attempted win32 API calls to give you a better idea of what the shellcode is attempting to do once it gains control. If there are any attempted downloads in the shellcode, pylibemu will attempt to retrieve them. Here's an example usage of the Usage:
emulator = pylibemu.Emulator()
offset = emulator.shellcode_getpc_test(shellcode)
emulator.prepare(shellcode, offset)
emulator.test()
print emulator.emu_profile_output
Pylibemu is a viable option for use in detecting shellcode, but I found it to be hit and miss. I started testing Georg Wicherski’s libscizzle when it was initially released and found that the results were better on the shellcodes I was testing it on.
Pylibscizzle
The one issue I had with libscizzle was that it was a C++ library and had no Python bindings. That led me to recently write and release cython-based python bindings for it. The one challenge in creating pylibscizzle was the constantly evolving C++ features in Cython. There are still a few functions that are not accessible via Python due to Cython not having a std::string native wrapping, but the core functionality is implemented. Using pylibscizzle is fairly simple, you need to create a pyDetector instance. Inside the initialization function of the pyDetector class, a pyScanner instance is created to retrieve candidate shellcode offsets. These offsets are then used in the detectShellcode function and a value is returned. If the value is 0xffffffff the detection was not successful, if the value is less than zero, then there was an error in detection. Here’s a code snippet showing example usage:
from pylibscizzle import pyDetector,pyScanner
detector = pyDetector(shellcode) offset = detector.detectShellcode()
The scanner module can also be accessed directly to retrieve the candidate offsets and use them for any other analysis.
scanner = pyScanner(shellcode)
offsets = scanner.findCandidates()
Once the offset from pylibscizzle has been obtained, the prepare() and test() functions from pylibemu can be used to look at the emu_profile_output to see libemu’s execution profile of the shellcode. This does not always work, but can be useful in some instances if you are having trouble wrapping your head around what the shellcode is attempting to accomplish. There are also many other python-based disassemblers that can be used
Example
This example will be using the sc1.bin that is provided in the libscizzle distribution but I will only show the first 48 bytes for brevity:
sc = “eb6e33c0648b403085c0780d568b400c”
sc += “8b701cad8b40085ec38b403483c07c8b”
sc += “403cc3608b6c24248b453c8b7c057803”
....
sc = sc.decode(‘hex’)
detector = pyDetector(sc)
offset = detector.detectShellcode()
In this instance the offset returned will be 0x70, but
running it through pylibemu will yield an offset of 0x0 on this file. In this instance
both are technically correct, and why becomes clear after looking
at the disassembly of the first 2 bytes:
0x00000000 eb 6e jmp 0x70
It’s clear that they are ultimately agreeing on where the shellcode exists but differ on where it starts. Further analysis on the disassembly can be done using a disassembler such as distorm or you can use radare2 to inspect the instructions and trace the shellcode manually in a more interactive fashion. Radare2 also has the ability to output a graphviz DOT file that illustrates the call structure.
Conclusion
I am currently in the process of building a custom javascript deobfuscator that targets strings deemed suspicious during the deobfuscation process in an attempt to find shellcodes embedded in the code. Once the deobfuscation process is done the plan is to run the decoded strings through pylibscizzle and pylibemu to verify the data has been extracted correctly. I plan to use both because I have had instances where they detect the shellcode offsets at different locations as well one detecting shellcode while the other does not. This will allow for a large database of shellcodes to be built that can be further analyzed for interesting techniques and allow us to more easily identify which of the malicious files that we collect through our harvesting procedures is malicious.
