TippingPoint Digital Vaccine Laboratories
DID YOU KNOW... In December of 2007, Microsoft released seven security bulletins which fixed 11 new security vulnerabilities. TippingPoint and ZDI were credited with discovering a total of four of those vulnerabilities.

Malicious Content Harvesting with Python, WebKit, and Scapy

 


Harvesting malicious files and websites isn’t a difficult task these days when you have sites like MalwareDomainList, jsunpack.jeek.org, etc. that allow pulling a list of URLs that have been reported as malicious or suspicious. What is more difficult and is most important to us is obtaining a complete picture of the actions that a malicious site is trying to perform. Tools like cURL, wget, etc. only retrieve an unrendered version of the page, but the exploit code will be missed if it is in an externally sourced file. Using libraries such as BeautifulSoup simplifies finding all external sources in a page and manually retrieve them, but if a malicious site limits the number of new connections for a host or if these external sources in turn load even more external sources you end up going down a rabbit hole you may never emerge from. Selenium is also a viable option in some circumstances, especially if you add on some custom scripts to auto-dump data from firebug. In the end I settled on using WebKit and Python because I felt they gave me what I needed and I gained some extra flexibility.

 

WebKit to the rescue

The Python Qt4-Webkit modules give a true rendered version of the webpage because it is a full implementation of the WebKit implementation. You also have the ability to enable and disable flash and java plugins if you desire to see what any malicious plugins may be attempting to do. The major drawback I found with WebKit is that it can be difficult to locate where the data cache is and there is no way to directly access the request and response headers. Solving that issue was fairly simple: tcpdump and tshark. Calling tcpdump / tshark in the background from Python is not difficult and the Malware Analyst's Cookbook (great book!) provides some Python modules that do that. The module is named analysis.py and it, along with the rest of the code from the book is available on their Google Code site. Now we are able to obtain a complete picture of the browser's actions and even have the opportunity to leverage QtWebKit to take a screenshot of the rendered site. Unfortunately, getting all of the data out of the pcap proved to be a not-so-simple problem.

Below is a piece of sample code used to enable plugins, JavaScript and images in the Python Qt4-based WebKit browser. There are many variations on the QtWebKit code for Python that can be found via Google searches. I used some of what I found as the basis for the request piece of the harvester and added on plugin options.

 
 
    webpage = QWebPage()

    request = QNetworkRequest()

    websettings = webpage.settings()

    websettings.setAttribute(QWebSettings.AutoLoadImages, True)

    websettings.setAttribute(QWebSettings.JavaEnabled, True)

    websettings.setAttribute(QWebSettings.JavascriptEnabled, True)

    websettings.setAttribute(QWebSettings.PluginsEnabled, True)

    webpage.connect(webpage, SIGNAL("loadFinished(bool)"), onLoadFinished)

    webpage.mainFrame().load(QUrl(url))

For reasons I have not been able to fully understand this code does not always exit cleanly when crawling some sites (also tried using signals on functions that manually calculated timeouts internally). To prevent this from killing automated processing of a large number of URLs, the WebKit code was placed in a separate script that uses subprocess.Popen to launch and manage the timeouts on the script.

Carving Files using Scapy

I initially attempted to use tcpxtract to pull data from the pcap but it does not support all the filetypes I was looking for, and does not give any information that helps me correlate the extracted data back to the request URLs. Since I can’t correlate data to a request I was also not able to extract the HTTP request/response headers. Scapy fit my needs because of = familiarity and the recent addition of the "sessions()" function that returns a dictionary with the streams separated out. This makes it easy to go stream by stream to find all the HTTP requests, match them up with the corresponding HTTP responses, and then reassemble the files. HTTP request pipelining presented another challenge but using known HTTP verbs and some regex magic it is not difficult to find all the requests in one stream and then do the same for the responses.

Below is a piece of sample code used to iterate through the sessions list and finding the response session for every request session. By default Scapy uses a string of the from “TCP <Source IP>:<Port> -> <Dest IP>:<Port>” so flipping the source and destination portions gives the response stream. Operating off of the requests from our WebKit browser we can then determine if we have found every HTTP response to our requests.

        for s in sess.keys():


                if s.startswith("TCP %s" % self.ip):

                        addrs = s.split(" > ")

                        resp = "TCP "+addrs[1] + " > "+ addrs[0][4:]

 

Scapy does not auto-decompress zlib deflate or gzip compressed streams so using python’s zlib and gzip modules is the best way to decompress any compressed HTTP streams. I chose to take the easy path to decompression and attempt to zlib decompress and gzip decompress every stream instead of parsing headers and checking for a compression method.

Now that all of the requests are collected we can now insert this data into a database schema that will allow a better picture of how malicious content behavior from sources such as exploit kits evolves over as new versions are released.

Future Enhancements

The dangers of using a full rendering engine with real Flash and Java support are not lost on us, especially with WebKit’s increasing popularity. We plan to offload the pcap carving to a “safe” system and add a set of VM controls to manage VM state and reset to clean states. Setting up our harvesting in this manner allows us to use Pyro or a similar RPC module to communicate with a large number of harvesters and collect the data back to a centralized processing server.

Another improvement is checking for TCP retransmits and removing them from the response stream. In the case of a retransmit within a compressed stream we will not be able to decompress it with my current method of collecting all of the payloads in an HTTP response and concatenating them together and decompressing.

Resources

Tags:
Published On: 2011-11-28 09:11:48

Comments post a comment

  1. Christian @xntrik Frichot commented on 2011-11-28 @ 17:18

    Probably doesn't exactly fit your model, but we've had some good luck with extracting HTTP session information out of tcpdump files using Chaosreader. The only downside may be is that it's a Perl script (with no Python ports as far as I know) and it hasn't been updated since 2004. Perhaps another thing to look into ?

  2. Simon Edwards commented on 2011-11-29 @ 06:43

    Sent the following as an email but it bounced...

    "First of all, excellent work. Secondly, I don’t know if you’ve tried it but we’ve had some success extracting content from pcaps using a combination of tcpreplay and foremost. Maybe you might find this useful: It splits the types of content out into different folders so if you only care about PE files you’ll get them in one place. Similarly with Zip files, JPEGs and so on.

    Apologies if this is not news to you :)

    Cheers,
    Simon


Trackback