Harvesting malicious files and websites isn’t a difficult task these days when you have sites like MalwareDomainList, jsunpack.jeek.org, etc. that allow pulling a list of URLs that have been reported as malicious or suspicious. What is more difficult and is most important to us is obtaining a complete picture of the actions that a malicious site is trying to perform. Tools like cURL, wget, etc. only retrieve an unrendered version of the page, but the exploit code will be missed if it is in an externally sourced file. Using libraries such as BeautifulSoup simplifies finding all external sources in a page and manually retrieve them, but if a malicious site limits the number of new connections for a host or if these external sources in turn load even more external sources you end up going down a rabbit hole you may never emerge from. Selenium is also a viable option in some circumstances, especially if you add on some custom scripts to auto-dump data from firebug. In the end I settled on using WebKit and Python because I felt they gave me what I needed and I gained some extra flexibility.
WebKit to the rescue
The Python Qt4-Webkit modules give a true rendered version of the webpage because it is a full implementation of the WebKit implementation. You also have the ability to enable and disable flash and java plugins if you desire to see what any malicious plugins may be attempting to do. The major drawback I found with WebKit is that it can be difficult to locate where the data cache is and there is no way to directly access the request and response headers. Solving that issue was fairly simple: tcpdump and tshark. Calling tcpdump / tshark in the background from Python is not difficult and the Malware Analyst's Cookbook (great book!) provides some Python modules that do that. The module is named analysis.py and it, along with the rest of the code from the book is available on their Google Code site. Now we are able to obtain a complete picture of the browser's actions and even have the opportunity to leverage QtWebKit to take a screenshot of the rendered site. Unfortunately, getting all of the data out of the pcap proved to be a not-so-simple problem.
For reasons I have not been able to fully understand this code does not always exit cleanly when crawling some sites (also tried using signals on functions that manually calculated timeouts internally). To prevent this from killing automated processing of a large number of URLs, the WebKit code was placed in a separate script that uses subprocess.Popen to launch and manage the timeouts on the script.
Carving Files using Scapy
I initially attempted to use tcpxtract to pull data from the pcap but it does not support all the filetypes I was looking for, and does not give any information that helps me correlate the extracted data back to the request URLs. Since I can’t correlate data to a request I was also not able to extract the HTTP request/response headers. Scapy fit my needs because of = familiarity and the recent addition of the "sessions()" function that returns a dictionary with the streams separated out. This makes it easy to go stream by stream to find all the HTTP requests, match them up with the corresponding HTTP responses, and then reassemble the files. HTTP request pipelining presented another challenge but using known HTTP verbs and some regex magic it is not difficult to find all the requests in one stream and then do the same for the responses.
Below is a piece of sample code used to iterate through the sessions list and finding the response session for every request session. By default Scapy uses a string of the from “TCP <Source IP>:<Port> -> <Dest IP>:<Port>” so flipping the source and destination portions gives the response stream. Operating off of the requests from our WebKit browser we can then determine if we have found every HTTP response to our requests.
for s in sess.keys(): if s.startswith("TCP %s" % self.ip): addrs = s.split(" > ") resp = "TCP "+addrs + " > "+ addrs[4:]
does not auto-decompress zlib deflate or gzip compressed streams so using python’s
zlib and gzip modules is the best way to decompress any compressed HTTP streams. I chose to take the easy path to decompression and attempt to zlib decompress and gzip decompress every stream instead of parsing headers and checking for a compression method.
Now that all of the requests are collected we can now insert this data into a database schema that will allow a better picture of how malicious content behavior from sources such as exploit kits evolves over as new versions are released.
dangers of using a full rendering engine with real Flash and Java support are
not lost on us, especially with WebKit’s increasing popularity. We plan to
offload the pcap carving to a “safe” system and add a set of VM controls to
manage VM state and reset to clean states. Setting up our harvesting in this
manner allows us to use Pyro or a similar RPC module to communicate with a
large number of harvesters and collect the data back to a centralized
Another improvement is checking for TCP retransmits and removing them from the response stream. In the case of a retransmit within a compressed stream we will not be able to decompress it with my current method of collecting all of the payloads in an HTTP response and concatenating them together and decompressing.