Bubble Gum and Duct Tape

Wireshark, you make me miserable.

I've been working on finding some specific DNS results on my network.

Here's the assignment
1. Locate DNS queries on a sniffer
2. Find "Suspicious IP's" or "Suspicious" DNS replies (confidential)
3. Report back.

Easy enough, right?

Instead of going too low level, I have the luxury of using tshark. So why not?

First, let's look at a DNS reply.

....
<field name="" show="Additional records" size="16" pos="165" value="c075000100010000e76f00044348150b">
<field name="" show="ns2.panthercdn.com: type A, class IN, addr 67.72.21.11" size="16" pos="165" value="c075000100010000e76f00044348150b">
<field name="dns.resp.name" showname="Name: ns2.panthercdn.com" size="2" pos="165" show="ns2.panthercdn.com" value="c075">
<field name="dns.resp.type" showname="Type: A (Host address)" size="2" pos="167" show="0x0001" value="0001">
<field name="dns.resp.class" showname="Class: IN (0x0001)" size="2" pos="169" show="0x0001" value="0001">
<field name="dns.resp.ttl" showname="Time to live: 16 hours, 27 minutes, 27 seconds" size="4" pos="171" show="59247" value="0000e76f">
<field name="dns.resp.len" showname="Data length: 4" size="2" pos="175" show="4" value="0004">
<field name="" show="Addr: 67.72.21.11" size="4" pos="177" value="4348150b">
</field>
.....


The problem is that the field name for the resolved address has no name! So if you wanted you could return data by


~>tshark -i eth0 dns -Tfields dns.resp.name
ns2.panthercdn.com


but you can never simply query for the resolved IP! If you'd like a full reference list of what you can and can't query, check this out http://www.wireshark.org/docs/dfref/d/dns.html .

At least I'm not the only one who thinks this is a feature that should be added. According to the Wireshark wish list:
"PDML output has several fields that do not have a name attribute assigned to them. This creates difficulties in parsing and offline analysis of the XML output from Wireshark. e.g. IP address field in a DNS response does not include the name attribute. (Fields added to the display using proto_tree_add_text() don't have a name related to them. This wish requires the abolishment of this function. Or will an empty name/showname attribute suffice? - Jaap Keuter) (Alternatively, we could just suppress unnamed fields from the PDML output, and turn unnamed fields to named fields as people complain or submit patches. A sweeping change of all unnamed fields would probably only happen over a long period of time. - Guy Harris) "


It looks like I may not get my wish any time soon. So the only solution is a workaround. (I'm getting tired of these types of solutions - for simple annoyances.)

I'm just going to use ruby combined with tshark inline.

tshark -r Desktop/dns.pcap -R dns.flags.response | ruby -e 'while line=STDIN.gets; unless line.scan("response").length.eql?(0); temp=line.split(" "); puts "#{temp[4]} #{temp[10]} #{unless temp[12..temp.length].nil?; temp[12..temp.length].join(" "); end}"; end; end'

This returns

"Caller IP -> Dest Site (or IP) -> resolved IPs (maybe multiples) "

I'm sure that there are sed, awk or bash guru's out there that could write it slimmer, but this seems to work for the time being. Essentially ruby is just being abused as a text parser. I'm thinking about dropping tshark altogether and writing the sniffer using pcaplet. However, this doesn't resolve packets to the application layer so we're stuck text parsing through udp.data text either way. Looks like we're not getting a elegant solution this time.

Like I said, bubble gum and duct tape man.

Posted at at 7:21 PM on Saturday, October 3, 2009 by Posted by nick | 1 comments   | Filed under:

Introducing Lightbulb

I've changed my script into an open source project! I'd like to formally introduce lightbulb. Lightbulb is a utility to read in proxy log data and use entropy based analysis to determine beacons associated with malware. You can tailor it to do just about anything you'd like and it may provide network admins with a deeper understanding of their network and the activity happening around them.

For the first official release of lightbulb I wanted to make some significant improvements to the code and the reporting. There are more changes to come, but what you'll find new in this version is.

  • Reporting back of the actual beacon time intervals
  • Outliers listed according to the standard deviation of all data
It's a nice improvement and it's coming a long way. The next version has a scheduled change in the backbone to boost performance. As well as taking a look at different learning algorithms. Stay tuned for more changes. All updates on the project will now be hosted at google code as well as on here. I'll try to keep the changelog up to date. :)

http://code.google.com/p/lightbulb

Posted at at 11:17 AM on Saturday, April 25, 2009 by Posted by nick | 1 comments   | Filed under: , ,

On the Horizon

Just a quick check in.

Since I've last written my entropy beacon script, I've received a flurry of suggestions and ideas on how to better expand and build upon this research. Those have been slowly making their way into my main codebase and some major backend work has begun.

Along with that I've got two new things coming down the pipeline. One has to do with rebuilding binaries off the wire (seems that there's not many good programs for this) and the other has to do with hunting malware via User Agent Strings. Would you believe that on a daily basis I've seen more than 5000 unique uas? It's unbelievable.

Also, I'm starting to port a good chuck of my work to google code so that I don't have to mail out updates to those who've been using my scripts.

Thanks again for all the suggestions, You'll be hearing from me soon.

Posted at at 4:38 PM on Saturday, April 4, 2009 by Posted by nick | 1 comments   | Filed under:

Using Entropy to Locate Automated Behavior

In my last post, I gave sample code to locate a hard-coded beacon using proxy logs as a source of information. I found some results on the small scale, but nothing that blew me away. To be honest, the results were spotty at best.

Well, it's been a while (just about 60 days) since I was last down that route. I've been working on a better way to determine anomalous behavior. I found a couple dead ends and a couple of promising avenues.

Let's start off with a couple of ideas.
1. Human behavior on the Internet is unpredictable at best.
2. Machine behavior is typically not.
3. If randomness is added in, it's typically a 'true' random distribution.
4. Humans are bad at generating random numbers.

Alright, so the starting with the shortcomings of the last script.
1. It couldn't detect beacons with random time signatures.
2. Only compared other values with the first beacon in the list.
3. The tolerance was letting a lot of false positives through the door.
4. Took FOREVER to run.

Using entropy to look for anomalous behavior is not a new idea. In fact, Shannon's entropy method was introduced in 1948. Some of the first novel uses involved using it to detect cipher text and different encryption algorithms. In fact there are tons of applications across all the sciences.

So what does it do? Roughly speaking, it'll tell you how predictable a set of data is. If your set of data consisted of just one number, it's entropy would be zero. However, if it was completely random it's entropy would be greater than 0 (depending on the size of the set). Let's take a look at the mathematics of it, and then how to implement it.

First off, here's what it looks like:



Let's use this against a data set. [50,50,50,50,50,50....50]

First we need to calculate the probability mass function for all values in the list. For this example it's trivial. 50 appears for every entry so the probability that 50 will occur is 1/1.
We then take this and compute
-[(1/1)*(Log(1)/Log(2))] (The reason for dividing by Log(2) is the change of base formula when you're using Log base 10. Algebra - FTW)
This equals 0. That tells us how 'random' our list is. For this example, it's easy to see that it's not very random at all.

So let's take a more 'random' entry: [30,40,30,40,30,40,30,40,30,40.....]
We'll put it to the test by plugging it into our formula.

30 and 40 are the only 2 numbers that appear so the probability that each on happens is 1/2.

- [(1/2)*(Log(1/2)/Log(2)) + (1/2)*(Log(1/2)/Log(2))] = 1

So this is a little more interesting......

One more, just entertain me: [15,65,8,7,99,35,200,3201,5,5]

This last one looks pretty random to me. Let's see what the Shannon's entropy has to say about this. First let's set up our pmf.

15 = 1/10
65 = 1/10
8 = 1/10
7 = 1/10
99 = 1/10
35 = 1/10
200 = 1/10
3201 = 1/10
5 = 1/5

Putting this into Shannon's formula looks a little like this
- [(1/10)*(Log(1/10)/Log(2))*8 + (1/5)*(Log(1/5)/Log(2))] = 3.12193

At this point it should be evident that the more random the number list is, the higher the level of entropy.

So how does this tie into finding beacons or automated behavior?
Building on the last entry and not being so rigid with our time entry items, this can give us a good idea of the behavior of the user hitting a particular site. Essentially, this will break down into a couple steps.
First, Build a hash of arrays of all IPs and every site that they've talked to:
{192.168.1.100 : evil.com} => [12:05:00, 12:10:00, 12:15:00, 12:20:00...]

Then take the time differences (in seconds)
{192.168.1.100 : evil.com} => [300,300,300,300....]

Next, calculate the entropy for each array and store
{192.168.1.100 : evil.com} => 0

...and that's all! Sort our lists and report on our findings!

Seems easy enough right? There are still some shortcomings of this method though. Let's discuss.

Network latency - Slowness in the network can mess with the beacon times and make them appear not to be perfect. The hope is that they will be "close enough" for enough common values to lower the entropy. Another way around this is to build in a tolerance to allow for a small amount of flexibility. As I found out in the last entry though, the tolerance will allow for a lot of false positives to come through the door.

Flagging regular user activity - It may not be uncommon for a user to click a link on a site every second. To a script, this looks like 'predictable' behavior. A way to avoid this problem is set a minimum tolerance for the beacon interval, this may allow for less false positives, but it will cut down a significant amount. The idea is that the user will visit the site more later on in a less predictable fashion and result in a higher entropy.

What about the results?
Simply put, it works. I've used this to locate beacons or just to better understand the network that I work in.

For now, the reporting capabilities are somewhat weak and could go for a little improvement. More on that later. One idea is to look at things like the standard deviation of the list and only list the outliers, flagging the rest as regular traffic. I guess that only time and testing will tell.

The script that I'm attaching is written in ruby for ease of reading. My future plans are to rewrite this in something faster like OCAML or even go as low as C or C++. Speed and memory are difficult to manage when you're dealing with terabytes of information, regardless of what language you're writing in. In these sorts of scenarios though, any small difference may go a long way. The results have uncovered a lot of quirks in the network, and have already allowed me to profile some machines. It also gives the ability to break up their activity based on user generated vs machine generated traffic, which to an incident responder, is huge.

Until next time.

Posted at at 10:27 PM on Tuesday, February 17, 2009 by Posted by nick | 2 comments   | Filed under: , ,