Using Entropy to Locate Automated Behavior

In my last post, I gave sample code to locate a hard-coded beacon using proxy logs as a source of information. I found some results on the small scale, but nothing that blew me away. To be honest, the results were spotty at best.

Well, it's been a while (just about 60 days) since I was last down that route. I've been working on a better way to determine anomalous behavior. I found a couple dead ends and a couple of promising avenues.

Let's start off with a couple of ideas.
1. Human behavior on the Internet is unpredictable at best.
2. Machine behavior is typically not.
3. If randomness is added in, it's typically a 'true' random distribution.
4. Humans are bad at generating random numbers.

Alright, so the starting with the shortcomings of the last script.
1. It couldn't detect beacons with random time signatures.
2. Only compared other values with the first beacon in the list.
3. The tolerance was letting a lot of false positives through the door.
4. Took FOREVER to run.

Using entropy to look for anomalous behavior is not a new idea. In fact, Shannon's entropy method was introduced in 1948. Some of the first novel uses involved using it to detect cipher text and different encryption algorithms. In fact there are tons of applications across all the sciences.

So what does it do? Roughly speaking, it'll tell you how predictable a set of data is. If your set of data consisted of just one number, it's entropy would be zero. However, if it was completely random it's entropy would be greater than 0 (depending on the size of the set). Let's take a look at the mathematics of it, and then how to implement it.

First off, here's what it looks like:



Let's use this against a data set. [50,50,50,50,50,50....50]

First we need to calculate the probability mass function for all values in the list. For this example it's trivial. 50 appears for every entry so the probability that 50 will occur is 1/1.
We then take this and compute
-[(1/1)*(Log(1)/Log(2))] (The reason for dividing by Log(2) is the change of base formula when you're using Log base 10. Algebra - FTW)
This equals 0. That tells us how 'random' our list is. For this example, it's easy to see that it's not very random at all.

So let's take a more 'random' entry: [30,40,30,40,30,40,30,40,30,40.....]
We'll put it to the test by plugging it into our formula.

30 and 40 are the only 2 numbers that appear so the probability that each on happens is 1/2.

- [(1/2)*(Log(1/2)/Log(2)) + (1/2)*(Log(1/2)/Log(2))] = 1

So this is a little more interesting......

One more, just entertain me: [15,65,8,7,99,35,200,3201,5,5]

This last one looks pretty random to me. Let's see what the Shannon's entropy has to say about this. First let's set up our pmf.

15 = 1/10
65 = 1/10
8 = 1/10
7 = 1/10
99 = 1/10
35 = 1/10
200 = 1/10
3201 = 1/10
5 = 1/5

Putting this into Shannon's formula looks a little like this
- [(1/10)*(Log(1/10)/Log(2))*8 + (1/5)*(Log(1/5)/Log(2))] = 3.12193

At this point it should be evident that the more random the number list is, the higher the level of entropy.

So how does this tie into finding beacons or automated behavior?
Building on the last entry and not being so rigid with our time entry items, this can give us a good idea of the behavior of the user hitting a particular site. Essentially, this will break down into a couple steps.
First, Build a hash of arrays of all IPs and every site that they've talked to:
{192.168.1.100 : evil.com} => [12:05:00, 12:10:00, 12:15:00, 12:20:00...]

Then take the time differences (in seconds)
{192.168.1.100 : evil.com} => [300,300,300,300....]

Next, calculate the entropy for each array and store
{192.168.1.100 : evil.com} => 0

...and that's all! Sort our lists and report on our findings!

Seems easy enough right? There are still some shortcomings of this method though. Let's discuss.

Network latency - Slowness in the network can mess with the beacon times and make them appear not to be perfect. The hope is that they will be "close enough" for enough common values to lower the entropy. Another way around this is to build in a tolerance to allow for a small amount of flexibility. As I found out in the last entry though, the tolerance will allow for a lot of false positives to come through the door.

Flagging regular user activity - It may not be uncommon for a user to click a link on a site every second. To a script, this looks like 'predictable' behavior. A way to avoid this problem is set a minimum tolerance for the beacon interval, this may allow for less false positives, but it will cut down a significant amount. The idea is that the user will visit the site more later on in a less predictable fashion and result in a higher entropy.

What about the results?
Simply put, it works. I've used this to locate beacons or just to better understand the network that I work in.

For now, the reporting capabilities are somewhat weak and could go for a little improvement. More on that later. One idea is to look at things like the standard deviation of the list and only list the outliers, flagging the rest as regular traffic. I guess that only time and testing will tell.

The script that I'm attaching is written in ruby for ease of reading. My future plans are to rewrite this in something faster like OCAML or even go as low as C or C++. Speed and memory are difficult to manage when you're dealing with terabytes of information, regardless of what language you're writing in. In these sorts of scenarios though, any small difference may go a long way. The results have uncovered a lot of quirks in the network, and have already allowed me to profile some machines. It also gives the ability to break up their activity based on user generated vs machine generated traffic, which to an incident responder, is huge.

Until next time.


About this entry


2 comments:

  1. Unknown February 18, 2009 at 8:20 AM
    This comment has been removed by the author.
  2. Unknown February 21, 2009 at 11:20 AM

    LOL WUT