September 02, 2002

Statistical Analysis of Spam - Part II

Just after finishing my last piece on this topic (Statistical Analysis of Spam - Part I) I recieved a list of some 1900 X-Spam-Status lines from Anders. Inserting this into Minitab I expected to get two graphs looking more or less similar to eachother.

No such luck. The graph below shows a histogram of Anders' spam's scores:

score_a.png

Now we can compare this to the scores achieved for my spam the previous day, and we can see a distinct difference between the two sets.

score_l.png

Now what can be the reason for this distinct difference between what should be quite similar sets of spam? Why does Anders get a different profile than I do? Anders' spam has a lower mean value than mine (16,4 vs. 17,8), and off the histogram one can easily see that there is a larger portion of spam being closer to the magic limit of 5,0. The sets shown does not include all the false positives, but from the graph it is easy to imagine that Anders' distribution would have much more messages going through...

It atleast explains why Anders is asking 'Is SpamAssassin helping the spammers too much? '. Anders is also keeping track of mine and other's investigation into spam (Statistical Analysis of Spam.

In the next article I will try to analyse the distribution of spam. One hint to give: it ain't normal.

Posted by ludvig at September 2, 2002 05:46 PM | TrackBack
Comments

Spam assassin seems to work well. What are your results?

Posted by: zip codes at December 2, 2002 08:09 PM

Well, it all works just very well... it gradually lets more and more things through, I have not looked at the statistics yet, but I believe that the score mean is increasing as spammers are adapting to the filtering technology.

Posted by: Lars at December 8, 2002 02:28 AM
Post a comment









Remember personal info?