September 02, 2002
Statistical Analysis of Spam - Part II
Just after finishing my last piece on this topic (Statistical Analysis of Spam - Part I) I recieved a list of some 1900 X-Spam-Status lines from Anders. Inserting this into Minitab I expected to get two graphs looking more or less similar to eachother.
No such luck. The graph below shows a histogram of Anders' spam's scores:

Now we can compare this to the scores achieved for my spam the previous day, and we can see a distinct difference between the two sets.

Now what can be the reason for this distinct difference between what should be quite similar sets of spam? Why does Anders get a different profile than I do? Anders' spam has a lower mean value than mine (16,4 vs. 17,8), and off the histogram one can easily see that there is a larger portion of spam being closer to the magic limit of 5,0. The sets shown does not include all the false positives, but from the graph it is easy to imagine that Anders' distribution would have much more messages going through...
It atleast explains why Anders is asking 'Is SpamAssassin helping the spammers too much? '. Anders is also keeping track of mine and other's investigation into spam (Statistical Analysis of Spam.
In the next article I will try to analyse the distribution of spam. One hint to give: it ain't normal.
Posted by ludvig at September 2, 2002 05:46 PM | TrackBackSpam assassin seems to work well. What are your results?
Posted by: zip codes at December 2, 2002 08:09 PMWell, it all works just very well... it gradually lets more and more things through, I have not looked at the statistics yet, but I believe that the score mean is increasing as spammers are adapting to the filtering technology.
Posted by: Lars at December 8, 2002 02:28 AM