August 29, 2002
Statistical Analysis of Spam - Part I
As Anders pointed out earlier today I have been working on analysing spam statistically. I made a brief introduction to it in Spam Filtering - To Do or Not To Do, and this is the first results emerging.
I have been using Minitab to analyse the results gathered from my spam mailbox. The spam mailbox has been tagged with X-Spam-Status by SpamAssassin. This preliminary analysis is based solely on the score that spam receives. No messages that passed without being categorized as spam will be analysed.
I will try to analyse statistically other aspects of spam detection and filtering later. There is a vast array of different tests that can be run. Possibilibies are many. Time is the limit.
But for now - just look at the histogram.
Histogram showing frequency for score intervals:
Statistical key figures:
Variable N Mean Median StDev Score 1964 17,884 17,100 7,847 Variable Minimum Maximum Q1 Q3 Score 5,000 44,400 11,825 22,800
As seen from the histogram, most spam is far away from the magic treshold of a score of 5. As it seems like the score is not normally distributed I have not tried to use the properties of normal distributions to calculate how many spam actually went through (ie. had a score less than 5). I could have counted them, but then again I cannot as I have deleted them all throughout summer.
Maybe with increasing N will the distribution look more normal. Or maybe is it chi-square? As my knowledge in statistics isn't all that good, I am more the kind of algebra and cryptography kind of mathematician I would have to let someone else look at it...
Expect more to come the following days/weeks though.Posted by ludvig at August 29, 2002 09:10 PM | TrackBack