August 29, 2002
Spam Filtering - To Do or Not To Do
Spam has become an increasing problem. There are many ways of fighting spam. Relay blacklists and filtering out e-mail without the correct To: header was the first generation of spam filtering. Seldom was it very successfull, atleast not for me. The blacklists took some, and some very na´ve filtering rules in procmail took some more, but still, spam was a nuisance.
Just before the summer I installed SpamAssassin, a tool that uses a whole array of different techniques to identify spam, including both RBL checks and more advanced header analysis. The installation was a big success, and since June 10 I have filtered out some 2000 spam messages.
The big success criteria of any spam filtering software package is its ability to filter out spam without a) filtering out non-spam and b) letting through spam. Spam filtering is very much like Hypothesis Testing in Statistics. Set H0 to be 'e-mail is a legitimate message' (i.e. non-spam). Now, rejecting H0 would lead to the classic Type I error and a false classification of the message. Accepting H0 when it is indeed false (i.e. when the message is a spam) is called a Type II error.
In the world of statistics the probability of committing a Type I error is commonly known as alpha, whereas the probability of committing a Type II error is known as beta.
Out of the two I would believe that the Type I error is the most serious one. The fear of loosing the e-mail from a long lost friend or the incredible job-offer makes it all slightly troublesome. But on the other hand if P(Type II) gets too big you are really back to square one with flooded mailboxes.
I have never really studied the results in my spam folder (except for searching for the extremes - spam achieving some 40 points in the SpamAssassing scoring system. The next task would be to see if the scores have a normal distribution, and then maybe calculate the risk of doing the so much feared Type I error.
In Anti Filter, Kalsey Consulting voices the possibilities of helping 'rightful' spammers to avoid being treated as spam. In SpamAssassin this means to lower the score for each single message. In a perfect world this sounds like a good idea, but in the real world one could guarantee only one thing - that spammers would take advantage of the same research to lower the score on illegitimate spam too. (Anders has a rather sharp reply to Kalsey.)
Well, needless to say, spammers will adopt to software like SpamAssassin, how do they actually manage to send out spam with a total score of 40 anyway? (Read more on the SpamAssassin scoring system.) Come Darwin...
Stay tuned for a statistical study of a mailbox... is there a unique P(Type I) and a P(Type II)? What would be the optimal trigger level?Posted by ludvig at August 29, 2002 12:35 AM | TrackBack