August 29, 2002

Spam Filtering - To Do or Not To Do

Spam has become an increasing problem. There are many ways of fighting spam. Relay blacklists and filtering out e-mail without the correct To: header was the first generation of spam filtering. Seldom was it very successfull, atleast not for me. The blacklists took some, and some very na´ve filtering rules in procmail took some more, but still, spam was a nuisance.

Just before the summer I installed SpamAssassin, a tool that uses a whole array of different techniques to identify spam, including both RBL checks and more advanced header analysis. The installation was a big success, and since June 10 I have filtered out some 2000 spam messages.

The big success criteria of any spam filtering software package is its ability to filter out spam without a) filtering out non-spam and b) letting through spam. Spam filtering is very much like Hypothesis Testing in Statistics. Set H0 to be 'e-mail is a legitimate message' (i.e. non-spam). Now, rejecting H0 would lead to the classic Type I error and a false classification of the message. Accepting H0 when it is indeed false (i.e. when the message is a spam) is called a Type II error.

In the world of statistics the probability of committing a Type I error is commonly known as alpha, whereas the probability of committing a Type II error is known as beta.

Out of the two I would believe that the Type I error is the most serious one. The fear of loosing the e-mail from a long lost friend or the incredible job-offer makes it all slightly troublesome. But on the other hand if P(Type II) gets too big you are really back to square one with flooded mailboxes.

I have never really studied the results in my spam folder (except for searching for the extremes - spam achieving some 40 points in the SpamAssassing scoring system. The next task would be to see if the scores have a normal distribution, and then maybe calculate the risk of doing the so much feared Type I error.


In Anti Filter, Kalsey Consulting voices the possibilities of helping 'rightful' spammers to avoid being treated as spam. In SpamAssassin this means to lower the score for each single message. In a perfect world this sounds like a good idea, but in the real world one could guarantee only one thing - that spammers would take advantage of the same research to lower the score on illegitimate spam too. (Anders has a rather sharp reply to Kalsey.)

Well, needless to say, spammers will adopt to software like SpamAssassin, how do they actually manage to send out spam with a total score of 40 anyway? (Read more on the SpamAssassin scoring system.) Come Darwin...

Stay tuned for a statistical study of a mailbox... is there a unique P(Type I) and a P(Type II)? What would be the optimal trigger level?

Posted by ludvig at August 29, 2002 12:35 AM | TrackBack

This argument has some parallels to arguments in the security industry. Do you publish exploits so that administrators, who have a legitimate need to know, can be aware of them? Or does the publishing of exploits only help hackers?

There are two classes of hackers. Those who understand the systems they are cracking and those who use tools created by others to hack by numbers. The second group are often referred to as "script kiddies." They couldn't really hack a system without pre-made hacking tools, and most don't have any idea why the hacks actually work. The first group has an understanding of network protocols, memory buffers, and the underbelly of operating systems.

If exploits were not published, the serious hackers would still be able to find and exploit vulnerabilities in systems, but the script kiddies would not.

The same goes for spam. The vast majority of spammers are "spam kiddies" that don't understand the concepts of open relays, filtering algorithms, and regular expressions. Even if you patiently explained to them exactly how to get past your filters, they couldn't do it.

The other spammers are sophisticated and pour money and sweat into research on bypassing spam filters. These people don't need someone like me to tell them the basics about how spam filters work. They could probably teach me a thing ot two about it.

So by helping marketers understand spam filters, you aren't likely to provide much help to spammers.

Posted by: Adam Kalsey at August 30, 2002 05:07 AM
Post a comment

Remember personal info?