September 10, 2002
Statistical Analysis of Spam - Part IV
My series on spam and the statistical aspects behind it continues. Last time (part III) I discovered that the probability distribution of spam seemingly was Weibull. An interesting discovery. Add to this some knowledge about the Weibull distribution, and some results from Minitab and one suddenly can actually calculate the risk of spam coming through SpamAssassin without detection.
Pretty fast I found out that the alphas and betas in Minitab and my textbook where totally different. So the correct formula for the probability distribution when using the coeficcients returned from Minitab is:
![]()
Note that the meaning of alpha and beta has been swapped (how useful...).
Anyway, plotting the values for Anders and myself we get the following two probability distributions:

Anders' spam has the leftmost graph, whereas my spam is the rightmost. Getting the probability for any spam to have score less than 5 (or in this simplified case; between 0 and 5) should be as easy as to integrate the probability density function as stated above from 0 to 5.
So it is. I would recommend to use some sort of either numerical or analytical mathematical package to do this, as the integral is slightly dirty. The answers are straight forward though. The probability for any spam to get less than a score of 5 was 5.5% for Anders and some 3.2% for mine.
Which for me atleast is quite what I would believe the score to be. What I would like to do in the next article is to see if I can detect any change within my own spam. Formulated as a question; does the average score for spam increase or decrease, and how does it affect the probability distribution?
Files:
- Maple file showing formulas, plots and integrals.
- Minitab file with all statistical data.