Bayes' theorem applied to spam
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


 P(A|B) == P(A.B) / P(B)

	== P(B|A) . P(A) / [ P(B|A) . P(A) + P(B|!A) . P(!A) ]

 P(A+B) == P(A) + P(B) - P(A.B)


----------------------------------------


F(X) is the number of emails in the database for which X is true.


P(spam) == F(spam) / [ F(spam) + F(good) ]
P(good) == F(good) / [ F(spam) + F(good) ]


Given the list of words in an email, what is the probability that it is spam?

P(spam|words) = P(words|spam) . P(spam) / [ P(words|spam) . P(spam) + P(words|good) . P(good) ]


Multiply through by F(spam) + F(good):

P(spam|words) = P(words|spam) . F(spam) / [ P(words|spam) . F(spam) + P(words|good) . F(good) ]


Also,

P(words|spam) == F(word1) / F(spam) . F(word2) / F(spam) . F(word3) / F(spam) ...


----------------------------------------


Major problem: how to collect emails for the database such that the
above formulae for P(spam) etc. are reasonably accurate?


----------------------------------------

$dotat: doc/web/writing/bayes.txt,v 1.1 2002/09/09 10:24:48 fanf2 Exp $