Bayes' theorem applied to spam ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ P(A|B) == P(A.B) / P(B) == P(B|A) . P(A) / [ P(B|A) . P(A) + P(B|!A) . P(!A) ] P(A+B) == P(A) + P(B) - P(A.B) ---------------------------------------- F(X) is the number of emails in the database for which X is true. P(spam) == F(spam) / [ F(spam) + F(good) ] P(good) == F(good) / [ F(spam) + F(good) ] Given the list of words in an email, what is the probability that it is spam? P(spam|words) = P(words|spam) . P(spam) / [ P(words|spam) . P(spam) + P(words|good) . P(good) ] Multiply through by F(spam) + F(good): P(spam|words) = P(words|spam) . F(spam) / [ P(words|spam) . F(spam) + P(words|good) . F(good) ] Also, P(words|spam) == F(word1) / F(spam) . F(word2) / F(spam) . F(word3) / F(spam) ... ---------------------------------------- Major problem: how to collect emails for the database such that the above formulae for P(spam) etc. are reasonably accurate? ---------------------------------------- $dotat: doc/web/writing/bayes.txt,v 1.1 2002/09/09 10:24:48 fanf2 Exp $