Simple application of bayesian spam mail filtering under Ximian
Evolution 1.4
This document (a basic primer) describes one method of using bogofilter with Ximian Evolution 1.4
The operating system of choice for
this demonstration is Debian due
to its easy to admin apt-get system.
So, lets get stuck into it.
You will need

My
Evolution with the bayesian sorted spam folder showing
Step 1 Installing bogofilter
Well this was pretty easy, Although my system is on the testing tree of
Debian, I already had this installed
debian:/home/user# apt-get install bogofilter
Reading Package Lists... Done
Building Dependency Tree... Done
Sorry, bogofilter is already the newest version.
0 packages upgraded, 0 newly installed, 0 to remove and 618 not
upgraded.
Step 2 We'll assume you have evolution 1.4
installed.
I have no idea if any Evolution version below 1.4 supports checking the
result of piped scripts in its filter, I know for fact version 1.0.5 does not.
You will need to create a sub-folder called "spam", move as many spam
items you have into this folder, as this is what we "teach' the filter
what is spam, you want as many entries as possible, while still
maintaining lots of legitimate emails in your Inbox so the system also
knows what HAM (Non spam) looks like, otherwise it will think all email
looks like spam.
Step 3 Bogofilter wrapper script
This script is called by the Evolution filter system to get a result
abuot an item of mail.
If you wanted to use SpamAssassin instead, this is where you would put it.
It provides the interface between Evolution and Bogofilter, I like to
place this as /home/user/bin/scanspam
, dont forget to chmod +x it.
You could also place all this gear in /usr/local/bin and make it system wide, maybe even a system wide bogofilter pool also.
SPAMBOX and HAMBOX point to actual mbox files that Evolution uses,
these are your main inbox and the spam folder you just created, maybe
do an 'ls' on these to check they exist in the path you think it does
etc.
#!/bin/bash
SPAMBOX="/home/user/evolution/local/Inbox/subfolders/spam/mbox"
HAMBOX="/home/user/evolution/local/Inbox/mbox"
cat $SPAMBOX|bogofilter -M -s #this is spam
# unhash this if you want it to relearn your inbox each time as well
#cat $HAMBOX|bogofilter -M -n #this is spam
# -o 0.45 gives us a "tollerance rating"
# change -3 to -2 if you only want a plain yes or no spam rating
bogofilter -2 -o 0.45
ret=$? # save the return value
echo $ret
exit $ret
The 0.45 seems dependant on the ratio of existing spam/ham (in your assigned folders)
cat a test known-spam message thru "|bogofilter -v" to see what it rates as, then set your threshold abuot there
Step 4 Configuring the Evolution filters
Evolution seems to use some XML to describe the filtering system, You
might want to configure some simple filters to shuffle mail around (so
you know your basic filter subsystem is functioning OK)
Now, either you can add them manually.. (sorry for the small pic) by
going to the filters menu and creating a new filter, or you can add to your filter.xml directly (see below)
You want to Pipe message to shell
command -> [Path of scanspam (/home/user/bin/scanspam)] ->returns -> 0

If you are adept at dealing with XML, look in ~/evolution/filters.xml
you can just add the following slice of XML (careful not to break your
existing structure..)
<rule grouping="any" source="incoming">
<title>SPAM!</title>
<partset>
<part name="pipe">
<value name="command" type="command">
<command>/home/user/bin/scanspam</command>
</value>
<value name="retval-type" type="option" value="is"/>
<value name="retval" type="integer" integer="0"/>
</part>
</partset>
<actionset>
<part name="move-to-folder">
<value name="folder" type="folder">
<folder uri="file:///home/user/evolution/local/Inbox/subfolders/spam"/>
</value>
</part>
</actionset>
</rule>
Step 5 Teaching some basics to the system and getting
on with life...
Obviously you need to show bogofilter what is, and what is not spam,
the system will relearn what is spam each time it runs, but wont
relearn what is NOT spam (or is ham) unless you tell it to.
Ive done it this way because my inbox takes abuot 30seconds to process
on this 450mhz x86 PC, you may wish to place something like...
HAMBOX=/path/to/known/mbox/with/no/spam/and/lots/of/emails
SPAMBOX=/path/to/evolution/folder/full/of/known/spam
cat $HAMBOX|bogofilter -M -n #this is ham
cat $SPAMBOX|bogofilter -M -s #this is spam <-- optional
Into a crnjob somewhere so maybe once per hour it will relearn what is
not spam.
A side effect of having the system relearn what is spam on each filter
event is that if it miss's an item of spam, or has a false positive,
drag the mail item into the spam folder and right click on the item
that is now in the spam folder and apply
filters, this will relearn. (dont forget to expunge your folders
before you do this so you dont have stale data)
Things to tweak
In the shell script you can tweak the
tollerance rating, this can improve results a lot.
Try adding another filter to move message to a
"Maybe spam" folder, you will need to change -2 to -3 in the shell
script where it calls bogofilter, set the check return value to "2"
Finnaly, it is realllllly important
that you dont have any spam in your initial "ham" mbox file (even a couple of
spams out of a thousand seems to be enough to skew the results pretty severly), otherwise
the bayesian analyses just wont be effective.
Another thing, you might also want to take a look at using SpamAssassin instead of bogofilter
in your shell script/wrapper, however ive found the results with bogofilter to be more "reliable".
Have fun!!
Any questions? leighm@linuxbandwagon.com
LinuxBandWagon Pty Ltd Contract Linux nerds
MailLaundry Pty ltd Commercial/Domain level mail filtering solution
And btw, John Howard is a gullible tool and needs to be excised from The Lodge as soon as possible.
(Last updated Dec 2003)