Simple application of bayesian spam mail filtering under Ximian Evolution 1.4

This document (a basic primer) describes one method of using bogofilter with Ximian Evolution 1.4
The operating system of choice for this demonstration is Debian due to its easy to admin apt-get system.

So, lets get stuck into it.

You will need

evolution

My Evolution with the bayesian sorted spam folder showing

Step 1    Installing bogofilter

Well this was pretty easy, Although my system is on the testing tree of Debian, I already had this installed

debian:/home/user# apt-get install bogofilter
Reading Package Lists... Done
Building Dependency Tree... Done
Sorry, bogofilter is already the newest version.
0 packages upgraded, 0 newly installed, 0 to remove and 618  not upgraded.


Step 2     We'll assume you have evolution 1.4 installed.

I have no idea if any Evolution version below 1.4 supports checking the result of piped scripts in its filter, I know for fact version 1.0.5 does not.
You will need to create a sub-folder called "spam", move as many spam items you have into this folder, as this is what we "teach' the filter what is spam, you want as many entries as possible, while still maintaining lots of legitimate emails in your Inbox so the system also knows what HAM (Non spam) looks like, otherwise it will think all email looks like spam.

Step 3     Bogofilter wrapper script

This script is called by the Evolution filter system to get a result abuot an item of mail.
If you wanted to use SpamAssassin instead, this is where you would put it.
It provides the interface between Evolution and Bogofilter, I like to place this as /home/user/bin/scanspam , dont forget to chmod +x it.
You could also place all this gear in /usr/local/bin and make it system wide, maybe even a system wide bogofilter pool also.

SPAMBOX and HAMBOX point to actual mbox files that Evolution uses, these are your main inbox and the spam folder you just created, maybe do an 'ls' on these to check they exist in the path you think it does etc.

#!/bin/bash
SPAMBOX="/home/user/evolution/local/Inbox/subfolders/spam/mbox"
HAMBOX="/home/user/evolution/local/Inbox/mbox"
cat $SPAMBOX|bogofilter -M -s #this is spam
# unhash this if you want it to relearn your inbox each time as well
#cat $HAMBOX|bogofilter -M -n #this is spam
# -o 0.45 gives us a "tollerance rating"
# change -3 to -2 if you only want a plain yes or no spam rating
bogofilter -2 -o 0.45
ret=$?  # save the return value
echo $ret
exit $ret

The 0.45 seems dependant on the ratio of existing spam/ham (in your assigned folders)
cat a test known-spam message thru "|bogofilter -v" to see what it rates as, then set your threshold abuot there

Step 4     Configuring the Evolution filters

Evolution seems to use some XML to describe the filtering system, You might want to configure some simple filters to shuffle mail around (so you know your basic filter subsystem is functioning OK)
Now, either you can add them manually.. (sorry for the small pic) by going to the filters menu and creating a new filter, or you can add to your filter.xml directly (see below)

You want to Pipe message to shell command -> [Path of scanspam (/home/user/bin/scanspam)] ->returns  -> 0
example filter
If you are adept at dealing with XML, look in ~/evolution/filters.xml you can just add the following slice of XML (careful not to break your existing structure..)

    <rule grouping="any" source="incoming">
      <title>SPAM!</title>
      <partset>
        <part name="pipe">
          <value name="command" type="command">
            <command>/home/user/bin/scanspam</command>
          </value>
          <value name="retval-type" type="option" value="is"/>
          <value name="retval" type="integer" integer="0"/>
        </part>
      </partset>
      <actionset>
        <part name="move-to-folder">
          <value name="folder" type="folder">
            <folder uri="file:///home/user/evolution/local/Inbox/subfolders/spam"/>
          </value>
        </part>
      </actionset>
    </rule>

Step 5     Teaching some basics to the system and getting on with life...

Obviously you need to show bogofilter what is, and what is not spam, the system will relearn what is spam each time it runs, but wont relearn what is NOT spam (or is ham) unless you tell it to.
Ive done it this way because my inbox takes abuot 30seconds to process on this 450mhz x86 PC, you may wish to place something like...
HAMBOX=/path/to/known/mbox/with/no/spam/and/lots/of/emails
SPAMBOX=/path/to/evolution/folder/full/of/known/spam

cat $HAMBOX|bogofilter -M -n #this is ham
cat $SPAMBOX|bogofilter -M -s #this is spam <-- optional
Into a crnjob somewhere so maybe once per hour it will relearn what is not spam.

A side effect of having the system relearn what is spam on each filter event is that if it miss's an item of spam, or has a false positive, drag the mail item into the spam folder and right click on the item that is now in the spam folder and apply filters, this will relearn. (dont forget to expunge your folders before you do this so you dont have stale data)

Things to tweak

    In the shell script you can tweak the tollerance rating, this can improve results a lot.
    Try adding another filter to move message to a "Maybe spam" folder, you will need to change -2 to -3 in the shell script where it calls bogofilter, set the check return value to "2"

    Finnaly, it is realllllly important that you dont have any spam in your initial "ham" mbox file (even a couple of spams out of a thousand seems to be enough to skew the results pretty severly), otherwise the bayesian analyses just wont be effective.
Another thing, you might also want to take a look at using SpamAssassin instead of bogofilter in your shell script/wrapper, however ive found the results with bogofilter to be more "reliable".

Have fun!!

Any questions? leighm@linuxbandwagon.com

LinuxBandWagon Pty Ltd Contract Linux nerds
MailLaundry Pty ltd Commercial/Domain level mail filtering solution









And btw, John Howard is a gullible tool and needs to be excised from The Lodge as soon as possible.
(Last updated Dec 2003)