I had heard a lot about Bayesian filtering in magazines, which learns to identify spam by analyzing the messages you receive, and as a developer of learning software, this sounded like a really common-sense approach to the the problem of stopping spam, so after trying a couple of client-side versions, I settled on bogofilter which is available from Hobbes. (http://hobbes.nmsu.edu/pub/os2/apps/internet/mail/server/bogofilter-0.14.5.4b.zip)
I am running Weasel 1.50 as my mail server, newer versions use a different style of REXX filter, so you'll need to adapt the concepts used in the SPAMPROCESS.CMD script if you are using those servers. When you specify the SPAMPROCESS.CMD filter in Weasel, make sure to have Weasel serialize the use of the filter, as it uses a fixed file for its processing and it won't work reliably if you don't. You'll also have to adjust the script to use the folders where your mail accounts are set up.
The second section takes any mail to the postmaster account, a must for mail servers, and routes it all to a folder for spam so I can feed it to bogofilter later. I have a nice cron job which deletes all postmaster mail every night. I know that's some sort of computer geek sin, but that account gets twice as much junk as any other account and I really just don't have time to deal with it. By the way, don't just delete your postmaster account, blackhole lists really don't like servers without postmaster accounts.
The next section actually runs bogofilter.exe and makes a copy of the source message in the file temp.msg with the bogofilter header tags added that mark the note as spam or not. You'll note the -3 command line parameter, this is supposed to make bogofilter mark spam as either spam, not spam, or unsure. Unfortunately I've never seen a note marked as unsure, so I don't think this actually works.
The next three sections put a message on the weasel console which indicates what bogofilter thought of the message, move the marked up copy to the proper destination folder and make a copy of the original note in one of three folders I set up for training: SPAM, NOSPAM or UNSURE. That's about it for the filter script.
bogofilter comes with two REXX scripts: train-no-spam.cmd and train-spam.cmd. To train bogofilter, you just run the scripts against the directory where your spam/not spam .MSG files are just like this:
train-spam e:\spammsgs train-no-spam e:\notspammsgs
The big trick of course is separating spam from non-spam so that you have nice clean samples for training. According to the documentation it is really important not to misidentify messages while training.
The spamprocess.cmd script puts messages into what it thinks is the right folders. When I want to do training, it is important to be able to pick any known-good messages out of the spam and known bad messages out of the good. To do this, I wrote an additional script called presort.cmd. It has a bunch of grep lines, these search all the MSG files for terms I know mean they are good messages and puts the names of those MSG files into a file called good.cmd. It does the same for junk and puts them into bad.cmd. You can do this for the sort of mail you get the first time you go through the .MSG files to sort them.
grep -i -l openal *.msg >good.cmd grep -i -l dxr3 *.msg >>good.cmd grep -i -l os2ddprog *.msg >>good.cmd grep -i -l "postmaster@aurora-systems.com" *.msg >bad.cmd
Once these .CMD files are created, they are just a list of filenames, so I use EPM's search and replace to insert MOVE commands to move the good and bad files to the proper folders ready for training. The rest of the messages I then scan through quickly with EPM (I love the way it loads multiple files) to make sure everything is where it should be. This is still time-consuming, but at least most of the work is done for you. I can go through 500 messages in about 10 minutes after they are presorted.
I just have the email client look for the X-bogosity: Yes part, the rest of it is interesting, but not of much practical use. I have the filter move the messages into a spam folder which I look at once a week or so just to make sure no mistakes have been made.
I still run JunkSpy 2.0 on the client because it catches notes that bogofilter doesn't. After feeding about 2000 spam to bogofilter for training, bogofilter catches about 85% of the spam I get, Junkspy catches nearly all of what bogofilter misses, and typically about 10 messages out of the 500 or so spam I get each week make it through to my inbox.
Shortly after I implemented this solution, the folks at JunkSpy (http://www.junkspy.com/) came out with
a new version 3.0 which I haven't tried yet but they claim it does a lot better job than previous versions.
This article is courtesy of www.os2ezine.com. You can view it online at http://www.os2ezine.com/20040316/page_2.html.