Lars Wirzenius describes his mail processing setup, especially in relation to how he trains then uses his spam filter. Since I have about the same set of problems to tackle, and I too have my own imperfect solution, his article got me thinking. So I’ll first describe how I currently do my mail filtering, and I hope I can propose something that’s an enhancement to both our setups. General setup: lots of email addresses all end up into one server, one procmail and one maildir hierarchy that I access through IMAP.
I too have a just-in-case backup mailbox as my first procmail rule -- I purge it from time to time, but I'm happy to announce I never needed to extract stuff from it. Knock on wood. Then there's a clamscan invocation through a bit of Perl, which adds (or not) an X-CLAMAV: header. The following rule stores such marked messages into /home/roland/Maildir/.virus/, so as not to overload Spamassassin (which used to eat my server before I put more RAM into it). And then, a pipe through spamc. Depending on the score obtained by the message, it'll end up in INBOX.spam (the "normal" spam folder that I check from time to time for false positives), INBOX.ultraspam (folder for spam with really high scores that I never check except when I'm really bored), or normal thematic folders (filtered on various headers such as To:, From:, X-Mailing-List:, X-BeenThere: and so on). One of these days I'll add an INBOX.spam.highscores, with a threshold score going ever up, just to keep an eye on how much spamminess can creep into a single message.
Okay, so that's for the filtering part. Now for the teaching/learning part. I have a script that copies all messages from the spam folders into some temporary directory, and all messages from the non-spam folders into another directory, and runs sa-learn over these directories. Actually I keep a timestamp of when the script was last run successfully, so as to only learn new messages. Of course, that only works if I know (or trust) all messages are in their appropriate folder. So I periodically (weekly or so) check all folders for messages to move into, or out of, INBOX.spam, then run that script. The server being rather old and slow (yes, it's still my Pentium MMX running at 200 MHz), that script takes a few minutes (up to half an hour) to run, so I have been looking for a way to be able to have it run by cron.
Lars uses crm114, which only needs to be trained on its past errors. If you trust Spamassassin enough to let it auto-learn spam (and non-spam) beyond certain score thresholds, and accept to let a few messages not be learnt upon, then I suppose you can use a similar pattern. In which case, my most recent idea would remove all need for manual triggering of the learning script: one can assume that messages marked as spam by Spamassassin (or crm114) yet residing in a non-spam folder are rightfully in their non-spam folders, and reciprocally. So one could take the big learning script, add a couple of greps in it, and let it run every night without fear of encouraging the filter in its errors.
The advantage over my current setup: no need to spend time making sure there are no misclassified messages in my maildir then waiting for the script to be done, since it all happens crontabically while I'm peacefully sleeping. The advantage I can see over Lars's described setup (okay, it's a minor one): no need for a script in mutt or in Evolution, nothing to do on the client but putting messages into their appropriate folder, all the magic happens on the server.
Someday, I'll even implement that.